All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup
@ 2020-07-02  9:24 Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments Jakub Sitnicki
                   ` (16 more replies)
  0 siblings, 17 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Andrii Nakryiko, Lorenz Bauer,
	Marek Majkowski, Martin KaFai Lau

Overview
========

(Same as in v2. Please skip to next section if you've read it.)

This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
or BPF sk_lookup for short.

BPF sk_lookup program runs when transport layer is looking up a listening
socket for a new connection request (TCP), or when looking up an
unconnected socket for a packet (UDP).

This serves as a mechanism to overcome the limits of what bind() API allows
to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a single socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a single socket

     198.51.100.1, any port -> L7 proxy socket

In its context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection, and returns BPF_REDIRECT code. Transport
layer then uses the selected socket as a result of socket lookup.

Alternatively, program can also fail the lookup (BPF_DROP), or let the
lookup continue as usual (BPF_OK).

This lets the user match packets with listening (TCP) or receiving (UDP)
sockets freely at the last possible point on the receive path, where we
know that packets are destined for local delivery after undergoing
policing, filtering, and routing.

Program is attached to a network namespace, similar to BPF flow_dissector.
We add a new attach type, BPF_SK_LOOKUP, for this.

Series structure
================

Patches are organized as so:

 1: enabled multiple link-based prog attachments for bpf-netns
 2: introduces sk_lookup program type
 3-4: hook up the program to run on ipv4/tcp socket lookup
 5-6: hook up the program to run on ipv6/tcp socket lookup
 7-8: hook up the program to run on ipv4/udp socket lookup
 9-10: hook up the program to run on ipv6/udp socket lookup
 11-13: libbpf & bpftool support for sk_lookup
 14-16: verifier and selftests for sk_lookup

Patches are also available on GH:

  https://github.com/jsitnicki/linux/commits/bpf-inet-lookup-v3

Performance considerations
==========================

I'm re-running udp6 small packet flood test, the scenario for which we had
performance concerns in [v2], to measure pps hit after the changes called
out in change log below.

Will follow up with results. But I'm posting the patches early for review
since there is a fair amount of code changes.

Further work
============

- user docs for new prog type, Documentation/bpf/prog_sk_lookup.rst
  I'm looking for consensus on multi-prog semantics outlined in patch #4
  description before drafting the document.

- timeout on accept() in tests
  I need to extract a helper for it into network_helpers in
  selftests/bpf/. Didn't want to make this series any longer.

Note to maintainers
===================

This patch series depends on bpf-netns multi-prog changes that went
recently into 'bpf' [0]. It won't apply onto 'bpf-next' until 'bpf' gets
merged into 'bpf-next'.

Changelog
=========

v3 brings the following changes based on feedback:

1. switch to link-based program attachment,
2. support for multi-prog attachment,
3. ability to skip reuseport socket selection,
4. code on RX path is guarded by a static key,
5. struct in6_addr's are no longer copied into BPF prog context,
6. BPF prog context is initialized as late as possible.

v2 -> v3:
- Changes called out in patches 1-2, 4, 6, 8, 10-14, 16
- Patches dropped:
  01/17 flow_dissector: Extract attach/detach/query helpers
  03/17 inet: Store layer 4 protocol in inet_hashinfo
  08/17 udp: Store layer 4 protocol in udp_table

v1 -> v2:
- Changes called out in patches 2, 13-15, 17
- Rebase to recent bpf-next (b4563facdcae)

RFCv2 -> v1:

- Switch to fetching a socket from a map and selecting a socket with
  bpf_sk_assign, instead of having a dedicated helper that does both.
- Run reuseport logic on sockets selected by BPF sk_lookup.
- Allow BPF sk_lookup to fail the lookup with no match.
- Go back to having just 2 hash table lookups in UDP.

RFCv1 -> RFCv2:

- Make socket lookup redirection map-based. BPF program now uses a
  dedicated helper and a SOCKARRAY map to select the socket to redirect to.
  A consequence of this change is that bpf_inet_lookup context is now
  read-only.
- Look for connected UDP sockets before allowing redirection from BPF.
  This makes connected UDP socket work as expected in the presence of
  inet_lookup prog.
- Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
  the only other per-netns BPF prog type.

[RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
[RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
[v1] https://lore.kernel.org/bpf/20200511185218.1422406-18-jakub@cloudflare.com/
[v2] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
[0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=951f38cf08350884e72e0936adf147a8d764cc5d

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Cc: Marek Majkowski <marek@cloudflare.com>
Cc: Martin KaFai Lau <kafai@fb.com>

Jakub Sitnicki (16):
  bpf, netns: Handle multiple link attachments
  bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  inet: Extract helper for selecting socket from reuseport group
  inet: Run SK_LOOKUP BPF program on socket lookup
  inet6: Extract helper for selecting socket from reuseport group
  inet6: Run SK_LOOKUP BPF program on socket lookup
  udp: Extract helper for selecting socket from reuseport group
  udp: Run SK_LOOKUP BPF program on socket lookup
  udp6: Extract helper for selecting socket from reuseport group
  udp6: Run SK_LOOKUP BPF program on socket lookup
  bpf: Sync linux/bpf.h to tools/
  libbpf: Add support for SK_LOOKUP program type
  tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type
  selftests/bpf: Add verifier tests for bpf_sk_lookup context access
  selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
  selftests/bpf: Tests for BPF_SK_LOOKUP attach point

 include/linux/bpf-netns.h                     |    3 +
 include/linux/bpf.h                           |   33 +
 include/linux/bpf_types.h                     |    2 +
 include/linux/filter.h                        |   99 ++
 include/uapi/linux/bpf.h                      |   74 +
 kernel/bpf/core.c                             |   22 +
 kernel/bpf/net_namespace.c                    |  125 +-
 kernel/bpf/syscall.c                          |    9 +
 net/core/filter.c                             |  188 +++
 net/ipv4/inet_hashtables.c                    |   60 +-
 net/ipv4/udp.c                                |   93 +-
 net/ipv6/inet6_hashtables.c                   |   66 +-
 net/ipv6/udp.c                                |   97 +-
 scripts/bpf_helpers_doc.py                    |    9 +-
 tools/bpf/bpftool/common.c                    |    1 +
 tools/bpf/bpftool/prog.c                      |    3 +-
 tools/include/uapi/linux/bpf.h                |   74 +
 tools/lib/bpf/libbpf.c                        |    3 +
 tools/lib/bpf/libbpf.h                        |    2 +
 tools/lib/bpf/libbpf.map                      |    2 +
 tools/lib/bpf/libbpf_probes.c                 |    3 +
 .../bpf/prog_tests/reference_tracking.c       |    2 +-
 .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
 .../selftests/bpf/progs/test_ref_track_kern.c |  181 +++
 .../selftests/bpf/progs/test_sk_lookup_kern.c |  462 ++++--
 .../selftests/bpf/verifier/ctx_sk_lookup.c    |  219 +++
 26 files changed, 2995 insertions(+), 190 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_ref_track_kern.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

-- 
2.25.4


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-09  3:44   ` Andrii Nakryiko
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
prog_array at given position when link gets attached/updated/released.

This let's us lift the limit of having just one link attached for the new
attach type introduced by subsequent patch.

No functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - New in v3 to support multi-prog attachments. (Alexei)

 include/linux/bpf.h        |  4 ++
 kernel/bpf/core.c          | 22 ++++++++++
 kernel/bpf/net_namespace.c | 88 +++++++++++++++++++++++++++++++++++---
 3 files changed, 107 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3d2ade703a35..26bc70533db0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -928,6 +928,10 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
 
 void bpf_prog_array_delete_safe(struct bpf_prog_array *progs,
 				struct bpf_prog *old_prog);
+void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
+				   unsigned int index);
+void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
+			      struct bpf_prog *prog);
 int bpf_prog_array_copy_info(struct bpf_prog_array *array,
 			     u32 *prog_ids, u32 request_cnt,
 			     u32 *prog_cnt);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9df4cc9a2907..d4b3b9ee6bf1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1958,6 +1958,28 @@ void bpf_prog_array_delete_safe(struct bpf_prog_array *array,
 		}
 }
 
+void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
+				   unsigned int index)
+{
+	bpf_prog_array_update_at(array, index, &dummy_bpf_prog.prog);
+}
+
+void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
+			      struct bpf_prog *prog)
+{
+	struct bpf_prog_array_item *item;
+
+	for (item = array->items; item->prog; item++) {
+		if (item->prog == &dummy_bpf_prog.prog)
+			continue;
+		if (!index) {
+			WRITE_ONCE(item->prog, prog);
+			break;
+		}
+		index--;
+	}
+}
+
 int bpf_prog_array_copy(struct bpf_prog_array *old_array,
 			struct bpf_prog *exclude_prog,
 			struct bpf_prog *include_prog,
diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
index 247543380fa6..6011122c35b6 100644
--- a/kernel/bpf/net_namespace.c
+++ b/kernel/bpf/net_namespace.c
@@ -36,11 +36,51 @@ static void netns_bpf_run_array_detach(struct net *net,
 	bpf_prog_array_free(run_array);
 }
 
+static unsigned int link_index(struct net *net,
+			       enum netns_bpf_attach_type type,
+			       struct bpf_netns_link *link)
+{
+	struct bpf_netns_link *pos;
+	unsigned int i = 0;
+
+	list_for_each_entry(pos, &net->bpf.links[type], node) {
+		if (pos == link)
+			return i;
+		i++;
+	}
+	return UINT_MAX;
+}
+
+static unsigned int link_count(struct net *net,
+			       enum netns_bpf_attach_type type)
+{
+	struct list_head *pos;
+	unsigned int i = 0;
+
+	list_for_each(pos, &net->bpf.links[type])
+		i++;
+	return i;
+}
+
+static void fill_prog_array(struct net *net, enum netns_bpf_attach_type type,
+			    struct bpf_prog_array *prog_array)
+{
+	struct bpf_netns_link *pos;
+	unsigned int i = 0;
+
+	list_for_each_entry(pos, &net->bpf.links[type], node) {
+		prog_array->items[i].prog = pos->link.prog;
+		i++;
+	}
+}
+
 static void bpf_netns_link_release(struct bpf_link *link)
 {
 	struct bpf_netns_link *net_link =
 		container_of(link, struct bpf_netns_link, link);
 	enum netns_bpf_attach_type type = net_link->netns_type;
+	struct bpf_prog_array *old_array, *new_array;
+	unsigned int cnt, idx;
 	struct net *net;
 
 	mutex_lock(&netns_bpf_mutex);
@@ -53,9 +93,27 @@ static void bpf_netns_link_release(struct bpf_link *link)
 	if (!net)
 		goto out_unlock;
 
-	netns_bpf_run_array_detach(net, type);
+	/* Remember link position in case of safe delete */
+	idx = link_index(net, type, net_link);
 	list_del(&net_link->node);
 
+	cnt = link_count(net, type);
+	if (!cnt) {
+		netns_bpf_run_array_detach(net, type);
+		goto out_unlock;
+	}
+
+	old_array = rcu_dereference_protected(net->bpf.run_array[type],
+					      lockdep_is_held(&netns_bpf_mutex));
+	new_array = bpf_prog_array_alloc(cnt, GFP_KERNEL);
+	if (!new_array) {
+		bpf_prog_array_delete_safe_at(old_array, idx);
+		goto out_unlock;
+	}
+	fill_prog_array(net, type, new_array);
+	rcu_assign_pointer(net->bpf.run_array[type], new_array);
+	bpf_prog_array_free(old_array);
+
 out_unlock:
 	mutex_unlock(&netns_bpf_mutex);
 }
@@ -76,6 +134,7 @@ static int bpf_netns_link_update_prog(struct bpf_link *link,
 		container_of(link, struct bpf_netns_link, link);
 	enum netns_bpf_attach_type type = net_link->netns_type;
 	struct bpf_prog_array *run_array;
+	unsigned int idx;
 	struct net *net;
 	int ret = 0;
 
@@ -95,7 +154,8 @@ static int bpf_netns_link_update_prog(struct bpf_link *link,
 
 	run_array = rcu_dereference_protected(net->bpf.run_array[type],
 					      lockdep_is_held(&netns_bpf_mutex));
-	WRITE_ONCE(run_array->items[0].prog, new_prog);
+	idx = link_index(net, type, net_link);
+	bpf_prog_array_update_at(run_array, idx, new_prog);
 
 	old_prog = xchg(&link->prog, new_prog);
 	bpf_prog_put(old_prog);
@@ -295,18 +355,29 @@ int netns_bpf_prog_detach(const union bpf_attr *attr)
 	return ret;
 }
 
+static int netns_bpf_max_progs(enum netns_bpf_attach_type type)
+{
+	switch (type) {
+	case NETNS_BPF_FLOW_DISSECTOR:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
 static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
 				 enum netns_bpf_attach_type type)
 {
 	struct bpf_netns_link *net_link =
 		container_of(link, struct bpf_netns_link, link);
 	struct bpf_prog_array *run_array;
+	unsigned int cnt;
 	int err;
 
 	mutex_lock(&netns_bpf_mutex);
 
-	/* Allow attaching only one prog or link for now */
-	if (!list_empty(&net->bpf.links[type])) {
+	cnt = link_count(net, type);
+	if (cnt >= netns_bpf_max_progs(type)) {
 		err = -E2BIG;
 		goto out_unlock;
 	}
@@ -327,16 +398,19 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
 	if (err)
 		goto out_unlock;
 
-	run_array = bpf_prog_array_alloc(1, GFP_KERNEL);
+	run_array = bpf_prog_array_alloc(cnt + 1, GFP_KERNEL);
 	if (!run_array) {
 		err = -ENOMEM;
 		goto out_unlock;
 	}
-	run_array->items[0].prog = link->prog;
-	rcu_assign_pointer(net->bpf.run_array[type], run_array);
 
 	list_add_tail(&net_link->node, &net->bpf.links[type]);
 
+	fill_prog_array(net, type, run_array);
+	run_array = rcu_replace_pointer(net->bpf.run_array[type], run_array,
+					lockdep_is_held(&netns_bpf_mutex));
+	bpf_prog_array_free(run_array);
+
 out_unlock:
 	mutex_unlock(&netns_bpf_mutex);
 	return err;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-04 18:42   ` Yonghong Song
                     ` (4 more replies)
  2020-07-02  9:24 ` [PATCH bpf-next v3 03/16] inet: Extract helper for selecting socket from reuseport group Jakub Sitnicki
                   ` (14 subsequent siblings)
  16 siblings, 5 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
when looking up a listening socket for a new connection request for
connection oriented protocols, or when looking up an unconnected socket for
a packet for connection-less protocols.

When called, SK_LOOKUP BPF program can select a socket that will receive
the packet. This serves as a mechanism to overcome the limits of what
bind() API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, on fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, on any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Allow bpf_sk_assign helper to replace previously selected socket only
      when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
      programs running in series to accidentally override each other's verdict.
    - Let BPF program decide that load-balancing within a reuseport socket group
      should be skipped for the socket selected with bpf_sk_assign() by passing
      BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
    - Extend struct bpf_sk_lookup program context with an 'sk' field containing
      the selected socket with an intention for multiple attached program
      running in series to see each other's choices. However, currently the
      verifier doesn't allow checking if pointer is set.
    - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
    - Get rid of macros in convert_ctx_access to make it easier to read.
    - Disallow 1-,2-byte access to context fields containing IP addresses.
    
    v2:
    - Make bpf_sk_assign reject sockets that don't use RCU freeing.
      Update bpf_sk_assign docs accordingly. (Martin)
    - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
    - Fix broken build when CONFIG_INET is not selected. (Martin)
    - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
    - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)

 include/linux/bpf-netns.h  |   3 +
 include/linux/bpf_types.h  |   2 +
 include/linux/filter.h     |  19 ++++
 include/uapi/linux/bpf.h   |  74 +++++++++++++++
 kernel/bpf/net_namespace.c |   5 +
 kernel/bpf/syscall.c       |   9 ++
 net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py |   9 +-
 8 files changed, 306 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
index 4052d649f36d..cb1d849c5d4f 100644
--- a/include/linux/bpf-netns.h
+++ b/include/linux/bpf-netns.h
@@ -8,6 +8,7 @@
 enum netns_bpf_attach_type {
 	NETNS_BPF_INVALID = -1,
 	NETNS_BPF_FLOW_DISSECTOR = 0,
+	NETNS_BPF_SK_LOOKUP,
 	MAX_NETNS_BPF_ATTACH_TYPE
 };
 
@@ -17,6 +18,8 @@ to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
 	switch (attach_type) {
 	case BPF_FLOW_DISSECTOR:
 		return NETNS_BPF_FLOW_DISSECTOR;
+	case BPF_SK_LOOKUP:
+		return NETNS_BPF_SK_LOOKUP;
 	default:
 		return NETNS_BPF_INVALID;
 	}
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a18ae82a298a..a52a5688418e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
+	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
 #endif
 #if defined(CONFIG_BPF_JIT)
 BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 259377723603..ba4f8595fa54 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1278,4 +1278,23 @@ struct bpf_sockopt_kern {
 	s32		retval;
 };
 
+struct bpf_sk_lookup_kern {
+	u16		family;
+	u16		protocol;
+	union {
+		struct {
+			__be32 saddr;
+			__be32 daddr;
+		} v4;
+		struct {
+			const struct in6_addr *saddr;
+			const struct in6_addr *daddr;
+		} v6;
+	};
+	__be16		sport;
+	u16		dport;
+	struct sock	*selected_sk;
+	bool		no_reuseport;
+};
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0cb8ec948816..8dd6e6ce5de9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -189,6 +189,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -226,6 +227,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET4_GETSOCKNAME,
 	BPF_CGROUP_INET6_GETSOCKNAME,
 	BPF_XDP_DEVMAP,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3067,6 +3069,10 @@ union bpf_attr {
  *
  * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3092,6 +3098,53 @@ union bpf_attr {
  *		**-ESOCKTNOSUPPORT** if the socket type is not supported
  *		(reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
+ *		be an exact match. While IP family (**AF_INET** or
+ *		**AF_INET6**) must be compatible, that is IPv6 sockets
+ *		that are not v6-only can be selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP unconnected sockets can be
+ *		selected.
+ *
+ *		*flags* argument can combination of following values:
+ *
+ *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
+ *		  socket selection, potentially done by a BPF program
+ *		  that ran before us.
+ *
+ *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
+ *		  load-balancing within reuseport group for the socket
+ *		  being selected.
+ *
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
+ *		  not compatible with packet family (*ctx->family*).
+ *
+ *		* **-EEXIST** if socket has been already selected,
+ *		  potentially by another program, and
+ *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
+ *
+ *		* **-EINVAL** if unsupported flags were specified.
+ *
+ *		* **-EPROTOTYPE** if socket L4 protocol
+ *		  (*sk->protocol*) doesn't match packet protocol
+ *		  (*ctx->protocol*).
+ *
+ *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
+ *		  state (TCP listening or UDP unconnected).
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -3569,6 +3622,12 @@ enum {
 	BPF_RINGBUF_HDR_SZ		= 8,
 };
 
+/* BPF_FUNC_sk_assign flags in bpf_sk_lookup context. */
+enum {
+	BPF_SK_LOOKUP_F_REPLACE		= (1ULL << 0),
+	BPF_SK_LOOKUP_F_NO_REUSEPORT	= (1ULL << 1),
+};
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
 	BPF_ADJ_ROOM_NET,
@@ -4298,4 +4357,19 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	__u32 remote_ip4;	/* Network byte order */
+	__u32 remote_ip6[4];	/* Network byte order */
+	__u32 remote_port;	/* Network byte order */
+	__u32 local_ip4;	/* Network byte order */
+	__u32 local_ip6[4];	/* Network byte order */
+	__u32 local_port;	/* Host byte order */
+
+	__bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
index 6011122c35b6..090166824ca4 100644
--- a/kernel/bpf/net_namespace.c
+++ b/kernel/bpf/net_namespace.c
@@ -360,6 +360,8 @@ static int netns_bpf_max_progs(enum netns_bpf_attach_type type)
 	switch (type) {
 	case NETNS_BPF_FLOW_DISSECTOR:
 		return 1;
+	case NETNS_BPF_SK_LOOKUP:
+		return 64;
 	default:
 		return 0;
 	}
@@ -391,6 +393,9 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
 	case NETNS_BPF_FLOW_DISSECTOR:
 		err = flow_dissector_bpf_prog_attach_check(net, link->prog);
 		break;
+	case NETNS_BPF_SK_LOOKUP:
+		err = 0; /* nothing to check */
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8da159936bab..e7d49959340e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2021,6 +2021,10 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		if (expected_attach_type == BPF_SK_LOOKUP)
+			return 0;
+		return -EINVAL;
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
 			return -EINVAL;
@@ -2755,6 +2759,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
 		if (!capable(CAP_NET_ADMIN))
@@ -2815,6 +2820,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
 	case BPF_TRACE_ITER:
 		return BPF_PROG_TYPE_TRACING;
+	case BPF_SK_LOOKUP:
+		return BPF_PROG_TYPE_SK_LOOKUP;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -2952,6 +2959,7 @@ static int bpf_prog_query(const union bpf_attr *attr,
 	case BPF_LIRC_MODE2:
 		return lirc_prog_query(attr, uattr);
 	case BPF_FLOW_DISSECTOR:
+	case BPF_SK_LOOKUP:
 		return netns_bpf_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
@@ -3885,6 +3893,7 @@ static int link_create(union bpf_attr *attr)
 		ret = tracing_bpf_link_attach(attr, prog);
 		break;
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+	case BPF_PROG_TYPE_SK_LOOKUP:
 		ret = netns_bpf_link_create(attr, prog);
 		break;
 	default:
diff --git a/net/core/filter.c b/net/core/filter.c
index c796e141ea8e..286f90e0c824 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9219,6 +9219,192 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
+	   struct sock *, sk, u64, flags)
+{
+	if (unlikely(flags & ~(BPF_SK_LOOKUP_F_REPLACE |
+			       BPF_SK_LOOKUP_F_NO_REUSEPORT)))
+		return -EINVAL;
+	if (unlikely(sk_is_refcounted(sk)))
+		return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
+	if (unlikely(sk->sk_state == TCP_ESTABLISHED))
+		return -ESOCKTNOSUPPORT; /* reject connected sockets */
+
+	/* Check if socket is suitable for packet L3/L4 protocol */
+	if (sk->sk_protocol != ctx->protocol)
+		return -EPROTOTYPE;
+	if (sk->sk_family != ctx->family &&
+	    (sk->sk_family == AF_INET || ipv6_only_sock(sk)))
+		return -EAFNOSUPPORT;
+
+	if (ctx->selected_sk && !(flags & BPF_SK_LOOKUP_F_REPLACE))
+		return -EEXIST;
+
+	/* Select socket as lookup result */
+	ctx->selected_sk = sk;
+	ctx->no_reuseport = flags & BPF_SK_LOOKUP_F_NO_REUSEPORT;
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
+	.func		= bpf_sk_lookup_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_SOCKET,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_lookup_assign_proto;
+	case BPF_FUNC_sk_release:
+		return &bpf_sk_release_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static bool sk_lookup_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sk_lookup, family):
+	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
+	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
+	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
+	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
+	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
+		return size == sizeof(__u32);
+
+	case offsetof(struct bpf_sk_lookup, sk):
+		info->reg_type = PTR_TO_SOCKET;
+		return size == sizeof(__u64);
+
+	default:
+		return false;
+	}
+}
+
+static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog,
+					u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+#if IS_ENABLED(CONFIG_IPV6)
+	int off;
+#endif
+
+	switch (si->off) {
+	case offsetof(struct bpf_sk_lookup, family):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, family));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, protocol):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, protocol));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_ip4):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_ip4):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				remote_ip6[0], remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
+		off += offsetof(struct in6_addr, s6_addr32[0]);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				local_ip6[0], local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
+		off += offsetof(struct in6_addr, s6_addr32[0]);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sk_lookup, remote_port):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, sport));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, local_port):
+		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, dport));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, sk):
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_prog_ops sk_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+	.get_func_proto		= sk_lookup_func_proto,
+	.is_valid_access	= sk_lookup_is_valid_access,
+	.convert_ctx_access	= sk_lookup_convert_ctx_access,
+};
+
 #endif /* CONFIG_INET */
 
 DEFINE_BPF_DISPATCHER(xdp)
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 6bab40ff442e..ea21e86a807c 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -404,6 +404,7 @@ class PrinterHelpers(Printer):
 
     type_fwds = [
             'struct bpf_fib_lookup',
+            'struct bpf_sk_lookup',
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
@@ -449,6 +450,7 @@ class PrinterHelpers(Printer):
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
+            'struct bpf_sk_lookup',
             'struct bpf_sock',
             'struct bpf_sock_addr',
             'struct bpf_sock_ops',
@@ -485,6 +487,11 @@ class PrinterHelpers(Printer):
             'struct sk_msg_buff': 'struct sk_msg_md',
             'struct xdp_buff': 'struct xdp_md',
     }
+    # Helpers overloaded for different context types.
+    overloaded_helpers = [
+        'bpf_get_socket_cookie',
+        'bpf_sk_assign',
+    ]
 
     def print_header(self):
         header = '''\
@@ -541,7 +548,7 @@ class PrinterHelpers(Printer):
         for i, a in enumerate(proto['args']):
             t = a['type']
             n = a['name']
-            if proto['name'] == 'bpf_get_socket_cookie' and i == 0:
+            if proto['name'] in self.overloaded_helpers and i == 0:
                     t = 'void'
                     n = 'ctx'
             one_arg = '{}{}'.format(comma, self.map_type(t))
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 03/16] inet: Extract helper for selecting socket from reuseport group
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Prepare for calling into reuseport from __inet_lookup_listener as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/inet_hashtables.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2bbaaf0c7176..ab64834837c8 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -246,6 +246,21 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb, int doff,
+					    __be32 saddr, __be16 sport,
+					    __be32 daddr, unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 phash;
+
+	if (sk->sk_reuseport) {
+		phash = inet_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
+	}
+	return reuse_sk;
+}
+
 /*
  * Here are some nice properties to exploit here. The BSD API
  * does not allow a listening sock to specify the remote port nor the
@@ -265,21 +280,17 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	struct inet_connection_sock *icsk;
 	struct sock *sk, *result = NULL;
 	int score, hiscore = 0;
-	u32 phash = 0;
 
 	inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
 		sk = (struct sock *)icsk;
 		score = compute_score(sk, net, hnum, daddr,
 				      dif, sdif, exact_dif);
 		if (score > hiscore) {
-			if (sk->sk_reuseport) {
-				phash = inet_ehashfn(net, daddr, hnum,
-						     saddr, sport);
-				result = reuseport_select_sock(sk, phash,
-							       skb, doff);
-				if (result)
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb, doff,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			hiscore = score;
 		}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 03/16] inet: Extract helper for selecting socket from reuseport group Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02 10:27   ` Lorenz Bauer
  2020-07-06 12:06   ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 05/16] inet6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

Run a BPF program before looking up a listening socket on the receive path.
Program selects a listening socket to yield as result of socket lookup by
calling bpf_sk_assign() helper and returning BPF_REDIRECT (7) code.

Alternatively, program can also fail the lookup by returning with
BPF_DROP (1), or let the lookup continue as usual with BPF_OK (0) on
return. Other return values are treated the same as BPF_OK.

This lets the user match packets with listening sockets freely at the last
possible point on the receive path, where we know that packets are destined
for local delivery after undergoing policing, filtering, and routing.

With BPF code selecting the socket, directing packets destined to an IP
range or to a port range to a single socket becomes possible.

In case multiple programs are attached, they are run in series in the order
in which they were attached. The end result gets determined from return
code from each program according to following rules.

 1. If any program returned BPF_REDIRECT and selected a valid socket, this
    socket will be used as result of the lookup.
 2. If more than one program returned BPF_REDIRECT and selected a socket,
    last selection takes effect.
 3. If any program returned BPF_DROP and none returned BPF_REDIRECT, the
    socket lookup will fail with -ECONNREFUSED.
 4. If no program returned neither BPF_DROP nor BPF_REDIRECT, socket lookup
    continues to htable-based lookup.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Use a static_key to minimize the hook overhead when not used. (Alexei)
    - Adapt for running an array of attached programs. (Alexei)
    - Adapt for optionally skipping reuseport selection. (Martin)

 include/linux/bpf.h        | 29 ++++++++++++++++++++++++++++
 include/linux/filter.h     | 39 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/net_namespace.c | 32 ++++++++++++++++++++++++++++++-
 net/core/filter.c          |  2 ++
 net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++
 5 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 26bc70533db0..98f79d39eaa1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1013,6 +1013,35 @@ _out:							\
 		_ret;					\
 	})
 
+/* Runner for BPF_SK_LOOKUP programs to invoke on socket lookup.
+ *
+ * Valid return codes for SK_LOOKUP programs are:
+ * - BPF_REDIRECT (7) to use selected socket as result of the lookup,
+ * - BPF_DROP (1) to fail the socket lookup with no result,
+ * - BPF_OK (0) to continue on to regular htable-based socket lookup.
+ *
+ * Runner returns an u32 value that has a bit set for each code
+ * returned by any of the programs. Bit position corresponds to the
+ * return code.
+ *
+ * Caller must ensure that array is non-NULL.
+ */
+#define BPF_PROG_SK_LOOKUP_RUN_ARRAY(array, ctx, func)		\
+	({							\
+		struct bpf_prog_array_item *_item;		\
+		struct bpf_prog *_prog;				\
+		u32 _bit, _ret = 0;				\
+		migrate_disable();				\
+		_item = &(array)->items[0];			\
+		while ((_prog = READ_ONCE(_item->prog))) {	\
+			_bit = func(_prog, ctx);		\
+			_ret |= 1U << (_bit & 31);		\
+			_item++;				\
+		}						\
+		migrate_enable();				\
+		_ret;						\
+	 })
+
 #define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
 	__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
 
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ba4f8595fa54..ff7721d862c2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1297,4 +1297,43 @@ struct bpf_sk_lookup_kern {
 	bool		no_reuseport;
 };
 
+extern struct static_key_false bpf_sk_lookup_enabled;
+
+static inline bool bpf_sk_lookup_run_v4(struct net *net, int protocol,
+					const __be32 saddr, const __be16 sport,
+					const __be32 daddr, const u16 dport,
+					struct sock **psk)
+{
+	struct bpf_prog_array *run_array;
+	bool do_reuseport = false;
+	struct sock *sk = NULL;
+
+	rcu_read_lock();
+	run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
+	if (run_array) {
+		const struct bpf_sk_lookup_kern ctx = {
+			.family		= AF_INET,
+			.protocol	= protocol,
+			.v4.saddr	= saddr,
+			.v4.daddr	= daddr,
+			.sport		= sport,
+			.dport		= dport,
+		};
+		u32 ret;
+
+		ret = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, &ctx,
+						   BPF_PROG_RUN);
+		if (ret & (1U << BPF_REDIRECT)) {
+			sk = ctx.selected_sk;
+			do_reuseport = sk && !ctx.no_reuseport;
+		} else if (ret & (1U << BPF_DROP)) {
+			sk = ERR_PTR(-ECONNREFUSED);
+		}
+	}
+	rcu_read_unlock();
+
+	*psk = sk;
+	return do_reuseport;
+}
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
index 090166824ca4..a7768feb3ade 100644
--- a/kernel/bpf/net_namespace.c
+++ b/kernel/bpf/net_namespace.c
@@ -25,6 +25,28 @@ struct bpf_netns_link {
 /* Protects updates to netns_bpf */
 DEFINE_MUTEX(netns_bpf_mutex);
 
+static void netns_bpf_attach_type_disable(enum netns_bpf_attach_type type)
+{
+	switch (type) {
+	case NETNS_BPF_SK_LOOKUP:
+		static_branch_dec(&bpf_sk_lookup_enabled);
+		break;
+	default:
+		break;
+	}
+}
+
+static void netns_bpf_attach_type_enable(enum netns_bpf_attach_type type)
+{
+	switch (type) {
+	case NETNS_BPF_SK_LOOKUP:
+		static_branch_inc(&bpf_sk_lookup_enabled);
+		break;
+	default:
+		break;
+	}
+}
+
 /* Must be called with netns_bpf_mutex held. */
 static void netns_bpf_run_array_detach(struct net *net,
 				       enum netns_bpf_attach_type type)
@@ -93,6 +115,9 @@ static void bpf_netns_link_release(struct bpf_link *link)
 	if (!net)
 		goto out_unlock;
 
+	/* Mark attach point as unused */
+	netns_bpf_attach_type_disable(type);
+
 	/* Remember link position in case of safe delete */
 	idx = link_index(net, type, net_link);
 	list_del(&net_link->node);
@@ -416,6 +441,9 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
 					lockdep_is_held(&netns_bpf_mutex));
 	bpf_prog_array_free(run_array);
 
+	/* Mark attach point as used */
+	netns_bpf_attach_type_enable(type);
+
 out_unlock:
 	mutex_unlock(&netns_bpf_mutex);
 	return err;
@@ -491,8 +519,10 @@ static void __net_exit netns_bpf_pernet_pre_exit(struct net *net)
 	mutex_lock(&netns_bpf_mutex);
 	for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) {
 		netns_bpf_run_array_detach(net, type);
-		list_for_each_entry(net_link, &net->bpf.links[type], node)
+		list_for_each_entry(net_link, &net->bpf.links[type], node) {
 			net_link->net = NULL; /* auto-detach link */
+			netns_bpf_attach_type_disable(type);
+		}
 		if (net->bpf.progs[type])
 			bpf_prog_put(net->bpf.progs[type]);
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 286f90e0c824..c0146977a6d1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9220,6 +9220,8 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
 
+DEFINE_STATIC_KEY_FALSE(bpf_sk_lookup_enabled);
+
 BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
 	   struct sock *, sk, u64, flags)
 {
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ab64834837c8..2b1fc194efaf 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -299,6 +299,29 @@ static struct sock *inet_lhash2_lookup(struct net *net,
 	return result;
 }
 
+static inline struct sock *inet_lookup_run_bpf(struct net *net,
+					       struct inet_hashinfo *hashinfo,
+					       struct sk_buff *skb, int doff,
+					       __be32 saddr, __be16 sport,
+					       __be32 daddr, u16 hnum)
+{
+	struct sock *sk, *reuse_sk;
+	bool do_reuseport;
+
+	if (hashinfo != &tcp_hashinfo)
+		return NULL; /* only TCP is supported */
+
+	do_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP,
+					    saddr, sport, daddr, hnum, &sk);
+	if (do_reuseport) {
+		reuse_sk = lookup_reuseport(net, sk, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			sk = reuse_sk;
+	}
+	return sk;
+}
+
 struct sock *__inet_lookup_listener(struct net *net,
 				    struct inet_hashinfo *hashinfo,
 				    struct sk_buff *skb, int doff,
@@ -310,6 +333,14 @@ struct sock *__inet_lookup_listener(struct net *net,
 	struct sock *result = NULL;
 	unsigned int hash2;
 
+	/* Lookup redirect from BPF */
+	if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
+		result = inet_lookup_run_bpf(net, hashinfo, skb, doff,
+					     saddr, sport, daddr, hnum);
+		if (result)
+			goto done;
+	}
+
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 05/16] inet6: Extract helper for selecting socket from reuseport group
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 06/16] inet6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Prepare for calling into reuseport from inet6_lookup_listener as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/inet6_hashtables.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index fbe9d4295eac..03942eef8ab6 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -111,6 +111,23 @@ static inline int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb, int doff,
+					    const struct in6_addr *saddr,
+					    __be16 sport,
+					    const struct in6_addr *daddr,
+					    unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 phash;
+
+	if (sk->sk_reuseport) {
+		phash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, phash, skb, doff);
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *inet6_lhash2_lookup(struct net *net,
 		struct inet_listen_hashbucket *ilb2,
@@ -123,21 +140,17 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
 	struct inet_connection_sock *icsk;
 	struct sock *sk, *result = NULL;
 	int score, hiscore = 0;
-	u32 phash = 0;
 
 	inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {
 		sk = (struct sock *)icsk;
 		score = compute_score(sk, net, hnum, daddr, dif, sdif,
 				      exact_dif);
 		if (score > hiscore) {
-			if (sk->sk_reuseport) {
-				phash = inet6_ehashfn(net, daddr, hnum,
-						      saddr, sport);
-				result = reuseport_select_sock(sk, phash,
-							       skb, doff);
-				if (result)
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb, doff,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			hiscore = score;
 		}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 06/16] inet6: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 05/16] inet6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 07/16] udp: Extract helper for selecting socket from reuseport group Jakub Sitnicki
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

Following ipv4 stack changes, run a BPF program attached to netns before
looking up a listening socket. Program can return a listening socket to use
as result of socket lookup, fail the lookup, or take no action.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Use a static_key to minimize the hook overhead when not used. (Alexei)
    - Don't copy struct in6_addr when populating BPF prog context. (Martin)
    - Adapt for running an array of attached programs. (Alexei)
    - Adapt for optionally skipping reuseport selection. (Martin)

 include/linux/filter.h      | 41 +++++++++++++++++++++++++++++++++++++
 net/ipv6/inet6_hashtables.c | 35 +++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index ff7721d862c2..e7462f178213 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1336,4 +1336,45 @@ static inline bool bpf_sk_lookup_run_v4(struct net *net, int protocol,
 	return do_reuseport;
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol,
+					const struct in6_addr *saddr,
+					const __be16 sport,
+					const struct in6_addr *daddr,
+					const u16 dport,
+					struct sock **psk)
+{
+	struct bpf_prog_array *run_array;
+	bool do_reuseport = false;
+	struct sock *sk = NULL;
+
+	rcu_read_lock();
+	run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
+	if (run_array) {
+		const struct bpf_sk_lookup_kern ctx = {
+			.family		= AF_INET6,
+			.protocol	= protocol,
+			.v6.saddr	= saddr,
+			.v6.daddr	= daddr,
+			.sport		= sport,
+			.dport		= dport,
+		};
+		u32 ret;
+
+		ret = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, &ctx,
+						   BPF_PROG_RUN);
+		if (ret & (1U << BPF_REDIRECT)) {
+			sk = ctx.selected_sk;
+			do_reuseport = sk && !ctx.no_reuseport;
+		} else if (ret & (1U << BPF_DROP)) {
+			sk = ERR_PTR(-ECONNREFUSED);
+		}
+	}
+	rcu_read_unlock();
+
+	*psk = sk;
+	return do_reuseport;
+}
+#endif /* IS_ENABLED(CONFIG_IPV6) */
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 03942eef8ab6..b63583d2aa76 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -21,6 +21,8 @@
 #include <net/ip.h>
 #include <net/sock_reuseport.h>
 
+extern struct inet_hashinfo tcp_hashinfo;
+
 u32 inet6_ehashfn(const struct net *net,
 		  const struct in6_addr *laddr, const u16 lport,
 		  const struct in6_addr *faddr, const __be16 fport)
@@ -159,6 +161,31 @@ static struct sock *inet6_lhash2_lookup(struct net *net,
 	return result;
 }
 
+static inline struct sock *inet6_lookup_run_bpf(struct net *net,
+						struct inet_hashinfo *hashinfo,
+						struct sk_buff *skb, int doff,
+						const struct in6_addr *saddr,
+						const __be16 sport,
+						const struct in6_addr *daddr,
+						const u16 hnum)
+{
+	struct sock *sk, *reuse_sk;
+	bool do_reuseport;
+
+	if (hashinfo != &tcp_hashinfo)
+		return NULL; /* only TCP is supported */
+
+	do_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP,
+					    saddr, sport, daddr, hnum, &sk);
+	if (do_reuseport) {
+		reuse_sk = lookup_reuseport(net, sk, skb, doff,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			sk = reuse_sk;
+	}
+	return sk;
+}
+
 struct sock *inet6_lookup_listener(struct net *net,
 		struct inet_hashinfo *hashinfo,
 		struct sk_buff *skb, int doff,
@@ -170,6 +197,14 @@ struct sock *inet6_lookup_listener(struct net *net,
 	struct sock *result = NULL;
 	unsigned int hash2;
 
+	/* Lookup redirect from BPF */
+	if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
+		result = inet6_lookup_run_bpf(net, hashinfo, skb, doff,
+					      saddr, sport, daddr, hnum);
+		if (result)
+			goto done;
+	}
+
 	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 07/16] udp: Extract helper for selecting socket from reuseport group
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 06/16] inet6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 08/16] udp: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Prepare for calling into reuseport from __udp4_lib_lookup as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv4/udp.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 31530129f137..0d03e0277263 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -408,6 +408,25 @@ static u32 udp_ehashfn(const struct net *net, const __be32 laddr,
 			      udp_ehash_secret + net_hash_mix(net));
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb,
+					    __be32 saddr, __be16 sport,
+					    __be32 daddr, unsigned short hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 hash;
+
+	if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
+		hash = udp_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, hash, skb,
+						 sizeof(struct udphdr));
+		/* Fall back to scoring if group has connections */
+		if (reuseport_has_conns(sk, false))
+			return NULL;
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *udp4_lib_lookup2(struct net *net,
 				     __be32 saddr, __be16 sport,
@@ -418,7 +437,6 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
-	u32 hash = 0;
 
 	result = NULL;
 	badness = 0;
@@ -426,15 +444,11 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
-			if (sk->sk_reuseport &&
-			    sk->sk_state != TCP_ESTABLISHED) {
-				hash = udp_ehashfn(net, daddr, hnum,
-						   saddr, sport);
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			badness = score;
 			result = sk;
 		}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 08/16] udp: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 07/16] udp: Extract helper for selecting socket from reuseport group Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 09/16] udp6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

Following INET/TCP socket lookup changes, modify UDP socket lookup to let
BPF program select a receiving socket before searching for a socket by
destination address and port as usual.

Lookup of connected sockets that match packet 4-tuple is unaffected by this
change. BPF program runs, and potentially overrides the lookup result, only
if a 4-tuple match was not found.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Use a static_key to minimize the hook overhead when not used. (Alexei)
    - Adapt for running an array of attached programs. (Alexei)
    - Adapt for optionally skipping reuseport selection. (Martin)

 net/ipv4/udp.c | 59 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0d03e0277263..c8f88b113f82 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -456,6 +456,29 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 	return result;
 }
 
+static inline struct sock *udp4_lookup_run_bpf(struct net *net,
+					       struct udp_table *udptable,
+					       struct sk_buff *skb,
+					       __be32 saddr, __be16 sport,
+					       __be32 daddr, u16 hnum)
+{
+	struct sock *sk, *reuse_sk;
+	bool do_reuseport;
+
+	if (udptable != &udp_table)
+		return NULL; /* only UDP is supported */
+
+	do_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_UDP,
+					    saddr, sport, daddr, hnum, &sk);
+	if (do_reuseport) {
+		reuse_sk = lookup_reuseport(net, sk, skb,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			sk = reuse_sk;
+	}
+	return sk;
+}
+
 /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
  * harder than this. -DaveM
  */
@@ -463,27 +486,45 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 		__be16 sport, __be32 daddr, __be16 dport, int dif,
 		int sdif, struct udp_table *udptable, struct sk_buff *skb)
 {
-	struct sock *result;
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
+	struct sock *result, *sk;
 
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+	/* Lookup connected or non-wildcard socket */
 	result = udp4_lib_lookup2(net, saddr, sport,
 				  daddr, hnum, dif, sdif,
 				  hslot2, skb);
-	if (!result) {
-		hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
-		slot2 = hash2 & udptable->mask;
-		hslot2 = &udptable->hash2[slot2];
-
-		result = udp4_lib_lookup2(net, saddr, sport,
-					  htonl(INADDR_ANY), hnum, dif, sdif,
-					  hslot2, skb);
+	if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
+		goto done;
+
+	/* Lookup redirect from BPF */
+	if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
+		sk = udp4_lookup_run_bpf(net, udptable, skb,
+					 saddr, sport, daddr, hnum);
+		if (sk) {
+			result = sk;
+			goto done;
+		}
 	}
+
+	/* Got non-wildcard socket or error on first lookup */
+	if (result)
+		goto done;
+
+	/* Lookup wildcard sockets */
+	hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
+
+	result = udp4_lib_lookup2(net, saddr, sport,
+				  htonl(INADDR_ANY), hnum, dif, sdif,
+				  hslot2, skb);
+done:
 	if (IS_ERR(result))
 		return NULL;
 	return result;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 09/16] udp6: Extract helper for selecting socket from reuseport group
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (7 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 08/16] udp: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Prepare for calling into reuseport from __udp6_lib_lookup as well.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/ipv6/udp.c | 37 ++++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7d4151747340..65b843e7acde 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -141,6 +141,27 @@ static int compute_score(struct sock *sk, struct net *net,
 	return score;
 }
 
+static inline struct sock *lookup_reuseport(struct net *net, struct sock *sk,
+					    struct sk_buff *skb,
+					    const struct in6_addr *saddr,
+					    __be16 sport,
+					    const struct in6_addr *daddr,
+					    unsigned int hnum)
+{
+	struct sock *reuse_sk = NULL;
+	u32 hash;
+
+	if (sk->sk_reuseport && sk->sk_state != TCP_ESTABLISHED) {
+		hash = udp6_ehashfn(net, daddr, hnum, saddr, sport);
+		reuse_sk = reuseport_select_sock(sk, hash, skb,
+						 sizeof(struct udphdr));
+		/* Fall back to scoring if group has connections */
+		if (reuseport_has_conns(sk, false))
+			return NULL;
+	}
+	return reuse_sk;
+}
+
 /* called with rcu_read_lock() */
 static struct sock *udp6_lib_lookup2(struct net *net,
 		const struct in6_addr *saddr, __be16 sport,
@@ -150,7 +171,6 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
-	u32 hash = 0;
 
 	result = NULL;
 	badness = -1;
@@ -158,16 +178,11 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 		score = compute_score(sk, net, saddr, sport,
 				      daddr, hnum, dif, sdif);
 		if (score > badness) {
-			if (sk->sk_reuseport &&
-			    sk->sk_state != TCP_ESTABLISHED) {
-				hash = udp6_ehashfn(net, daddr, hnum,
-						    saddr, sport);
-
-				result = reuseport_select_sock(sk, hash, skb,
-							sizeof(struct udphdr));
-				if (result && !reuseport_has_conns(sk, false))
-					return result;
-			}
+			result = lookup_reuseport(net, sk, skb,
+						  saddr, sport, daddr, hnum);
+			if (result)
+				return result;
+
 			result = sk;
 			badness = score;
 		}
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (8 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 09/16] udp6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02 14:51     ` kernel test robot
  2020-07-02  9:24 ` [PATCH bpf-next v3 11/16] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

Same as for udp4, let BPF program override the socket lookup result, by
selecting a receiving socket of its choice or failing the lookup, if no
connected UDP socket matched packet 4-tuple.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Use a static_key to minimize the hook overhead when not used. (Alexei)
    - Adapt for running an array of attached programs. (Alexei)
    - Adapt for optionally skipping reuseport selection. (Martin)

 net/ipv6/udp.c | 60 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 51 insertions(+), 9 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 65b843e7acde..c4338cfe7a8c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -190,6 +190,31 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 	return result;
 }
 
+static inline struct sock *udp6_lookup_run_bpf(struct net *net,
+					       struct udp_table *udptable,
+					       struct sk_buff *skb,
+					       const struct in6_addr *saddr,
+					       __be16 sport,
+					       const struct in6_addr *daddr,
+					       u16 hnum)
+{
+	struct sock *sk, *reuse_sk;
+	bool do_reuseport;
+
+	if (udptable != &udp_table)
+		return NULL; /* only UDP is supported */
+
+	do_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_UDP,
+					    saddr, sport, daddr, hnum, &sk);
+	if (do_reuseport) {
+		reuse_sk = lookup_reuseport(net, sk, skb,
+					    saddr, sport, daddr, hnum);
+		if (reuse_sk)
+			sk = reuse_sk;
+	}
+	return sk;
+}
+
 /* rcu_read_lock() must be held */
 struct sock *__udp6_lib_lookup(struct net *net,
 			       const struct in6_addr *saddr, __be16 sport,
@@ -200,25 +225,42 @@ struct sock *__udp6_lib_lookup(struct net *net,
 	unsigned short hnum = ntohs(dport);
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
-	struct sock *result;
+	struct sock *result, *sk;
 
 	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
+	/* Lookup connected or non-wildcard sockets */
 	result = udp6_lib_lookup2(net, saddr, sport,
 				  daddr, hnum, dif, sdif,
 				  hslot2, skb);
-	if (!result) {
-		hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
-		slot2 = hash2 & udptable->mask;
+	if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
+		goto done;
+
+	/* Lookup redirect from BPF */
+	if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
+		sk = udp6_lookup_run_bpf(net, udptable, skb,
+					 saddr, sport, daddr, hnum);
+		if (sk) {
+			result = sk;
+			goto done;
+		}
+	}
 
-		hslot2 = &udptable->hash2[slot2];
+	/* Got non-wildcard socket or error on first lookup */
+	if (result)
+		goto done;
 
-		result = udp6_lib_lookup2(net, saddr, sport,
-					  &in6addr_any, hnum, dif, sdif,
-					  hslot2, skb);
-	}
+	/* Lookup wildcard sockets */
+	hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
+	slot2 = hash2 & udptable->mask;
+	hslot2 = &udptable->hash2[slot2];
+
+	result = udp6_lib_lookup2(net, saddr, sport,
+				  &in6addr_any, hnum, dif, sdif,
+				  hslot2, skb);
+done:
 	if (IS_ERR(result))
 		return NULL;
 	return result;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 11/16] bpf: Sync linux/bpf.h to tools/
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (9 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type Jakub Sitnicki
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Newly added program, context type and helper is used by tests in a
subsequent patch. Synchronize the header file.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Update after changes to bpf.h in earlier patch.
    
    v2:
    - Update after changes to bpf.h in earlier patch.

 tools/include/uapi/linux/bpf.h | 74 ++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0cb8ec948816..8dd6e6ce5de9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -189,6 +189,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -226,6 +227,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_INET4_GETSOCKNAME,
 	BPF_CGROUP_INET6_GETSOCKNAME,
 	BPF_XDP_DEVMAP,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3067,6 +3069,10 @@ union bpf_attr {
  *
  * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3092,6 +3098,53 @@ union bpf_attr {
  *		**-ESOCKTNOSUPPORT** if the socket type is not supported
  *		(reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
+ *		be an exact match. While IP family (**AF_INET** or
+ *		**AF_INET6**) must be compatible, that is IPv6 sockets
+ *		that are not v6-only can be selected for IPv4 packets.
+ *
+ *		Only TCP listeners and UDP unconnected sockets can be
+ *		selected.
+ *
+ *		*flags* argument can combination of following values:
+ *
+ *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
+ *		  socket selection, potentially done by a BPF program
+ *		  that ran before us.
+ *
+ *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
+ *		  load-balancing within reuseport group for the socket
+ *		  being selected.
+ *
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
+ *		  not compatible with packet family (*ctx->family*).
+ *
+ *		* **-EEXIST** if socket has been already selected,
+ *		  potentially by another program, and
+ *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
+ *
+ *		* **-EINVAL** if unsupported flags were specified.
+ *
+ *		* **-EPROTOTYPE** if socket L4 protocol
+ *		  (*sk->protocol*) doesn't match packet protocol
+ *		  (*ctx->protocol*).
+ *
+ *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
+ *		  state (TCP listening or UDP unconnected).
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -3569,6 +3622,12 @@ enum {
 	BPF_RINGBUF_HDR_SZ		= 8,
 };
 
+/* BPF_FUNC_sk_assign flags in bpf_sk_lookup context. */
+enum {
+	BPF_SK_LOOKUP_F_REPLACE		= (1ULL << 0),
+	BPF_SK_LOOKUP_F_NO_REUSEPORT	= (1ULL << 1),
+};
+
 /* Mode for BPF_FUNC_skb_adjust_room helper. */
 enum bpf_adj_room_mode {
 	BPF_ADJ_ROOM_NET,
@@ -4298,4 +4357,19 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* Protocol family (AF_INET, AF_INET6) */
+	__u32 protocol;		/* IP protocol (IPPROTO_TCP, IPPROTO_UDP) */
+	__u32 remote_ip4;	/* Network byte order */
+	__u32 remote_ip6[4];	/* Network byte order */
+	__u32 remote_port;	/* Network byte order */
+	__u32 local_ip4;	/* Network byte order */
+	__u32 local_ip6[4];	/* Network byte order */
+	__u32 local_port;	/* Host byte order */
+
+	__bpf_md_ptr(struct bpf_sock *, sk); /* Selected socket */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (10 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 11/16] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-09  4:23   ` Andrii Nakryiko
  2020-07-02  9:24 ` [PATCH bpf-next v3 13/16] tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type Jakub Sitnicki
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Make libbpf aware of the newly added program type, and assign it a
section name.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Move new libbpf symbols to version 0.1.0.
    - Set expected_attach_type in probe_load for new prog type.
    
    v2:
    - Add new libbpf symbols to version 0.0.9. (Andrii)

 tools/lib/bpf/libbpf.c        | 3 +++
 tools/lib/bpf/libbpf.h        | 2 ++
 tools/lib/bpf/libbpf.map      | 2 ++
 tools/lib/bpf/libbpf_probes.c | 3 +++
 4 files changed, 10 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4ea7f4f1a691..ddcbb5dd78df 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6793,6 +6793,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
 BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
 BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
 BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
+BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
 
 enum bpf_attach_type
 bpf_program__get_expected_attach_type(struct bpf_program *prog)
@@ -6969,6 +6970,8 @@ static const struct bpf_sec_def section_defs[] = {
 	BPF_EAPROG_SEC("cgroup/setsockopt",	BPF_PROG_TYPE_CGROUP_SOCKOPT,
 						BPF_CGROUP_SETSOCKOPT),
 	BPF_PROG_SEC("struct_ops",		BPF_PROG_TYPE_STRUCT_OPS),
+	BPF_EAPROG_SEC("sk_lookup",		BPF_PROG_TYPE_SK_LOOKUP,
+						BPF_SK_LOOKUP),
 };
 
 #undef BPF_PROG_SEC_IMPL
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 2335971ed0bd..c2272132e929 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -350,6 +350,7 @@ LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
+LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
 
 LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog);
 LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
@@ -377,6 +378,7 @@ LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
+LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 6544d2cd1ed6..04b99f63a45c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -287,5 +287,7 @@ LIBBPF_0.1.0 {
 		bpf_map__type;
 		bpf_map__value_size;
 		bpf_program__autoload;
+		bpf_program__is_sk_lookup;
 		bpf_program__set_autoload;
+		bpf_program__set_sk_lookup;
 } LIBBPF_0.0.9;
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 10cd8d1891f5..5a3d3f078408 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -78,6 +78,9 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 		xattr.expected_attach_type = BPF_CGROUP_INET4_CONNECT;
 		break;
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		xattr.expected_attach_type = BPF_SK_LOOKUP;
+		break;
 	case BPF_PROG_TYPE_KPROBE:
 		xattr.kern_version = get_kernel_version();
 		break;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 13/16] tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (11 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 14/16] selftests/bpf: Add verifier tests for bpf_sk_lookup context access Jakub Sitnicki
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Make bpftool show human-friendly identifiers for newly introduced program
and attach type, BPF_PROG_TYPE_SK_LOOKUP and BPF_SK_LOOKUP, respectively.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - New patch in v3.

 tools/bpf/bpftool/common.c | 1 +
 tools/bpf/bpftool/prog.c   | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 18e5604fe260..c254f6f5a3d6 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -63,6 +63,7 @@ const char * const attach_type_name[__MAX_BPF_ATTACH_TYPE] = {
 	[BPF_TRACE_FEXIT]		= "fexit",
 	[BPF_MODIFY_RETURN]		= "mod_ret",
 	[BPF_LSM_MAC]			= "lsm_mac",
+	[BPF_SK_LOOKUP]			= "sk_lookup",
 };
 
 void p_err(const char *fmt, ...)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 6863c57effd0..3e6ecc6332e2 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -59,6 +59,7 @@ const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_TRACING]			= "tracing",
 	[BPF_PROG_TYPE_STRUCT_OPS]		= "struct_ops",
 	[BPF_PROG_TYPE_EXT]			= "ext",
+	[BPF_PROG_TYPE_SK_LOOKUP]		= "sk_lookup",
 };
 
 const size_t prog_type_name_size = ARRAY_SIZE(prog_type_name);
@@ -1905,7 +1906,7 @@ static int do_help(int argc, char **argv)
 		"                 cgroup/getsockname4 | cgroup/getsockname6 | cgroup/sendmsg4 |\n"
 		"                 cgroup/sendmsg6 | cgroup/recvmsg4 | cgroup/recvmsg6 |\n"
 		"                 cgroup/getsockopt | cgroup/setsockopt |\n"
-		"                 struct_ops | fentry | fexit | freplace }\n"
+		"                 struct_ops | fentry | fexit | freplace | sk_lookup }\n"
 		"       ATTACH_TYPE := { msg_verdict | stream_verdict | stream_parser |\n"
 		"                        flow_dissector }\n"
 		"       METRIC := { cycles | instructions | l1d_loads | llc_misses }\n"
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 14/16] selftests/bpf: Add verifier tests for bpf_sk_lookup context access
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (12 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 13/16] tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 15/16] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c Jakub Sitnicki
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Exercise verifier access checks for bpf_sk_lookup context fields.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Consolidate ACCEPT tests into one.
    - Deduplicate REJECT tests and arrange them into logical groups.
    - Add tests for out-of-bounds and unaligned access.
    - Cover access to newly introduced 'sk' field.
    
    v2:
     - Adjust for fields renames in struct bpf_sk_lookup.

 .../selftests/bpf/verifier/ctx_sk_lookup.c    | 219 ++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c

diff --git a/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
new file mode 100644
index 000000000000..9542b07892ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
@@ -0,0 +1,219 @@
+{
+	"valid 4-byte read from bpf_sk_lookup",
+	.insns = {
+		/* 4-byte read from family field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		/* 4-byte read from protocol field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, protocol)),
+		/* 4-byte read from remote_ip4 field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip4)),
+		/* 4-byte read from remote_ip6 field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[1])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[2])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_ip6[3])),
+		/* 4-byte read from remote_port field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, remote_port)),
+		/* 4-byte read from local_ip4 field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip4)),
+		/* 4-byte read from local_ip6 field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[1])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[2])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_ip6[3])),
+		/* 4-byte read from local_port field */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, local_port)),
+		/* 8-byte read from sk field */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, sk)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+/* invalid size reads from a 4-byte field in bpf_sk_lookup */
+{
+	"invalid 8-byte read from bpf_sk_lookup family field",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read from bpf_sk_lookup family field",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read from bpf_sk_lookup family field",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, family)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+/* invalid size reads from an 8-byte field in bpf_sk_lookup */
+{
+	"invalid 4-byte read from bpf_sk_lookup sk field",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, sk)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 2-byte read from bpf_sk_lookup sk field",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, sk)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 1-byte read from bpf_sk_lookup sk field",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_sk_lookup, sk)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+/* out of bounds and unaligned reads from bpf_sk_lookup */
+{
+	"invalid 4-byte read past end of bpf_sk_lookup",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    sizeof(struct bpf_sk_lookup)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 4-byte unaligned read from bpf_sk_lookup at odd offset",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 1),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 4-byte unaligned read from bpf_sk_lookup at even offset",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 2),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+/* writes to and out of bounds of bpf_sk_lookup */
+{
+	"invalid 8-byte write to bpf_sk_lookup",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0xcafe4a11U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write to bpf_sk_lookup",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0xcafe4a11U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 2-byte write to bpf_sk_lookup",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0xcafe4a11U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 1-byte write to bpf_sk_lookup",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0xcafe4a11U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
+{
+	"invalid 4-byte write past end of bpf_sk_lookup",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0xcafe4a11U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    sizeof(struct bpf_sk_lookup)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_SK_LOOKUP,
+	.expected_attach_type = BPF_SK_LOOKUP,
+},
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 15/16] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (13 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 14/16] selftests/bpf: Add verifier tests for bpf_sk_lookup context access Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02  9:24 ` [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point Jakub Sitnicki
  2020-07-02 11:05 ` [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Lorenz Bauer
  16 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Name the BPF C file after the test case that uses it.

This frees up "test_sk_lookup" namespace for BPF sk_lookup program tests
introduced by the following patch.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/prog_tests/reference_tracking.c     | 2 +-
 .../bpf/progs/{test_sk_lookup_kern.c => test_ref_track_kern.c}  | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename tools/testing/selftests/bpf/progs/{test_sk_lookup_kern.c => test_ref_track_kern.c} (100%)

diff --git a/tools/testing/selftests/bpf/prog_tests/reference_tracking.c b/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
index fc0d7f4f02cf..106ca8bb2a8f 100644
--- a/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
+++ b/tools/testing/selftests/bpf/prog_tests/reference_tracking.c
@@ -3,7 +3,7 @@
 
 void test_reference_tracking(void)
 {
-	const char *file = "test_sk_lookup_kern.o";
+	const char *file = "test_ref_track_kern.o";
 	const char *obj_name = "ref_track";
 	DECLARE_LIBBPF_OPTS(bpf_object_open_opts, open_opts,
 		.object_name = obj_name,
diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_ref_track_kern.c
similarity index 100%
rename from tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
rename to tools/testing/selftests/bpf/progs/test_ref_track_kern.c
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (14 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 15/16] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c Jakub Sitnicki
@ 2020-07-02  9:24 ` Jakub Sitnicki
  2020-07-02 11:01   ` Lorenz Bauer
  2020-07-02 11:05 ` [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Lorenz Bauer
  16 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02  9:24 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski

Add tests to test_progs that exercise:

 - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
 - redirecting socket lookup to a socket selected by BPF program,
 - failing a socket lookup on BPF program's request,
 - error scenarios for selecting a socket from BPF program,
 - accessing BPF program context,
 - attaching and running multiple BPF programs.

Run log:
| # ./test_progs -n 68
| #68/1 query lookup prog:OK
| #68/2 TCP IPv4 redir port:OK
| #68/3 TCP IPv4 redir addr:OK
| #68/4 TCP IPv4 redir with reuseport:OK
| #68/5 TCP IPv4 redir skip reuseport:OK
| #68/6 TCP IPv6 redir port:OK
| #68/7 TCP IPv6 redir addr:OK
| #68/8 TCP IPv4->IPv6 redir port:OK
| #68/9 TCP IPv6 redir with reuseport:OK
| #68/10 TCP IPv6 redir skip reuseport:OK
| #68/11 UDP IPv4 redir port:OK
| #68/12 UDP IPv4 redir addr:OK
| #68/13 UDP IPv4 redir with reuseport:OK
| #68/14 UDP IPv4 redir skip reuseport:OK
| #68/15 UDP IPv6 redir port:OK
| #68/16 UDP IPv6 redir addr:OK
| #68/17 UDP IPv4->IPv6 redir port:OK
| #68/18 UDP IPv6 redir and reuseport:OK
| #68/19 UDP IPv6 redir skip reuseport:OK
| #68/20 TCP IPv4 drop on lookup:OK
| #68/21 TCP IPv6 drop on lookup:OK
| #68/22 UDP IPv4 drop on lookup:OK
| #68/23 UDP IPv6 drop on lookup:OK
| #68/24 TCP IPv4 drop on reuseport:OK
| #68/25 TCP IPv6 drop on reuseport:OK
| #68/26 UDP IPv4 drop on reuseport:OK
| #68/27 TCP IPv6 drop on reuseport:OK
| #68/28 sk_assign returns EEXIST:OK
| #68/29 sk_assign honors F_REPLACE:OK
| #68/30 access ctx->sk:OK
| #68/31 sk_assign rejects TCP established:OK
| #68/32 sk_assign rejects UDP connected:OK
| #68/33 multi prog - pass, pass:OK
| #68/34 multi prog - pass, inval:OK
| #68/35 multi prog - inval, pass:OK
| #68/36 multi prog - drop, drop:OK
| #68/37 multi prog - pass, drop:OK
| #68/38 multi prog - drop, pass:OK
| #68/39 multi prog - drop, inval:OK
| #68/40 multi prog - inval, drop:OK
| #68/41 multi prog - pass, redir:OK
| #68/42 multi prog - redir, pass:OK
| #68/43 multi prog - drop, redir:OK
| #68/44 multi prog - redir, drop:OK
| #68/45 multi prog - inval, redir:OK
| #68/46 multi prog - redir, inval:OK
| #68/47 multi prog - redir, redir:OK
| #68 sk_lookup:OK
| Summary: 1/47 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---

Notes:
    v3:
    - Extend tests to cover new functionality in v3:
      - multi-prog attachments (query, running, verdict precedence)
      - socket selecting for the second time with bpf_sk_assign
      - skipping over reuseport load-balancing
    
    v2:
     - Adjust for fields renames in struct bpf_sk_lookup.

 .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
 .../selftests/bpf/progs/test_sk_lookup_kern.c |  399 +++++
 2 files changed, 1752 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
new file mode 100644
index 000000000000..2859dc7e65b0
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
@@ -0,0 +1,1353 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2020 Cloudflare
+/*
+ * Test BPF attach point for INET socket lookup (BPF_SK_LOOKUP).
+ *
+ * Tests exercise:
+ *  - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
+ *  - redirecting socket lookup to a socket selected by BPF program,
+ *  - failing a socket lookup on BPF program's request,
+ *  - error scenarios for selecting a socket from BPF program,
+ *  - accessing BPF program context,
+ *  - attaching and running multiple BPF programs.
+ *
+ * Tests run in a dedicated network namespace.
+ */
+
+#define _GNU_SOURCE
+#include <arpa/inet.h>
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+#include "test_sk_lookup_kern.skel.h"
+#include "test_progs.h"
+
+/* External (address, port) pairs the client sends packets to. */
+#define EXT_IP4		"127.0.0.1"
+#define EXT_IP6		"fd00::1"
+#define EXT_PORT	7007
+
+/* Internal (address, port) pairs the server listens/receives at. */
+#define INT_IP4		"127.0.0.2"
+#define INT_IP4_V6	"::ffff:127.0.0.2"
+#define INT_IP6		"fd00::2"
+#define INT_PORT	8008
+
+#define IO_TIMEOUT_SEC	3
+
+enum server {
+	SERVER_A = 0,
+	SERVER_B = 1,
+	MAX_SERVERS,
+};
+
+enum {
+	PROG1 = 0,
+	PROG2,
+};
+
+struct inet_addr {
+	const char *ip;
+	unsigned short port;
+};
+
+struct test {
+	const char *desc;
+	struct bpf_program *lookup_prog;
+	struct bpf_program *reuseport_prog;
+	struct bpf_map *sock_map;
+	int sotype;
+	struct inet_addr connect_to;
+	struct inet_addr listen_at;
+	enum server accept_on;
+};
+
+static bool is_ipv6(const char *ip)
+{
+	return !!strchr(ip, ':');
+}
+
+static int make_addr(const char *ip, int port, struct sockaddr_storage *addr)
+{
+	struct sockaddr_in6 *addr6 = (void *)addr;
+	struct sockaddr_in *addr4 = (void *)addr;
+	int ret;
+
+	errno = 0;
+	if (is_ipv6(ip)) {
+		ret = inet_pton(AF_INET6, ip, &addr6->sin6_addr);
+		if (CHECK_FAIL(ret <= 0)) {
+			log_err("failed to convert IPv6 address '%s'", ip);
+			return -1;
+		}
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(port);
+	} else {
+		ret = inet_pton(AF_INET, ip, &addr4->sin_addr);
+		if (CHECK_FAIL(ret <= 0)) {
+			log_err("failed to convert IPv4 address '%s'", ip);
+			return -1;
+		}
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(port);
+	}
+	return 0;
+}
+
+static int setup_reuseport_prog(int sock_fd, struct bpf_program *reuseport_prog)
+{
+	int err, prog_fd;
+
+	prog_fd = bpf_program__fd(reuseport_prog);
+	if (prog_fd < 0) {
+		errno = -prog_fd;
+		log_err("failed to get fd for program '%s'",
+			bpf_program__name(reuseport_prog));
+		return -1;
+	}
+
+	err = setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
+			 &prog_fd, sizeof(prog_fd));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to ATTACH_REUSEPORT_EBPF");
+		return -1;
+	}
+
+	return 0;
+}
+
+static socklen_t inetaddr_len(const struct sockaddr_storage *addr)
+{
+	return (addr->ss_family == AF_INET ? sizeof(struct sockaddr_in) :
+		addr->ss_family == AF_INET6 ? sizeof(struct sockaddr_in6) : 0);
+}
+
+static int make_socket_with_addr(int sotype, const char *ip, int port,
+				 struct sockaddr_storage *addr)
+{
+	struct timeval timeo = { .tv_sec = IO_TIMEOUT_SEC };
+	int err, fd;
+
+	err = make_addr(ip, port, addr);
+	if (err)
+		return -1;
+
+	fd = socket(addr->ss_family, sotype, 0);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to create listen socket");
+		return -1;
+	}
+
+	err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, sizeof(timeo));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to set SO_SNDTIMEO");
+		return -1;
+	}
+
+	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo, sizeof(timeo));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to set SO_RCVTIMEO");
+		return -1;
+	}
+
+	return fd;
+}
+
+static int make_server(int sotype, const char *ip, int port,
+		       struct bpf_program *reuseport_prog)
+{
+	struct sockaddr_storage addr = {0};
+	const int one = 1;
+	int err, fd = -1;
+
+	fd = make_socket_with_addr(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+
+	/* Enabled for UDPv6 sockets for IPv4-mapped IPv6 to work. */
+	if (sotype == SOCK_DGRAM) {
+		err = setsockopt(fd, SOL_IP, IP_RECVORIGDSTADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable IP_RECVORIGDSTADDR");
+			goto fail;
+		}
+	}
+
+	if (sotype == SOCK_DGRAM && addr.ss_family == AF_INET6) {
+		err = setsockopt(fd, SOL_IPV6, IPV6_RECVORIGDSTADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable IPV6_RECVORIGDSTADDR");
+			goto fail;
+		}
+	}
+
+	if (sotype == SOCK_STREAM) {
+		err = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable SO_REUSEADDR");
+			goto fail;
+		}
+	}
+
+	if (reuseport_prog) {
+		err = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &one,
+				 sizeof(one));
+		if (CHECK_FAIL(err)) {
+			log_err("failed to enable SO_REUSEPORT");
+			goto fail;
+		}
+	}
+
+	err = bind(fd, (void *)&addr, inetaddr_len(&addr));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to bind listen socket");
+		goto fail;
+	}
+
+	if (sotype == SOCK_STREAM) {
+		err = listen(fd, SOMAXCONN);
+		if (CHECK_FAIL(err)) {
+			log_err("failed to listen on port %d", port);
+			goto fail;
+		}
+	}
+
+	/* Late attach reuseport prog so we can have one init path */
+	if (reuseport_prog) {
+		err = setup_reuseport_prog(fd, reuseport_prog);
+		if (err)
+			goto fail;
+	}
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+static int make_client(int sotype, const char *ip, int port)
+{
+	struct sockaddr_storage addr = {0};
+	int err, fd;
+
+	fd = make_socket_with_addr(sotype, ip, port, &addr);
+	if (fd < 0)
+		return -1;
+
+	err = connect(fd, (void *)&addr, inetaddr_len(&addr));
+	if (CHECK_FAIL(err)) {
+		log_err("failed to connect client socket");
+		goto fail;
+	}
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+static int send_byte(int fd)
+{
+	ssize_t n;
+
+	errno = 0;
+	n = send(fd, "a", 1, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial send");
+		return -1;
+	}
+	return 0;
+}
+
+static int recv_byte(int fd)
+{
+	char buf[1];
+	ssize_t n;
+
+	n = recv(fd, buf, sizeof(buf), 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial recv");
+		return -1;
+	}
+	return 0;
+}
+
+static int tcp_recv_send(int server_fd)
+{
+	char buf[1];
+	int ret, fd;
+	ssize_t n;
+
+	fd = accept(server_fd, NULL, NULL);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to accept");
+		return -1;
+	}
+
+	n = recv(fd, buf, sizeof(buf), 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial recv");
+		ret = -1;
+		goto close;
+	}
+
+	n = send(fd, buf, n, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed/partial send");
+		ret = -1;
+		goto close;
+	}
+
+	ret = 0;
+close:
+	close(fd);
+	return ret;
+}
+
+static void v4_to_v6(struct sockaddr_storage *ss)
+{
+	struct sockaddr_in6 *v6 = (struct sockaddr_in6 *)ss;
+	struct sockaddr_in v4 = *(struct sockaddr_in *)ss;
+
+	v6->sin6_family = AF_INET6;
+	v6->sin6_port = v4.sin_port;
+	v6->sin6_addr.s6_addr[10] = 0xff;
+	v6->sin6_addr.s6_addr[11] = 0xff;
+	memcpy(&v6->sin6_addr.s6_addr[12], &v4.sin_addr.s_addr, 4);
+}
+
+static int udp_recv_send(int server_fd)
+{
+	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
+	struct sockaddr_storage _src_addr = { 0 };
+	struct sockaddr_storage *src_addr = &_src_addr;
+	struct sockaddr_storage *dst_addr = NULL;
+	struct msghdr msg = { 0 };
+	struct iovec iov = { 0 };
+	struct cmsghdr *cm;
+	char buf[1];
+	int ret, fd;
+	ssize_t n;
+
+	iov.iov_base = buf;
+	iov.iov_len = sizeof(buf);
+
+	msg.msg_name = src_addr;
+	msg.msg_namelen = sizeof(*src_addr);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsg_buf;
+	msg.msg_controllen = sizeof(cmsg_buf);
+
+	errno = 0;
+	n = recvmsg(server_fd, &msg, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed to receive");
+		return -1;
+	}
+	if (CHECK_FAIL(msg.msg_flags & MSG_CTRUNC)) {
+		log_err("truncated cmsg");
+		return -1;
+	}
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if ((cm->cmsg_level == SOL_IP &&
+		     cm->cmsg_type == IP_ORIGDSTADDR) ||
+		    (cm->cmsg_level == SOL_IPV6 &&
+		     cm->cmsg_type == IPV6_ORIGDSTADDR)) {
+			dst_addr = (struct sockaddr_storage *)CMSG_DATA(cm);
+			break;
+		}
+		log_err("warning: ignored cmsg at level %d type %d",
+			cm->cmsg_level, cm->cmsg_type);
+	}
+	if (CHECK_FAIL(!dst_addr)) {
+		log_err("failed to get destination address");
+		return -1;
+	}
+
+	/* Server socket bound to IPv4-mapped IPv6 address */
+	if (src_addr->ss_family == AF_INET6 &&
+	    dst_addr->ss_family == AF_INET) {
+		v4_to_v6(dst_addr);
+	}
+
+	/* Reply from original destination address. */
+	fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("failed to create tx socket");
+		return -1;
+	}
+
+	ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
+	if (CHECK_FAIL(ret)) {
+		log_err("failed to bind tx socket");
+		goto out;
+	}
+
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	n = sendmsg(fd, &msg, 0);
+	if (CHECK_FAIL(n <= 0)) {
+		log_err("failed to send echo reply");
+		ret = -1;
+		goto out;
+	}
+
+	ret = 0;
+out:
+	close(fd);
+	return ret;
+}
+
+static int tcp_echo_test(int client_fd, int server_fd)
+{
+	int err;
+
+	err = send_byte(client_fd);
+	if (err)
+		return -1;
+	err = tcp_recv_send(server_fd);
+	if (err)
+		return -1;
+	err = recv_byte(client_fd);
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static int udp_echo_test(int client_fd, int server_fd)
+{
+	int err;
+
+	err = send_byte(client_fd);
+	if (err)
+		return -1;
+	err = udp_recv_send(server_fd);
+	if (err)
+		return -1;
+	err = recv_byte(client_fd);
+	if (err)
+		return -1;
+
+	return 0;
+}
+
+static struct bpf_link *attach_lookup_prog(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	int net_fd;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("failed to open /proc/self/ns/net");
+		return NULL;
+	}
+
+	link = bpf_program__attach_netns(prog, net_fd);
+	if (CHECK_FAIL(IS_ERR(link))) {
+		errno = -PTR_ERR(link);
+		log_err("failed to attach program '%s' to netns",
+			bpf_program__name(prog));
+		link = NULL;
+	}
+
+	close(net_fd);
+	return link;
+}
+
+static int update_lookup_map(struct bpf_map *map, int index, int sock_fd)
+{
+	int err, map_fd;
+	uint64_t value;
+
+	map_fd = bpf_map__fd(map);
+	if (CHECK_FAIL(map_fd < 0)) {
+		errno = -map_fd;
+		log_err("failed to get map FD");
+		return -1;
+	}
+
+	value = (uint64_t)sock_fd;
+	err = bpf_map_update_elem(map_fd, &index, &value, BPF_NOEXIST);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to update redir_map @ %d", index);
+		return -1;
+	}
+
+	return 0;
+}
+
+static __u32 link_info_prog_id(struct bpf_link *link)
+{
+	struct bpf_link_info info = {};
+	__u32 info_len = sizeof(info);
+	int link_fd, err;
+
+	link_fd = bpf_link__fd(link);
+	if (CHECK_FAIL(link_fd < 0)) {
+		errno = -link_fd;
+		log_err("bpf_link__fd failed");
+		return 0;
+	}
+
+	err = bpf_obj_get_info_by_fd(link_fd, &info, &info_len);
+	if (CHECK_FAIL(err || info_len != sizeof(info))) {
+		log_err("bpf_obj_get_info_by_fd");
+		return 0;
+	}
+
+	return info.prog_id;
+}
+
+static void query_lookup_prog(struct test_sk_lookup_kern *skel)
+{
+	struct bpf_link *link[3] = {};
+	__u32 attach_flags = 0;
+	__u32 prog_ids[3] = {};
+	__u32 prog_cnt = 3;
+	__u32 prog_id;
+	int net_fd;
+	int err;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("failed to open /proc/self/ns/net");
+		return;
+	}
+
+	link[0] = attach_lookup_prog(skel->progs.lookup_pass);
+	if (!link[0])
+		goto close;
+	link[1] = attach_lookup_prog(skel->progs.lookup_pass);
+	if (!link[1])
+		goto detach;
+	link[2] = attach_lookup_prog(skel->progs.lookup_drop);
+	if (!link[2])
+		goto detach;
+
+	err = bpf_prog_query(net_fd, BPF_SK_LOOKUP, 0 /* query flags */,
+			     &attach_flags, prog_ids, &prog_cnt);
+	if (CHECK_FAIL(err)) {
+		log_err("failed to query lookup prog");
+		goto detach;
+	}
+
+	system("/home/jkbs/src/linux/tools/bpf/bpftool/bpftool link show");
+
+	errno = 0;
+	if (CHECK_FAIL(attach_flags != 0)) {
+		log_err("wrong attach_flags on query: %u", attach_flags);
+		goto detach;
+	}
+	if (CHECK_FAIL(prog_cnt != 3)) {
+		log_err("wrong program count on query: %u", prog_cnt);
+		goto detach;
+	}
+	prog_id = link_info_prog_id(link[0]);
+	if (CHECK_FAIL(prog_ids[0] != prog_id)) {
+		log_err("invalid program id on query: %u != %u",
+			prog_ids[0], prog_id);
+		goto detach;
+	}
+	prog_id = link_info_prog_id(link[1]);
+	if (CHECK_FAIL(prog_ids[1] != prog_id)) {
+		log_err("invalid program id on query: %u != %u",
+			prog_ids[1], prog_id);
+		goto detach;
+	}
+	prog_id = link_info_prog_id(link[2]);
+	if (CHECK_FAIL(prog_ids[2] != prog_id)) {
+		log_err("invalid program id on query: %u != %u",
+			prog_ids[2], prog_id);
+		goto detach;
+	}
+
+detach:
+	if (link[2])
+		bpf_link__destroy(link[2]);
+	if (link[1])
+		bpf_link__destroy(link[1]);
+	if (link[0])
+		bpf_link__destroy(link[0]);
+close:
+	close(net_fd);
+}
+
+static void run_lookup_prog(const struct test *t)
+{
+	int client_fd, server_fds[MAX_SERVERS] = { -1 };
+	struct bpf_link *lookup_link;
+	int i, err;
+
+	lookup_link = attach_lookup_prog(t->lookup_prog);
+	if (!lookup_link)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
+		server_fds[i] = make_server(t->sotype, t->listen_at.ip,
+					    t->listen_at.port,
+					    t->reuseport_prog);
+		if (server_fds[i] < 0)
+			goto close;
+
+		err = update_lookup_map(t->sock_map, i, server_fds[i]);
+		if (err)
+			goto close;
+
+		/* want just one server for non-reuseport test */
+		if (!t->reuseport_prog)
+			break;
+	}
+
+	client_fd = make_client(t->sotype, t->connect_to.ip, t->connect_to.port);
+	if (client_fd < 0)
+		goto close;
+
+	if (t->sotype == SOCK_STREAM)
+		tcp_echo_test(client_fd, server_fds[t->accept_on]);
+	else
+		udp_echo_test(client_fd, server_fds[t->accept_on]);
+
+	close(client_fd);
+close:
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
+		if (server_fds[i] != -1)
+			close(server_fds[i]);
+	}
+	bpf_link__destroy(lookup_link);
+}
+
+static void test_redirect_lookup(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { EXT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4 redir addr",
+			.lookup_prog	= skel->progs.redir_ip4,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4 redir with reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+			.accept_on	= SERVER_B,
+		},
+		{
+			.desc		= "TCP IPv4 redir skip reuseport",
+			.lookup_prog	= skel->progs.select_sock_a_no_reuseport,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+			.accept_on	= SERVER_A,
+		},
+		{
+			.desc		= "TCP IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { EXT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 redir addr",
+			.lookup_prog	= skel->progs.redir_ip6,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv4->IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4_V6, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 redir with reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, INT_PORT },
+			.accept_on	= SERVER_B,
+		},
+		{
+			.desc		= "TCP IPv6 redir skip reuseport",
+			.lookup_prog	= skel->progs.select_sock_a_no_reuseport,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, INT_PORT },
+			.accept_on	= SERVER_A,
+		},
+		{
+			.desc		= "UDP IPv4 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { EXT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 redir addr",
+			.lookup_prog	= skel->progs.redir_ip4,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 redir with reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+			.accept_on	= SERVER_B,
+		},
+		{
+			.desc		= "UDP IPv4 redir skip reuseport",
+			.lookup_prog	= skel->progs.select_sock_a_no_reuseport,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+			.accept_on	= SERVER_A,
+		},
+		{
+			.desc		= "UDP IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { EXT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 redir addr",
+			.lookup_prog	= skel->progs.redir_ip6,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4->IPv6 redir port",
+			.lookup_prog	= skel->progs.redir_port,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.listen_at	= { INT_IP4_V6, INT_PORT },
+			.connect_to	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 redir and reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, INT_PORT },
+			.accept_on	= SERVER_B,
+		},
+		{
+			.desc		= "UDP IPv6 redir skip reuseport",
+			.lookup_prog	= skel->progs.select_sock_a_no_reuseport,
+			.reuseport_prog	= skel->progs.select_sock_b,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, INT_PORT },
+			.accept_on	= SERVER_A,
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			run_lookup_prog(t);
+	}
+}
+
+static void drop_on_lookup(const struct test *t)
+{
+	struct sockaddr_storage dst = {};
+	int client_fd, server_fd, err;
+	struct bpf_link *lookup_link;
+	ssize_t n;
+
+	lookup_link = attach_lookup_prog(t->lookup_prog);
+	if (!lookup_link)
+		return;
+
+	server_fd = make_server(t->sotype, t->listen_at.ip, t->listen_at.port,
+				t->reuseport_prog);
+	if (server_fd < 0)
+		goto detach;
+
+	client_fd = make_socket_with_addr(t->sotype, t->connect_to.ip,
+					  t->connect_to.port, &dst);
+	if (client_fd < 0)
+		goto close_srv;
+
+	err = connect(client_fd, (void *)&dst, inetaddr_len(&dst));
+	if (t->sotype == SOCK_DGRAM) {
+		err = send_byte(client_fd);
+		if (err)
+			goto close_all;
+
+		/* Read out asynchronous error */
+		n = recv(client_fd, NULL, 0, 0);
+		err = n == -1;
+	}
+	if (CHECK_FAIL(!err || errno != ECONNREFUSED))
+		log_err("expected ECONNREFUSED on connect");
+
+close_all:
+	close(client_fd);
+close_srv:
+	close(server_fd);
+detach:
+	bpf_link__destroy(lookup_link);
+}
+
+static void test_drop_on_lookup(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { EXT_IP6, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "UDP IPv6 drop on lookup",
+			.lookup_prog	= skel->progs.lookup_drop,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { EXT_IP6, INT_PORT },
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			drop_on_lookup(t);
+	}
+}
+
+static void drop_on_reuseport(const struct test *t)
+{
+	struct sockaddr_storage dst = { 0 };
+	int client, server1, server2, err;
+	struct bpf_link *lookup_link;
+	ssize_t n;
+
+	lookup_link = attach_lookup_prog(t->lookup_prog);
+	if (!lookup_link)
+		return;
+
+	server1 = make_server(t->sotype, t->listen_at.ip, t->listen_at.port,
+			      t->reuseport_prog);
+	if (server1 < 0)
+		goto detach;
+
+	err = update_lookup_map(t->sock_map, SERVER_A, server1);
+	if (err)
+		goto detach;
+
+	/* second server on destination address we should never reach */
+	server2 = make_server(t->sotype, t->connect_to.ip, t->connect_to.port,
+			      NULL /* reuseport prog */);
+	if (server2 < 0)
+		goto close_srv1;
+
+	client = make_socket_with_addr(t->sotype, t->connect_to.ip,
+				       t->connect_to.port, &dst);
+	if (client < 0)
+		goto close_srv2;
+
+	err = connect(client, (void *)&dst, inetaddr_len(&dst));
+	if (t->sotype == SOCK_DGRAM) {
+		err = send_byte(client);
+		if (err)
+			goto close_all;
+
+		/* Read out asynchronous error */
+		n = recv(client, NULL, 0, 0);
+		err = n == -1;
+	}
+	if (CHECK_FAIL(!err || errno != ECONNREFUSED))
+		log_err("expected ECONNREFUSED on connect");
+
+close_all:
+	close(client);
+close_srv2:
+	close(server2);
+close_srv1:
+	close(server1);
+detach:
+	bpf_link__destroy(lookup_link);
+}
+
+static void test_drop_on_reuseport(struct test_sk_lookup_kern *skel)
+{
+	const struct test tests[] = {
+		{
+			.desc		= "TCP IPv4 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, INT_PORT },
+		},
+		{
+			.desc		= "UDP IPv4 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_DGRAM,
+			.connect_to	= { EXT_IP4, EXT_PORT },
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "TCP IPv6 drop on reuseport",
+			.lookup_prog	= skel->progs.select_sock_a,
+			.reuseport_prog	= skel->progs.reuseport_drop,
+			.sock_map	= skel->maps.redir_map,
+			.sotype		= SOCK_STREAM,
+			.connect_to	= { EXT_IP6, EXT_PORT },
+			.listen_at	= { INT_IP6, INT_PORT },
+		},
+	};
+	const struct test *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		if (test__start_subtest(t->desc))
+			drop_on_reuseport(t);
+	}
+}
+
+static void run_sk_assign(struct test_sk_lookup_kern *skel,
+			  struct bpf_program *lookup_prog)
+{
+	int client_fd, peer_fd, server_fds[MAX_SERVERS] = { -1 };
+	struct bpf_link *lookup_link;
+	int i, err;
+
+	lookup_link = attach_lookup_prog(lookup_prog);
+	if (!lookup_link)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
+		server_fds[i] = make_server(SOCK_STREAM, INT_IP4, 0, NULL);
+		if (server_fds[i] < 0)
+			goto close_servers;
+
+		err = update_lookup_map(skel->maps.redir_map, i,
+					server_fds[i]);
+		if (err)
+			goto close_servers;
+	}
+
+	client_fd = make_client(SOCK_STREAM, EXT_IP4, EXT_PORT);
+	if (client_fd < 0)
+		goto close_servers;
+
+	peer_fd = accept(server_fds[SERVER_B], NULL, NULL);
+	if (CHECK_FAIL(peer_fd < 0))
+		goto close_client;
+
+	close(peer_fd);
+close_client:
+	close(client_fd);
+close_servers:
+	for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
+		if (server_fds[i] != -1)
+			close(server_fds[i]);
+	}
+	bpf_link__destroy(lookup_link);
+}
+
+static void run_sk_assign_connected(struct test_sk_lookup_kern *skel,
+				    int sotype)
+{
+	int err, client_fd, connected_fd, server_fd;
+	struct bpf_link *lookup_link;
+
+	server_fd = make_server(sotype, EXT_IP4, EXT_PORT, NULL);
+	if (server_fd < 0)
+		return;
+
+	connected_fd = make_client(sotype, EXT_IP4, EXT_PORT);
+	if (connected_fd < 0)
+		goto out_close_server;
+
+	/* Put a connected socket in redirect map */
+	err = update_lookup_map(skel->maps.redir_map, SERVER_A, connected_fd);
+	if (err)
+		goto out_close_connected;
+
+	lookup_link = attach_lookup_prog(skel->progs.sk_assign_esocknosupport);
+	if (!lookup_link)
+		goto out_close_connected;
+
+	/* Try to redirect TCP SYN / UDP packet to a connected socket */
+	client_fd = make_client(sotype, EXT_IP4, EXT_PORT);
+	if (client_fd < 0)
+		goto out_unlink_prog;
+	if (sotype == SOCK_DGRAM) {
+		send_byte(client_fd);
+		recv_byte(server_fd);
+	}
+
+	close(client_fd);
+out_unlink_prog:
+	bpf_link__destroy(lookup_link);
+out_close_connected:
+	close(connected_fd);
+out_close_server:
+	close(server_fd);
+}
+
+static void test_sk_assign_helper(struct test_sk_lookup_kern *skel)
+{
+	if (test__start_subtest("sk_assign returns EEXIST"))
+		run_sk_assign(skel, skel->progs.sk_assign_eexist);
+	if (test__start_subtest("sk_assign honors F_REPLACE"))
+		run_sk_assign(skel, skel->progs.sk_assign_replace_flag);
+	if (test__start_subtest("access ctx->sk"))
+		run_sk_assign(skel, skel->progs.access_ctx_sk);
+	if (test__start_subtest("sk_assign rejects TCP established"))
+		run_sk_assign_connected(skel, SOCK_STREAM);
+	if (test__start_subtest("sk_assign rejects UDP connected"))
+		run_sk_assign_connected(skel, SOCK_DGRAM);
+}
+
+struct test_multi_prog {
+	const char *desc;
+	struct bpf_program *prog1;
+	struct bpf_program *prog2;
+	struct bpf_map *redir_map;
+	struct bpf_map *run_map;
+	int expect_errno;
+	struct inet_addr listen_at;
+};
+
+static void run_multi_prog_lookup(const struct test_multi_prog *t)
+{
+	struct sockaddr_storage dst = {};
+	int map_fd, server_fd, client_fd;
+	struct bpf_link *link1, *link2;
+	int prog_idx, done, err;
+
+	map_fd = bpf_map__fd(t->run_map);
+
+	done = 0;
+	prog_idx = PROG1;
+	CHECK_FAIL(bpf_map_update_elem(map_fd, &prog_idx, &done, BPF_ANY));
+	prog_idx = PROG2;
+	CHECK_FAIL(bpf_map_update_elem(map_fd, &prog_idx, &done, BPF_ANY));
+
+	link1 = attach_lookup_prog(t->prog1);
+	if (!link1)
+		return;
+	link2 = attach_lookup_prog(t->prog2);
+	if (!link2)
+		goto out_unlink1;
+
+	server_fd = make_server(SOCK_STREAM, t->listen_at.ip,
+				t->listen_at.port, NULL);
+	if (server_fd < 0)
+		goto out_unlink2;
+
+	err = update_lookup_map(t->redir_map, SERVER_A, server_fd);
+	if (err)
+		goto out_close_server;
+
+	client_fd = make_socket_with_addr(SOCK_STREAM, EXT_IP4, EXT_PORT,
+					  &dst);
+	if (client_fd < 0)
+		goto out_close_server;
+
+	err = connect(client_fd, (void *)&dst, inetaddr_len(&dst));
+	if (CHECK_FAIL(err && !t->expect_errno))
+		goto out_close_client;
+	if (CHECK_FAIL(err && t->expect_errno && errno != t->expect_errno))
+		goto out_close_client;
+
+	done = 0;
+	prog_idx = PROG1;
+	CHECK_FAIL(bpf_map_lookup_elem(map_fd, &prog_idx, &done));
+	CHECK_FAIL(!done);
+
+	done = 0;
+	prog_idx = PROG2;
+	CHECK_FAIL(bpf_map_lookup_elem(map_fd, &prog_idx, &done));
+	CHECK_FAIL(!done);
+
+out_close_client:
+	close(client_fd);
+out_close_server:
+	close(server_fd);
+out_unlink2:
+	bpf_link__destroy(link2);
+out_unlink1:
+	bpf_link__destroy(link1);
+}
+
+static void test_multi_prog_lookup(struct test_sk_lookup_kern *skel)
+{
+	struct test_multi_prog tests[] = {
+		{
+			.desc		= "multi prog - pass, pass",
+			.prog1		= skel->progs.multi_prog_pass1,
+			.prog2		= skel->progs.multi_prog_pass2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "multi prog - pass, inval",
+			.prog1		= skel->progs.multi_prog_pass1,
+			.prog2		= skel->progs.multi_prog_inval2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "multi prog - inval, pass",
+			.prog1		= skel->progs.multi_prog_inval1,
+			.prog2		= skel->progs.multi_prog_pass2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+		},
+		{
+			.desc		= "multi prog - drop, drop",
+			.prog1		= skel->progs.multi_prog_drop1,
+			.prog2		= skel->progs.multi_prog_drop2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+			.expect_errno	= ECONNREFUSED,
+		},
+		{
+			.desc		= "multi prog - pass, drop",
+			.prog1		= skel->progs.multi_prog_pass1,
+			.prog2		= skel->progs.multi_prog_drop2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+			.expect_errno	= ECONNREFUSED,
+		},
+		{
+			.desc		= "multi prog - drop, pass",
+			.prog1		= skel->progs.multi_prog_drop1,
+			.prog2		= skel->progs.multi_prog_pass2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+			.expect_errno	= ECONNREFUSED,
+		},
+		{
+			.desc		= "multi prog - drop, inval",
+			.prog1		= skel->progs.multi_prog_drop1,
+			.prog2		= skel->progs.multi_prog_inval2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+			.expect_errno	= ECONNREFUSED,
+		},
+		{
+			.desc		= "multi prog - inval, drop",
+			.prog1		= skel->progs.multi_prog_inval1,
+			.prog2		= skel->progs.multi_prog_drop2,
+			.listen_at	= { EXT_IP4, EXT_PORT },
+			.expect_errno	= ECONNREFUSED,
+		},
+		{
+			.desc		= "multi prog - pass, redir",
+			.prog1		= skel->progs.multi_prog_pass1,
+			.prog2		= skel->progs.multi_prog_redir2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "multi prog - redir, pass",
+			.prog1		= skel->progs.multi_prog_redir1,
+			.prog2		= skel->progs.multi_prog_pass2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "multi prog - drop, redir",
+			.prog1		= skel->progs.multi_prog_drop1,
+			.prog2		= skel->progs.multi_prog_redir2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "multi prog - redir, drop",
+			.prog1		= skel->progs.multi_prog_redir1,
+			.prog2		= skel->progs.multi_prog_drop2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "multi prog - inval, redir",
+			.prog1		= skel->progs.multi_prog_inval1,
+			.prog2		= skel->progs.multi_prog_redir2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "multi prog - redir, inval",
+			.prog1		= skel->progs.multi_prog_redir1,
+			.prog2		= skel->progs.multi_prog_inval2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+		{
+			.desc		= "multi prog - redir, redir",
+			.prog1		= skel->progs.multi_prog_redir1,
+			.prog2		= skel->progs.multi_prog_redir2,
+			.listen_at	= { INT_IP4, INT_PORT },
+		},
+	};
+	struct test_multi_prog *t;
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
+		t->redir_map = skel->maps.redir_map;
+		t->run_map = skel->maps.run_map;
+		if (test__start_subtest(t->desc))
+			run_multi_prog_lookup(t);
+	}
+}
+
+static void run_tests(struct test_sk_lookup_kern *skel)
+{
+	if (test__start_subtest("query lookup prog"))
+		query_lookup_prog(skel);
+	test_redirect_lookup(skel);
+	test_drop_on_lookup(skel);
+	test_drop_on_reuseport(skel);
+	test_sk_assign_helper(skel);
+	test_multi_prog_lookup(skel);
+}
+
+static int switch_netns(int *saved_net)
+{
+	static const char * const setup_script[] = {
+		"ip -6 addr add dev lo " EXT_IP6 "/128 nodad",
+		"ip -6 addr add dev lo " INT_IP6 "/128 nodad",
+		"ip link set dev lo up",
+		NULL,
+	};
+	const char * const *cmd;
+	int net_fd, err;
+
+	net_fd = open("/proc/self/ns/net", O_RDONLY);
+	if (CHECK_FAIL(net_fd < 0)) {
+		log_err("open(/proc/self/ns/net)");
+		return -1;
+	}
+
+	err = unshare(CLONE_NEWNET);
+	if (CHECK_FAIL(err)) {
+		log_err("unshare(CLONE_NEWNET)");
+		goto close;
+	}
+
+	for (cmd = setup_script; *cmd; cmd++) {
+		err = system(*cmd);
+		if (CHECK_FAIL(err)) {
+			log_err("system(%s)", *cmd);
+			goto close;
+		}
+	}
+
+	*saved_net = net_fd;
+	return 0;
+
+close:
+	close(net_fd);
+	return -1;
+}
+
+static void restore_netns(int saved_net)
+{
+	int err;
+
+	err = setns(saved_net, CLONE_NEWNET);
+	if (CHECK_FAIL(err))
+		log_err("setns(CLONE_NEWNET)");
+
+	close(saved_net);
+}
+
+void test_sk_lookup(void)
+{
+	struct test_sk_lookup_kern *skel;
+	int err, saved_net;
+
+	err = switch_netns(&saved_net);
+	if (err)
+		return;
+
+	skel = test_sk_lookup_kern__open_and_load();
+	if (CHECK_FAIL(!skel)) {
+		errno = 0;
+		log_err("failed to open and load BPF skeleton");
+		goto restore_netns;
+	}
+
+	run_tests(skel);
+
+	test_sk_lookup_kern__destroy(skel);
+restore_netns:
+	restore_netns(saved_net);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
new file mode 100644
index 000000000000..75745898fd3b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
@@ -0,0 +1,399 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2020 Cloudflare
+
+#include <errno.h>
+#include <linux/bpf.h>
+#include <sys/socket.h>
+
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_helpers.h>
+
+#define IP4(a, b, c, d)					\
+	bpf_htonl((((__u32)(a) & 0xffU) << 24) |	\
+		  (((__u32)(b) & 0xffU) << 16) |	\
+		  (((__u32)(c) & 0xffU) <<  8) |	\
+		  (((__u32)(d) & 0xffU) <<  0))
+#define IP6(aaaa, bbbb, cccc, dddd)			\
+	{ bpf_htonl(aaaa), bpf_htonl(bbbb), bpf_htonl(cccc), bpf_htonl(dddd) }
+
+#define MAX_SOCKS 32
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, MAX_SOCKS);
+	__type(key, __u32);
+	__type(value, __u64);
+} redir_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 2);
+	__type(key, int);
+	__type(value, int);
+} run_map SEC(".maps");
+
+enum {
+	PROG1 = 0,
+	PROG2,
+};
+
+enum {
+	SERVER_A = 0,
+	SERVER_B,
+};
+
+/* Addressable key/value constants for convenience */
+static const int KEY_PROG1 = PROG1;
+static const int KEY_PROG2 = PROG2;
+static const int PROG_DONE = 1;
+
+static const __u32 KEY_SERVER_A = SERVER_A;
+static const __u32 KEY_SERVER_B = SERVER_B;
+
+static const __u32 DST_PORT = 7007;
+static const __u32 DST_IP4 = IP4(127, 0, 0, 1);
+static const __u32 DST_IP6[] = IP6(0xfd000000, 0x0, 0x0, 0x00000001);
+
+SEC("sk_lookup/lookup_pass")
+int lookup_pass(struct bpf_sk_lookup *ctx)
+{
+	return BPF_OK;
+}
+
+SEC("sk_lookup/lookup_drop")
+int lookup_drop(struct bpf_sk_lookup *ctx)
+{
+	return BPF_DROP;
+}
+
+SEC("sk_reuseport/reuse_pass")
+int reuseport_pass(struct sk_reuseport_md *ctx)
+{
+	return SK_PASS;
+}
+
+SEC("sk_reuseport/reuse_drop")
+int reuseport_drop(struct sk_reuseport_md *ctx)
+{
+	return SK_DROP;
+}
+
+/* Redirect packets destined for port DST_PORT to socket at redir_map[0]. */
+SEC("sk_lookup/redir_port")
+int redir_port(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, 0);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+/* Redirect packets destined for DST_IP4 address to socket at redir_map[0]. */
+SEC("sk_lookup/redir_ip4")
+int redir_ip4(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->family != AF_INET)
+		return BPF_OK;
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+	if (ctx->local_ip4 != DST_IP4)
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, 0);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+/* Redirect packets destined for DST_IP6 address to socket at redir_map[0]. */
+SEC("sk_lookup/redir_ip6")
+int redir_ip6(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	if (ctx->family != AF_INET6)
+		return BPF_OK;
+	if (ctx->local_port != DST_PORT)
+		return BPF_OK;
+	if (ctx->local_ip6[0] != DST_IP6[0] ||
+	    ctx->local_ip6[1] != DST_IP6[1] ||
+	    ctx->local_ip6[2] != DST_IP6[2] ||
+	    ctx->local_ip6[3] != DST_IP6[3])
+		return BPF_OK;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, 0);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_lookup/select_sock_a")
+int select_sock_a(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_OK;
+
+	err = bpf_sk_assign(ctx, sk, 0);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_lookup/select_sock_a_no_reuseport")
+int select_sock_a_no_reuseport(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_DROP;
+
+	err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_NO_REUSEPORT);
+	bpf_sk_release(sk);
+	return err ? BPF_DROP : BPF_REDIRECT;
+}
+
+SEC("sk_reuseport/select_sock_b")
+int select_sock_b(struct sk_reuseport_md *ctx)
+{
+	__u32 key = KEY_SERVER_B;
+	int err;
+
+	err = bpf_sk_select_reuseport(ctx, &redir_map, &key, 0);
+	return err ? SK_DROP : SK_PASS;
+}
+
+/* Check that bpf_sk_assign() returns -EEXIST if socket already selected. */
+SEC("sk_lookup/sk_assign_eexist")
+int sk_assign_eexist(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err, ret;
+
+	ret = BPF_DROP;
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_B);
+	if (!sk)
+		goto out;
+	err = bpf_sk_assign(ctx, sk, 0);
+	if (err)
+		goto out;
+	bpf_sk_release(sk);
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		goto out;
+	err = bpf_sk_assign(ctx, sk, 0);
+	if (err != -EEXIST) {
+		bpf_printk("sk_assign returned %d, expected %d\n",
+			   err, -EEXIST);
+		goto out;
+	}
+
+	ret = BPF_REDIRECT; /* Success, redirect to KEY_SERVER_B */
+out:
+	if (sk)
+		bpf_sk_release(sk);
+	return ret;
+}
+
+/* Check that bpf_sk_assign(BPF_SK_LOOKUP_F_REPLACE) can override selection. */
+SEC("sk_lookup/sk_assign_replace_flag")
+int sk_assign_replace_flag(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err, ret;
+
+	ret = BPF_DROP;
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		goto out;
+	err = bpf_sk_assign(ctx, sk, 0);
+	if (err)
+		goto out;
+	bpf_sk_release(sk);
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_B);
+	if (!sk)
+		goto out;
+	err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_REPLACE);
+	if (err) {
+		bpf_printk("sk_assign returned %d, expected 0\n", err);
+		goto out;
+	}
+
+	ret = BPF_REDIRECT; /* Success, redirect to KEY_SERVER_B */
+out:
+	if (sk)
+		bpf_sk_release(sk);
+	return ret;
+}
+
+/* Check that selected sk is accessible thru context. */
+SEC("sk_lookup/access_ctx_sk")
+int access_ctx_sk(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err, ret;
+
+	ret = BPF_DROP;
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		goto out;
+	err = bpf_sk_assign(ctx, sk, 0);
+	if (err)
+		goto out;
+	if (sk != ctx->sk) {
+		bpf_printk("expected ctx->sk == KEY_SERVER_A\n");
+		goto out;
+	}
+	bpf_sk_release(sk);
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_B);
+	if (!sk)
+		goto out;
+	err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_REPLACE);
+	if (err)
+		goto out;
+	if (sk != ctx->sk) {
+		bpf_printk("expected ctx->sk == KEY_SERVER_B\n");
+		goto out;
+	}
+
+	ret = BPF_REDIRECT; /* Success, redirect to KEY_SERVER_B */
+out:
+	if (sk)
+		bpf_sk_release(sk);
+	return ret;
+}
+
+/* Check that sk_assign rejects KEY_SERVER_A socket with -ESOCKNOSUPPORT */
+SEC("sk_lookup/sk_assign_esocknosupport")
+int sk_assign_esocknosupport(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err, ret;
+
+	ret = BPF_DROP;
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		goto out;
+
+	err = bpf_sk_assign(ctx, sk, 0);
+	if (err != -ESOCKTNOSUPPORT) {
+		bpf_printk("sk_assign returned %d, expected %d\n",
+			   err, -ESOCKTNOSUPPORT);
+		goto out;
+	}
+
+	ret = BPF_OK; /* Success, pass to regular lookup */
+out:
+	if (sk)
+		bpf_sk_release(sk);
+	return ret;
+}
+
+SEC("sk_lookup/multi_prog_pass1")
+int multi_prog_pass1(struct bpf_sk_lookup *ctx)
+{
+	bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
+	return BPF_OK;
+}
+
+SEC("sk_lookup/multi_prog_pass2")
+int multi_prog_pass2(struct bpf_sk_lookup *ctx)
+{
+	bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
+	return BPF_OK;
+}
+
+SEC("sk_lookup/multi_prog_drop1")
+int multi_prog_drop1(struct bpf_sk_lookup *ctx)
+{
+	bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
+	return BPF_DROP;
+}
+
+SEC("sk_lookup/multi_prog_drop2")
+int multi_prog_drop2(struct bpf_sk_lookup *ctx)
+{
+	bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
+	return BPF_DROP;
+}
+
+SEC("sk_lookup/multi_prog_inval1")
+int multi_prog_inval1(struct bpf_sk_lookup *ctx)
+{
+	bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
+	return -1;
+}
+
+SEC("sk_lookup/multi_prog_inval2")
+int multi_prog_inval2(struct bpf_sk_lookup *ctx)
+{
+	bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
+	return -1;
+}
+
+SEC("sk_lookup/multi_prog_redir1")
+int multi_prog_redir1(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_DROP;
+
+	err = bpf_sk_assign(ctx, sk, 0);
+	bpf_sk_release(sk);
+	if (err)
+		return BPF_DROP;
+
+	bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
+	return BPF_REDIRECT;
+}
+
+SEC("sk_lookup/multi_prog_redir2")
+int multi_prog_redir2(struct bpf_sk_lookup *ctx)
+{
+	struct bpf_sock *sk;
+	int err;
+
+	sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
+	if (!sk)
+		return BPF_DROP;
+
+	err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_REPLACE);
+	bpf_sk_release(sk);
+	if (err)
+		return BPF_DROP;
+
+	bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
+	return BPF_REDIRECT;
+}
+
+char _license[] SEC("license") = "Dual BSD/GPL";
+__u32 _version SEC("version") = 1;
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 ` [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02 10:27   ` Lorenz Bauer
  2020-07-02 12:46     ` Jakub Sitnicki
  2020-07-06 12:06   ` Jakub Sitnicki
  1 sibling, 1 reply; 51+ messages in thread
From: Lorenz Bauer @ 2020-07-02 10:27 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Run a BPF program before looking up a listening socket on the receive path.
> Program selects a listening socket to yield as result of socket lookup by
> calling bpf_sk_assign() helper and returning BPF_REDIRECT (7) code.
>
> Alternatively, program can also fail the lookup by returning with
> BPF_DROP (1), or let the lookup continue as usual with BPF_OK (0) on
> return. Other return values are treated the same as BPF_OK.

I'd prefer if other values were treated as BPF_DROP, with other semantics
unchanged. Otherwise we won't be able to introduce new semantics
without potentially breaking user code.

>
> This lets the user match packets with listening sockets freely at the last
> possible point on the receive path, where we know that packets are destined
> for local delivery after undergoing policing, filtering, and routing.
>
> With BPF code selecting the socket, directing packets destined to an IP
> range or to a port range to a single socket becomes possible.
>
> In case multiple programs are attached, they are run in series in the order
> in which they were attached. The end result gets determined from return
> code from each program according to following rules.
>
>  1. If any program returned BPF_REDIRECT and selected a valid socket, this
>     socket will be used as result of the lookup.
>  2. If more than one program returned BPF_REDIRECT and selected a socket,
>     last selection takes effect.
>  3. If any program returned BPF_DROP and none returned BPF_REDIRECT, the
>     socket lookup will fail with -ECONNREFUSED.
>  4. If no program returned neither BPF_DROP nor BPF_REDIRECT, socket lookup
>     continues to htable-based lookup.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Use a static_key to minimize the hook overhead when not used. (Alexei)
>     - Adapt for running an array of attached programs. (Alexei)
>     - Adapt for optionally skipping reuseport selection. (Martin)
>
>  include/linux/bpf.h        | 29 ++++++++++++++++++++++++++++
>  include/linux/filter.h     | 39 ++++++++++++++++++++++++++++++++++++++
>  kernel/bpf/net_namespace.c | 32 ++++++++++++++++++++++++++++++-
>  net/core/filter.c          |  2 ++
>  net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++
>  5 files changed, 132 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 26bc70533db0..98f79d39eaa1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1013,6 +1013,35 @@ _out:                                                    \
>                 _ret;                                   \
>         })
>
> +/* Runner for BPF_SK_LOOKUP programs to invoke on socket lookup.
> + *
> + * Valid return codes for SK_LOOKUP programs are:
> + * - BPF_REDIRECT (7) to use selected socket as result of the lookup,
> + * - BPF_DROP (1) to fail the socket lookup with no result,
> + * - BPF_OK (0) to continue on to regular htable-based socket lookup.
> + *
> + * Runner returns an u32 value that has a bit set for each code
> + * returned by any of the programs. Bit position corresponds to the
> + * return code.
> + *
> + * Caller must ensure that array is non-NULL.
> + */
> +#define BPF_PROG_SK_LOOKUP_RUN_ARRAY(array, ctx, func)         \
> +       ({                                                      \
> +               struct bpf_prog_array_item *_item;              \
> +               struct bpf_prog *_prog;                         \
> +               u32 _bit, _ret = 0;                             \
> +               migrate_disable();                              \
> +               _item = &(array)->items[0];                     \
> +               while ((_prog = READ_ONCE(_item->prog))) {      \
> +                       _bit = func(_prog, ctx);                \
> +                       _ret |= 1U << (_bit & 31);              \
> +                       _item++;                                \
> +               }                                               \
> +               migrate_enable();                               \
> +               _ret;                                           \
> +        })
> +
>  #define BPF_PROG_RUN_ARRAY(array, ctx, func)           \
>         __BPF_PROG_RUN_ARRAY(array, ctx, func, false)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index ba4f8595fa54..ff7721d862c2 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1297,4 +1297,43 @@ struct bpf_sk_lookup_kern {
>         bool            no_reuseport;
>  };
>
> +extern struct static_key_false bpf_sk_lookup_enabled;
> +
> +static inline bool bpf_sk_lookup_run_v4(struct net *net, int protocol,
> +                                       const __be32 saddr, const __be16 sport,
> +                                       const __be32 daddr, const u16 dport,
> +                                       struct sock **psk)
> +{
> +       struct bpf_prog_array *run_array;
> +       bool do_reuseport = false;
> +       struct sock *sk = NULL;
> +
> +       rcu_read_lock();
> +       run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
> +       if (run_array) {
> +               const struct bpf_sk_lookup_kern ctx = {
> +                       .family         = AF_INET,
> +                       .protocol       = protocol,
> +                       .v4.saddr       = saddr,
> +                       .v4.daddr       = daddr,
> +                       .sport          = sport,
> +                       .dport          = dport,
> +               };
> +               u32 ret;
> +
> +               ret = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, &ctx,
> +                                                  BPF_PROG_RUN);
> +               if (ret & (1U << BPF_REDIRECT)) {
> +                       sk = ctx.selected_sk;
> +                       do_reuseport = sk && !ctx.no_reuseport;
> +               } else if (ret & (1U << BPF_DROP)) {
> +                       sk = ERR_PTR(-ECONNREFUSED);
> +               }
> +       }
> +       rcu_read_unlock();
> +
> +       *psk = sk;
> +       return do_reuseport;
> +}
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
> index 090166824ca4..a7768feb3ade 100644
> --- a/kernel/bpf/net_namespace.c
> +++ b/kernel/bpf/net_namespace.c
> @@ -25,6 +25,28 @@ struct bpf_netns_link {
>  /* Protects updates to netns_bpf */
>  DEFINE_MUTEX(netns_bpf_mutex);
>
> +static void netns_bpf_attach_type_disable(enum netns_bpf_attach_type type)

Nit: maybe netns_bpf_attach_type_dec()? Disable sounds like it happens
unconditionally.

> +{
> +       switch (type) {
> +       case NETNS_BPF_SK_LOOKUP:
> +               static_branch_dec(&bpf_sk_lookup_enabled);
> +               break;
> +       default:
> +               break;
> +       }
> +}
> +
> +static void netns_bpf_attach_type_enable(enum netns_bpf_attach_type type)
> +{
> +       switch (type) {
> +       case NETNS_BPF_SK_LOOKUP:
> +               static_branch_inc(&bpf_sk_lookup_enabled);
> +               break;
> +       default:
> +               break;
> +       }
> +}
> +
>  /* Must be called with netns_bpf_mutex held. */
>  static void netns_bpf_run_array_detach(struct net *net,
>                                        enum netns_bpf_attach_type type)
> @@ -93,6 +115,9 @@ static void bpf_netns_link_release(struct bpf_link *link)
>         if (!net)
>                 goto out_unlock;
>
> +       /* Mark attach point as unused */
> +       netns_bpf_attach_type_disable(type);
> +
>         /* Remember link position in case of safe delete */
>         idx = link_index(net, type, net_link);
>         list_del(&net_link->node);
> @@ -416,6 +441,9 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
>                                         lockdep_is_held(&netns_bpf_mutex));
>         bpf_prog_array_free(run_array);
>
> +       /* Mark attach point as used */
> +       netns_bpf_attach_type_enable(type);
> +
>  out_unlock:
>         mutex_unlock(&netns_bpf_mutex);
>         return err;
> @@ -491,8 +519,10 @@ static void __net_exit netns_bpf_pernet_pre_exit(struct net *net)
>         mutex_lock(&netns_bpf_mutex);
>         for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) {
>                 netns_bpf_run_array_detach(net, type);
> -               list_for_each_entry(net_link, &net->bpf.links[type], node)
> +               list_for_each_entry(net_link, &net->bpf.links[type], node) {
>                         net_link->net = NULL; /* auto-detach link */
> +                       netns_bpf_attach_type_disable(type);
> +               }
>                 if (net->bpf.progs[type])
>                         bpf_prog_put(net->bpf.progs[type]);
>         }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 286f90e0c824..c0146977a6d1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9220,6 +9220,8 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>  const struct bpf_prog_ops sk_reuseport_prog_ops = {
>  };
>
> +DEFINE_STATIC_KEY_FALSE(bpf_sk_lookup_enabled);
> +
>  BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>            struct sock *, sk, u64, flags)
>  {
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ab64834837c8..2b1fc194efaf 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -299,6 +299,29 @@ static struct sock *inet_lhash2_lookup(struct net *net,
>         return result;
>  }
>
> +static inline struct sock *inet_lookup_run_bpf(struct net *net,
> +                                              struct inet_hashinfo *hashinfo,
> +                                              struct sk_buff *skb, int doff,
> +                                              __be32 saddr, __be16 sport,
> +                                              __be32 daddr, u16 hnum)
> +{
> +       struct sock *sk, *reuse_sk;
> +       bool do_reuseport;
> +
> +       if (hashinfo != &tcp_hashinfo)
> +               return NULL; /* only TCP is supported */
> +
> +       do_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP,
> +                                           saddr, sport, daddr, hnum, &sk);
> +       if (do_reuseport) {
> +               reuse_sk = lookup_reuseport(net, sk, skb, doff,
> +                                           saddr, sport, daddr, hnum);
> +               if (reuse_sk)
> +                       sk = reuse_sk;
> +       }
> +       return sk;
> +}
> +
>  struct sock *__inet_lookup_listener(struct net *net,
>                                     struct inet_hashinfo *hashinfo,
>                                     struct sk_buff *skb, int doff,
> @@ -310,6 +333,14 @@ struct sock *__inet_lookup_listener(struct net *net,
>         struct sock *result = NULL;
>         unsigned int hash2;
>
> +       /* Lookup redirect from BPF */
> +       if (static_branch_unlikely(&bpf_sk_lookup_enabled)) {
> +               result = inet_lookup_run_bpf(net, hashinfo, skb, doff,
> +                                            saddr, sport, daddr, hnum);
> +               if (result)
> +                       goto done;
> +       }
> +
>         hash2 = ipv4_portaddr_hash(net, daddr, hnum);
>         ilb2 = inet_lhash2_bucket(hashinfo, hash2);
>
> --
> 2.25.4
>


-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
  2020-07-02  9:24 ` [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point Jakub Sitnicki
@ 2020-07-02 11:01   ` Lorenz Bauer
  2020-07-02 12:59     ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Lorenz Bauer @ 2020-07-02 11:01 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Add tests to test_progs that exercise:
>
>  - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
>  - redirecting socket lookup to a socket selected by BPF program,
>  - failing a socket lookup on BPF program's request,
>  - error scenarios for selecting a socket from BPF program,
>  - accessing BPF program context,
>  - attaching and running multiple BPF programs.
>
> Run log:
> | # ./test_progs -n 68
> | #68/1 query lookup prog:OK
> | #68/2 TCP IPv4 redir port:OK
> | #68/3 TCP IPv4 redir addr:OK
> | #68/4 TCP IPv4 redir with reuseport:OK
> | #68/5 TCP IPv4 redir skip reuseport:OK
> | #68/6 TCP IPv6 redir port:OK
> | #68/7 TCP IPv6 redir addr:OK
> | #68/8 TCP IPv4->IPv6 redir port:OK
> | #68/9 TCP IPv6 redir with reuseport:OK
> | #68/10 TCP IPv6 redir skip reuseport:OK
> | #68/11 UDP IPv4 redir port:OK
> | #68/12 UDP IPv4 redir addr:OK
> | #68/13 UDP IPv4 redir with reuseport:OK
> | #68/14 UDP IPv4 redir skip reuseport:OK
> | #68/15 UDP IPv6 redir port:OK
> | #68/16 UDP IPv6 redir addr:OK
> | #68/17 UDP IPv4->IPv6 redir port:OK
> | #68/18 UDP IPv6 redir and reuseport:OK
> | #68/19 UDP IPv6 redir skip reuseport:OK
> | #68/20 TCP IPv4 drop on lookup:OK
> | #68/21 TCP IPv6 drop on lookup:OK
> | #68/22 UDP IPv4 drop on lookup:OK
> | #68/23 UDP IPv6 drop on lookup:OK
> | #68/24 TCP IPv4 drop on reuseport:OK
> | #68/25 TCP IPv6 drop on reuseport:OK
> | #68/26 UDP IPv4 drop on reuseport:OK
> | #68/27 TCP IPv6 drop on reuseport:OK
> | #68/28 sk_assign returns EEXIST:OK
> | #68/29 sk_assign honors F_REPLACE:OK
> | #68/30 access ctx->sk:OK
> | #68/31 sk_assign rejects TCP established:OK
> | #68/32 sk_assign rejects UDP connected:OK
> | #68/33 multi prog - pass, pass:OK
> | #68/34 multi prog - pass, inval:OK
> | #68/35 multi prog - inval, pass:OK
> | #68/36 multi prog - drop, drop:OK
> | #68/37 multi prog - pass, drop:OK
> | #68/38 multi prog - drop, pass:OK
> | #68/39 multi prog - drop, inval:OK
> | #68/40 multi prog - inval, drop:OK
> | #68/41 multi prog - pass, redir:OK
> | #68/42 multi prog - redir, pass:OK
> | #68/43 multi prog - drop, redir:OK
> | #68/44 multi prog - redir, drop:OK
> | #68/45 multi prog - inval, redir:OK
> | #68/46 multi prog - redir, inval:OK
> | #68/47 multi prog - redir, redir:OK
> | #68 sk_lookup:OK
> | Summary: 1/47 PASSED, 0 SKIPPED, 0 FAILED
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Extend tests to cover new functionality in v3:
>       - multi-prog attachments (query, running, verdict precedence)
>       - socket selecting for the second time with bpf_sk_assign
>       - skipping over reuseport load-balancing
>
>     v2:
>      - Adjust for fields renames in struct bpf_sk_lookup.
>
>  .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
>  .../selftests/bpf/progs/test_sk_lookup_kern.c |  399 +++++
>  2 files changed, 1752 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> new file mode 100644
> index 000000000000..2859dc7e65b0
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> @@ -0,0 +1,1353 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> +// Copyright (c) 2020 Cloudflare
> +/*
> + * Test BPF attach point for INET socket lookup (BPF_SK_LOOKUP).
> + *
> + * Tests exercise:
> + *  - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
> + *  - redirecting socket lookup to a socket selected by BPF program,
> + *  - failing a socket lookup on BPF program's request,
> + *  - error scenarios for selecting a socket from BPF program,
> + *  - accessing BPF program context,
> + *  - attaching and running multiple BPF programs.
> + *
> + * Tests run in a dedicated network namespace.
> + */
> +
> +#define _GNU_SOURCE
> +#include <arpa/inet.h>
> +#include <assert.h>
> +#include <errno.h>
> +#include <error.h>
> +#include <fcntl.h>
> +#include <sched.h>
> +#include <stdio.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +
> +#include "bpf_rlimit.h"
> +#include "bpf_util.h"
> +#include "cgroup_helpers.h"
> +#include "test_sk_lookup_kern.skel.h"
> +#include "test_progs.h"
> +
> +/* External (address, port) pairs the client sends packets to. */
> +#define EXT_IP4                "127.0.0.1"
> +#define EXT_IP6                "fd00::1"
> +#define EXT_PORT       7007
> +
> +/* Internal (address, port) pairs the server listens/receives at. */
> +#define INT_IP4                "127.0.0.2"
> +#define INT_IP4_V6     "::ffff:127.0.0.2"
> +#define INT_IP6                "fd00::2"
> +#define INT_PORT       8008
> +
> +#define IO_TIMEOUT_SEC 3
> +
> +enum server {
> +       SERVER_A = 0,
> +       SERVER_B = 1,
> +       MAX_SERVERS,
> +};
> +
> +enum {
> +       PROG1 = 0,
> +       PROG2,
> +};
> +
> +struct inet_addr {
> +       const char *ip;
> +       unsigned short port;
> +};
> +
> +struct test {
> +       const char *desc;
> +       struct bpf_program *lookup_prog;
> +       struct bpf_program *reuseport_prog;
> +       struct bpf_map *sock_map;
> +       int sotype;
> +       struct inet_addr connect_to;
> +       struct inet_addr listen_at;
> +       enum server accept_on;
> +};
> +
> +static bool is_ipv6(const char *ip)
> +{
> +       return !!strchr(ip, ':');
> +}
> +
> +static int make_addr(const char *ip, int port, struct sockaddr_storage *addr)
> +{
> +       struct sockaddr_in6 *addr6 = (void *)addr;
> +       struct sockaddr_in *addr4 = (void *)addr;
> +       int ret;
> +
> +       errno = 0;
> +       if (is_ipv6(ip)) {
> +               ret = inet_pton(AF_INET6, ip, &addr6->sin6_addr);
> +               if (CHECK_FAIL(ret <= 0)) {
> +                       log_err("failed to convert IPv6 address '%s'", ip);
> +                       return -1;
> +               }
> +               addr6->sin6_family = AF_INET6;
> +               addr6->sin6_port = htons(port);
> +       } else {
> +               ret = inet_pton(AF_INET, ip, &addr4->sin_addr);
> +               if (CHECK_FAIL(ret <= 0)) {
> +                       log_err("failed to convert IPv4 address '%s'", ip);
> +                       return -1;
> +               }
> +               addr4->sin_family = AF_INET;
> +               addr4->sin_port = htons(port);
> +       }
> +       return 0;
> +}
> +
> +static int setup_reuseport_prog(int sock_fd, struct bpf_program *reuseport_prog)
> +{
> +       int err, prog_fd;
> +
> +       prog_fd = bpf_program__fd(reuseport_prog);
> +       if (prog_fd < 0) {
> +               errno = -prog_fd;
> +               log_err("failed to get fd for program '%s'",
> +                       bpf_program__name(reuseport_prog));
> +               return -1;
> +       }
> +
> +       err = setsockopt(sock_fd, SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF,
> +                        &prog_fd, sizeof(prog_fd));
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to ATTACH_REUSEPORT_EBPF");
> +               return -1;
> +       }
> +
> +       return 0;
> +}
> +
> +static socklen_t inetaddr_len(const struct sockaddr_storage *addr)
> +{
> +       return (addr->ss_family == AF_INET ? sizeof(struct sockaddr_in) :
> +               addr->ss_family == AF_INET6 ? sizeof(struct sockaddr_in6) : 0);
> +}
> +
> +static int make_socket_with_addr(int sotype, const char *ip, int port,
> +                                struct sockaddr_storage *addr)
> +{
> +       struct timeval timeo = { .tv_sec = IO_TIMEOUT_SEC };
> +       int err, fd;
> +
> +       err = make_addr(ip, port, addr);
> +       if (err)
> +               return -1;
> +
> +       fd = socket(addr->ss_family, sotype, 0);
> +       if (CHECK_FAIL(fd < 0)) {
> +               log_err("failed to create listen socket");
> +               return -1;
> +       }
> +
> +       err = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeo, sizeof(timeo));
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to set SO_SNDTIMEO");
> +               return -1;
> +       }
> +
> +       err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeo, sizeof(timeo));
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to set SO_RCVTIMEO");
> +               return -1;
> +       }
> +
> +       return fd;
> +}
> +
> +static int make_server(int sotype, const char *ip, int port,
> +                      struct bpf_program *reuseport_prog)
> +{
> +       struct sockaddr_storage addr = {0};
> +       const int one = 1;
> +       int err, fd = -1;
> +
> +       fd = make_socket_with_addr(sotype, ip, port, &addr);
> +       if (fd < 0)
> +               return -1;
> +
> +       /* Enabled for UDPv6 sockets for IPv4-mapped IPv6 to work. */
> +       if (sotype == SOCK_DGRAM) {
> +               err = setsockopt(fd, SOL_IP, IP_RECVORIGDSTADDR, &one,
> +                                sizeof(one));
> +               if (CHECK_FAIL(err)) {
> +                       log_err("failed to enable IP_RECVORIGDSTADDR");
> +                       goto fail;
> +               }
> +       }
> +
> +       if (sotype == SOCK_DGRAM && addr.ss_family == AF_INET6) {
> +               err = setsockopt(fd, SOL_IPV6, IPV6_RECVORIGDSTADDR, &one,
> +                                sizeof(one));
> +               if (CHECK_FAIL(err)) {
> +                       log_err("failed to enable IPV6_RECVORIGDSTADDR");
> +                       goto fail;
> +               }
> +       }
> +
> +       if (sotype == SOCK_STREAM) {
> +               err = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &one,
> +                                sizeof(one));
> +               if (CHECK_FAIL(err)) {
> +                       log_err("failed to enable SO_REUSEADDR");
> +                       goto fail;
> +               }
> +       }
> +
> +       if (reuseport_prog) {
> +               err = setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &one,
> +                                sizeof(one));
> +               if (CHECK_FAIL(err)) {
> +                       log_err("failed to enable SO_REUSEPORT");
> +                       goto fail;
> +               }
> +       }
> +
> +       err = bind(fd, (void *)&addr, inetaddr_len(&addr));
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to bind listen socket");
> +               goto fail;
> +       }
> +
> +       if (sotype == SOCK_STREAM) {
> +               err = listen(fd, SOMAXCONN);
> +               if (CHECK_FAIL(err)) {
> +                       log_err("failed to listen on port %d", port);
> +                       goto fail;
> +               }
> +       }
> +
> +       /* Late attach reuseport prog so we can have one init path */
> +       if (reuseport_prog) {
> +               err = setup_reuseport_prog(fd, reuseport_prog);
> +               if (err)
> +                       goto fail;
> +       }
> +
> +       return fd;
> +fail:
> +       close(fd);
> +       return -1;
> +}
> +
> +static int make_client(int sotype, const char *ip, int port)
> +{
> +       struct sockaddr_storage addr = {0};
> +       int err, fd;
> +
> +       fd = make_socket_with_addr(sotype, ip, port, &addr);
> +       if (fd < 0)
> +               return -1;
> +
> +       err = connect(fd, (void *)&addr, inetaddr_len(&addr));
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to connect client socket");
> +               goto fail;
> +       }
> +
> +       return fd;
> +fail:
> +       close(fd);
> +       return -1;
> +}
> +
> +static int send_byte(int fd)
> +{
> +       ssize_t n;
> +
> +       errno = 0;
> +       n = send(fd, "a", 1, 0);
> +       if (CHECK_FAIL(n <= 0)) {
> +               log_err("failed/partial send");
> +               return -1;
> +       }
> +       return 0;
> +}
> +
> +static int recv_byte(int fd)
> +{
> +       char buf[1];
> +       ssize_t n;
> +
> +       n = recv(fd, buf, sizeof(buf), 0);
> +       if (CHECK_FAIL(n <= 0)) {
> +               log_err("failed/partial recv");
> +               return -1;
> +       }
> +       return 0;
> +}
> +
> +static int tcp_recv_send(int server_fd)
> +{
> +       char buf[1];
> +       int ret, fd;
> +       ssize_t n;
> +
> +       fd = accept(server_fd, NULL, NULL);
> +       if (CHECK_FAIL(fd < 0)) {
> +               log_err("failed to accept");
> +               return -1;
> +       }
> +
> +       n = recv(fd, buf, sizeof(buf), 0);
> +       if (CHECK_FAIL(n <= 0)) {
> +               log_err("failed/partial recv");
> +               ret = -1;
> +               goto close;
> +       }
> +
> +       n = send(fd, buf, n, 0);
> +       if (CHECK_FAIL(n <= 0)) {
> +               log_err("failed/partial send");
> +               ret = -1;
> +               goto close;
> +       }
> +
> +       ret = 0;
> +close:
> +       close(fd);
> +       return ret;
> +}
> +
> +static void v4_to_v6(struct sockaddr_storage *ss)
> +{
> +       struct sockaddr_in6 *v6 = (struct sockaddr_in6 *)ss;
> +       struct sockaddr_in v4 = *(struct sockaddr_in *)ss;
> +
> +       v6->sin6_family = AF_INET6;
> +       v6->sin6_port = v4.sin_port;
> +       v6->sin6_addr.s6_addr[10] = 0xff;
> +       v6->sin6_addr.s6_addr[11] = 0xff;
> +       memcpy(&v6->sin6_addr.s6_addr[12], &v4.sin_addr.s_addr, 4);
> +}
> +
> +static int udp_recv_send(int server_fd)
> +{
> +       char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
> +       struct sockaddr_storage _src_addr = { 0 };
> +       struct sockaddr_storage *src_addr = &_src_addr;
> +       struct sockaddr_storage *dst_addr = NULL;
> +       struct msghdr msg = { 0 };
> +       struct iovec iov = { 0 };
> +       struct cmsghdr *cm;
> +       char buf[1];
> +       int ret, fd;
> +       ssize_t n;
> +
> +       iov.iov_base = buf;
> +       iov.iov_len = sizeof(buf);
> +
> +       msg.msg_name = src_addr;
> +       msg.msg_namelen = sizeof(*src_addr);
> +       msg.msg_iov = &iov;
> +       msg.msg_iovlen = 1;
> +       msg.msg_control = cmsg_buf;
> +       msg.msg_controllen = sizeof(cmsg_buf);
> +
> +       errno = 0;
> +       n = recvmsg(server_fd, &msg, 0);
> +       if (CHECK_FAIL(n <= 0)) {
> +               log_err("failed to receive");
> +               return -1;
> +       }
> +       if (CHECK_FAIL(msg.msg_flags & MSG_CTRUNC)) {
> +               log_err("truncated cmsg");
> +               return -1;
> +       }
> +
> +       for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +               if ((cm->cmsg_level == SOL_IP &&
> +                    cm->cmsg_type == IP_ORIGDSTADDR) ||
> +                   (cm->cmsg_level == SOL_IPV6 &&
> +                    cm->cmsg_type == IPV6_ORIGDSTADDR)) {
> +                       dst_addr = (struct sockaddr_storage *)CMSG_DATA(cm);
> +                       break;
> +               }
> +               log_err("warning: ignored cmsg at level %d type %d",
> +                       cm->cmsg_level, cm->cmsg_type);
> +       }
> +       if (CHECK_FAIL(!dst_addr)) {
> +               log_err("failed to get destination address");
> +               return -1;
> +       }
> +
> +       /* Server socket bound to IPv4-mapped IPv6 address */
> +       if (src_addr->ss_family == AF_INET6 &&
> +           dst_addr->ss_family == AF_INET) {
> +               v4_to_v6(dst_addr);
> +       }
> +
> +       /* Reply from original destination address. */
> +       fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
> +       if (CHECK_FAIL(fd < 0)) {
> +               log_err("failed to create tx socket");
> +               return -1;
> +       }
> +
> +       ret = bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr));
> +       if (CHECK_FAIL(ret)) {
> +               log_err("failed to bind tx socket");
> +               goto out;
> +       }
> +
> +       msg.msg_control = NULL;
> +       msg.msg_controllen = 0;
> +       n = sendmsg(fd, &msg, 0);
> +       if (CHECK_FAIL(n <= 0)) {
> +               log_err("failed to send echo reply");
> +               ret = -1;
> +               goto out;
> +       }
> +
> +       ret = 0;
> +out:
> +       close(fd);
> +       return ret;
> +}
> +
> +static int tcp_echo_test(int client_fd, int server_fd)
> +{
> +       int err;
> +
> +       err = send_byte(client_fd);
> +       if (err)
> +               return -1;
> +       err = tcp_recv_send(server_fd);
> +       if (err)
> +               return -1;
> +       err = recv_byte(client_fd);
> +       if (err)
> +               return -1;
> +
> +       return 0;
> +}
> +
> +static int udp_echo_test(int client_fd, int server_fd)
> +{
> +       int err;
> +
> +       err = send_byte(client_fd);
> +       if (err)
> +               return -1;
> +       err = udp_recv_send(server_fd);
> +       if (err)
> +               return -1;
> +       err = recv_byte(client_fd);
> +       if (err)
> +               return -1;
> +
> +       return 0;
> +}
> +
> +static struct bpf_link *attach_lookup_prog(struct bpf_program *prog)
> +{
> +       struct bpf_link *link;
> +       int net_fd;
> +
> +       net_fd = open("/proc/self/ns/net", O_RDONLY);
> +       if (CHECK_FAIL(net_fd < 0)) {
> +               log_err("failed to open /proc/self/ns/net");
> +               return NULL;
> +       }
> +
> +       link = bpf_program__attach_netns(prog, net_fd);
> +       if (CHECK_FAIL(IS_ERR(link))) {
> +               errno = -PTR_ERR(link);
> +               log_err("failed to attach program '%s' to netns",
> +                       bpf_program__name(prog));
> +               link = NULL;
> +       }
> +
> +       close(net_fd);
> +       return link;
> +}
> +
> +static int update_lookup_map(struct bpf_map *map, int index, int sock_fd)
> +{
> +       int err, map_fd;
> +       uint64_t value;
> +
> +       map_fd = bpf_map__fd(map);
> +       if (CHECK_FAIL(map_fd < 0)) {
> +               errno = -map_fd;
> +               log_err("failed to get map FD");
> +               return -1;
> +       }
> +
> +       value = (uint64_t)sock_fd;
> +       err = bpf_map_update_elem(map_fd, &index, &value, BPF_NOEXIST);
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to update redir_map @ %d", index);
> +               return -1;
> +       }
> +
> +       return 0;
> +}
> +
> +static __u32 link_info_prog_id(struct bpf_link *link)
> +{
> +       struct bpf_link_info info = {};
> +       __u32 info_len = sizeof(info);
> +       int link_fd, err;
> +
> +       link_fd = bpf_link__fd(link);
> +       if (CHECK_FAIL(link_fd < 0)) {
> +               errno = -link_fd;
> +               log_err("bpf_link__fd failed");
> +               return 0;
> +       }
> +
> +       err = bpf_obj_get_info_by_fd(link_fd, &info, &info_len);
> +       if (CHECK_FAIL(err || info_len != sizeof(info))) {
> +               log_err("bpf_obj_get_info_by_fd");
> +               return 0;
> +       }
> +
> +       return info.prog_id;
> +}
> +
> +static void query_lookup_prog(struct test_sk_lookup_kern *skel)
> +{
> +       struct bpf_link *link[3] = {};
> +       __u32 attach_flags = 0;
> +       __u32 prog_ids[3] = {};
> +       __u32 prog_cnt = 3;
> +       __u32 prog_id;
> +       int net_fd;
> +       int err;
> +
> +       net_fd = open("/proc/self/ns/net", O_RDONLY);
> +       if (CHECK_FAIL(net_fd < 0)) {
> +               log_err("failed to open /proc/self/ns/net");
> +               return;
> +       }
> +
> +       link[0] = attach_lookup_prog(skel->progs.lookup_pass);
> +       if (!link[0])
> +               goto close;
> +       link[1] = attach_lookup_prog(skel->progs.lookup_pass);
> +       if (!link[1])
> +               goto detach;
> +       link[2] = attach_lookup_prog(skel->progs.lookup_drop);
> +       if (!link[2])
> +               goto detach;
> +
> +       err = bpf_prog_query(net_fd, BPF_SK_LOOKUP, 0 /* query flags */,
> +                            &attach_flags, prog_ids, &prog_cnt);
> +       if (CHECK_FAIL(err)) {
> +               log_err("failed to query lookup prog");
> +               goto detach;
> +       }
> +
> +       system("/home/jkbs/src/linux/tools/bpf/bpftool/bpftool link show");

This is to make sure that I read all of the tests as well? ;P

> +
> +       errno = 0;
> +       if (CHECK_FAIL(attach_flags != 0)) {
> +               log_err("wrong attach_flags on query: %u", attach_flags);
> +               goto detach;
> +       }
> +       if (CHECK_FAIL(prog_cnt != 3)) {
> +               log_err("wrong program count on query: %u", prog_cnt);
> +               goto detach;
> +       }
> +       prog_id = link_info_prog_id(link[0]);
> +       if (CHECK_FAIL(prog_ids[0] != prog_id)) {
> +               log_err("invalid program id on query: %u != %u",
> +                       prog_ids[0], prog_id);
> +               goto detach;
> +       }
> +       prog_id = link_info_prog_id(link[1]);
> +       if (CHECK_FAIL(prog_ids[1] != prog_id)) {
> +               log_err("invalid program id on query: %u != %u",
> +                       prog_ids[1], prog_id);
> +               goto detach;
> +       }
> +       prog_id = link_info_prog_id(link[2]);
> +       if (CHECK_FAIL(prog_ids[2] != prog_id)) {
> +               log_err("invalid program id on query: %u != %u",
> +                       prog_ids[2], prog_id);
> +               goto detach;
> +       }
> +
> +detach:
> +       if (link[2])
> +               bpf_link__destroy(link[2]);
> +       if (link[1])
> +               bpf_link__destroy(link[1]);
> +       if (link[0])
> +               bpf_link__destroy(link[0]);
> +close:
> +       close(net_fd);
> +}
> +
> +static void run_lookup_prog(const struct test *t)
> +{
> +       int client_fd, server_fds[MAX_SERVERS] = { -1 };
> +       struct bpf_link *lookup_link;
> +       int i, err;
> +
> +       lookup_link = attach_lookup_prog(t->lookup_prog);
> +       if (!lookup_link)

Why doesn't this fail the test? Same for the other error paths in the
function, and the other helpers.

> +               return;
> +
> +       for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
> +               server_fds[i] = make_server(t->sotype, t->listen_at.ip,
> +                                           t->listen_at.port,
> +                                           t->reuseport_prog);
> +               if (server_fds[i] < 0)
> +                       goto close;
> +
> +               err = update_lookup_map(t->sock_map, i, server_fds[i]);
> +               if (err)
> +                       goto close;
> +
> +               /* want just one server for non-reuseport test */
> +               if (!t->reuseport_prog)
> +                       break;
> +       }
> +
> +       client_fd = make_client(t->sotype, t->connect_to.ip, t->connect_to.port);
> +       if (client_fd < 0)
> +               goto close;
> +
> +       if (t->sotype == SOCK_STREAM)
> +               tcp_echo_test(client_fd, server_fds[t->accept_on]);
> +       else
> +               udp_echo_test(client_fd, server_fds[t->accept_on]);
> +
> +       close(client_fd);
> +close:
> +       for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
> +               if (server_fds[i] != -1)
> +                       close(server_fds[i]);
> +       }
> +       bpf_link__destroy(lookup_link);
> +}
> +
> +static void test_redirect_lookup(struct test_sk_lookup_kern *skel)
> +{
> +       const struct test tests[] = {
> +               {
> +                       .desc           = "TCP IPv4 redir port",
> +                       .lookup_prog    = skel->progs.redir_port,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { EXT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv4 redir addr",
> +                       .lookup_prog    = skel->progs.redir_ip4,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv4 redir with reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +                       .accept_on      = SERVER_B,
> +               },
> +               {
> +                       .desc           = "TCP IPv4 redir skip reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a_no_reuseport,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +                       .accept_on      = SERVER_A,
> +               },
> +               {
> +                       .desc           = "TCP IPv6 redir port",
> +                       .lookup_prog    = skel->progs.redir_port,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { EXT_IP6, INT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv6 redir addr",
> +                       .lookup_prog    = skel->progs.redir_ip6,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv4->IPv6 redir port",
> +                       .lookup_prog    = skel->progs.redir_port,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4_V6, INT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv6 redir with reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, INT_PORT },
> +                       .accept_on      = SERVER_B,
> +               },
> +               {
> +                       .desc           = "TCP IPv6 redir skip reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a_no_reuseport,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, INT_PORT },
> +                       .accept_on      = SERVER_A,
> +               },
> +               {
> +                       .desc           = "UDP IPv4 redir port",
> +                       .lookup_prog    = skel->progs.redir_port,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { EXT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv4 redir addr",
> +                       .lookup_prog    = skel->progs.redir_ip4,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv4 redir with reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +                       .accept_on      = SERVER_B,
> +               },
> +               {
> +                       .desc           = "UDP IPv4 redir skip reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a_no_reuseport,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +                       .accept_on      = SERVER_A,
> +               },
> +               {
> +                       .desc           = "UDP IPv6 redir port",
> +                       .lookup_prog    = skel->progs.redir_port,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { EXT_IP6, INT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv6 redir addr",
> +                       .lookup_prog    = skel->progs.redir_ip6,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv4->IPv6 redir port",
> +                       .lookup_prog    = skel->progs.redir_port,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .listen_at      = { INT_IP4_V6, INT_PORT },
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv6 redir and reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, INT_PORT },
> +                       .accept_on      = SERVER_B,
> +               },
> +               {
> +                       .desc           = "UDP IPv6 redir skip reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a_no_reuseport,
> +                       .reuseport_prog = skel->progs.select_sock_b,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, INT_PORT },
> +                       .accept_on      = SERVER_A,
> +               },
> +       };
> +       const struct test *t;
> +
> +       for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
> +               if (test__start_subtest(t->desc))
> +                       run_lookup_prog(t);
> +       }
> +}
> +
> +static void drop_on_lookup(const struct test *t)
> +{
> +       struct sockaddr_storage dst = {};
> +       int client_fd, server_fd, err;
> +       struct bpf_link *lookup_link;
> +       ssize_t n;
> +
> +       lookup_link = attach_lookup_prog(t->lookup_prog);
> +       if (!lookup_link)
> +               return;
> +
> +       server_fd = make_server(t->sotype, t->listen_at.ip, t->listen_at.port,
> +                               t->reuseport_prog);
> +       if (server_fd < 0)
> +               goto detach;
> +
> +       client_fd = make_socket_with_addr(t->sotype, t->connect_to.ip,
> +                                         t->connect_to.port, &dst);
> +       if (client_fd < 0)
> +               goto close_srv;
> +
> +       err = connect(client_fd, (void *)&dst, inetaddr_len(&dst));
> +       if (t->sotype == SOCK_DGRAM) {
> +               err = send_byte(client_fd);
> +               if (err)
> +                       goto close_all;
> +
> +               /* Read out asynchronous error */
> +               n = recv(client_fd, NULL, 0, 0);
> +               err = n == -1;
> +       }
> +       if (CHECK_FAIL(!err || errno != ECONNREFUSED))
> +               log_err("expected ECONNREFUSED on connect");
> +
> +close_all:
> +       close(client_fd);
> +close_srv:
> +       close(server_fd);
> +detach:
> +       bpf_link__destroy(lookup_link);
> +}
> +
> +static void test_drop_on_lookup(struct test_sk_lookup_kern *skel)
> +{
> +       const struct test tests[] = {
> +               {
> +                       .desc           = "TCP IPv4 drop on lookup",
> +                       .lookup_prog    = skel->progs.lookup_drop,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv6 drop on lookup",
> +                       .lookup_prog    = skel->progs.lookup_drop,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { EXT_IP6, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv4 drop on lookup",
> +                       .lookup_prog    = skel->progs.lookup_drop,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv6 drop on lookup",
> +                       .lookup_prog    = skel->progs.lookup_drop,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { EXT_IP6, INT_PORT },
> +               },
> +       };
> +       const struct test *t;
> +
> +       for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
> +               if (test__start_subtest(t->desc))
> +                       drop_on_lookup(t);
> +       }
> +}
> +
> +static void drop_on_reuseport(const struct test *t)
> +{
> +       struct sockaddr_storage dst = { 0 };
> +       int client, server1, server2, err;
> +       struct bpf_link *lookup_link;
> +       ssize_t n;
> +
> +       lookup_link = attach_lookup_prog(t->lookup_prog);
> +       if (!lookup_link)
> +               return;
> +
> +       server1 = make_server(t->sotype, t->listen_at.ip, t->listen_at.port,
> +                             t->reuseport_prog);
> +       if (server1 < 0)
> +               goto detach;
> +
> +       err = update_lookup_map(t->sock_map, SERVER_A, server1);
> +       if (err)
> +               goto detach;
> +
> +       /* second server on destination address we should never reach */
> +       server2 = make_server(t->sotype, t->connect_to.ip, t->connect_to.port,
> +                             NULL /* reuseport prog */);
> +       if (server2 < 0)
> +               goto close_srv1;
> +
> +       client = make_socket_with_addr(t->sotype, t->connect_to.ip,
> +                                      t->connect_to.port, &dst);
> +       if (client < 0)
> +               goto close_srv2;
> +
> +       err = connect(client, (void *)&dst, inetaddr_len(&dst));
> +       if (t->sotype == SOCK_DGRAM) {
> +               err = send_byte(client);
> +               if (err)
> +                       goto close_all;
> +
> +               /* Read out asynchronous error */
> +               n = recv(client, NULL, 0, 0);
> +               err = n == -1;
> +       }
> +       if (CHECK_FAIL(!err || errno != ECONNREFUSED))
> +               log_err("expected ECONNREFUSED on connect");
> +
> +close_all:
> +       close(client);
> +close_srv2:
> +       close(server2);
> +close_srv1:
> +       close(server1);
> +detach:
> +       bpf_link__destroy(lookup_link);
> +}
> +
> +static void test_drop_on_reuseport(struct test_sk_lookup_kern *skel)
> +{
> +       const struct test tests[] = {
> +               {
> +                       .desc           = "TCP IPv4 drop on reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.reuseport_drop,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv6 drop on reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.reuseport_drop,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, INT_PORT },
> +               },
> +               {
> +                       .desc           = "UDP IPv4 drop on reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.reuseport_drop,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_DGRAM,
> +                       .connect_to     = { EXT_IP4, EXT_PORT },
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "TCP IPv6 drop on reuseport",
> +                       .lookup_prog    = skel->progs.select_sock_a,
> +                       .reuseport_prog = skel->progs.reuseport_drop,
> +                       .sock_map       = skel->maps.redir_map,
> +                       .sotype         = SOCK_STREAM,
> +                       .connect_to     = { EXT_IP6, EXT_PORT },
> +                       .listen_at      = { INT_IP6, INT_PORT },
> +               },
> +       };
> +       const struct test *t;
> +
> +       for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
> +               if (test__start_subtest(t->desc))
> +                       drop_on_reuseport(t);
> +       }
> +}
> +
> +static void run_sk_assign(struct test_sk_lookup_kern *skel,
> +                         struct bpf_program *lookup_prog)
> +{
> +       int client_fd, peer_fd, server_fds[MAX_SERVERS] = { -1 };
> +       struct bpf_link *lookup_link;
> +       int i, err;
> +
> +       lookup_link = attach_lookup_prog(lookup_prog);
> +       if (!lookup_link)
> +               return;
> +
> +       for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
> +               server_fds[i] = make_server(SOCK_STREAM, INT_IP4, 0, NULL);
> +               if (server_fds[i] < 0)
> +                       goto close_servers;
> +
> +               err = update_lookup_map(skel->maps.redir_map, i,
> +                                       server_fds[i]);
> +               if (err)
> +                       goto close_servers;
> +       }
> +
> +       client_fd = make_client(SOCK_STREAM, EXT_IP4, EXT_PORT);
> +       if (client_fd < 0)
> +               goto close_servers;
> +
> +       peer_fd = accept(server_fds[SERVER_B], NULL, NULL);
> +       if (CHECK_FAIL(peer_fd < 0))
> +               goto close_client;
> +
> +       close(peer_fd);
> +close_client:
> +       close(client_fd);
> +close_servers:
> +       for (i = 0; i < ARRAY_SIZE(server_fds); i++) {
> +               if (server_fds[i] != -1)
> +                       close(server_fds[i]);
> +       }
> +       bpf_link__destroy(lookup_link);
> +}
> +
> +static void run_sk_assign_connected(struct test_sk_lookup_kern *skel,
> +                                   int sotype)
> +{
> +       int err, client_fd, connected_fd, server_fd;
> +       struct bpf_link *lookup_link;
> +
> +       server_fd = make_server(sotype, EXT_IP4, EXT_PORT, NULL);
> +       if (server_fd < 0)
> +               return;
> +
> +       connected_fd = make_client(sotype, EXT_IP4, EXT_PORT);
> +       if (connected_fd < 0)
> +               goto out_close_server;
> +
> +       /* Put a connected socket in redirect map */
> +       err = update_lookup_map(skel->maps.redir_map, SERVER_A, connected_fd);
> +       if (err)
> +               goto out_close_connected;
> +
> +       lookup_link = attach_lookup_prog(skel->progs.sk_assign_esocknosupport);
> +       if (!lookup_link)
> +               goto out_close_connected;
> +
> +       /* Try to redirect TCP SYN / UDP packet to a connected socket */
> +       client_fd = make_client(sotype, EXT_IP4, EXT_PORT);
> +       if (client_fd < 0)
> +               goto out_unlink_prog;
> +       if (sotype == SOCK_DGRAM) {
> +               send_byte(client_fd);
> +               recv_byte(server_fd);
> +       }
> +
> +       close(client_fd);
> +out_unlink_prog:
> +       bpf_link__destroy(lookup_link);
> +out_close_connected:
> +       close(connected_fd);
> +out_close_server:
> +       close(server_fd);
> +}
> +
> +static void test_sk_assign_helper(struct test_sk_lookup_kern *skel)
> +{
> +       if (test__start_subtest("sk_assign returns EEXIST"))
> +               run_sk_assign(skel, skel->progs.sk_assign_eexist);
> +       if (test__start_subtest("sk_assign honors F_REPLACE"))
> +               run_sk_assign(skel, skel->progs.sk_assign_replace_flag);
> +       if (test__start_subtest("access ctx->sk"))
> +               run_sk_assign(skel, skel->progs.access_ctx_sk);
> +       if (test__start_subtest("sk_assign rejects TCP established"))
> +               run_sk_assign_connected(skel, SOCK_STREAM);
> +       if (test__start_subtest("sk_assign rejects UDP connected"))
> +               run_sk_assign_connected(skel, SOCK_DGRAM);
> +}
> +
> +struct test_multi_prog {
> +       const char *desc;
> +       struct bpf_program *prog1;
> +       struct bpf_program *prog2;
> +       struct bpf_map *redir_map;
> +       struct bpf_map *run_map;
> +       int expect_errno;
> +       struct inet_addr listen_at;
> +};
> +
> +static void run_multi_prog_lookup(const struct test_multi_prog *t)
> +{
> +       struct sockaddr_storage dst = {};
> +       int map_fd, server_fd, client_fd;
> +       struct bpf_link *link1, *link2;
> +       int prog_idx, done, err;
> +
> +       map_fd = bpf_map__fd(t->run_map);
> +
> +       done = 0;
> +       prog_idx = PROG1;
> +       CHECK_FAIL(bpf_map_update_elem(map_fd, &prog_idx, &done, BPF_ANY));
> +       prog_idx = PROG2;
> +       CHECK_FAIL(bpf_map_update_elem(map_fd, &prog_idx, &done, BPF_ANY));
> +
> +       link1 = attach_lookup_prog(t->prog1);
> +       if (!link1)
> +               return;
> +       link2 = attach_lookup_prog(t->prog2);
> +       if (!link2)
> +               goto out_unlink1;
> +
> +       server_fd = make_server(SOCK_STREAM, t->listen_at.ip,
> +                               t->listen_at.port, NULL);
> +       if (server_fd < 0)
> +               goto out_unlink2;
> +
> +       err = update_lookup_map(t->redir_map, SERVER_A, server_fd);
> +       if (err)
> +               goto out_close_server;
> +
> +       client_fd = make_socket_with_addr(SOCK_STREAM, EXT_IP4, EXT_PORT,
> +                                         &dst);
> +       if (client_fd < 0)
> +               goto out_close_server;
> +
> +       err = connect(client_fd, (void *)&dst, inetaddr_len(&dst));
> +       if (CHECK_FAIL(err && !t->expect_errno))
> +               goto out_close_client;
> +       if (CHECK_FAIL(err && t->expect_errno && errno != t->expect_errno))
> +               goto out_close_client;
> +
> +       done = 0;
> +       prog_idx = PROG1;
> +       CHECK_FAIL(bpf_map_lookup_elem(map_fd, &prog_idx, &done));
> +       CHECK_FAIL(!done);
> +
> +       done = 0;
> +       prog_idx = PROG2;
> +       CHECK_FAIL(bpf_map_lookup_elem(map_fd, &prog_idx, &done));
> +       CHECK_FAIL(!done);
> +
> +out_close_client:
> +       close(client_fd);
> +out_close_server:
> +       close(server_fd);
> +out_unlink2:
> +       bpf_link__destroy(link2);
> +out_unlink1:
> +       bpf_link__destroy(link1);
> +}
> +
> +static void test_multi_prog_lookup(struct test_sk_lookup_kern *skel)
> +{
> +       struct test_multi_prog tests[] = {
> +               {
> +                       .desc           = "multi prog - pass, pass",
> +                       .prog1          = skel->progs.multi_prog_pass1,
> +                       .prog2          = skel->progs.multi_prog_pass2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - pass, inval",
> +                       .prog1          = skel->progs.multi_prog_pass1,
> +                       .prog2          = skel->progs.multi_prog_inval2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - inval, pass",
> +                       .prog1          = skel->progs.multi_prog_inval1,
> +                       .prog2          = skel->progs.multi_prog_pass2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - drop, drop",
> +                       .prog1          = skel->progs.multi_prog_drop1,
> +                       .prog2          = skel->progs.multi_prog_drop2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +                       .expect_errno   = ECONNREFUSED,
> +               },
> +               {
> +                       .desc           = "multi prog - pass, drop",
> +                       .prog1          = skel->progs.multi_prog_pass1,
> +                       .prog2          = skel->progs.multi_prog_drop2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +                       .expect_errno   = ECONNREFUSED,
> +               },
> +               {
> +                       .desc           = "multi prog - drop, pass",
> +                       .prog1          = skel->progs.multi_prog_drop1,
> +                       .prog2          = skel->progs.multi_prog_pass2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +                       .expect_errno   = ECONNREFUSED,
> +               },
> +               {
> +                       .desc           = "multi prog - drop, inval",
> +                       .prog1          = skel->progs.multi_prog_drop1,
> +                       .prog2          = skel->progs.multi_prog_inval2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +                       .expect_errno   = ECONNREFUSED,
> +               },
> +               {
> +                       .desc           = "multi prog - inval, drop",
> +                       .prog1          = skel->progs.multi_prog_inval1,
> +                       .prog2          = skel->progs.multi_prog_drop2,
> +                       .listen_at      = { EXT_IP4, EXT_PORT },
> +                       .expect_errno   = ECONNREFUSED,
> +               },
> +               {
> +                       .desc           = "multi prog - pass, redir",
> +                       .prog1          = skel->progs.multi_prog_pass1,
> +                       .prog2          = skel->progs.multi_prog_redir2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - redir, pass",
> +                       .prog1          = skel->progs.multi_prog_redir1,
> +                       .prog2          = skel->progs.multi_prog_pass2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - drop, redir",
> +                       .prog1          = skel->progs.multi_prog_drop1,
> +                       .prog2          = skel->progs.multi_prog_redir2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - redir, drop",
> +                       .prog1          = skel->progs.multi_prog_redir1,
> +                       .prog2          = skel->progs.multi_prog_drop2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - inval, redir",
> +                       .prog1          = skel->progs.multi_prog_inval1,
> +                       .prog2          = skel->progs.multi_prog_redir2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - redir, inval",
> +                       .prog1          = skel->progs.multi_prog_redir1,
> +                       .prog2          = skel->progs.multi_prog_inval2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +               {
> +                       .desc           = "multi prog - redir, redir",
> +                       .prog1          = skel->progs.multi_prog_redir1,
> +                       .prog2          = skel->progs.multi_prog_redir2,
> +                       .listen_at      = { INT_IP4, INT_PORT },
> +               },
> +       };
> +       struct test_multi_prog *t;
> +
> +       for (t = tests; t < tests + ARRAY_SIZE(tests); t++) {
> +               t->redir_map = skel->maps.redir_map;
> +               t->run_map = skel->maps.run_map;
> +               if (test__start_subtest(t->desc))
> +                       run_multi_prog_lookup(t);
> +       }
> +}
> +
> +static void run_tests(struct test_sk_lookup_kern *skel)
> +{
> +       if (test__start_subtest("query lookup prog"))
> +               query_lookup_prog(skel);
> +       test_redirect_lookup(skel);
> +       test_drop_on_lookup(skel);
> +       test_drop_on_reuseport(skel);
> +       test_sk_assign_helper(skel);
> +       test_multi_prog_lookup(skel);
> +}
> +
> +static int switch_netns(int *saved_net)
> +{
> +       static const char * const setup_script[] = {
> +               "ip -6 addr add dev lo " EXT_IP6 "/128 nodad",
> +               "ip -6 addr add dev lo " INT_IP6 "/128 nodad",
> +               "ip link set dev lo up",
> +               NULL,
> +       };
> +       const char * const *cmd;
> +       int net_fd, err;
> +
> +       net_fd = open("/proc/self/ns/net", O_RDONLY);
> +       if (CHECK_FAIL(net_fd < 0)) {
> +               log_err("open(/proc/self/ns/net)");
> +               return -1;
> +       }
> +
> +       err = unshare(CLONE_NEWNET);
> +       if (CHECK_FAIL(err)) {
> +               log_err("unshare(CLONE_NEWNET)");
> +               goto close;
> +       }
> +
> +       for (cmd = setup_script; *cmd; cmd++) {
> +               err = system(*cmd);
> +               if (CHECK_FAIL(err)) {
> +                       log_err("system(%s)", *cmd);
> +                       goto close;
> +               }
> +       }
> +
> +       *saved_net = net_fd;
> +       return 0;
> +
> +close:
> +       close(net_fd);
> +       return -1;
> +}
> +
> +static void restore_netns(int saved_net)
> +{
> +       int err;
> +
> +       err = setns(saved_net, CLONE_NEWNET);
> +       if (CHECK_FAIL(err))
> +               log_err("setns(CLONE_NEWNET)");
> +
> +       close(saved_net);
> +}
> +
> +void test_sk_lookup(void)
> +{
> +       struct test_sk_lookup_kern *skel;
> +       int err, saved_net;
> +
> +       err = switch_netns(&saved_net);
> +       if (err)
> +               return;
> +
> +       skel = test_sk_lookup_kern__open_and_load();
> +       if (CHECK_FAIL(!skel)) {
> +               errno = 0;
> +               log_err("failed to open and load BPF skeleton");
> +               goto restore_netns;
> +       }
> +
> +       run_tests(skel);
> +
> +       test_sk_lookup_kern__destroy(skel);
> +restore_netns:
> +       restore_netns(saved_net);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
> new file mode 100644
> index 000000000000..75745898fd3b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
> @@ -0,0 +1,399 @@
> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
> +// Copyright (c) 2020 Cloudflare
> +
> +#include <errno.h>
> +#include <linux/bpf.h>
> +#include <sys/socket.h>
> +
> +#include <bpf/bpf_endian.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#define IP4(a, b, c, d)                                        \
> +       bpf_htonl((((__u32)(a) & 0xffU) << 24) |        \
> +                 (((__u32)(b) & 0xffU) << 16) |        \
> +                 (((__u32)(c) & 0xffU) <<  8) |        \
> +                 (((__u32)(d) & 0xffU) <<  0))
> +#define IP6(aaaa, bbbb, cccc, dddd)                    \
> +       { bpf_htonl(aaaa), bpf_htonl(bbbb), bpf_htonl(cccc), bpf_htonl(dddd) }
> +
> +#define MAX_SOCKS 32
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_SOCKMAP);
> +       __uint(max_entries, MAX_SOCKS);
> +       __type(key, __u32);
> +       __type(value, __u64);
> +} redir_map SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_ARRAY);
> +       __uint(max_entries, 2);
> +       __type(key, int);
> +       __type(value, int);
> +} run_map SEC(".maps");
> +
> +enum {
> +       PROG1 = 0,
> +       PROG2,
> +};
> +
> +enum {
> +       SERVER_A = 0,
> +       SERVER_B,
> +};
> +
> +/* Addressable key/value constants for convenience */
> +static const int KEY_PROG1 = PROG1;
> +static const int KEY_PROG2 = PROG2;
> +static const int PROG_DONE = 1;
> +
> +static const __u32 KEY_SERVER_A = SERVER_A;
> +static const __u32 KEY_SERVER_B = SERVER_B;
> +
> +static const __u32 DST_PORT = 7007;
> +static const __u32 DST_IP4 = IP4(127, 0, 0, 1);
> +static const __u32 DST_IP6[] = IP6(0xfd000000, 0x0, 0x0, 0x00000001);
> +
> +SEC("sk_lookup/lookup_pass")
> +int lookup_pass(struct bpf_sk_lookup *ctx)
> +{
> +       return BPF_OK;
> +}
> +
> +SEC("sk_lookup/lookup_drop")
> +int lookup_drop(struct bpf_sk_lookup *ctx)
> +{
> +       return BPF_DROP;
> +}
> +
> +SEC("sk_reuseport/reuse_pass")
> +int reuseport_pass(struct sk_reuseport_md *ctx)
> +{
> +       return SK_PASS;
> +}
> +
> +SEC("sk_reuseport/reuse_drop")
> +int reuseport_drop(struct sk_reuseport_md *ctx)
> +{
> +       return SK_DROP;
> +}
> +
> +/* Redirect packets destined for port DST_PORT to socket at redir_map[0]. */
> +SEC("sk_lookup/redir_port")
> +int redir_port(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       if (ctx->local_port != DST_PORT)
> +               return BPF_OK;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_OK;
> +
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       bpf_sk_release(sk);
> +       return err ? BPF_DROP : BPF_REDIRECT;
> +}
> +
> +/* Redirect packets destined for DST_IP4 address to socket at redir_map[0]. */
> +SEC("sk_lookup/redir_ip4")
> +int redir_ip4(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       if (ctx->family != AF_INET)
> +               return BPF_OK;
> +       if (ctx->local_port != DST_PORT)
> +               return BPF_OK;
> +       if (ctx->local_ip4 != DST_IP4)
> +               return BPF_OK;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_OK;
> +
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       bpf_sk_release(sk);
> +       return err ? BPF_DROP : BPF_REDIRECT;
> +}
> +
> +/* Redirect packets destined for DST_IP6 address to socket at redir_map[0]. */
> +SEC("sk_lookup/redir_ip6")
> +int redir_ip6(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       if (ctx->family != AF_INET6)
> +               return BPF_OK;
> +       if (ctx->local_port != DST_PORT)
> +               return BPF_OK;
> +       if (ctx->local_ip6[0] != DST_IP6[0] ||
> +           ctx->local_ip6[1] != DST_IP6[1] ||
> +           ctx->local_ip6[2] != DST_IP6[2] ||
> +           ctx->local_ip6[3] != DST_IP6[3])
> +               return BPF_OK;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_OK;
> +
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       bpf_sk_release(sk);
> +       return err ? BPF_DROP : BPF_REDIRECT;
> +}
> +
> +SEC("sk_lookup/select_sock_a")
> +int select_sock_a(struct bpf_sk_lookup *ctx)

Nit: you could have a function __force_inline__
select_sock_helper(ctx, key, flags)
and then call that from select_sock_a, select_sock_a_no_reuseport, etc.
That might help cut down on code duplication.

> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_OK;
> +
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       bpf_sk_release(sk);
> +       return err ? BPF_DROP : BPF_REDIRECT;
> +}
> +
> +SEC("sk_lookup/select_sock_a_no_reuseport")
> +int select_sock_a_no_reuseport(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_DROP;
> +
> +       err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_NO_REUSEPORT);
> +       bpf_sk_release(sk);
> +       return err ? BPF_DROP : BPF_REDIRECT;
> +}
> +
> +SEC("sk_reuseport/select_sock_b")
> +int select_sock_b(struct sk_reuseport_md *ctx)
> +{
> +       __u32 key = KEY_SERVER_B;
> +       int err;
> +
> +       err = bpf_sk_select_reuseport(ctx, &redir_map, &key, 0);
> +       return err ? SK_DROP : SK_PASS;
> +}
> +
> +/* Check that bpf_sk_assign() returns -EEXIST if socket already selected. */
> +SEC("sk_lookup/sk_assign_eexist")
> +int sk_assign_eexist(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err, ret;
> +
> +       ret = BPF_DROP;
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_B);
> +       if (!sk)
> +               goto out;
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       if (err)
> +               goto out;
> +       bpf_sk_release(sk);
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               goto out;
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       if (err != -EEXIST) {
> +               bpf_printk("sk_assign returned %d, expected %d\n",
> +                          err, -EEXIST);
> +               goto out;
> +       }
> +
> +       ret = BPF_REDIRECT; /* Success, redirect to KEY_SERVER_B */
> +out:
> +       if (sk)
> +               bpf_sk_release(sk);
> +       return ret;
> +}
> +
> +/* Check that bpf_sk_assign(BPF_SK_LOOKUP_F_REPLACE) can override selection. */
> +SEC("sk_lookup/sk_assign_replace_flag")
> +int sk_assign_replace_flag(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err, ret;
> +
> +       ret = BPF_DROP;
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               goto out;
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       if (err)
> +               goto out;
> +       bpf_sk_release(sk);
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_B);
> +       if (!sk)
> +               goto out;
> +       err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_REPLACE);
> +       if (err) {
> +               bpf_printk("sk_assign returned %d, expected 0\n", err);
> +               goto out;
> +       }
> +
> +       ret = BPF_REDIRECT; /* Success, redirect to KEY_SERVER_B */
> +out:
> +       if (sk)
> +               bpf_sk_release(sk);
> +       return ret;
> +}
> +
> +/* Check that selected sk is accessible thru context. */
> +SEC("sk_lookup/access_ctx_sk")
> +int access_ctx_sk(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err, ret;
> +
> +       ret = BPF_DROP;
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               goto out;
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       if (err)
> +               goto out;
> +       if (sk != ctx->sk) {
> +               bpf_printk("expected ctx->sk == KEY_SERVER_A\n");
> +               goto out;
> +       }
> +       bpf_sk_release(sk);
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_B);
> +       if (!sk)
> +               goto out;
> +       err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_REPLACE);
> +       if (err)
> +               goto out;
> +       if (sk != ctx->sk) {
> +               bpf_printk("expected ctx->sk == KEY_SERVER_B\n");
> +               goto out;
> +       }
> +
> +       ret = BPF_REDIRECT; /* Success, redirect to KEY_SERVER_B */
> +out:
> +       if (sk)
> +               bpf_sk_release(sk);
> +       return ret;
> +}
> +
> +/* Check that sk_assign rejects KEY_SERVER_A socket with -ESOCKNOSUPPORT */
> +SEC("sk_lookup/sk_assign_esocknosupport")
> +int sk_assign_esocknosupport(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err, ret;
> +
> +       ret = BPF_DROP;
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               goto out;
> +
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       if (err != -ESOCKTNOSUPPORT) {
> +               bpf_printk("sk_assign returned %d, expected %d\n",
> +                          err, -ESOCKTNOSUPPORT);
> +               goto out;
> +       }
> +
> +       ret = BPF_OK; /* Success, pass to regular lookup */
> +out:
> +       if (sk)
> +               bpf_sk_release(sk);
> +       return ret;
> +}
> +
> +SEC("sk_lookup/multi_prog_pass1")
> +int multi_prog_pass1(struct bpf_sk_lookup *ctx)
> +{
> +       bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
> +       return BPF_OK;
> +}
> +
> +SEC("sk_lookup/multi_prog_pass2")
> +int multi_prog_pass2(struct bpf_sk_lookup *ctx)
> +{
> +       bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
> +       return BPF_OK;
> +}
> +
> +SEC("sk_lookup/multi_prog_drop1")
> +int multi_prog_drop1(struct bpf_sk_lookup *ctx)
> +{
> +       bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
> +       return BPF_DROP;
> +}
> +
> +SEC("sk_lookup/multi_prog_drop2")
> +int multi_prog_drop2(struct bpf_sk_lookup *ctx)
> +{
> +       bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
> +       return BPF_DROP;
> +}
> +
> +SEC("sk_lookup/multi_prog_inval1")
> +int multi_prog_inval1(struct bpf_sk_lookup *ctx)
> +{
> +       bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
> +       return -1;
> +}
> +
> +SEC("sk_lookup/multi_prog_inval2")
> +int multi_prog_inval2(struct bpf_sk_lookup *ctx)
> +{
> +       bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
> +       return -1;
> +}
> +
> +SEC("sk_lookup/multi_prog_redir1")
> +int multi_prog_redir1(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_DROP;
> +
> +       err = bpf_sk_assign(ctx, sk, 0);
> +       bpf_sk_release(sk);
> +       if (err)
> +               return BPF_DROP;
> +
> +       bpf_map_update_elem(&run_map, &KEY_PROG1, &PROG_DONE, BPF_ANY);
> +       return BPF_REDIRECT;
> +}
> +
> +SEC("sk_lookup/multi_prog_redir2")
> +int multi_prog_redir2(struct bpf_sk_lookup *ctx)
> +{
> +       struct bpf_sock *sk;
> +       int err;
> +
> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
> +       if (!sk)
> +               return BPF_DROP;
> +
> +       err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_REPLACE);
> +       bpf_sk_release(sk);
> +       if (err)
> +               return BPF_DROP;
> +
> +       bpf_map_update_elem(&run_map, &KEY_PROG2, &PROG_DONE, BPF_ANY);
> +       return BPF_REDIRECT;
> +}
> +
> +char _license[] SEC("license") = "Dual BSD/GPL";
> +__u32 _version SEC("version") = 1;
> --
> 2.25.4
>


-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup
  2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
                   ` (15 preceding siblings ...)
  2020-07-02  9:24 ` [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point Jakub Sitnicki
@ 2020-07-02 11:05 ` Lorenz Bauer
  16 siblings, 0 replies; 51+ messages in thread
From: Lorenz Bauer @ 2020-07-02 11:05 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Andrii Nakryiko, Marek Majkowski, Martin KaFai Lau

On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Overview
> ========
>
> (Same as in v2. Please skip to next section if you've read it.)
>
> This series proposes a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP,
> or BPF sk_lookup for short.
>
> BPF sk_lookup program runs when transport layer is looking up a listening
> socket for a new connection request (TCP), or when looking up an
> unconnected socket for a packet (UDP).
>
> This serves as a mechanism to overcome the limits of what bind() API allows
> to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, fixed port to a single socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, any port to a single socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its context, program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection, and returns BPF_REDIRECT code. Transport
> layer then uses the selected socket as a result of socket lookup.
>
> Alternatively, program can also fail the lookup (BPF_DROP), or let the
> lookup continue as usual (BPF_OK).
>
> This lets the user match packets with listening (TCP) or receiving (UDP)
> sockets freely at the last possible point on the receive path, where we
> know that packets are destined for local delivery after undergoing
> policing, filtering, and routing.
>
> Program is attached to a network namespace, similar to BPF flow_dissector.
> We add a new attach type, BPF_SK_LOOKUP, for this.
>
> Series structure
> ================
>
> Patches are organized as so:
>
>  1: enabled multiple link-based prog attachments for bpf-netns
>  2: introduces sk_lookup program type
>  3-4: hook up the program to run on ipv4/tcp socket lookup
>  5-6: hook up the program to run on ipv6/tcp socket lookup
>  7-8: hook up the program to run on ipv4/udp socket lookup
>  9-10: hook up the program to run on ipv6/udp socket lookup
>  11-13: libbpf & bpftool support for sk_lookup
>  14-16: verifier and selftests for sk_lookup
>
> Patches are also available on GH:
>
>   https://github.com/jsitnicki/linux/commits/bpf-inet-lookup-v3
>
> Performance considerations
> ==========================
>
> I'm re-running udp6 small packet flood test, the scenario for which we had
> performance concerns in [v2], to measure pps hit after the changes called
> out in change log below.
>
> Will follow up with results. But I'm posting the patches early for review
> since there is a fair amount of code changes.
>
> Further work
> ============
>
> - user docs for new prog type, Documentation/bpf/prog_sk_lookup.rst
>   I'm looking for consensus on multi-prog semantics outlined in patch #4
>   description before drafting the document.
>
> - timeout on accept() in tests
>   I need to extract a helper for it into network_helpers in
>   selftests/bpf/. Didn't want to make this series any longer.
>
> Note to maintainers
> ===================
>
> This patch series depends on bpf-netns multi-prog changes that went
> recently into 'bpf' [0]. It won't apply onto 'bpf-next' until 'bpf' gets
> merged into 'bpf-next'.
>
> Changelog
> =========
>
> v3 brings the following changes based on feedback:
>
> 1. switch to link-based program attachment,
> 2. support for multi-prog attachment,
> 3. ability to skip reuseport socket selection,
> 4. code on RX path is guarded by a static key,
> 5. struct in6_addr's are no longer copied into BPF prog context,
> 6. BPF prog context is initialized as late as possible.
>
> v2 -> v3:
> - Changes called out in patches 1-2, 4, 6, 8, 10-14, 16
> - Patches dropped:
>   01/17 flow_dissector: Extract attach/detach/query helpers
>   03/17 inet: Store layer 4 protocol in inet_hashinfo
>   08/17 udp: Store layer 4 protocol in udp_table
>
> v1 -> v2:
> - Changes called out in patches 2, 13-15, 17
> - Rebase to recent bpf-next (b4563facdcae)
>
> RFCv2 -> v1:
>
> - Switch to fetching a socket from a map and selecting a socket with
>   bpf_sk_assign, instead of having a dedicated helper that does both.
> - Run reuseport logic on sockets selected by BPF sk_lookup.
> - Allow BPF sk_lookup to fail the lookup with no match.
> - Go back to having just 2 hash table lookups in UDP.
>
> RFCv1 -> RFCv2:
>
> - Make socket lookup redirection map-based. BPF program now uses a
>   dedicated helper and a SOCKARRAY map to select the socket to redirect to.
>   A consequence of this change is that bpf_inet_lookup context is now
>   read-only.
> - Look for connected UDP sockets before allowing redirection from BPF.
>   This makes connected UDP socket work as expected in the presence of
>   inet_lookup prog.
> - Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
>   the only other per-netns BPF prog type.
>
> [RFCv1] https://lore.kernel.org/bpf/20190618130050.8344-1-jakub@cloudflare.com/
> [RFCv2] https://lore.kernel.org/bpf/20190828072250.29828-1-jakub@cloudflare.com/
> [v1] https://lore.kernel.org/bpf/20200511185218.1422406-18-jakub@cloudflare.com/
> [v2] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=951f38cf08350884e72e0936adf147a8d764cc5d
>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrii Nakryiko <andriin@fb.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Cc: Marek Majkowski <marek@cloudflare.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
>
> Jakub Sitnicki (16):
>   bpf, netns: Handle multiple link attachments
>   bpf: Introduce SK_LOOKUP program type with a dedicated attach point
>   inet: Extract helper for selecting socket from reuseport group
>   inet: Run SK_LOOKUP BPF program on socket lookup
>   inet6: Extract helper for selecting socket from reuseport group
>   inet6: Run SK_LOOKUP BPF program on socket lookup
>   udp: Extract helper for selecting socket from reuseport group
>   udp: Run SK_LOOKUP BPF program on socket lookup
>   udp6: Extract helper for selecting socket from reuseport group
>   udp6: Run SK_LOOKUP BPF program on socket lookup
>   bpf: Sync linux/bpf.h to tools/
>   libbpf: Add support for SK_LOOKUP program type
>   tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type
>   selftests/bpf: Add verifier tests for bpf_sk_lookup context access
>   selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c
>   selftests/bpf: Tests for BPF_SK_LOOKUP attach point

For the series:
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>


>
>  include/linux/bpf-netns.h                     |    3 +
>  include/linux/bpf.h                           |   33 +
>  include/linux/bpf_types.h                     |    2 +
>  include/linux/filter.h                        |   99 ++
>  include/uapi/linux/bpf.h                      |   74 +
>  kernel/bpf/core.c                             |   22 +
>  kernel/bpf/net_namespace.c                    |  125 +-
>  kernel/bpf/syscall.c                          |    9 +
>  net/core/filter.c                             |  188 +++
>  net/ipv4/inet_hashtables.c                    |   60 +-
>  net/ipv4/udp.c                                |   93 +-
>  net/ipv6/inet6_hashtables.c                   |   66 +-
>  net/ipv6/udp.c                                |   97 +-
>  scripts/bpf_helpers_doc.py                    |    9 +-
>  tools/bpf/bpftool/common.c                    |    1 +
>  tools/bpf/bpftool/prog.c                      |    3 +-
>  tools/include/uapi/linux/bpf.h                |   74 +
>  tools/lib/bpf/libbpf.c                        |    3 +
>  tools/lib/bpf/libbpf.h                        |    2 +
>  tools/lib/bpf/libbpf.map                      |    2 +
>  tools/lib/bpf/libbpf_probes.c                 |    3 +
>  .../bpf/prog_tests/reference_tracking.c       |    2 +-
>  .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
>  .../selftests/bpf/progs/test_ref_track_kern.c |  181 +++
>  .../selftests/bpf/progs/test_sk_lookup_kern.c |  462 ++++--
>  .../selftests/bpf/verifier/ctx_sk_lookup.c    |  219 +++
>  26 files changed, 2995 insertions(+), 190 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_ref_track_kern.c
>  create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_lookup.c
>
> --
> 2.25.4
>


--
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02 10:27   ` Lorenz Bauer
@ 2020-07-02 12:46     ` Jakub Sitnicki
  2020-07-02 13:19       ` Lorenz Bauer
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02 12:46 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, Jul 02, 2020 at 12:27 PM CEST, Lorenz Bauer wrote:
> On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Run a BPF program before looking up a listening socket on the receive path.
>> Program selects a listening socket to yield as result of socket lookup by
>> calling bpf_sk_assign() helper and returning BPF_REDIRECT (7) code.
>>
>> Alternatively, program can also fail the lookup by returning with
>> BPF_DROP (1), or let the lookup continue as usual with BPF_OK (0) on
>> return. Other return values are treated the same as BPF_OK.
>
> I'd prefer if other values were treated as BPF_DROP, with other semantics
> unchanged. Otherwise we won't be able to introduce new semantics
> without potentially breaking user code.

That might be surprising or even risky. If you attach a badly written
program that say returns a negative value, it will drop all TCP SYNs and
UDP traffic.

>
>>
>> This lets the user match packets with listening sockets freely at the last
>> possible point on the receive path, where we know that packets are destined
>> for local delivery after undergoing policing, filtering, and routing.
>>
>> With BPF code selecting the socket, directing packets destined to an IP
>> range or to a port range to a single socket becomes possible.
>>
>> In case multiple programs are attached, they are run in series in the order
>> in which they were attached. The end result gets determined from return
>> code from each program according to following rules.
>>
>>  1. If any program returned BPF_REDIRECT and selected a valid socket, this
>>     socket will be used as result of the lookup.
>>  2. If more than one program returned BPF_REDIRECT and selected a socket,
>>     last selection takes effect.
>>  3. If any program returned BPF_DROP and none returned BPF_REDIRECT, the
>>     socket lookup will fail with -ECONNREFUSED.
>>  4. If no program returned neither BPF_DROP nor BPF_REDIRECT, socket lookup
>>     continues to htable-based lookup.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>     v3:
>>     - Use a static_key to minimize the hook overhead when not used. (Alexei)
>>     - Adapt for running an array of attached programs. (Alexei)
>>     - Adapt for optionally skipping reuseport selection. (Martin)
>>
>>  include/linux/bpf.h        | 29 ++++++++++++++++++++++++++++
>>  include/linux/filter.h     | 39 ++++++++++++++++++++++++++++++++++++++
>>  kernel/bpf/net_namespace.c | 32 ++++++++++++++++++++++++++++++-
>>  net/core/filter.c          |  2 ++
>>  net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++
>>  5 files changed, 132 insertions(+), 1 deletion(-)
>>

[...]

>> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
>> index 090166824ca4..a7768feb3ade 100644
>> --- a/kernel/bpf/net_namespace.c
>> +++ b/kernel/bpf/net_namespace.c
>> @@ -25,6 +25,28 @@ struct bpf_netns_link {
>>  /* Protects updates to netns_bpf */
>>  DEFINE_MUTEX(netns_bpf_mutex);
>>
>> +static void netns_bpf_attach_type_disable(enum netns_bpf_attach_type type)
>
> Nit: maybe netns_bpf_attach_type_dec()? Disable sounds like it happens
> unconditionally.

attach_type_dec()/_inc() seems a bit cryptic, since it's not the attach
type we are incrementing/decrementing.

But I was considering _need()/_unneed(), which would follow an existing
example, if you think that improves things.

>
>> +{
>> +       switch (type) {
>> +       case NETNS_BPF_SK_LOOKUP:
>> +               static_branch_dec(&bpf_sk_lookup_enabled);
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +}
>> +
>> +static void netns_bpf_attach_type_enable(enum netns_bpf_attach_type type)
>> +{
>> +       switch (type) {
>> +       case NETNS_BPF_SK_LOOKUP:
>> +               static_branch_inc(&bpf_sk_lookup_enabled);
>> +               break;
>> +       default:
>> +               break;
>> +       }
>> +}
>> +
>>  /* Must be called with netns_bpf_mutex held. */
>>  static void netns_bpf_run_array_detach(struct net *net,
>>                                        enum netns_bpf_attach_type type)
>> @@ -93,6 +115,9 @@ static void bpf_netns_link_release(struct bpf_link *link)
>>         if (!net)
>>                 goto out_unlock;
>>
>> +       /* Mark attach point as unused */
>> +       netns_bpf_attach_type_disable(type);
>> +
>>         /* Remember link position in case of safe delete */
>>         idx = link_index(net, type, net_link);
>>         list_del(&net_link->node);
>> @@ -416,6 +441,9 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
>>                                         lockdep_is_held(&netns_bpf_mutex));
>>         bpf_prog_array_free(run_array);
>>
>> +       /* Mark attach point as used */
>> +       netns_bpf_attach_type_enable(type);
>> +
>>  out_unlock:
>>         mutex_unlock(&netns_bpf_mutex);
>>         return err;
>> @@ -491,8 +519,10 @@ static void __net_exit netns_bpf_pernet_pre_exit(struct net *net)
>>         mutex_lock(&netns_bpf_mutex);
>>         for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) {
>>                 netns_bpf_run_array_detach(net, type);
>> -               list_for_each_entry(net_link, &net->bpf.links[type], node)
>> +               list_for_each_entry(net_link, &net->bpf.links[type], node) {
>>                         net_link->net = NULL; /* auto-detach link */
>> +                       netns_bpf_attach_type_disable(type);
>> +               }
>>                 if (net->bpf.progs[type])
>>                         bpf_prog_put(net->bpf.progs[type]);
>>         }
>> diff --git a/net/core/filter.c b/net/core/filter.c

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
  2020-07-02 11:01   ` Lorenz Bauer
@ 2020-07-02 12:59     ` Jakub Sitnicki
  2020-07-09  4:28       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-02 12:59 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 02, 2020 at 01:01 PM CEST, Lorenz Bauer wrote:
> On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Add tests to test_progs that exercise:
>>
>>  - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
>>  - redirecting socket lookup to a socket selected by BPF program,
>>  - failing a socket lookup on BPF program's request,
>>  - error scenarios for selecting a socket from BPF program,
>>  - accessing BPF program context,
>>  - attaching and running multiple BPF programs.
>>
>> Run log:
>> | # ./test_progs -n 68
>> | #68/1 query lookup prog:OK
>> | #68/2 TCP IPv4 redir port:OK
>> | #68/3 TCP IPv4 redir addr:OK
>> | #68/4 TCP IPv4 redir with reuseport:OK
>> | #68/5 TCP IPv4 redir skip reuseport:OK
>> | #68/6 TCP IPv6 redir port:OK
>> | #68/7 TCP IPv6 redir addr:OK
>> | #68/8 TCP IPv4->IPv6 redir port:OK
>> | #68/9 TCP IPv6 redir with reuseport:OK
>> | #68/10 TCP IPv6 redir skip reuseport:OK
>> | #68/11 UDP IPv4 redir port:OK
>> | #68/12 UDP IPv4 redir addr:OK
>> | #68/13 UDP IPv4 redir with reuseport:OK
>> | #68/14 UDP IPv4 redir skip reuseport:OK
>> | #68/15 UDP IPv6 redir port:OK
>> | #68/16 UDP IPv6 redir addr:OK
>> | #68/17 UDP IPv4->IPv6 redir port:OK
>> | #68/18 UDP IPv6 redir and reuseport:OK
>> | #68/19 UDP IPv6 redir skip reuseport:OK
>> | #68/20 TCP IPv4 drop on lookup:OK
>> | #68/21 TCP IPv6 drop on lookup:OK
>> | #68/22 UDP IPv4 drop on lookup:OK
>> | #68/23 UDP IPv6 drop on lookup:OK
>> | #68/24 TCP IPv4 drop on reuseport:OK
>> | #68/25 TCP IPv6 drop on reuseport:OK
>> | #68/26 UDP IPv4 drop on reuseport:OK
>> | #68/27 TCP IPv6 drop on reuseport:OK
>> | #68/28 sk_assign returns EEXIST:OK
>> | #68/29 sk_assign honors F_REPLACE:OK
>> | #68/30 access ctx->sk:OK
>> | #68/31 sk_assign rejects TCP established:OK
>> | #68/32 sk_assign rejects UDP connected:OK
>> | #68/33 multi prog - pass, pass:OK
>> | #68/34 multi prog - pass, inval:OK
>> | #68/35 multi prog - inval, pass:OK
>> | #68/36 multi prog - drop, drop:OK
>> | #68/37 multi prog - pass, drop:OK
>> | #68/38 multi prog - drop, pass:OK
>> | #68/39 multi prog - drop, inval:OK
>> | #68/40 multi prog - inval, drop:OK
>> | #68/41 multi prog - pass, redir:OK
>> | #68/42 multi prog - redir, pass:OK
>> | #68/43 multi prog - drop, redir:OK
>> | #68/44 multi prog - redir, drop:OK
>> | #68/45 multi prog - inval, redir:OK
>> | #68/46 multi prog - redir, inval:OK
>> | #68/47 multi prog - redir, redir:OK
>> | #68 sk_lookup:OK
>> | Summary: 1/47 PASSED, 0 SKIPPED, 0 FAILED
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>     v3:
>>     - Extend tests to cover new functionality in v3:
>>       - multi-prog attachments (query, running, verdict precedence)
>>       - socket selecting for the second time with bpf_sk_assign
>>       - skipping over reuseport load-balancing
>>
>>     v2:
>>      - Adjust for fields renames in struct bpf_sk_lookup.
>>
>>  .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
>>  .../selftests/bpf/progs/test_sk_lookup_kern.c |  399 +++++
>>  2 files changed, 1752 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>> new file mode 100644
>> index 000000000000..2859dc7e65b0
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c

[...]

>> +static void query_lookup_prog(struct test_sk_lookup_kern *skel)
>> +{
>> +       struct bpf_link *link[3] = {};
>> +       __u32 attach_flags = 0;
>> +       __u32 prog_ids[3] = {};
>> +       __u32 prog_cnt = 3;
>> +       __u32 prog_id;
>> +       int net_fd;
>> +       int err;
>> +
>> +       net_fd = open("/proc/self/ns/net", O_RDONLY);
>> +       if (CHECK_FAIL(net_fd < 0)) {
>> +               log_err("failed to open /proc/self/ns/net");
>> +               return;
>> +       }
>> +
>> +       link[0] = attach_lookup_prog(skel->progs.lookup_pass);
>> +       if (!link[0])
>> +               goto close;
>> +       link[1] = attach_lookup_prog(skel->progs.lookup_pass);
>> +       if (!link[1])
>> +               goto detach;
>> +       link[2] = attach_lookup_prog(skel->progs.lookup_drop);
>> +       if (!link[2])
>> +               goto detach;
>> +
>> +       err = bpf_prog_query(net_fd, BPF_SK_LOOKUP, 0 /* query flags */,
>> +                            &attach_flags, prog_ids, &prog_cnt);
>> +       if (CHECK_FAIL(err)) {
>> +               log_err("failed to query lookup prog");
>> +               goto detach;
>> +       }
>> +
>> +       system("/home/jkbs/src/linux/tools/bpf/bpftool/bpftool link show");
>
> This is to make sure that I read all of the tests as well? ;P

Ha! Yes!

Of course, my bad. A left-over from debugging a test I extended last
minute to cover prog query when multiple programs are attached.

Thanks for reading through it all, though.

>
>> +
>> +       errno = 0;
>> +       if (CHECK_FAIL(attach_flags != 0)) {
>> +               log_err("wrong attach_flags on query: %u", attach_flags);
>> +               goto detach;
>> +       }
>> +       if (CHECK_FAIL(prog_cnt != 3)) {
>> +               log_err("wrong program count on query: %u", prog_cnt);
>> +               goto detach;
>> +       }
>> +       prog_id = link_info_prog_id(link[0]);
>> +       if (CHECK_FAIL(prog_ids[0] != prog_id)) {
>> +               log_err("invalid program id on query: %u != %u",
>> +                       prog_ids[0], prog_id);
>> +               goto detach;
>> +       }
>> +       prog_id = link_info_prog_id(link[1]);
>> +       if (CHECK_FAIL(prog_ids[1] != prog_id)) {
>> +               log_err("invalid program id on query: %u != %u",
>> +                       prog_ids[1], prog_id);
>> +               goto detach;
>> +       }
>> +       prog_id = link_info_prog_id(link[2]);
>> +       if (CHECK_FAIL(prog_ids[2] != prog_id)) {
>> +               log_err("invalid program id on query: %u != %u",
>> +                       prog_ids[2], prog_id);
>> +               goto detach;
>> +       }
>> +
>> +detach:
>> +       if (link[2])
>> +               bpf_link__destroy(link[2]);
>> +       if (link[1])
>> +               bpf_link__destroy(link[1]);
>> +       if (link[0])
>> +               bpf_link__destroy(link[0]);
>> +close:
>> +       close(net_fd);
>> +}
>> +
>> +static void run_lookup_prog(const struct test *t)
>> +{
>> +       int client_fd, server_fds[MAX_SERVERS] = { -1 };
>> +       struct bpf_link *lookup_link;
>> +       int i, err;
>> +
>> +       lookup_link = attach_lookup_prog(t->lookup_prog);
>> +       if (!lookup_link)
>
> Why doesn't this fail the test? Same for the other error paths in the
> function, and the other helpers.

I took the approach of placing CHECK_FAIL checks only right after the
failure point. So a syscall or a call to libbpf.

This way if I'm calling a helper, I know it already fails the test if
anything goes wrong, and I can have less CHECK_FAILs peppered over the
code.

[...]

>> diff --git a/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
>> new file mode 100644
>> index 000000000000..75745898fd3b
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
>> @@ -0,0 +1,399 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
>> +// Copyright (c) 2020 Cloudflare
>> +
>> +#include <errno.h>
>> +#include <linux/bpf.h>
>> +#include <sys/socket.h>
>> +
>> +#include <bpf/bpf_endian.h>
>> +#include <bpf/bpf_helpers.h>
>> +
>> +#define IP4(a, b, c, d)                                        \
>> +       bpf_htonl((((__u32)(a) & 0xffU) << 24) |        \
>> +                 (((__u32)(b) & 0xffU) << 16) |        \
>> +                 (((__u32)(c) & 0xffU) <<  8) |        \
>> +                 (((__u32)(d) & 0xffU) <<  0))
>> +#define IP6(aaaa, bbbb, cccc, dddd)                    \
>> +       { bpf_htonl(aaaa), bpf_htonl(bbbb), bpf_htonl(cccc), bpf_htonl(dddd) }
>> +
>> +#define MAX_SOCKS 32
>> +
>> +struct {
>> +       __uint(type, BPF_MAP_TYPE_SOCKMAP);
>> +       __uint(max_entries, MAX_SOCKS);
>> +       __type(key, __u32);
>> +       __type(value, __u64);
>> +} redir_map SEC(".maps");
>> +
>> +struct {
>> +       __uint(type, BPF_MAP_TYPE_ARRAY);
>> +       __uint(max_entries, 2);
>> +       __type(key, int);
>> +       __type(value, int);
>> +} run_map SEC(".maps");
>> +
>> +enum {
>> +       PROG1 = 0,
>> +       PROG2,
>> +};
>> +
>> +enum {
>> +       SERVER_A = 0,
>> +       SERVER_B,
>> +};
>> +
>> +/* Addressable key/value constants for convenience */
>> +static const int KEY_PROG1 = PROG1;
>> +static const int KEY_PROG2 = PROG2;
>> +static const int PROG_DONE = 1;
>> +
>> +static const __u32 KEY_SERVER_A = SERVER_A;
>> +static const __u32 KEY_SERVER_B = SERVER_B;
>> +
>> +static const __u32 DST_PORT = 7007;
>> +static const __u32 DST_IP4 = IP4(127, 0, 0, 1);
>> +static const __u32 DST_IP6[] = IP6(0xfd000000, 0x0, 0x0, 0x00000001);
>> +
>> +SEC("sk_lookup/lookup_pass")
>> +int lookup_pass(struct bpf_sk_lookup *ctx)
>> +{
>> +       return BPF_OK;
>> +}
>> +
>> +SEC("sk_lookup/lookup_drop")
>> +int lookup_drop(struct bpf_sk_lookup *ctx)
>> +{
>> +       return BPF_DROP;
>> +}
>> +
>> +SEC("sk_reuseport/reuse_pass")
>> +int reuseport_pass(struct sk_reuseport_md *ctx)
>> +{
>> +       return SK_PASS;
>> +}
>> +
>> +SEC("sk_reuseport/reuse_drop")
>> +int reuseport_drop(struct sk_reuseport_md *ctx)
>> +{
>> +       return SK_DROP;
>> +}
>> +
>> +/* Redirect packets destined for port DST_PORT to socket at redir_map[0]. */
>> +SEC("sk_lookup/redir_port")
>> +int redir_port(struct bpf_sk_lookup *ctx)
>> +{
>> +       struct bpf_sock *sk;
>> +       int err;
>> +
>> +       if (ctx->local_port != DST_PORT)
>> +               return BPF_OK;
>> +
>> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
>> +       if (!sk)
>> +               return BPF_OK;
>> +
>> +       err = bpf_sk_assign(ctx, sk, 0);
>> +       bpf_sk_release(sk);
>> +       return err ? BPF_DROP : BPF_REDIRECT;
>> +}
>> +
>> +/* Redirect packets destined for DST_IP4 address to socket at redir_map[0]. */
>> +SEC("sk_lookup/redir_ip4")
>> +int redir_ip4(struct bpf_sk_lookup *ctx)
>> +{
>> +       struct bpf_sock *sk;
>> +       int err;
>> +
>> +       if (ctx->family != AF_INET)
>> +               return BPF_OK;
>> +       if (ctx->local_port != DST_PORT)
>> +               return BPF_OK;
>> +       if (ctx->local_ip4 != DST_IP4)
>> +               return BPF_OK;
>> +
>> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
>> +       if (!sk)
>> +               return BPF_OK;
>> +
>> +       err = bpf_sk_assign(ctx, sk, 0);
>> +       bpf_sk_release(sk);
>> +       return err ? BPF_DROP : BPF_REDIRECT;
>> +}
>> +
>> +/* Redirect packets destined for DST_IP6 address to socket at redir_map[0]. */
>> +SEC("sk_lookup/redir_ip6")
>> +int redir_ip6(struct bpf_sk_lookup *ctx)
>> +{
>> +       struct bpf_sock *sk;
>> +       int err;
>> +
>> +       if (ctx->family != AF_INET6)
>> +               return BPF_OK;
>> +       if (ctx->local_port != DST_PORT)
>> +               return BPF_OK;
>> +       if (ctx->local_ip6[0] != DST_IP6[0] ||
>> +           ctx->local_ip6[1] != DST_IP6[1] ||
>> +           ctx->local_ip6[2] != DST_IP6[2] ||
>> +           ctx->local_ip6[3] != DST_IP6[3])
>> +               return BPF_OK;
>> +
>> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
>> +       if (!sk)
>> +               return BPF_OK;
>> +
>> +       err = bpf_sk_assign(ctx, sk, 0);
>> +       bpf_sk_release(sk);
>> +       return err ? BPF_DROP : BPF_REDIRECT;
>> +}
>> +
>> +SEC("sk_lookup/select_sock_a")
>> +int select_sock_a(struct bpf_sk_lookup *ctx)
>
> Nit: you could have a function __force_inline__
> select_sock_helper(ctx, key, flags)
> and then call that from select_sock_a, select_sock_a_no_reuseport, etc.
> That might help cut down on code duplication.

I will play with that. Thanks for the idea.

Overall I realize tests could use more polishing. I was focusing on
coverage first to demonstrate correctness. But am planning improving
code sharing.

>
>> +{
>> +       struct bpf_sock *sk;
>> +       int err;
>> +
>> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
>> +       if (!sk)
>> +               return BPF_OK;
>> +
>> +       err = bpf_sk_assign(ctx, sk, 0);
>> +       bpf_sk_release(sk);
>> +       return err ? BPF_DROP : BPF_REDIRECT;
>> +}
>> +
>> +SEC("sk_lookup/select_sock_a_no_reuseport")
>> +int select_sock_a_no_reuseport(struct bpf_sk_lookup *ctx)
>> +{
>> +       struct bpf_sock *sk;
>> +       int err;
>> +
>> +       sk = bpf_map_lookup_elem(&redir_map, &KEY_SERVER_A);
>> +       if (!sk)
>> +               return BPF_DROP;
>> +
>> +       err = bpf_sk_assign(ctx, sk, BPF_SK_LOOKUP_F_NO_REUSEPORT);
>> +       bpf_sk_release(sk);
>> +       return err ? BPF_DROP : BPF_REDIRECT;
>> +}
>> +
>> +SEC("sk_reuseport/select_sock_b")
>> +int select_sock_b(struct sk_reuseport_md *ctx)
>> +{
>> +       __u32 key = KEY_SERVER_B;
>> +       int err;
>> +
>> +       err = bpf_sk_select_reuseport(ctx, &redir_map, &key, 0);
>> +       return err ? SK_DROP : SK_PASS;
>> +}
>> +

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02 12:46     ` Jakub Sitnicki
@ 2020-07-02 13:19       ` Lorenz Bauer
  2020-07-06 11:24         ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Lorenz Bauer @ 2020-07-02 13:19 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, 2 Jul 2020 at 13:46, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Thu, Jul 02, 2020 at 12:27 PM CEST, Lorenz Bauer wrote:
> > On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Run a BPF program before looking up a listening socket on the receive path.
> >> Program selects a listening socket to yield as result of socket lookup by
> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT (7) code.
> >>
> >> Alternatively, program can also fail the lookup by returning with
> >> BPF_DROP (1), or let the lookup continue as usual with BPF_OK (0) on
> >> return. Other return values are treated the same as BPF_OK.
> >
> > I'd prefer if other values were treated as BPF_DROP, with other semantics
> > unchanged. Otherwise we won't be able to introduce new semantics
> > without potentially breaking user code.
>
> That might be surprising or even risky. If you attach a badly written
> program that say returns a negative value, it will drop all TCP SYNs and
> UDP traffic.

I think if you do that all bets are off anyways. No use in trying to stagger on.
Being stricter here will actually make it easier to for a developer to ensure
that their program is doing the right thing.

My point about future extensions also still stands.

>
> >
> >>
> >> This lets the user match packets with listening sockets freely at the last
> >> possible point on the receive path, where we know that packets are destined
> >> for local delivery after undergoing policing, filtering, and routing.
> >>
> >> With BPF code selecting the socket, directing packets destined to an IP
> >> range or to a port range to a single socket becomes possible.
> >>
> >> In case multiple programs are attached, they are run in series in the order
> >> in which they were attached. The end result gets determined from return
> >> code from each program according to following rules.
> >>
> >>  1. If any program returned BPF_REDIRECT and selected a valid socket, this
> >>     socket will be used as result of the lookup.
> >>  2. If more than one program returned BPF_REDIRECT and selected a socket,
> >>     last selection takes effect.
> >>  3. If any program returned BPF_DROP and none returned BPF_REDIRECT, the
> >>     socket lookup will fail with -ECONNREFUSED.
> >>  4. If no program returned neither BPF_DROP nor BPF_REDIRECT, socket lookup
> >>     continues to htable-based lookup.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - Use a static_key to minimize the hook overhead when not used. (Alexei)
> >>     - Adapt for running an array of attached programs. (Alexei)
> >>     - Adapt for optionally skipping reuseport selection. (Martin)
> >>
> >>  include/linux/bpf.h        | 29 ++++++++++++++++++++++++++++
> >>  include/linux/filter.h     | 39 ++++++++++++++++++++++++++++++++++++++
> >>  kernel/bpf/net_namespace.c | 32 ++++++++++++++++++++++++++++++-
> >>  net/core/filter.c          |  2 ++
> >>  net/ipv4/inet_hashtables.c | 31 ++++++++++++++++++++++++++++++
> >>  5 files changed, 132 insertions(+), 1 deletion(-)
> >>
>
> [...]
>
> >> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
> >> index 090166824ca4..a7768feb3ade 100644
> >> --- a/kernel/bpf/net_namespace.c
> >> +++ b/kernel/bpf/net_namespace.c
> >> @@ -25,6 +25,28 @@ struct bpf_netns_link {
> >>  /* Protects updates to netns_bpf */
> >>  DEFINE_MUTEX(netns_bpf_mutex);
> >>
> >> +static void netns_bpf_attach_type_disable(enum netns_bpf_attach_type type)
> >
> > Nit: maybe netns_bpf_attach_type_dec()? Disable sounds like it happens
> > unconditionally.
>
> attach_type_dec()/_inc() seems a bit cryptic, since it's not the attach
> type we are incrementing/decrementing.
>
> But I was considering _need()/_unneed(), which would follow an existing
> example, if you think that improves things.

SGTM!

>
> >
> >> +{
> >> +       switch (type) {
> >> +       case NETNS_BPF_SK_LOOKUP:
> >> +               static_branch_dec(&bpf_sk_lookup_enabled);
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +}
> >> +
> >> +static void netns_bpf_attach_type_enable(enum netns_bpf_attach_type type)
> >> +{
> >> +       switch (type) {
> >> +       case NETNS_BPF_SK_LOOKUP:
> >> +               static_branch_inc(&bpf_sk_lookup_enabled);
> >> +               break;
> >> +       default:
> >> +               break;
> >> +       }
> >> +}
> >> +
> >>  /* Must be called with netns_bpf_mutex held. */
> >>  static void netns_bpf_run_array_detach(struct net *net,
> >>                                        enum netns_bpf_attach_type type)
> >> @@ -93,6 +115,9 @@ static void bpf_netns_link_release(struct bpf_link *link)
> >>         if (!net)
> >>                 goto out_unlock;
> >>
> >> +       /* Mark attach point as unused */
> >> +       netns_bpf_attach_type_disable(type);
> >> +
> >>         /* Remember link position in case of safe delete */
> >>         idx = link_index(net, type, net_link);
> >>         list_del(&net_link->node);
> >> @@ -416,6 +441,9 @@ static int netns_bpf_link_attach(struct net *net, struct bpf_link *link,
> >>                                         lockdep_is_held(&netns_bpf_mutex));
> >>         bpf_prog_array_free(run_array);
> >>
> >> +       /* Mark attach point as used */
> >> +       netns_bpf_attach_type_enable(type);
> >> +
> >>  out_unlock:
> >>         mutex_unlock(&netns_bpf_mutex);
> >>         return err;
> >> @@ -491,8 +519,10 @@ static void __net_exit netns_bpf_pernet_pre_exit(struct net *net)
> >>         mutex_lock(&netns_bpf_mutex);
> >>         for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) {
> >>                 netns_bpf_run_array_detach(net, type);
> >> -               list_for_each_entry(net_link, &net->bpf.links[type], node)
> >> +               list_for_each_entry(net_link, &net->bpf.links[type], node) {
> >>                         net_link->net = NULL; /* auto-detach link */
> >> +                       netns_bpf_attach_type_disable(type);
> >> +               }
> >>                 if (net->bpf.progs[type])
> >>                         bpf_prog_put(net->bpf.progs[type]);
> >>         }
> >> diff --git a/net/core/filter.c b/net/core/filter.c
>
> [...]



-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 ` [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
@ 2020-07-02 14:51     ` kernel test robot
  0 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2020-07-02 14:51 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: kbuild-all, netdev, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Marek Majkowski

[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]

Hi Jakub,

I love your patch! Yet something to improve:

[auto build test ERROR on next-20200702]
[cannot apply to bpf-next/master bpf/master net/master vhost/linux-next ipvs/master net-next/master linus/master v5.8-rc3 v5.8-rc2 v5.8-rc1 v5.8-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Run-a-BPF-program-on-socket-lookup/20200702-173127
base:    d37d57041350dff35dd17cbdf9aef4011acada38
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "bpf_sk_lookup_enabled" [net/ipv6/ipv6.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 13270 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup
@ 2020-07-02 14:51     ` kernel test robot
  0 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2020-07-02 14:51 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1384 bytes --]

Hi Jakub,

I love your patch! Yet something to improve:

[auto build test ERROR on next-20200702]
[cannot apply to bpf-next/master bpf/master net/master vhost/linux-next ipvs/master net-next/master linus/master v5.8-rc3 v5.8-rc2 v5.8-rc1 v5.8-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Run-a-BPF-program-on-socket-lookup/20200702-173127
base:    d37d57041350dff35dd17cbdf9aef4011acada38
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "bpf_sk_lookup_enabled" [net/ipv6/ipv6.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 13270 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02 14:51     ` kernel test robot
  (?)
@ 2020-07-03 13:04     ` Jakub Sitnicki
  -1 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-03 13:04 UTC (permalink / raw)
  To: kernel test robot
  Cc: bpf, kbuild-all, netdev, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Marek Majkowski

On Thu, Jul 02, 2020 at 04:51 PM CEST, kernel test robot wrote:

[...]

> All errors (new ones prefixed by >>, old ones prefixed by <<):
>
>>> ERROR: modpost: "bpf_sk_lookup_enabled" [net/ipv6/ipv6.ko] undefined!
>

We're missing an EXPORT_SYMBOL for CONFIG_IPV6=m build. Will fix in v4.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
@ 2020-07-04 18:42   ` Yonghong Song
  2020-07-06 11:44     ` Jakub Sitnicki
  2020-07-05  9:20     ` kernel test robot
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 51+ messages in thread
From: Yonghong Song @ 2020-07-04 18:42 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski



On 7/2/20 2:24 AM, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> when looking up a listening socket for a new connection request for
> connection oriented protocols, or when looking up an unconnected socket for
> a packet for connection-less protocols.
> 
> When called, SK_LOOKUP BPF program can select a socket that will receive
> the packet. This serves as a mechanism to overcome the limits of what
> bind() API allows to express. Two use-cases driving this work are:
> 
>   (1) steer packets destined to an IP range, on fixed port to a socket
> 
>       192.0.2.0/24, port 80 -> NGINX socket
> 
>   (2) steer packets destined to an IP address, on any port to a socket
> 
>       198.51.100.1, any port -> L7 proxy socket
> 
> In its run-time context program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
> 
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
> 
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
> 
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> 
> Notes:
>      v3:
>      - Allow bpf_sk_assign helper to replace previously selected socket only
>        when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>        programs running in series to accidentally override each other's verdict.
>      - Let BPF program decide that load-balancing within a reuseport socket group
>        should be skipped for the socket selected with bpf_sk_assign() by passing
>        BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>      - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>        the selected socket with an intention for multiple attached program
>        running in series to see each other's choices. However, currently the
>        verifier doesn't allow checking if pointer is set.
>      - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>      - Get rid of macros in convert_ctx_access to make it easier to read.
>      - Disallow 1-,2-byte access to context fields containing IP addresses.
>      
>      v2:
>      - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>        Update bpf_sk_assign docs accordingly. (Martin)
>      - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>      - Fix broken build when CONFIG_INET is not selected. (Martin)
>      - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>      - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
> 
>   include/linux/bpf-netns.h  |   3 +
>   include/linux/bpf_types.h  |   2 +
>   include/linux/filter.h     |  19 ++++
>   include/uapi/linux/bpf.h   |  74 +++++++++++++++
>   kernel/bpf/net_namespace.c |   5 +
>   kernel/bpf/syscall.c       |   9 ++
>   net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>   scripts/bpf_helpers_doc.py |   9 +-
>   8 files changed, 306 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
> index 4052d649f36d..cb1d849c5d4f 100644
> --- a/include/linux/bpf-netns.h
> +++ b/include/linux/bpf-netns.h
> @@ -8,6 +8,7 @@
>   enum netns_bpf_attach_type {
>   	NETNS_BPF_INVALID = -1,
>   	NETNS_BPF_FLOW_DISSECTOR = 0,
> +	NETNS_BPF_SK_LOOKUP,
>   	MAX_NETNS_BPF_ATTACH_TYPE
>   };
>   
> @@ -17,6 +18,8 @@ to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
>   	switch (attach_type) {
>   	case BPF_FLOW_DISSECTOR:
>   		return NETNS_BPF_FLOW_DISSECTOR;
> +	case BPF_SK_LOOKUP:
> +		return NETNS_BPF_SK_LOOKUP;
>   	default:
>   		return NETNS_BPF_INVALID;
>   	}
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index a18ae82a298a..a52a5688418e 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
>   #ifdef CONFIG_INET
>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
>   	      struct sk_reuseport_md, struct sk_reuseport_kern)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
> +	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
>   #endif
>   #if defined(CONFIG_BPF_JIT)
>   BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 259377723603..ba4f8595fa54 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1278,4 +1278,23 @@ struct bpf_sockopt_kern {
>   	s32		retval;
>   };
>   
> +struct bpf_sk_lookup_kern {
> +	u16		family;
> +	u16		protocol;
> +	union {
> +		struct {
> +			__be32 saddr;
> +			__be32 daddr;
> +		} v4;
> +		struct {
> +			const struct in6_addr *saddr;
> +			const struct in6_addr *daddr;
> +		} v6;
> +	};
> +	__be16		sport;
> +	u16		dport;
> +	struct sock	*selected_sk;
> +	bool		no_reuseport;
> +};
> +
>   #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0cb8ec948816..8dd6e6ce5de9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -189,6 +189,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_STRUCT_OPS,
>   	BPF_PROG_TYPE_EXT,
>   	BPF_PROG_TYPE_LSM,
> +	BPF_PROG_TYPE_SK_LOOKUP,
>   };
>   
>   enum bpf_attach_type {
> @@ -226,6 +227,7 @@ enum bpf_attach_type {
>   	BPF_CGROUP_INET4_GETSOCKNAME,
>   	BPF_CGROUP_INET6_GETSOCKNAME,
>   	BPF_XDP_DEVMAP,
> +	BPF_SK_LOOKUP,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -3067,6 +3069,10 @@ union bpf_attr {
>    *
>    * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
>    *	Description
> + *		Helper is overloaded depending on BPF program type. This
> + *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
> + *		**BPF_PROG_TYPE_SCHED_ACT** programs.
> + *
>    *		Assign the *sk* to the *skb*. When combined with appropriate
>    *		routing configuration to receive the packet towards the socket,
>    *		will cause *skb* to be delivered to the specified socket.
> @@ -3092,6 +3098,53 @@ union bpf_attr {
>    *		**-ESOCKTNOSUPPORT** if the socket type is not supported
>    *		(reuseport).
>    *
> + * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)

recently, we have changed return value from "int" to "long" if the 
helper intends to return a negative error. See above
    long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)

> + *	Description
> + *		Helper is overloaded depending on BPF program type. This
> + *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
> + *
> + *		Select the *sk* as a result of a socket lookup.
> + *
> + *		For the operation to succeed passed socket must be compatible
> + *		with the packet description provided by the *ctx* object.
> + *
> + *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
> + *		be an exact match. While IP family (**AF_INET** or
> + *		**AF_INET6**) must be compatible, that is IPv6 sockets
> + *		that are not v6-only can be selected for IPv4 packets.
> + *
> + *		Only TCP listeners and UDP unconnected sockets can be
> + *		selected.
> + *
> + *		*flags* argument can combination of following values:
> + *
> + *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
> + *		  socket selection, potentially done by a BPF program
> + *		  that ran before us.
> + *
> + *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
> + *		  load-balancing within reuseport group for the socket
> + *		  being selected.
> + *
> + *	Return
> + *		0 on success, or a negative errno in case of failure.
> + *
> + *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
> + *		  not compatible with packet family (*ctx->family*).
> + *
> + *		* **-EEXIST** if socket has been already selected,
> + *		  potentially by another program, and
> + *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
> + *
> + *		* **-EINVAL** if unsupported flags were specified.
> + *
> + *		* **-EPROTOTYPE** if socket L4 protocol
> + *		  (*sk->protocol*) doesn't match packet protocol
> + *		  (*ctx->protocol*).
> + *
> + *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
> + *		  state (TCP listening or UDP unconnected).
> + *
[...]
> +static bool sk_lookup_is_valid_access(int off, int size,
> +				      enum bpf_access_type type,
> +				      const struct bpf_prog *prog,
> +				      struct bpf_insn_access_aux *info)
> +{
> +	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	if (type != BPF_READ)
> +		return false;
> +
> +	switch (off) {
> +	case bpf_ctx_range(struct bpf_sk_lookup, family):
> +	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
> +		return size == sizeof(__u32);

Maybe some of the above forcing 4-byte access too restrictive?
For example, if user did
    __u16 *remote_port = ctx->remote_port;
    __u16 *local_port = ctx->local_port;
compiler is likely to generate a 2-byte load and the verifier
will reject the program. The same for protocol, family, ...
Even for local_ip4, user may just want to read one byte to
do something ...

One example, bpf_sock_addr->user_port.

We have numerous instances like this and kernel has to be
patched to permit it later.

I think for read we should allow 1/2/4 byte accesses
whenever possible. pointer of course not allowed.

> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		info->reg_type = PTR_TO_SOCKET;
> +		return size == sizeof(__u64);
> +
> +	default:
> +		return false;
> +	}
> +}
> +
> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> +					const struct bpf_insn *si,
> +					struct bpf_insn *insn_buf,
> +					struct bpf_prog *prog,
> +					u32 *target_size)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	int off;
> +#endif
> +
> +	switch (si->off) {
> +	case offsetof(struct bpf_sk_lookup, family):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, family));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, protocol):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, protocol));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				remote_ip6[0], remote_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				local_ip6[0], local_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, sport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, dport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
> +		break;
> +	}
> +
> +	return insn - insn_buf;
> +}
> +
> +const struct bpf_prog_ops sk_lookup_prog_ops = {
> +};
> +
> +const struct bpf_verifier_ops sk_lookup_verifier_ops = {
> +	.get_func_proto		= sk_lookup_func_proto,
> +	.is_valid_access	= sk_lookup_is_valid_access,
> +	.convert_ctx_access	= sk_lookup_convert_ctx_access,
> +};
> +
>   #endif /* CONFIG_INET */
>   
>   DEFINE_BPF_DISPATCHER(xdp)
[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
@ 2020-07-05  9:20     ` kernel test robot
  2020-07-05  9:20     ` kernel test robot
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2020-07-05  9:20 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: kbuild-all, netdev, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Marek Majkowski

[-- Attachment #1: Type: text/plain, Size: 12222 bytes --]

Hi Jakub,

I love your patch! Perhaps something to improve:

[auto build test WARNING on next-20200702]
[cannot apply to bpf-next/master bpf/master net/master vhost/linux-next ipvs/master net-next/master linus/master v5.8-rc3 v5.8-rc2 v5.8-rc1 v5.8-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Run-a-BPF-program-on-socket-lookup/20200702-173127
base:    d37d57041350dff35dd17cbdf9aef4011acada38
config: x86_64-randconfig-s021-20200705 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-14) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-3-gfa153962-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   net/core/filter.c:402:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:405:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:408:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:411:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:414:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:488:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:491:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:494:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:1382:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1382:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1382:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:1460:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1460:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1460:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:7044:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:7047:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:7050:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:8770:31: sparse: sparse: symbol 'sk_filter_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8777:27: sparse: sparse: symbol 'sk_filter_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8781:31: sparse: sparse: symbol 'tc_cls_act_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8789:27: sparse: sparse: symbol 'tc_cls_act_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8793:31: sparse: sparse: symbol 'xdp_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8804:31: sparse: sparse: symbol 'cg_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8810:27: sparse: sparse: symbol 'cg_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8814:31: sparse: sparse: symbol 'lwt_in_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8820:27: sparse: sparse: symbol 'lwt_in_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8824:31: sparse: sparse: symbol 'lwt_out_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8830:27: sparse: sparse: symbol 'lwt_out_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8834:31: sparse: sparse: symbol 'lwt_xmit_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8841:27: sparse: sparse: symbol 'lwt_xmit_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8845:31: sparse: sparse: symbol 'lwt_seg6local_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8851:27: sparse: sparse: symbol 'lwt_seg6local_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8855:31: sparse: sparse: symbol 'cg_sock_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8861:27: sparse: sparse: symbol 'cg_sock_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8864:31: sparse: sparse: symbol 'cg_sock_addr_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8870:27: sparse: sparse: symbol 'cg_sock_addr_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8873:31: sparse: sparse: symbol 'sock_ops_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8879:27: sparse: sparse: symbol 'sock_ops_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8882:31: sparse: sparse: symbol 'sk_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8889:27: sparse: sparse: symbol 'sk_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8892:31: sparse: sparse: symbol 'sk_msg_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8899:27: sparse: sparse: symbol 'sk_msg_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8902:31: sparse: sparse: symbol 'flow_dissector_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8908:27: sparse: sparse: symbol 'flow_dissector_prog_ops' was not declared. Should it be static?
   net/core/filter.c:9214:31: sparse: sparse: symbol 'sk_reuseport_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:9220:27: sparse: sparse: symbol 'sk_reuseport_prog_ops' was not declared. Should it be static?
>> net/core/filter.c:9399:27: sparse: sparse: symbol 'sk_lookup_prog_ops' was not declared. Should it be static?
>> net/core/filter.c:9402:31: sparse: sparse: symbol 'sk_lookup_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:1884:43: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1884:43: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1884:43: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1887:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be16 [usertype] old @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1887:36: sparse:     expected restricted __be16 [usertype] old
   net/core/filter.c:1887:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1887:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] new @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1887:42: sparse:     expected restricted __be16 [usertype] new
   net/core/filter.c:1887:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1890:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1890:36: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1890:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1890:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1890:42: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1890:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1935:59: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1935:59: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1935:59: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1938:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1938:52: sparse:     expected restricted __be16 [usertype] from
   net/core/filter.c:1938:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1938:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be16 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1938:58: sparse:     expected restricted __be16 [usertype] to
   net/core/filter.c:1938:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1941:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1941:52: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1941:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1941:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1941:58: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1941:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1987:28: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum @@
   net/core/filter.c:1987:28: sparse:     expected unsigned long long
   net/core/filter.c:1987:28: sparse:     got restricted __wsum
   net/core/filter.c:2009:35: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum [usertype] csum @@
   net/core/filter.c:2009:35: sparse:     expected unsigned long long
   net/core/filter.c:2009:35: sparse:     got restricted __wsum [usertype] csum
   net/core/filter.c:4730:17: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] spi @@     got restricted __be32 const [usertype] spi @@
   net/core/filter.c:4730:17: sparse:     expected unsigned int [usertype] spi
   net/core/filter.c:4730:17: sparse:     got restricted __be32 const [usertype] spi
   net/core/filter.c:4738:33: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] remote_ipv4 @@     got restricted __be32 const [usertype] a4 @@
   net/core/filter.c:4738:33: sparse:     expected unsigned int [usertype] remote_ipv4
   net/core/filter.c:4738:33: sparse:     got restricted __be32 const [usertype] a4

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33143 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-07-05  9:20     ` kernel test robot
  0 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2020-07-05  9:20 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 12356 bytes --]

Hi Jakub,

I love your patch! Perhaps something to improve:

[auto build test WARNING on next-20200702]
[cannot apply to bpf-next/master bpf/master net/master vhost/linux-next ipvs/master net-next/master linus/master v5.8-rc3 v5.8-rc2 v5.8-rc1 v5.8-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jakub-Sitnicki/Run-a-BPF-program-on-socket-lookup/20200702-173127
base:    d37d57041350dff35dd17cbdf9aef4011acada38
config: x86_64-randconfig-s021-20200705 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-14) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-3-gfa153962-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

   net/core/filter.c:402:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:405:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:408:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:411:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:414:33: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:488:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:491:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:494:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:1382:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1382:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1382:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:1460:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sock_filter const *filter @@     got struct sock_filter [noderef] __user *filter @@
   net/core/filter.c:1460:39: sparse:     expected struct sock_filter const *filter
   net/core/filter.c:1460:39: sparse:     got struct sock_filter [noderef] __user *filter
   net/core/filter.c:7044:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:7047:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:7050:27: sparse: sparse: subtraction of functions? Share your drugs
   net/core/filter.c:8770:31: sparse: sparse: symbol 'sk_filter_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8777:27: sparse: sparse: symbol 'sk_filter_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8781:31: sparse: sparse: symbol 'tc_cls_act_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8789:27: sparse: sparse: symbol 'tc_cls_act_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8793:31: sparse: sparse: symbol 'xdp_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8804:31: sparse: sparse: symbol 'cg_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8810:27: sparse: sparse: symbol 'cg_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8814:31: sparse: sparse: symbol 'lwt_in_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8820:27: sparse: sparse: symbol 'lwt_in_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8824:31: sparse: sparse: symbol 'lwt_out_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8830:27: sparse: sparse: symbol 'lwt_out_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8834:31: sparse: sparse: symbol 'lwt_xmit_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8841:27: sparse: sparse: symbol 'lwt_xmit_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8845:31: sparse: sparse: symbol 'lwt_seg6local_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8851:27: sparse: sparse: symbol 'lwt_seg6local_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8855:31: sparse: sparse: symbol 'cg_sock_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8861:27: sparse: sparse: symbol 'cg_sock_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8864:31: sparse: sparse: symbol 'cg_sock_addr_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8870:27: sparse: sparse: symbol 'cg_sock_addr_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8873:31: sparse: sparse: symbol 'sock_ops_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8879:27: sparse: sparse: symbol 'sock_ops_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8882:31: sparse: sparse: symbol 'sk_skb_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8889:27: sparse: sparse: symbol 'sk_skb_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8892:31: sparse: sparse: symbol 'sk_msg_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8899:27: sparse: sparse: symbol 'sk_msg_prog_ops' was not declared. Should it be static?
   net/core/filter.c:8902:31: sparse: sparse: symbol 'flow_dissector_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:8908:27: sparse: sparse: symbol 'flow_dissector_prog_ops' was not declared. Should it be static?
   net/core/filter.c:9214:31: sparse: sparse: symbol 'sk_reuseport_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:9220:27: sparse: sparse: symbol 'sk_reuseport_prog_ops' was not declared. Should it be static?
>> net/core/filter.c:9399:27: sparse: sparse: symbol 'sk_lookup_prog_ops' was not declared. Should it be static?
>> net/core/filter.c:9402:31: sparse: sparse: symbol 'sk_lookup_verifier_ops' was not declared. Should it be static?
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:217:32: sparse: sparse: cast to restricted __be16
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:244:32: sparse: sparse: cast to restricted __be32
   net/core/filter.c:1884:43: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1884:43: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1884:43: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1887:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be16 [usertype] old @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1887:36: sparse:     expected restricted __be16 [usertype] old
   net/core/filter.c:1887:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1887:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] new @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1887:42: sparse:     expected restricted __be16 [usertype] new
   net/core/filter.c:1887:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1890:36: sparse: sparse: incorrect type in argument 2 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1890:36: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1890:36: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1890:42: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1890:42: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1890:42: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1935:59: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __wsum [usertype] diff @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1935:59: sparse:     expected restricted __wsum [usertype] diff
   net/core/filter.c:1935:59: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1938:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be16 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1938:52: sparse:     expected restricted __be16 [usertype] from
   net/core/filter.c:1938:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1938:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be16 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1938:58: sparse:     expected restricted __be16 [usertype] to
   net/core/filter.c:1938:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1941:52: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected restricted __be32 [usertype] from @@     got unsigned long long [usertype] from @@
   net/core/filter.c:1941:52: sparse:     expected restricted __be32 [usertype] from
   net/core/filter.c:1941:52: sparse:     got unsigned long long [usertype] from
   net/core/filter.c:1941:58: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted __be32 [usertype] to @@     got unsigned long long [usertype] to @@
   net/core/filter.c:1941:58: sparse:     expected restricted __be32 [usertype] to
   net/core/filter.c:1941:58: sparse:     got unsigned long long [usertype] to
   net/core/filter.c:1987:28: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum @@
   net/core/filter.c:1987:28: sparse:     expected unsigned long long
   net/core/filter.c:1987:28: sparse:     got restricted __wsum
   net/core/filter.c:2009:35: sparse: sparse: incorrect type in return expression (different base types) @@     expected unsigned long long @@     got restricted __wsum [usertype] csum @@
   net/core/filter.c:2009:35: sparse:     expected unsigned long long
   net/core/filter.c:2009:35: sparse:     got restricted __wsum [usertype] csum
   net/core/filter.c:4730:17: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] spi @@     got restricted __be32 const [usertype] spi @@
   net/core/filter.c:4730:17: sparse:     expected unsigned int [usertype] spi
   net/core/filter.c:4730:17: sparse:     got restricted __be32 const [usertype] spi
   net/core/filter.c:4738:33: sparse: sparse: incorrect type in assignment (different base types) @@     expected unsigned int [usertype] remote_ipv4 @@     got restricted __be32 const [usertype] a4 @@
   net/core/filter.c:4738:33: sparse:     expected unsigned int [usertype] remote_ipv4
   net/core/filter.c:4738:33: sparse:     got restricted __be32 const [usertype] a4

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 33143 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC PATCH] bpf: sk_lookup_prog_ops can be static
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
@ 2020-07-05  9:20     ` kernel test robot
  2020-07-05  9:20     ` kernel test robot
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2020-07-05  9:20 UTC (permalink / raw)
  To: Jakub Sitnicki, bpf
  Cc: kbuild-all, netdev, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Marek Majkowski


Signed-off-by: kernel test robot <lkp@intel.com>
---
 filter.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 477f3bb440c4c..d8153d217ca8e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9396,10 +9396,10 @@ static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
-const struct bpf_prog_ops sk_lookup_prog_ops = {
+static const struct bpf_prog_ops sk_lookup_prog_ops = {
 };
 
-const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+static const struct bpf_verifier_ops sk_lookup_verifier_ops = {
 	.get_func_proto		= sk_lookup_func_proto,
 	.is_valid_access	= sk_lookup_is_valid_access,
 	.convert_ctx_access	= sk_lookup_convert_ctx_access,

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH] bpf: sk_lookup_prog_ops can be static
@ 2020-07-05  9:20     ` kernel test robot
  0 siblings, 0 replies; 51+ messages in thread
From: kernel test robot @ 2020-07-05  9:20 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]


Signed-off-by: kernel test robot <lkp@intel.com>
---
 filter.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 477f3bb440c4c..d8153d217ca8e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9396,10 +9396,10 @@ static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
-const struct bpf_prog_ops sk_lookup_prog_ops = {
+static const struct bpf_prog_ops sk_lookup_prog_ops = {
 };
 
-const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+static const struct bpf_verifier_ops sk_lookup_verifier_ops = {
 	.get_func_proto		= sk_lookup_func_proto,
 	.is_valid_access	= sk_lookup_is_valid_access,
 	.convert_ctx_access	= sk_lookup_convert_ctx_access,

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02 13:19       ` Lorenz Bauer
@ 2020-07-06 11:24         ` Jakub Sitnicki
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-06 11:24 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, Jul 02, 2020 at 03:19 PM CEST, Lorenz Bauer wrote:
> On Thu, 2 Jul 2020 at 13:46, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Thu, Jul 02, 2020 at 12:27 PM CEST, Lorenz Bauer wrote:
>> > On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Run a BPF program before looking up a listening socket on the receive path.
>> >> Program selects a listening socket to yield as result of socket lookup by
>> >> calling bpf_sk_assign() helper and returning BPF_REDIRECT (7) code.
>> >>
>> >> Alternatively, program can also fail the lookup by returning with
>> >> BPF_DROP (1), or let the lookup continue as usual with BPF_OK (0) on
>> >> return. Other return values are treated the same as BPF_OK.
>> >
>> > I'd prefer if other values were treated as BPF_DROP, with other semantics
>> > unchanged. Otherwise we won't be able to introduce new semantics
>> > without potentially breaking user code.
>>
>> That might be surprising or even risky. If you attach a badly written
>> program that say returns a negative value, it will drop all TCP SYNs and
>> UDP traffic.
>
> I think if you do that all bets are off anyways. No use in trying to stagger on.
> Being stricter here will actually make it easier to for a developer to ensure
> that their program is doing the right thing.
>
> My point about future extensions also still stands.

We've chatted with Lorenz off-list about pros & cons of defaulting to
drop on illegal return code from a BPF program.

On the upside, it is consistent with XDP, SK_REUSEPORT, and SK_SKB
(sockmap) program types.

TC BPF ignores illegal return values, unspecified action means no
action, so no drop. While CGROUP_INET_INGRESS and SOCKET_FILTER look
only at the lowest bit ("ret & 1"), so it is a roulette.

Then there is also the extensibility argument. If we allow traffic to
pass to regular socket lookup on illegal return code from BPF, and users
start to rely on that, then it will be hard or impossible to repurpose
an illegal return value for something else.

Downside of defaulting to drop is that you can accidentally lock
yourself out, e.g. lose SSH access, by attaching a buggy program.


Being consistent with other existing program types is what convinces me
most to set default to drop, so I'll make the change in v4 unless there
are objections.

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-04 18:42   ` Yonghong Song
@ 2020-07-06 11:44     ` Jakub Sitnicki
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-06 11:44 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

On Sat, Jul 04, 2020 at 08:42 PM CEST, Yonghong Song wrote:
> On 7/2/20 2:24 AM, Jakub Sitnicki wrote:
>> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
>> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
>> when looking up a listening socket for a new connection request for
>> connection oriented protocols, or when looking up an unconnected socket for
>> a packet for connection-less protocols.
>>
>> When called, SK_LOOKUP BPF program can select a socket that will receive
>> the packet. This serves as a mechanism to overcome the limits of what
>> bind() API allows to express. Two use-cases driving this work are:
>>
>>   (1) steer packets destined to an IP range, on fixed port to a socket
>>
>>       192.0.2.0/24, port 80 -> NGINX socket
>>
>>   (2) steer packets destined to an IP address, on any port to a socket
>>
>>       198.51.100.1, any port -> L7 proxy socket
>>
>> In its run-time context program receives information about the packet that
>> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> address 4-tuple. Context can be further extended to include ingress
>> interface identifier.
>>
>> To select a socket BPF program fetches it from a map holding socket
>> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> helper to record the selection. Transport layer then uses the selected
>> socket as a result of socket lookup.
>>
>> This patch only enables the user to attach an SK_LOOKUP program to a
>> network namespace. Subsequent patches hook it up to run on local delivery
>> path in ipv4 and ipv6 stacks.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>      v3:
>>      - Allow bpf_sk_assign helper to replace previously selected socket only
>>        when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>>        programs running in series to accidentally override each other's verdict.
>>      - Let BPF program decide that load-balancing within a reuseport socket group
>>        should be skipped for the socket selected with bpf_sk_assign() by passing
>>        BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>>      - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>>        the selected socket with an intention for multiple attached program
>>        running in series to see each other's choices. However, currently the
>>        verifier doesn't allow checking if pointer is set.
>>      - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>>      - Get rid of macros in convert_ctx_access to make it easier to read.
>>      - Disallow 1-,2-byte access to context fields containing IP addresses.
>>           v2:
>>      - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>>        Update bpf_sk_assign docs accordingly. (Martin)
>>      - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>>      - Fix broken build when CONFIG_INET is not selected. (Martin)
>>      - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>>      - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>>
>>   include/linux/bpf-netns.h  |   3 +
>>   include/linux/bpf_types.h  |   2 +
>>   include/linux/filter.h     |  19 ++++
>>   include/uapi/linux/bpf.h   |  74 +++++++++++++++
>>   kernel/bpf/net_namespace.c |   5 +
>>   kernel/bpf/syscall.c       |   9 ++
>>   net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>>   scripts/bpf_helpers_doc.py |   9 +-
>>   8 files changed, 306 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
>> index 4052d649f36d..cb1d849c5d4f 100644
>> --- a/include/linux/bpf-netns.h
>> +++ b/include/linux/bpf-netns.h
>> @@ -8,6 +8,7 @@
>>   enum netns_bpf_attach_type {
>>   	NETNS_BPF_INVALID = -1,
>>   	NETNS_BPF_FLOW_DISSECTOR = 0,
>> +	NETNS_BPF_SK_LOOKUP,
>>   	MAX_NETNS_BPF_ATTACH_TYPE
>>   };
>>   @@ -17,6 +18,8 @@ to_netns_bpf_attach_type(enum bpf_attach_type attach_type)
>>   	switch (attach_type) {
>>   	case BPF_FLOW_DISSECTOR:
>>   		return NETNS_BPF_FLOW_DISSECTOR;
>> +	case BPF_SK_LOOKUP:
>> +		return NETNS_BPF_SK_LOOKUP;
>>   	default:
>>   		return NETNS_BPF_INVALID;
>>   	}
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index a18ae82a298a..a52a5688418e 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
>>   #ifdef CONFIG_INET
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
>>   	      struct sk_reuseport_md, struct sk_reuseport_kern)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
>> +	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
>>   #endif
>>   #if defined(CONFIG_BPF_JIT)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 259377723603..ba4f8595fa54 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1278,4 +1278,23 @@ struct bpf_sockopt_kern {
>>   	s32		retval;
>>   };
>>   +struct bpf_sk_lookup_kern {
>> +	u16		family;
>> +	u16		protocol;
>> +	union {
>> +		struct {
>> +			__be32 saddr;
>> +			__be32 daddr;
>> +		} v4;
>> +		struct {
>> +			const struct in6_addr *saddr;
>> +			const struct in6_addr *daddr;
>> +		} v6;
>> +	};
>> +	__be16		sport;
>> +	u16		dport;
>> +	struct sock	*selected_sk;
>> +	bool		no_reuseport;
>> +};
>> +
>>   #endif /* __LINUX_FILTER_H__ */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0cb8ec948816..8dd6e6ce5de9 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -189,6 +189,7 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_STRUCT_OPS,
>>   	BPF_PROG_TYPE_EXT,
>>   	BPF_PROG_TYPE_LSM,
>> +	BPF_PROG_TYPE_SK_LOOKUP,
>>   };
>>     enum bpf_attach_type {
>> @@ -226,6 +227,7 @@ enum bpf_attach_type {
>>   	BPF_CGROUP_INET4_GETSOCKNAME,
>>   	BPF_CGROUP_INET6_GETSOCKNAME,
>>   	BPF_XDP_DEVMAP,
>> +	BPF_SK_LOOKUP,
>>   	__MAX_BPF_ATTACH_TYPE
>>   };
>>   @@ -3067,6 +3069,10 @@ union bpf_attr {
>>    *
>>    * long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
>>    *	Description
>> + *		Helper is overloaded depending on BPF program type. This
>> + *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
>> + *		**BPF_PROG_TYPE_SCHED_ACT** programs.
>> + *
>>    *		Assign the *sk* to the *skb*. When combined with appropriate
>>    *		routing configuration to receive the packet towards the socket,
>>    *		will cause *skb* to be delivered to the specified socket.
>> @@ -3092,6 +3098,53 @@ union bpf_attr {
>>    *		**-ESOCKTNOSUPPORT** if the socket type is not supported
>>    *		(reuseport).
>>    *
>> + * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
>
> recently, we have changed return value from "int" to "long" if the helper
> intends to return a negative error. See above
>    long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)

Thanks. I missed that one. Will fix in v4.

>
>> + *	Description
>> + *		Helper is overloaded depending on BPF program type. This
>> + *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
>> + *
>> + *		Select the *sk* as a result of a socket lookup.
>> + *
>> + *		For the operation to succeed passed socket must be compatible
>> + *		with the packet description provided by the *ctx* object.
>> + *
>> + *		L4 protocol (**IPPROTO_TCP** or **IPPROTO_UDP**) must
>> + *		be an exact match. While IP family (**AF_INET** or
>> + *		**AF_INET6**) must be compatible, that is IPv6 sockets
>> + *		that are not v6-only can be selected for IPv4 packets.
>> + *
>> + *		Only TCP listeners and UDP unconnected sockets can be
>> + *		selected.
>> + *
>> + *		*flags* argument can combination of following values:
>> + *
>> + *		* **BPF_SK_LOOKUP_F_REPLACE** to override the previous
>> + *		  socket selection, potentially done by a BPF program
>> + *		  that ran before us.
>> + *
>> + *		* **BPF_SK_LOOKUP_F_NO_REUSEPORT** to skip
>> + *		  load-balancing within reuseport group for the socket
>> + *		  being selected.
>> + *
>> + *	Return
>> + *		0 on success, or a negative errno in case of failure.
>> + *
>> + *		* **-EAFNOSUPPORT** if socket family (*sk->family*) is
>> + *		  not compatible with packet family (*ctx->family*).
>> + *
>> + *		* **-EEXIST** if socket has been already selected,
>> + *		  potentially by another program, and
>> + *		  **BPF_SK_LOOKUP_F_REPLACE** flag was not specified.
>> + *
>> + *		* **-EINVAL** if unsupported flags were specified.
>> + *
>> + *		* **-EPROTOTYPE** if socket L4 protocol
>> + *		  (*sk->protocol*) doesn't match packet protocol
>> + *		  (*ctx->protocol*).
>> + *
>> + *		* **-ESOCKTNOSUPPORT** if socket is not in allowed
>> + *		  state (TCP listening or UDP unconnected).
>> + *
> [...]
>> +static bool sk_lookup_is_valid_access(int off, int size,
>> +				      enum bpf_access_type type,
>> +				      const struct bpf_prog *prog,
>> +				      struct bpf_insn_access_aux *info)
>> +{
>> +	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
>> +		return false;
>> +	if (off % size != 0)
>> +		return false;
>> +	if (type != BPF_READ)
>> +		return false;
>> +
>> +	switch (off) {
>> +	case bpf_ctx_range(struct bpf_sk_lookup, family):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
>> +	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
>> +		return size == sizeof(__u32);
>
> Maybe some of the above forcing 4-byte access too restrictive?
> For example, if user did
>    __u16 *remote_port = ctx->remote_port;
>    __u16 *local_port = ctx->local_port;
> compiler is likely to generate a 2-byte load and the verifier
> will reject the program. The same for protocol, family, ...
> Even for local_ip4, user may just want to read one byte to
> do something ...
>
> One example, bpf_sock_addr->user_port.
>
> We have numerous instances like this and kernel has to be
> patched to permit it later.
>
> I think for read we should allow 1/2/4 byte accesses
> whenever possible. pointer of course not allowed.

You have a point. I've tried to keep it simple, but did not consider
that this is creating a pain-point for users and can lead to fights with
the compiler.

Will revert to having 1,2,4-byte reads in v4.

Thanks for comments.

>
>> +
>> +	case offsetof(struct bpf_sk_lookup, sk):
>> +		info->reg_type = PTR_TO_SOCKET;
>> +		return size == sizeof(__u64);
>> +
>> +	default:
>> +		return false;
>> +	}
>> +}
>> +
>> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
>> +					const struct bpf_insn *si,
>> +					struct bpf_insn *insn_buf,
>> +					struct bpf_prog *prog,
>> +					u32 *target_size)
>> +{
>> +	struct bpf_insn *insn = insn_buf;
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +	int off;
>> +#endif
>> +
>> +	switch (si->off) {
>> +	case offsetof(struct bpf_sk_lookup, family):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, family));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, protocol):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, protocol));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, remote_ip4):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, local_ip4):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
>> +		break;
>> +
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
>> +				remote_ip6[0], remote_ip6[3]):
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
>> +
>> +		off = si->off;
>> +		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
>> +		off += offsetof(struct in6_addr, s6_addr32[0]);
>> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
>> +#else
>> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
>> +#endif
>> +		break;
>> +
>> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
>> +				local_ip6[0], local_ip6[3]):
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
>> +
>> +		off = si->off;
>> +		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
>> +		off += offsetof(struct in6_addr, s6_addr32[0]);
>> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
>> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
>> +#else
>> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
>> +#endif
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, remote_port):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, sport));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, local_port):
>> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, dport));
>> +		break;
>> +
>> +	case offsetof(struct bpf_sk_lookup, sk):
>> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
>> +				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
>> +		break;
>> +	}
>> +
>> +	return insn - insn_buf;
>> +}
>> +
>> +const struct bpf_prog_ops sk_lookup_prog_ops = {
>> +};
>> +
>> +const struct bpf_verifier_ops sk_lookup_verifier_ops = {
>> +	.get_func_proto		= sk_lookup_func_proto,
>> +	.is_valid_access	= sk_lookup_is_valid_access,
>> +	.convert_ctx_access	= sk_lookup_convert_ctx_access,
>> +};
>> +
>>   #endif /* CONFIG_INET */
>>     DEFINE_BPF_DISPATCHER(xdp)
> [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup
  2020-07-02  9:24 ` [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
  2020-07-02 10:27   ` Lorenz Bauer
@ 2020-07-06 12:06   ` Jakub Sitnicki
  1 sibling, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-06 12:06 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

On Thu, Jul 02, 2020 at 11:24 AM CEST, Jakub Sitnicki wrote:
> Run a BPF program before looking up a listening socket on the receive path.
> Program selects a listening socket to yield as result of socket lookup by
> calling bpf_sk_assign() helper and returning BPF_REDIRECT (7) code.
>
> Alternatively, program can also fail the lookup by returning with
> BPF_DROP (1), or let the lookup continue as usual with BPF_OK (0) on
> return. Other return values are treated the same as BPF_OK.
>
> This lets the user match packets with listening sockets freely at the last
> possible point on the receive path, where we know that packets are destined
> for local delivery after undergoing policing, filtering, and routing.
>
> With BPF code selecting the socket, directing packets destined to an IP
> range or to a port range to a single socket becomes possible.
>
> In case multiple programs are attached, they are run in series in the order
> in which they were attached. The end result gets determined from return
> code from each program according to following rules.
>
>  1. If any program returned BPF_REDIRECT and selected a valid socket, this
>     socket will be used as result of the lookup.
>  2. If more than one program returned BPF_REDIRECT and selected a socket,
>     last selection takes effect.
>  3. If any program returned BPF_DROP and none returned BPF_REDIRECT, the
>     socket lookup will fail with -ECONNREFUSED.
>  4. If no program returned neither BPF_DROP nor BPF_REDIRECT, socket lookup
>     continues to htable-based lookup.

Lorenz suggested that we cut down the allowed return values to just
BPF_OK (pass) or BPF_DROP, and get rid of BPF_REDIRECT.

Instead of returning BPF_REDIRECT, BPF program will select a socket with
bpf_sk_assign() and return BPF_OK.

Also, program will be able to discard the socket is has selected by
passing NULL to bpf_sk_assign(). This requires a slight change to
verifier in order to support an argument type that is a pointer to full
socket or NULL.

These simplified semantics seem very attractive. They make the the new
type of behave like a filter that can simply pass / drop connection
requests in its basic form. And with a key ability to select an
alternative socket to handle the connection request when bpf_sk_assign()
gets called.

It is also closer to how redirection in TC BPF, SK_SKB and SK_REUSEPORT
programs work. There is no REDIRECT return code expectation there.

We can even go a step further and adopt SK_PASS / SK_DROP as return
values, instead of BPF_OK / BPF_DROP, as they are already in use by
SK_SKB and SK_REUSEPORT programs.

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                     ` (2 preceding siblings ...)
  2020-07-05  9:20     ` kernel test robot
@ 2020-07-07  9:21   ` Jakub Sitnicki
  2020-07-09  4:08   ` Andrii Nakryiko
  4 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-07  9:21 UTC (permalink / raw)
  To: bpf
  Cc: netdev, kernel-team, Alexei Starovoitov, Daniel Borkmann,
	David S. Miller, Jakub Kicinski, Marek Majkowski

On Thu, Jul 02, 2020 at 11:24 AM CEST, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> when looking up a listening socket for a new connection request for
> connection oriented protocols, or when looking up an unconnected socket for
> a packet for connection-less protocols.
>
> When called, SK_LOOKUP BPF program can select a socket that will receive
> the packet. This serves as a mechanism to overcome the limits of what
> bind() API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, on fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, on any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Allow bpf_sk_assign helper to replace previously selected socket only
>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>       programs running in series to accidentally override each other's verdict.
>     - Let BPF program decide that load-balancing within a reuseport socket group
>       should be skipped for the socket selected with bpf_sk_assign() by passing
>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>       the selected socket with an intention for multiple attached program
>       running in series to see each other's choices. However, currently the
>       verifier doesn't allow checking if pointer is set.
>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>     - Get rid of macros in convert_ctx_access to make it easier to read.
>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>
>     v2:
>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>       Update bpf_sk_assign docs accordingly. (Martin)
>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>
>  include/linux/bpf-netns.h  |   3 +
>  include/linux/bpf_types.h  |   2 +
>  include/linux/filter.h     |  19 ++++
>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>  kernel/bpf/net_namespace.c |   5 +
>  kernel/bpf/syscall.c       |   9 ++
>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py |   9 +-
>  8 files changed, 306 insertions(+), 1 deletion(-)
>

[...]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index c796e141ea8e..286f90e0c824 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9219,6 +9219,192 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>
>  const struct bpf_prog_ops sk_reuseport_prog_ops = {
>  };
> +
> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> +	   struct sock *, sk, u64, flags)
> +{
> +	if (unlikely(flags & ~(BPF_SK_LOOKUP_F_REPLACE |
> +			       BPF_SK_LOOKUP_F_NO_REUSEPORT)))
> +		return -EINVAL;
> +	if (unlikely(sk_is_refcounted(sk)))
> +		return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> +	if (unlikely(sk->sk_state == TCP_ESTABLISHED))
> +		return -ESOCKTNOSUPPORT; /* reject connected sockets */
> +
> +	/* Check if socket is suitable for packet L3/L4 protocol */
> +	if (sk->sk_protocol != ctx->protocol)
> +		return -EPROTOTYPE;
> +	if (sk->sk_family != ctx->family &&
> +	    (sk->sk_family == AF_INET || ipv6_only_sock(sk)))
> +		return -EAFNOSUPPORT;
> +
> +	if (ctx->selected_sk && !(flags & BPF_SK_LOOKUP_F_REPLACE))
> +		return -EEXIST;
> +
> +	/* Select socket as lookup result */
> +	ctx->selected_sk = sk;
> +	ctx->no_reuseport = flags & BPF_SK_LOOKUP_F_NO_REUSEPORT;
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
> +	.func		= bpf_sk_lookup_assign,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_PTR_TO_SOCKET,
> +	.arg3_type	= ARG_ANYTHING,
> +};
> +
> +static const struct bpf_func_proto *
> +sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_sk_assign:
> +		return &bpf_sk_lookup_assign_proto;
> +	case BPF_FUNC_sk_release:
> +		return &bpf_sk_release_proto;
> +	default:
> +		return bpf_base_func_proto(func_id);
> +	}
> +}
> +
> +static bool sk_lookup_is_valid_access(int off, int size,
> +				      enum bpf_access_type type,
> +				      const struct bpf_prog *prog,
> +				      struct bpf_insn_access_aux *info)
> +{
> +	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	if (type != BPF_READ)
> +		return false;
> +
> +	switch (off) {
> +	case bpf_ctx_range(struct bpf_sk_lookup, family):
> +	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_ip4):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_ip4):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, remote_ip6[0], remote_ip6[3]):
> +	case bpf_ctx_range_till(struct bpf_sk_lookup, local_ip6[0], local_ip6[3]):
> +	case bpf_ctx_range(struct bpf_sk_lookup, remote_port):
> +	case bpf_ctx_range(struct bpf_sk_lookup, local_port):
> +		return size == sizeof(__u32);
> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		info->reg_type = PTR_TO_SOCKET;

There's a bug here. bpf_sk_lookup 'sk' field is initially NULL.
reg_type should be PTR_TO_SOCKET_OR_NULL to inform the verifier.
Will fix in v4.

> +		return size == sizeof(__u64);
> +
> +	default:
> +		return false;
> +	}
> +}
> +
> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> +					const struct bpf_insn *si,
> +					struct bpf_insn *insn_buf,
> +					struct bpf_prog *prog,
> +					u32 *target_size)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	int off;
> +#endif
> +
> +	switch (si->off) {
> +	case offsetof(struct bpf_sk_lookup, family):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, family) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, family));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, protocol):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, protocol) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, protocol));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.saddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.saddr));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_ip4):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, v4.daddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v4.daddr));
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				remote_ip6[0], remote_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, remote_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.saddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sk_lookup,
> +				local_ip6[0], local_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(sizeof_field(struct in6_addr, s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sk_lookup, local_ip6[0]);
> +		off += offsetof(struct in6_addr, s6_addr32[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, v6.daddr));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, off);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, remote_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, sport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, sport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, local_port):
> +		BUILD_BUG_ON(sizeof_field(struct bpf_sk_lookup_kern, dport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, dport));
> +		break;
> +
> +	case offsetof(struct bpf_sk_lookup, sk):
> +		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sk_lookup_kern, selected_sk));
> +		break;
> +	}
> +
> +	return insn - insn_buf;
> +}
> +
> +const struct bpf_prog_ops sk_lookup_prog_ops = {
> +};
> +
> +const struct bpf_verifier_ops sk_lookup_verifier_ops = {
> +	.get_func_proto		= sk_lookup_func_proto,
> +	.is_valid_access	= sk_lookup_is_valid_access,
> +	.convert_ctx_access	= sk_lookup_convert_ctx_access,
> +};
> +
>  #endif /* CONFIG_INET */
>
>  DEFINE_BPF_DISPATCHER(xdp)

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments
  2020-07-02  9:24 ` [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments Jakub Sitnicki
@ 2020-07-09  3:44   ` Andrii Nakryiko
  2020-07-09 12:49     ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09  3:44 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 2, 2020 at 2:24 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
> prog_array at given position when link gets attached/updated/released.
>
> This let's us lift the limit of having just one link attached for the new
> attach type introduced by subsequent patch.
>
> No functional changes intended.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - New in v3 to support multi-prog attachments. (Alexei)
>
>  include/linux/bpf.h        |  4 ++
>  kernel/bpf/core.c          | 22 ++++++++++
>  kernel/bpf/net_namespace.c | 88 +++++++++++++++++++++++++++++++++++---
>  3 files changed, 107 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3d2ade703a35..26bc70533db0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -928,6 +928,10 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
>
>  void bpf_prog_array_delete_safe(struct bpf_prog_array *progs,
>                                 struct bpf_prog *old_prog);
> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
> +                                  unsigned int index);
> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
> +                             struct bpf_prog *prog);
>  int bpf_prog_array_copy_info(struct bpf_prog_array *array,
>                              u32 *prog_ids, u32 request_cnt,
>                              u32 *prog_cnt);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 9df4cc9a2907..d4b3b9ee6bf1 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1958,6 +1958,28 @@ void bpf_prog_array_delete_safe(struct bpf_prog_array *array,
>                 }
>  }
>
> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
> +                                  unsigned int index)
> +{
> +       bpf_prog_array_update_at(array, index, &dummy_bpf_prog.prog);
> +}
> +
> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
> +                             struct bpf_prog *prog)

it's a good idea to mention it in a comment for both delete_safe_at
and update_at that slots with dummy entries are ignored.

Also, given that index can be out of bounds, should these functions
actually return error if the slot is not found?

> +{
> +       struct bpf_prog_array_item *item;
> +
> +       for (item = array->items; item->prog; item++) {
> +               if (item->prog == &dummy_bpf_prog.prog)
> +                       continue;
> +               if (!index) {
> +                       WRITE_ONCE(item->prog, prog);
> +                       break;
> +               }
> +               index--;
> +       }
> +}
> +
>  int bpf_prog_array_copy(struct bpf_prog_array *old_array,
>                         struct bpf_prog *exclude_prog,
>                         struct bpf_prog *include_prog,
> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
> index 247543380fa6..6011122c35b6 100644
> --- a/kernel/bpf/net_namespace.c
> +++ b/kernel/bpf/net_namespace.c
> @@ -36,11 +36,51 @@ static void netns_bpf_run_array_detach(struct net *net,
>         bpf_prog_array_free(run_array);
>  }
>
> +static unsigned int link_index(struct net *net,
> +                              enum netns_bpf_attach_type type,
> +                              struct bpf_netns_link *link)
> +{
> +       struct bpf_netns_link *pos;
> +       unsigned int i = 0;
> +
> +       list_for_each_entry(pos, &net->bpf.links[type], node) {
> +               if (pos == link)
> +                       return i;
> +               i++;
> +       }
> +       return UINT_MAX;

Why not return a negative error, if the slot is not found? Feels a bit
unusual as far as error reporting goes.

> +}
> +

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                     ` (3 preceding siblings ...)
  2020-07-07  9:21   ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
@ 2020-07-09  4:08   ` Andrii Nakryiko
  2020-07-09 13:25     ` Jakub Sitnicki
  4 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09  4:08 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> when looking up a listening socket for a new connection request for
> connection oriented protocols, or when looking up an unconnected socket for
> a packet for connection-less protocols.
>
> When called, SK_LOOKUP BPF program can select a socket that will receive
> the packet. This serves as a mechanism to overcome the limits of what
> bind() API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, on fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, on any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Allow bpf_sk_assign helper to replace previously selected socket only
>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>       programs running in series to accidentally override each other's verdict.
>     - Let BPF program decide that load-balancing within a reuseport socket group
>       should be skipped for the socket selected with bpf_sk_assign() by passing
>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>       the selected socket with an intention for multiple attached program
>       running in series to see each other's choices. However, currently the
>       verifier doesn't allow checking if pointer is set.
>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>     - Get rid of macros in convert_ctx_access to make it easier to read.
>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>
>     v2:
>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>       Update bpf_sk_assign docs accordingly. (Martin)
>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>
>  include/linux/bpf-netns.h  |   3 +
>  include/linux/bpf_types.h  |   2 +
>  include/linux/filter.h     |  19 ++++
>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>  kernel/bpf/net_namespace.c |   5 +
>  kernel/bpf/syscall.c       |   9 ++
>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py |   9 +-
>  8 files changed, 306 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf-netns.h b/include/linux/bpf-netns.h
> index 4052d649f36d..cb1d849c5d4f 100644
> --- a/include/linux/bpf-netns.h
> +++ b/include/linux/bpf-netns.h
> @@ -8,6 +8,7 @@
>  enum netns_bpf_attach_type {
>         NETNS_BPF_INVALID = -1,
>         NETNS_BPF_FLOW_DISSECTOR = 0,
> +       NETNS_BPF_SK_LOOKUP,
>         MAX_NETNS_BPF_ATTACH_TYPE
>  };
>

[...]

> +struct bpf_sk_lookup_kern {
> +       u16             family;
> +       u16             protocol;
> +       union {
> +               struct {
> +                       __be32 saddr;
> +                       __be32 daddr;
> +               } v4;
> +               struct {
> +                       const struct in6_addr *saddr;
> +                       const struct in6_addr *daddr;
> +               } v6;
> +       };
> +       __be16          sport;
> +       u16             dport;
> +       struct sock     *selected_sk;
> +       bool            no_reuseport;
> +};
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0cb8ec948816..8dd6e6ce5de9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -189,6 +189,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_STRUCT_OPS,
>         BPF_PROG_TYPE_EXT,
>         BPF_PROG_TYPE_LSM,
> +       BPF_PROG_TYPE_SK_LOOKUP,
>  };
>
>  enum bpf_attach_type {
> @@ -226,6 +227,7 @@ enum bpf_attach_type {
>         BPF_CGROUP_INET4_GETSOCKNAME,
>         BPF_CGROUP_INET6_GETSOCKNAME,
>         BPF_XDP_DEVMAP,
> +       BPF_SK_LOOKUP,


Point not specific to your changes, but I wanted to bring it up for a
while now, so thought this one might be as good an opportunity as any.

It seems like enum bpf_attach_type originally was intended for only
cgroup BPF programs. To that end, cgroup_bpf has a bunch of fields
with sizes proportional to MAX_BPF_ATTACH_TYPE. It costs at least
8+4+16=28 bytes for each different type *per each cgroup*. At this
point, we have 22 cgroup-specific attach types, and this will be the
13th non-cgroup attach type. So cgroups pay a price for each time we
extend bpf_attach_type with a new non-cgroup attach type. cgroup_bpf
is now 336 bytes bigger than it needs to be.

So I wanted to propose that we do the same thing for cgroup_bpf as you
did for net_ns with netns_bpf_attach_type: have a densely-packed enum
just for cgroup attach types and translate now generic bpf_attach_type
to cgroup-specific cgroup_bpf_attach_type.

I wonder what people think? Is that a good idea? Is anyone up for doing this?

>         __MAX_BPF_ATTACH_TYPE
>  };
>

[...]

> +
> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> +                                       const struct bpf_insn *si,
> +                                       struct bpf_insn *insn_buf,
> +                                       struct bpf_prog *prog,
> +                                       u32 *target_size)

Would it be too extreme to rely on BTF and direct memory access
(similar to tp_raw, fentry/fexit, etc) for accessing context fields,
instead of all this assembly rewrites? So instead of having
bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
(bpf_sk_lookup_kern, or however we'd want to name it then) and
verifier will just ensure that direct memory reads go to the right
field boundaries?

> +{
> +       struct bpf_insn *insn = insn_buf;
> +#if IS_ENABLED(CONFIG_IPV6)
> +       int off;
> +#endif
> +

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-02  9:24 ` [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type Jakub Sitnicki
@ 2020-07-09  4:23   ` Andrii Nakryiko
  2020-07-09 15:51     ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09  4:23 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Make libbpf aware of the newly added program type, and assign it a
> section name.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>
> Notes:
>     v3:
>     - Move new libbpf symbols to version 0.1.0.
>     - Set expected_attach_type in probe_load for new prog type.
>
>     v2:
>     - Add new libbpf symbols to version 0.0.9. (Andrii)
>
>  tools/lib/bpf/libbpf.c        | 3 +++
>  tools/lib/bpf/libbpf.h        | 2 ++
>  tools/lib/bpf/libbpf.map      | 2 ++
>  tools/lib/bpf/libbpf_probes.c | 3 +++
>  4 files changed, 10 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 4ea7f4f1a691..ddcbb5dd78df 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -6793,6 +6793,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
>  BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
>  BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
>  BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
> +BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
>
>  enum bpf_attach_type
>  bpf_program__get_expected_attach_type(struct bpf_program *prog)
> @@ -6969,6 +6970,8 @@ static const struct bpf_sec_def section_defs[] = {
>         BPF_EAPROG_SEC("cgroup/setsockopt",     BPF_PROG_TYPE_CGROUP_SOCKOPT,
>                                                 BPF_CGROUP_SETSOCKOPT),
>         BPF_PROG_SEC("struct_ops",              BPF_PROG_TYPE_STRUCT_OPS),
> +       BPF_EAPROG_SEC("sk_lookup",             BPF_PROG_TYPE_SK_LOOKUP,
> +                                               BPF_SK_LOOKUP),

So it's a BPF_PROG_TYPE_SK_LOOKUP with attach type BPF_SK_LOOKUP. What
other potential attach types could there be for
BPF_PROG_TYPE_SK_LOOKUP? How the section name will look like in that
case?

>  };
>
>  #undef BPF_PROG_SEC_IMPL
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 2335971ed0bd..c2272132e929 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -350,6 +350,7 @@ LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
>  LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
>  LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
>  LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
> +LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
>
>  LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog);
>  LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
> @@ -377,6 +378,7 @@ LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog);
>  LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
>  LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
>  LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
> +LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
>
>  /*
>   * No need for __attribute__((packed)), all members of 'bpf_map_def'
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 6544d2cd1ed6..04b99f63a45c 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -287,5 +287,7 @@ LIBBPF_0.1.0 {
>                 bpf_map__type;
>                 bpf_map__value_size;
>                 bpf_program__autoload;
> +               bpf_program__is_sk_lookup;
>                 bpf_program__set_autoload;
> +               bpf_program__set_sk_lookup;
>  } LIBBPF_0.0.9;
> diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
> index 10cd8d1891f5..5a3d3f078408 100644
> --- a/tools/lib/bpf/libbpf_probes.c
> +++ b/tools/lib/bpf/libbpf_probes.c
> @@ -78,6 +78,9 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
>         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>                 xattr.expected_attach_type = BPF_CGROUP_INET4_CONNECT;
>                 break;
> +       case BPF_PROG_TYPE_SK_LOOKUP:
> +               xattr.expected_attach_type = BPF_SK_LOOKUP;
> +               break;
>         case BPF_PROG_TYPE_KPROBE:
>                 xattr.kern_version = get_kernel_version();
>                 break;
> --
> 2.25.4
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
  2020-07-02 12:59     ` Jakub Sitnicki
@ 2020-07-09  4:28       ` Andrii Nakryiko
  2020-07-09 15:54         ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09  4:28 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Lorenz Bauer, bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 2, 2020 at 6:00 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Thu, Jul 02, 2020 at 01:01 PM CEST, Lorenz Bauer wrote:
> > On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Add tests to test_progs that exercise:
> >>
> >>  - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
> >>  - redirecting socket lookup to a socket selected by BPF program,
> >>  - failing a socket lookup on BPF program's request,
> >>  - error scenarios for selecting a socket from BPF program,
> >>  - accessing BPF program context,
> >>  - attaching and running multiple BPF programs.
> >>
> >> Run log:
> >> | # ./test_progs -n 68
> >> | #68/1 query lookup prog:OK
> >> | #68/2 TCP IPv4 redir port:OK
> >> | #68/3 TCP IPv4 redir addr:OK
> >> | #68/4 TCP IPv4 redir with reuseport:OK
> >> | #68/5 TCP IPv4 redir skip reuseport:OK
> >> | #68/6 TCP IPv6 redir port:OK
> >> | #68/7 TCP IPv6 redir addr:OK
> >> | #68/8 TCP IPv4->IPv6 redir port:OK
> >> | #68/9 TCP IPv6 redir with reuseport:OK
> >> | #68/10 TCP IPv6 redir skip reuseport:OK
> >> | #68/11 UDP IPv4 redir port:OK
> >> | #68/12 UDP IPv4 redir addr:OK
> >> | #68/13 UDP IPv4 redir with reuseport:OK
> >> | #68/14 UDP IPv4 redir skip reuseport:OK
> >> | #68/15 UDP IPv6 redir port:OK
> >> | #68/16 UDP IPv6 redir addr:OK
> >> | #68/17 UDP IPv4->IPv6 redir port:OK
> >> | #68/18 UDP IPv6 redir and reuseport:OK
> >> | #68/19 UDP IPv6 redir skip reuseport:OK
> >> | #68/20 TCP IPv4 drop on lookup:OK
> >> | #68/21 TCP IPv6 drop on lookup:OK
> >> | #68/22 UDP IPv4 drop on lookup:OK
> >> | #68/23 UDP IPv6 drop on lookup:OK
> >> | #68/24 TCP IPv4 drop on reuseport:OK
> >> | #68/25 TCP IPv6 drop on reuseport:OK
> >> | #68/26 UDP IPv4 drop on reuseport:OK
> >> | #68/27 TCP IPv6 drop on reuseport:OK
> >> | #68/28 sk_assign returns EEXIST:OK
> >> | #68/29 sk_assign honors F_REPLACE:OK
> >> | #68/30 access ctx->sk:OK
> >> | #68/31 sk_assign rejects TCP established:OK
> >> | #68/32 sk_assign rejects UDP connected:OK
> >> | #68/33 multi prog - pass, pass:OK
> >> | #68/34 multi prog - pass, inval:OK
> >> | #68/35 multi prog - inval, pass:OK
> >> | #68/36 multi prog - drop, drop:OK
> >> | #68/37 multi prog - pass, drop:OK
> >> | #68/38 multi prog - drop, pass:OK
> >> | #68/39 multi prog - drop, inval:OK
> >> | #68/40 multi prog - inval, drop:OK
> >> | #68/41 multi prog - pass, redir:OK
> >> | #68/42 multi prog - redir, pass:OK
> >> | #68/43 multi prog - drop, redir:OK
> >> | #68/44 multi prog - redir, drop:OK
> >> | #68/45 multi prog - inval, redir:OK
> >> | #68/46 multi prog - redir, inval:OK
> >> | #68/47 multi prog - redir, redir:OK
> >> | #68 sk_lookup:OK
> >> | Summary: 1/47 PASSED, 0 SKIPPED, 0 FAILED
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - Extend tests to cover new functionality in v3:
> >>       - multi-prog attachments (query, running, verdict precedence)
> >>       - socket selecting for the second time with bpf_sk_assign
> >>       - skipping over reuseport load-balancing
> >>
> >>     v2:
> >>      - Adjust for fields renames in struct bpf_sk_lookup.
> >>
> >>  .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
> >>  .../selftests/bpf/progs/test_sk_lookup_kern.c |  399 +++++
> >>  2 files changed, 1752 insertions(+)
> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> >>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
> >>
> >> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
> >> new file mode 100644
> >> index 000000000000..2859dc7e65b0
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>
> [...]
>

[...]

> >> +static void run_lookup_prog(const struct test *t)
> >> +{
> >> +       int client_fd, server_fds[MAX_SERVERS] = { -1 };
> >> +       struct bpf_link *lookup_link;
> >> +       int i, err;
> >> +
> >> +       lookup_link = attach_lookup_prog(t->lookup_prog);
> >> +       if (!lookup_link)
> >
> > Why doesn't this fail the test? Same for the other error paths in the
> > function, and the other helpers.
>
> I took the approach of placing CHECK_FAIL checks only right after the
> failure point. So a syscall or a call to libbpf.
>
> This way if I'm calling a helper, I know it already fails the test if
> anything goes wrong, and I can have less CHECK_FAILs peppered over the
> code.

Please prefere CHECK() over CHECK_FAIL(), unless you are making
hundreds of checks and it's extremely unlikely they will ever fail.
Using CHECK_FAIL makes even knowing where the test fails hard. CHECK()
leaves a trail, so it's easier to pinpoint what and why failed.


[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments
  2020-07-09  3:44   ` Andrii Nakryiko
@ 2020-07-09 12:49     ` Jakub Sitnicki
  2020-07-09 22:02       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-09 12:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 09, 2020 at 05:44 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 2, 2020 at 2:24 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
>> prog_array at given position when link gets attached/updated/released.
>>
>> This let's us lift the limit of having just one link attached for the new
>> attach type introduced by subsequent patch.
>>
>> No functional changes intended.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>     v3:
>>     - New in v3 to support multi-prog attachments. (Alexei)
>>
>>  include/linux/bpf.h        |  4 ++
>>  kernel/bpf/core.c          | 22 ++++++++++
>>  kernel/bpf/net_namespace.c | 88 +++++++++++++++++++++++++++++++++++---
>>  3 files changed, 107 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 3d2ade703a35..26bc70533db0 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -928,6 +928,10 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
>>
>>  void bpf_prog_array_delete_safe(struct bpf_prog_array *progs,
>>                                 struct bpf_prog *old_prog);
>> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
>> +                                  unsigned int index);
>> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
>> +                             struct bpf_prog *prog);
>>  int bpf_prog_array_copy_info(struct bpf_prog_array *array,
>>                              u32 *prog_ids, u32 request_cnt,
>>                              u32 *prog_cnt);
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 9df4cc9a2907..d4b3b9ee6bf1 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -1958,6 +1958,28 @@ void bpf_prog_array_delete_safe(struct bpf_prog_array *array,
>>                 }
>>  }
>>
>> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
>> +                                  unsigned int index)
>> +{
>> +       bpf_prog_array_update_at(array, index, &dummy_bpf_prog.prog);
>> +}
>> +
>> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
>> +                             struct bpf_prog *prog)
>
> it's a good idea to mention it in a comment for both delete_safe_at
> and update_at that slots with dummy entries are ignored.

I agree. These two need doc comments. update_at doesn't event hint that
this is not a regular update operation. Will add in v4.

>
> Also, given that index can be out of bounds, should these functions
> actually return error if the slot is not found?

That won't hurt. I mean, from bpf-netns PoV getting such an error would
indicate that there is a bug in the code that manages prog_array. But
perhaps other future users of this new prog_array API can benefit.

>
>> +{
>> +       struct bpf_prog_array_item *item;
>> +
>> +       for (item = array->items; item->prog; item++) {
>> +               if (item->prog == &dummy_bpf_prog.prog)
>> +                       continue;
>> +               if (!index) {
>> +                       WRITE_ONCE(item->prog, prog);
>> +                       break;
>> +               }
>> +               index--;
>> +       }
>> +}
>> +
>>  int bpf_prog_array_copy(struct bpf_prog_array *old_array,
>>                         struct bpf_prog *exclude_prog,
>>                         struct bpf_prog *include_prog,
>> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
>> index 247543380fa6..6011122c35b6 100644
>> --- a/kernel/bpf/net_namespace.c
>> +++ b/kernel/bpf/net_namespace.c
>> @@ -36,11 +36,51 @@ static void netns_bpf_run_array_detach(struct net *net,
>>         bpf_prog_array_free(run_array);
>>  }
>>
>> +static unsigned int link_index(struct net *net,
>> +                              enum netns_bpf_attach_type type,
>> +                              struct bpf_netns_link *link)
>> +{
>> +       struct bpf_netns_link *pos;
>> +       unsigned int i = 0;
>> +
>> +       list_for_each_entry(pos, &net->bpf.links[type], node) {
>> +               if (pos == link)
>> +                       return i;
>> +               i++;
>> +       }
>> +       return UINT_MAX;
>
> Why not return a negative error, if the slot is not found? Feels a bit
> unusual as far as error reporting goes.

Returning uint played well with the consumer of link_index() return
value, that is bpf_prog_array_update_at(). update at takes an index into
the array, which must not be negative.

But I don't have strong feelings toward it. Will switch to -ENOENT in
v4.

>
>> +}
>> +
>
> [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-09  4:08   ` Andrii Nakryiko
@ 2020-07-09 13:25     ` Jakub Sitnicki
  2020-07-09 23:09       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-09 13:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
>> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
>> when looking up a listening socket for a new connection request for
>> connection oriented protocols, or when looking up an unconnected socket for
>> a packet for connection-less protocols.
>>
>> When called, SK_LOOKUP BPF program can select a socket that will receive
>> the packet. This serves as a mechanism to overcome the limits of what
>> bind() API allows to express. Two use-cases driving this work are:
>>
>>  (1) steer packets destined to an IP range, on fixed port to a socket
>>
>>      192.0.2.0/24, port 80 -> NGINX socket
>>
>>  (2) steer packets destined to an IP address, on any port to a socket
>>
>>      198.51.100.1, any port -> L7 proxy socket
>>
>> In its run-time context program receives information about the packet that
>> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> address 4-tuple. Context can be further extended to include ingress
>> interface identifier.
>>
>> To select a socket BPF program fetches it from a map holding socket
>> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> helper to record the selection. Transport layer then uses the selected
>> socket as a result of socket lookup.
>>
>> This patch only enables the user to attach an SK_LOOKUP program to a
>> network namespace. Subsequent patches hook it up to run on local delivery
>> path in ipv4 and ipv6 stacks.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>     v3:
>>     - Allow bpf_sk_assign helper to replace previously selected socket only
>>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>>       programs running in series to accidentally override each other's verdict.
>>     - Let BPF program decide that load-balancing within a reuseport socket group
>>       should be skipped for the socket selected with bpf_sk_assign() by passing
>>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>>       the selected socket with an intention for multiple attached program
>>       running in series to see each other's choices. However, currently the
>>       verifier doesn't allow checking if pointer is set.
>>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>>     - Get rid of macros in convert_ctx_access to make it easier to read.
>>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>>
>>     v2:
>>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>>       Update bpf_sk_assign docs accordingly. (Martin)
>>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>>
>>  include/linux/bpf-netns.h  |   3 +
>>  include/linux/bpf_types.h  |   2 +
>>  include/linux/filter.h     |  19 ++++
>>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>>  kernel/bpf/net_namespace.c |   5 +
>>  kernel/bpf/syscall.c       |   9 ++
>>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>>  scripts/bpf_helpers_doc.py |   9 +-
>>  8 files changed, 306 insertions(+), 1 deletion(-)
>>

[...]

>> +
>> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
>> +                                       const struct bpf_insn *si,
>> +                                       struct bpf_insn *insn_buf,
>> +                                       struct bpf_prog *prog,
>> +                                       u32 *target_size)
>
> Would it be too extreme to rely on BTF and direct memory access
> (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
> instead of all this assembly rewrites? So instead of having
> bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
> (bpf_sk_lookup_kern, or however we'd want to name it then) and
> verifier will just ensure that direct memory reads go to the right
> field boundaries?

Sounds like a decision related to long-term vision. I'd appreciate input
from maintainers if this is the direction we want to go in.

From implementation PoV - hard for me to say what would be needed to get
it working, I'm not familiar how BPF_TRACE_* attach types provide access
to context, so I'd need to look around and prototype it
first. (Actually, I'm not sure if you're asking if it is doable or you
already know?)

Off the top of my head, I have one concern, I'm exposing the selected
socket in the context. This is for the benefit of one program being
aware of other program's selection, if multiple programs are attached.

I understand that any piece of data reachable from struct sock *, would
be readable by SK_LOOKUP prog (writes can be blocked in
is_valid_access). And that this is a desired property for tracing. Not
sure how to limit it for a network program that doesn't need all that
info.

>
>> +{
>> +       struct bpf_insn *insn = insn_buf;
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +       int off;
>> +#endif
>> +
>
> [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-09  4:23   ` Andrii Nakryiko
@ 2020-07-09 15:51     ` Jakub Sitnicki
  2020-07-09 23:13       ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-09 15:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 09, 2020 at 06:23 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Make libbpf aware of the newly added program type, and assign it a
>> section name.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>
>> Notes:
>>     v3:
>>     - Move new libbpf symbols to version 0.1.0.
>>     - Set expected_attach_type in probe_load for new prog type.
>>
>>     v2:
>>     - Add new libbpf symbols to version 0.0.9. (Andrii)
>>
>>  tools/lib/bpf/libbpf.c        | 3 +++
>>  tools/lib/bpf/libbpf.h        | 2 ++
>>  tools/lib/bpf/libbpf.map      | 2 ++
>>  tools/lib/bpf/libbpf_probes.c | 3 +++
>>  4 files changed, 10 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 4ea7f4f1a691..ddcbb5dd78df 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -6793,6 +6793,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
>>  BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
>>  BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
>>  BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
>> +BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
>>
>>  enum bpf_attach_type
>>  bpf_program__get_expected_attach_type(struct bpf_program *prog)
>> @@ -6969,6 +6970,8 @@ static const struct bpf_sec_def section_defs[] = {
>>         BPF_EAPROG_SEC("cgroup/setsockopt",     BPF_PROG_TYPE_CGROUP_SOCKOPT,
>>                                                 BPF_CGROUP_SETSOCKOPT),
>>         BPF_PROG_SEC("struct_ops",              BPF_PROG_TYPE_STRUCT_OPS),
>> +       BPF_EAPROG_SEC("sk_lookup",             BPF_PROG_TYPE_SK_LOOKUP,
>> +                                               BPF_SK_LOOKUP),
>
> So it's a BPF_PROG_TYPE_SK_LOOKUP with attach type BPF_SK_LOOKUP. What
> other potential attach types could there be for
> BPF_PROG_TYPE_SK_LOOKUP? How the section name will look like in that
> case?

BPF_PROG_TYPE_SK_LOOKUP won't have any other attach types that I can
forsee. There is a single attach type shared by tcp4, tcp6, udp4, and
udp6 hook points. If we hook it up in the future say to sctp, I expect
the same attach point will be reused.

>
>>  };
>>
>>  #undef BPF_PROG_SEC_IMPL
>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>> index 2335971ed0bd..c2272132e929 100644
>> --- a/tools/lib/bpf/libbpf.h
>> +++ b/tools/lib/bpf/libbpf.h
>> @@ -350,6 +350,7 @@ LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
>>  LIBBPF_API int bpf_program__set_tracing(struct bpf_program *prog);
>>  LIBBPF_API int bpf_program__set_struct_ops(struct bpf_program *prog);
>>  LIBBPF_API int bpf_program__set_extension(struct bpf_program *prog);
>> +LIBBPF_API int bpf_program__set_sk_lookup(struct bpf_program *prog);
>>
>>  LIBBPF_API enum bpf_prog_type bpf_program__get_type(struct bpf_program *prog);
>>  LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
>> @@ -377,6 +378,7 @@ LIBBPF_API bool bpf_program__is_perf_event(const struct bpf_program *prog);
>>  LIBBPF_API bool bpf_program__is_tracing(const struct bpf_program *prog);
>>  LIBBPF_API bool bpf_program__is_struct_ops(const struct bpf_program *prog);
>>  LIBBPF_API bool bpf_program__is_extension(const struct bpf_program *prog);
>> +LIBBPF_API bool bpf_program__is_sk_lookup(const struct bpf_program *prog);
>>
>>  /*
>>   * No need for __attribute__((packed)), all members of 'bpf_map_def'
>> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
>> index 6544d2cd1ed6..04b99f63a45c 100644
>> --- a/tools/lib/bpf/libbpf.map
>> +++ b/tools/lib/bpf/libbpf.map
>> @@ -287,5 +287,7 @@ LIBBPF_0.1.0 {
>>                 bpf_map__type;
>>                 bpf_map__value_size;
>>                 bpf_program__autoload;
>> +               bpf_program__is_sk_lookup;
>>                 bpf_program__set_autoload;
>> +               bpf_program__set_sk_lookup;
>>  } LIBBPF_0.0.9;
>> diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
>> index 10cd8d1891f5..5a3d3f078408 100644
>> --- a/tools/lib/bpf/libbpf_probes.c
>> +++ b/tools/lib/bpf/libbpf_probes.c
>> @@ -78,6 +78,9 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
>>         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>>                 xattr.expected_attach_type = BPF_CGROUP_INET4_CONNECT;
>>                 break;
>> +       case BPF_PROG_TYPE_SK_LOOKUP:
>> +               xattr.expected_attach_type = BPF_SK_LOOKUP;
>> +               break;
>>         case BPF_PROG_TYPE_KPROBE:
>>                 xattr.kern_version = get_kernel_version();
>>                 break;
>> --
>> 2.25.4
>>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point
  2020-07-09  4:28       ` Andrii Nakryiko
@ 2020-07-09 15:54         ` Jakub Sitnicki
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-09 15:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Lorenz Bauer, bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 09, 2020 at 06:28 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 2, 2020 at 6:00 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Thu, Jul 02, 2020 at 01:01 PM CEST, Lorenz Bauer wrote:
>> > On Thu, 2 Jul 2020 at 10:24, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Add tests to test_progs that exercise:
>> >>
>> >>  - attaching/detaching/querying programs to BPF_SK_LOOKUP hook,
>> >>  - redirecting socket lookup to a socket selected by BPF program,
>> >>  - failing a socket lookup on BPF program's request,
>> >>  - error scenarios for selecting a socket from BPF program,
>> >>  - accessing BPF program context,
>> >>  - attaching and running multiple BPF programs.
>> >>
>> >> Run log:
>> >> | # ./test_progs -n 68
>> >> | #68/1 query lookup prog:OK
>> >> | #68/2 TCP IPv4 redir port:OK
>> >> | #68/3 TCP IPv4 redir addr:OK
>> >> | #68/4 TCP IPv4 redir with reuseport:OK
>> >> | #68/5 TCP IPv4 redir skip reuseport:OK
>> >> | #68/6 TCP IPv6 redir port:OK
>> >> | #68/7 TCP IPv6 redir addr:OK
>> >> | #68/8 TCP IPv4->IPv6 redir port:OK
>> >> | #68/9 TCP IPv6 redir with reuseport:OK
>> >> | #68/10 TCP IPv6 redir skip reuseport:OK
>> >> | #68/11 UDP IPv4 redir port:OK
>> >> | #68/12 UDP IPv4 redir addr:OK
>> >> | #68/13 UDP IPv4 redir with reuseport:OK
>> >> | #68/14 UDP IPv4 redir skip reuseport:OK
>> >> | #68/15 UDP IPv6 redir port:OK
>> >> | #68/16 UDP IPv6 redir addr:OK
>> >> | #68/17 UDP IPv4->IPv6 redir port:OK
>> >> | #68/18 UDP IPv6 redir and reuseport:OK
>> >> | #68/19 UDP IPv6 redir skip reuseport:OK
>> >> | #68/20 TCP IPv4 drop on lookup:OK
>> >> | #68/21 TCP IPv6 drop on lookup:OK
>> >> | #68/22 UDP IPv4 drop on lookup:OK
>> >> | #68/23 UDP IPv6 drop on lookup:OK
>> >> | #68/24 TCP IPv4 drop on reuseport:OK
>> >> | #68/25 TCP IPv6 drop on reuseport:OK
>> >> | #68/26 UDP IPv4 drop on reuseport:OK
>> >> | #68/27 TCP IPv6 drop on reuseport:OK
>> >> | #68/28 sk_assign returns EEXIST:OK
>> >> | #68/29 sk_assign honors F_REPLACE:OK
>> >> | #68/30 access ctx->sk:OK
>> >> | #68/31 sk_assign rejects TCP established:OK
>> >> | #68/32 sk_assign rejects UDP connected:OK
>> >> | #68/33 multi prog - pass, pass:OK
>> >> | #68/34 multi prog - pass, inval:OK
>> >> | #68/35 multi prog - inval, pass:OK
>> >> | #68/36 multi prog - drop, drop:OK
>> >> | #68/37 multi prog - pass, drop:OK
>> >> | #68/38 multi prog - drop, pass:OK
>> >> | #68/39 multi prog - drop, inval:OK
>> >> | #68/40 multi prog - inval, drop:OK
>> >> | #68/41 multi prog - pass, redir:OK
>> >> | #68/42 multi prog - redir, pass:OK
>> >> | #68/43 multi prog - drop, redir:OK
>> >> | #68/44 multi prog - redir, drop:OK
>> >> | #68/45 multi prog - inval, redir:OK
>> >> | #68/46 multi prog - redir, inval:OK
>> >> | #68/47 multi prog - redir, redir:OK
>> >> | #68 sk_lookup:OK
>> >> | Summary: 1/47 PASSED, 0 SKIPPED, 0 FAILED
>> >>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>
>> >> Notes:
>> >>     v3:
>> >>     - Extend tests to cover new functionality in v3:
>> >>       - multi-prog attachments (query, running, verdict precedence)
>> >>       - socket selecting for the second time with bpf_sk_assign
>> >>       - skipping over reuseport load-balancing
>> >>
>> >>     v2:
>> >>      - Adjust for fields renames in struct bpf_sk_lookup.
>> >>
>> >>  .../selftests/bpf/prog_tests/sk_lookup.c      | 1353 +++++++++++++++++
>> >>  .../selftests/bpf/progs/test_sk_lookup_kern.c |  399 +++++
>> >>  2 files changed, 1752 insertions(+)
>> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>> >>  create mode 100644 tools/testing/selftests/bpf/progs/test_sk_lookup_kern.c
>> >>
>> >> diff --git a/tools/testing/selftests/bpf/prog_tests/sk_lookup.c b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>> >> new file mode 100644
>> >> index 000000000000..2859dc7e65b0
>> >> --- /dev/null
>> >> +++ b/tools/testing/selftests/bpf/prog_tests/sk_lookup.c
>>
>> [...]
>>
>
> [...]
>
>> >> +static void run_lookup_prog(const struct test *t)
>> >> +{
>> >> +       int client_fd, server_fds[MAX_SERVERS] = { -1 };
>> >> +       struct bpf_link *lookup_link;
>> >> +       int i, err;
>> >> +
>> >> +       lookup_link = attach_lookup_prog(t->lookup_prog);
>> >> +       if (!lookup_link)
>> >
>> > Why doesn't this fail the test? Same for the other error paths in the
>> > function, and the other helpers.
>>
>> I took the approach of placing CHECK_FAIL checks only right after the
>> failure point. So a syscall or a call to libbpf.
>>
>> This way if I'm calling a helper, I know it already fails the test if
>> anything goes wrong, and I can have less CHECK_FAILs peppered over the
>> code.
>
> Please prefere CHECK() over CHECK_FAIL(), unless you are making
> hundreds of checks and it's extremely unlikely they will ever fail.
> Using CHECK_FAIL makes even knowing where the test fails hard. CHECK()
> leaves a trail, so it's easier to pinpoint what and why failed.

I'll convert it in v4. I wrote most of these tests before we chatted
about CHECK vs CHECK_FAIL some time ago and just haven't gotten around
to it so far.

>
>
> [...]


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments
  2020-07-09 12:49     ` Jakub Sitnicki
@ 2020-07-09 22:02       ` Andrii Nakryiko
  2020-07-10 19:23         ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09 22:02 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 9, 2020 at 5:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Thu, Jul 09, 2020 at 05:44 AM CEST, Andrii Nakryiko wrote:
> > On Thu, Jul 2, 2020 at 2:24 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
> >> prog_array at given position when link gets attached/updated/released.
> >>
> >> This let's us lift the limit of having just one link attached for the new
> >> attach type introduced by subsequent patch.
> >>
> >> No functional changes intended.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - New in v3 to support multi-prog attachments. (Alexei)
> >>
> >>  include/linux/bpf.h        |  4 ++
> >>  kernel/bpf/core.c          | 22 ++++++++++
> >>  kernel/bpf/net_namespace.c | 88 +++++++++++++++++++++++++++++++++++---
> >>  3 files changed, 107 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index 3d2ade703a35..26bc70533db0 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -928,6 +928,10 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
> >>
> >>  void bpf_prog_array_delete_safe(struct bpf_prog_array *progs,
> >>                                 struct bpf_prog *old_prog);
> >> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
> >> +                                  unsigned int index);
> >> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
> >> +                             struct bpf_prog *prog);
> >>  int bpf_prog_array_copy_info(struct bpf_prog_array *array,
> >>                              u32 *prog_ids, u32 request_cnt,
> >>                              u32 *prog_cnt);
> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >> index 9df4cc9a2907..d4b3b9ee6bf1 100644
> >> --- a/kernel/bpf/core.c
> >> +++ b/kernel/bpf/core.c
> >> @@ -1958,6 +1958,28 @@ void bpf_prog_array_delete_safe(struct bpf_prog_array *array,
> >>                 }
> >>  }
> >>
> >> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
> >> +                                  unsigned int index)
> >> +{
> >> +       bpf_prog_array_update_at(array, index, &dummy_bpf_prog.prog);
> >> +}
> >> +
> >> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
> >> +                             struct bpf_prog *prog)
> >
> > it's a good idea to mention it in a comment for both delete_safe_at
> > and update_at that slots with dummy entries are ignored.
>
> I agree. These two need doc comments. update_at doesn't event hint that
> this is not a regular update operation. Will add in v4.
>
> >
> > Also, given that index can be out of bounds, should these functions
> > actually return error if the slot is not found?
>
> That won't hurt. I mean, from bpf-netns PoV getting such an error would
> indicate that there is a bug in the code that manages prog_array. But
> perhaps other future users of this new prog_array API can benefit.
>
> >
> >> +{
> >> +       struct bpf_prog_array_item *item;
> >> +
> >> +       for (item = array->items; item->prog; item++) {
> >> +               if (item->prog == &dummy_bpf_prog.prog)
> >> +                       continue;
> >> +               if (!index) {
> >> +                       WRITE_ONCE(item->prog, prog);
> >> +                       break;
> >> +               }
> >> +               index--;
> >> +       }
> >> +}
> >> +
> >>  int bpf_prog_array_copy(struct bpf_prog_array *old_array,
> >>                         struct bpf_prog *exclude_prog,
> >>                         struct bpf_prog *include_prog,
> >> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
> >> index 247543380fa6..6011122c35b6 100644
> >> --- a/kernel/bpf/net_namespace.c
> >> +++ b/kernel/bpf/net_namespace.c
> >> @@ -36,11 +36,51 @@ static void netns_bpf_run_array_detach(struct net *net,
> >>         bpf_prog_array_free(run_array);
> >>  }
> >>
> >> +static unsigned int link_index(struct net *net,
> >> +                              enum netns_bpf_attach_type type,
> >> +                              struct bpf_netns_link *link)
> >> +{
> >> +       struct bpf_netns_link *pos;
> >> +       unsigned int i = 0;
> >> +
> >> +       list_for_each_entry(pos, &net->bpf.links[type], node) {
> >> +               if (pos == link)
> >> +                       return i;
> >> +               i++;
> >> +       }
> >> +       return UINT_MAX;
> >
> > Why not return a negative error, if the slot is not found? Feels a bit
> > unusual as far as error reporting goes.
>
> Returning uint played well with the consumer of link_index() return
> value, that is bpf_prog_array_update_at(). update at takes an index into
> the array, which must not be negative.

Yeah, it did, but it's also quite implicit. I think just doing
BUG_ON() for update_at or delete_at would be good enough there.

>
> But I don't have strong feelings toward it. Will switch to -ENOENT in
> v4.
>
> >
> >> +}
> >> +
> >
> > [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-09 13:25     ` Jakub Sitnicki
@ 2020-07-09 23:09       ` Andrii Nakryiko
  2020-07-10  8:55         ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09 23:09 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Thu, Jul 9, 2020 at 6:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
> >> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
> >> when looking up a listening socket for a new connection request for
> >> connection oriented protocols, or when looking up an unconnected socket for
> >> a packet for connection-less protocols.
> >>
> >> When called, SK_LOOKUP BPF program can select a socket that will receive
> >> the packet. This serves as a mechanism to overcome the limits of what
> >> bind() API allows to express. Two use-cases driving this work are:
> >>
> >>  (1) steer packets destined to an IP range, on fixed port to a socket
> >>
> >>      192.0.2.0/24, port 80 -> NGINX socket
> >>
> >>  (2) steer packets destined to an IP address, on any port to a socket
> >>
> >>      198.51.100.1, any port -> L7 proxy socket
> >>
> >> In its run-time context program receives information about the packet that
> >> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> >> address 4-tuple. Context can be further extended to include ingress
> >> interface identifier.
> >>
> >> To select a socket BPF program fetches it from a map holding socket
> >> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> >> helper to record the selection. Transport layer then uses the selected
> >> socket as a result of socket lookup.
> >>
> >> This patch only enables the user to attach an SK_LOOKUP program to a
> >> network namespace. Subsequent patches hook it up to run on local delivery
> >> path in ipv4 and ipv6 stacks.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - Allow bpf_sk_assign helper to replace previously selected socket only
> >>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
> >>       programs running in series to accidentally override each other's verdict.
> >>     - Let BPF program decide that load-balancing within a reuseport socket group
> >>       should be skipped for the socket selected with bpf_sk_assign() by passing
> >>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
> >>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
> >>       the selected socket with an intention for multiple attached program
> >>       running in series to see each other's choices. However, currently the
> >>       verifier doesn't allow checking if pointer is set.
> >>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
> >>     - Get rid of macros in convert_ctx_access to make it easier to read.
> >>     - Disallow 1-,2-byte access to context fields containing IP addresses.
> >>
> >>     v2:
> >>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
> >>       Update bpf_sk_assign docs accordingly. (Martin)
> >>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
> >>     - Fix broken build when CONFIG_INET is not selected. (Martin)
> >>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
> >>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
> >>
> >>  include/linux/bpf-netns.h  |   3 +
> >>  include/linux/bpf_types.h  |   2 +
> >>  include/linux/filter.h     |  19 ++++
> >>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
> >>  kernel/bpf/net_namespace.c |   5 +
> >>  kernel/bpf/syscall.c       |   9 ++
> >>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
> >>  scripts/bpf_helpers_doc.py |   9 +-
> >>  8 files changed, 306 insertions(+), 1 deletion(-)
> >>
>
> [...]
>
> >> +
> >> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
> >> +                                       const struct bpf_insn *si,
> >> +                                       struct bpf_insn *insn_buf,
> >> +                                       struct bpf_prog *prog,
> >> +                                       u32 *target_size)
> >
> > Would it be too extreme to rely on BTF and direct memory access
> > (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
> > instead of all this assembly rewrites? So instead of having
> > bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
> > (bpf_sk_lookup_kern, or however we'd want to name it then) and
> > verifier will just ensure that direct memory reads go to the right
> > field boundaries?
>
> Sounds like a decision related to long-term vision. I'd appreciate input
> from maintainers if this is the direction we want to go in.
>
> From implementation PoV - hard for me to say what would be needed to get
> it working, I'm not familiar how BPF_TRACE_* attach types provide access
> to context, so I'd need to look around and prototype it
> first. (Actually, I'm not sure if you're asking if it is doable or you
> already know?)

I'm pretty sure it's doable with what we have in verifier, but I'm not
sure about all the details and amount of work. So consider this an
initiation of a medium-term discussion. I was also curious to hear an
opinion from Alexei and Daniel whether that's would be the right way
to do this moving forward (not necessarily with your changes, though).

>
> Off the top of my head, I have one concern, I'm exposing the selected
> socket in the context. This is for the benefit of one program being
> aware of other program's selection, if multiple programs are attached.
>
> I understand that any piece of data reachable from struct sock *, would
> be readable by SK_LOOKUP prog (writes can be blocked in
> is_valid_access). And that this is a desired property for tracing. Not
> sure how to limit it for a network program that doesn't need all that
> info.
>
> >
> >> +{
> >> +       struct bpf_insn *insn = insn_buf;
> >> +#if IS_ENABLED(CONFIG_IPV6)
> >> +       int off;
> >> +#endif
> >> +
> >
> > [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-09 15:51     ` Jakub Sitnicki
@ 2020-07-09 23:13       ` Andrii Nakryiko
  2020-07-10  8:37         ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-09 23:13 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Thu, Jul 9, 2020 at 8:51 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Thu, Jul 09, 2020 at 06:23 AM CEST, Andrii Nakryiko wrote:
> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Make libbpf aware of the newly added program type, and assign it a
> >> section name.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>
> >> Notes:
> >>     v3:
> >>     - Move new libbpf symbols to version 0.1.0.
> >>     - Set expected_attach_type in probe_load for new prog type.
> >>
> >>     v2:
> >>     - Add new libbpf symbols to version 0.0.9. (Andrii)
> >>
> >>  tools/lib/bpf/libbpf.c        | 3 +++
> >>  tools/lib/bpf/libbpf.h        | 2 ++
> >>  tools/lib/bpf/libbpf.map      | 2 ++
> >>  tools/lib/bpf/libbpf_probes.c | 3 +++
> >>  4 files changed, 10 insertions(+)
> >>
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index 4ea7f4f1a691..ddcbb5dd78df 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -6793,6 +6793,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
> >>  BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
> >>  BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
> >>  BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
> >> +BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
> >>
> >>  enum bpf_attach_type
> >>  bpf_program__get_expected_attach_type(struct bpf_program *prog)
> >> @@ -6969,6 +6970,8 @@ static const struct bpf_sec_def section_defs[] = {
> >>         BPF_EAPROG_SEC("cgroup/setsockopt",     BPF_PROG_TYPE_CGROUP_SOCKOPT,
> >>                                                 BPF_CGROUP_SETSOCKOPT),
> >>         BPF_PROG_SEC("struct_ops",              BPF_PROG_TYPE_STRUCT_OPS),
> >> +       BPF_EAPROG_SEC("sk_lookup",             BPF_PROG_TYPE_SK_LOOKUP,
> >> +                                               BPF_SK_LOOKUP),
> >
> > So it's a BPF_PROG_TYPE_SK_LOOKUP with attach type BPF_SK_LOOKUP. What
> > other potential attach types could there be for
> > BPF_PROG_TYPE_SK_LOOKUP? How the section name will look like in that
> > case?
>
> BPF_PROG_TYPE_SK_LOOKUP won't have any other attach types that I can
> forsee. There is a single attach type shared by tcp4, tcp6, udp4, and
> udp6 hook points. If we hook it up in the future say to sctp, I expect
> the same attach point will be reused.

So you needed to add to bpf_attach_type just to fit into link_create
model of attach_type -> prog_type, right? As I mentioned extending
bpf_attach_type has a real cost on each cgroup, so we either need to
solve that problem (and I think that would be the best) or we can
change link_create logic to not require attach_type for programs like
SK_LOOKUP, where it's clear without attach type.

Second order question was if we have another attach type, having
SEC("sk_lookup/just_kidding_something_else") would be a bit weird :)
But it seems like that's not a concern.

>
> >
> >>  };
> >>

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-09 23:13       ` Andrii Nakryiko
@ 2020-07-10  8:37         ` Jakub Sitnicki
  2020-07-10 18:55           ` Andrii Nakryiko
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-10  8:37 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Fri, Jul 10, 2020 at 01:13 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 9, 2020 at 8:51 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Thu, Jul 09, 2020 at 06:23 AM CEST, Andrii Nakryiko wrote:
>> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Make libbpf aware of the newly added program type, and assign it a
>> >> section name.
>> >>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>
>> >> Notes:
>> >>     v3:
>> >>     - Move new libbpf symbols to version 0.1.0.
>> >>     - Set expected_attach_type in probe_load for new prog type.
>> >>
>> >>     v2:
>> >>     - Add new libbpf symbols to version 0.0.9. (Andrii)
>> >>
>> >>  tools/lib/bpf/libbpf.c        | 3 +++
>> >>  tools/lib/bpf/libbpf.h        | 2 ++
>> >>  tools/lib/bpf/libbpf.map      | 2 ++
>> >>  tools/lib/bpf/libbpf_probes.c | 3 +++
>> >>  4 files changed, 10 insertions(+)
>> >>
>> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> >> index 4ea7f4f1a691..ddcbb5dd78df 100644
>> >> --- a/tools/lib/bpf/libbpf.c
>> >> +++ b/tools/lib/bpf/libbpf.c
>> >> @@ -6793,6 +6793,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
>> >>  BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
>> >>  BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
>> >>  BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
>> >> +BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
>> >>
>> >>  enum bpf_attach_type
>> >>  bpf_program__get_expected_attach_type(struct bpf_program *prog)
>> >> @@ -6969,6 +6970,8 @@ static const struct bpf_sec_def section_defs[] = {
>> >>         BPF_EAPROG_SEC("cgroup/setsockopt",     BPF_PROG_TYPE_CGROUP_SOCKOPT,
>> >>                                                 BPF_CGROUP_SETSOCKOPT),
>> >>         BPF_PROG_SEC("struct_ops",              BPF_PROG_TYPE_STRUCT_OPS),
>> >> +       BPF_EAPROG_SEC("sk_lookup",             BPF_PROG_TYPE_SK_LOOKUP,
>> >> +                                               BPF_SK_LOOKUP),
>> >
>> > So it's a BPF_PROG_TYPE_SK_LOOKUP with attach type BPF_SK_LOOKUP. What
>> > other potential attach types could there be for
>> > BPF_PROG_TYPE_SK_LOOKUP? How the section name will look like in that
>> > case?
>>
>> BPF_PROG_TYPE_SK_LOOKUP won't have any other attach types that I can
>> forsee. There is a single attach type shared by tcp4, tcp6, udp4, and
>> udp6 hook points. If we hook it up in the future say to sctp, I expect
>> the same attach point will be reused.
>
> So you needed to add to bpf_attach_type just to fit into link_create
> model of attach_type -> prog_type, right? As I mentioned extending
> bpf_attach_type has a real cost on each cgroup, so we either need to
> solve that problem (and I think that would be the best) or we can
> change link_create logic to not require attach_type for programs like
> SK_LOOKUP, where it's clear without attach type.

Right. I was thinking about that a bit. For prog types map 1:1 to an
attach type, like flow_dissector or proposed sk_lookup, we don't really
to know the attach type to attach the program.

PROG_QUERY is more problematic though. But I imagine we could introduce
a flag like BPF_QUERY_F_BY_PROG_TYPE that would make the kernel
interpret attr->query.attach_type as prog type.

PROG_DETACH is yet another story but sk_lookup uses only link-based
attachment, so I'm ignoring it here.

What also might get in the way is the fact that there is no
bpf_attach_type value reserved for unspecified attach type at the
moment. We would have to ensure that the first enum,
BPF_CGROUP_INET_INGRESS, is not treated as an exact attach type.

>
> Second order question was if we have another attach type, having
> SEC("sk_lookup/just_kidding_something_else") would be a bit weird :)
> But it seems like that's not a concern.

Yes. Sorry, I didn't mean to leave it unanswered. Just assumed that it
was obvious that it's not the case.

I've been happily using the part of section name following "sk_lookup"
prefix to name the programs just to make section names in ELF object
unique:

  SEC("sk_lookup/lookup_pass")
  SEC("sk_lookup/lookup_drop")
  SEC("sk_lookup/redir_port")

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-07-09 23:09       ` Andrii Nakryiko
@ 2020-07-10  8:55         ` Jakub Sitnicki
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-10  8:55 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski,
	Marek Majkowski

On Fri, Jul 10, 2020 at 01:09 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 9, 2020 at 6:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Thu, Jul 09, 2020 at 06:08 AM CEST, Andrii Nakryiko wrote:
>> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
>> >> BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
>> >> when looking up a listening socket for a new connection request for
>> >> connection oriented protocols, or when looking up an unconnected socket for
>> >> a packet for connection-less protocols.
>> >>
>> >> When called, SK_LOOKUP BPF program can select a socket that will receive
>> >> the packet. This serves as a mechanism to overcome the limits of what
>> >> bind() API allows to express. Two use-cases driving this work are:
>> >>
>> >>  (1) steer packets destined to an IP range, on fixed port to a socket
>> >>
>> >>      192.0.2.0/24, port 80 -> NGINX socket
>> >>
>> >>  (2) steer packets destined to an IP address, on any port to a socket
>> >>
>> >>      198.51.100.1, any port -> L7 proxy socket
>> >>
>> >> In its run-time context program receives information about the packet that
>> >> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> >> address 4-tuple. Context can be further extended to include ingress
>> >> interface identifier.
>> >>
>> >> To select a socket BPF program fetches it from a map holding socket
>> >> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> >> helper to record the selection. Transport layer then uses the selected
>> >> socket as a result of socket lookup.
>> >>
>> >> This patch only enables the user to attach an SK_LOOKUP program to a
>> >> network namespace. Subsequent patches hook it up to run on local delivery
>> >> path in ipv4 and ipv6 stacks.
>> >>
>> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>
>> >> Notes:
>> >>     v3:
>> >>     - Allow bpf_sk_assign helper to replace previously selected socket only
>> >>       when BPF_SK_LOOKUP_F_REPLACE flag is set, as a precaution for multiple
>> >>       programs running in series to accidentally override each other's verdict.
>> >>     - Let BPF program decide that load-balancing within a reuseport socket group
>> >>       should be skipped for the socket selected with bpf_sk_assign() by passing
>> >>       BPF_SK_LOOKUP_F_NO_REUSEPORT flag. (Martin)
>> >>     - Extend struct bpf_sk_lookup program context with an 'sk' field containing
>> >>       the selected socket with an intention for multiple attached program
>> >>       running in series to see each other's choices. However, currently the
>> >>       verifier doesn't allow checking if pointer is set.
>> >>     - Use bpf-netns infra for link-based multi-program attachment. (Alexei)
>> >>     - Get rid of macros in convert_ctx_access to make it easier to read.
>> >>     - Disallow 1-,2-byte access to context fields containing IP addresses.
>> >>
>> >>     v2:
>> >>     - Make bpf_sk_assign reject sockets that don't use RCU freeing.
>> >>       Update bpf_sk_assign docs accordingly. (Martin)
>> >>     - Change bpf_sk_assign proto to take PTR_TO_SOCKET as argument. (Martin)
>> >>     - Fix broken build when CONFIG_INET is not selected. (Martin)
>> >>     - Rename bpf_sk_lookup{} src_/dst_* fields remote_/local_*. (Martin)
>> >>     - Enforce BPF_SK_LOOKUP attach point on load & attach. (Martin)
>> >>
>> >>  include/linux/bpf-netns.h  |   3 +
>> >>  include/linux/bpf_types.h  |   2 +
>> >>  include/linux/filter.h     |  19 ++++
>> >>  include/uapi/linux/bpf.h   |  74 +++++++++++++++
>> >>  kernel/bpf/net_namespace.c |   5 +
>> >>  kernel/bpf/syscall.c       |   9 ++
>> >>  net/core/filter.c          | 186 +++++++++++++++++++++++++++++++++++++
>> >>  scripts/bpf_helpers_doc.py |   9 +-
>> >>  8 files changed, 306 insertions(+), 1 deletion(-)
>> >>
>>
>> [...]
>>
>> >> +
>> >> +static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
>> >> +                                       const struct bpf_insn *si,
>> >> +                                       struct bpf_insn *insn_buf,
>> >> +                                       struct bpf_prog *prog,
>> >> +                                       u32 *target_size)
>> >
>> > Would it be too extreme to rely on BTF and direct memory access
>> > (similar to tp_raw, fentry/fexit, etc) for accessing context fields,
>> > instead of all this assembly rewrites? So instead of having
>> > bpf_sk_lookup and bpf_sk_lookup_kern, it will always be a full variant
>> > (bpf_sk_lookup_kern, or however we'd want to name it then) and
>> > verifier will just ensure that direct memory reads go to the right
>> > field boundaries?
>>
>> Sounds like a decision related to long-term vision. I'd appreciate input
>> from maintainers if this is the direction we want to go in.
>>
>> From implementation PoV - hard for me to say what would be needed to get
>> it working, I'm not familiar how BPF_TRACE_* attach types provide access
>> to context, so I'd need to look around and prototype it
>> first. (Actually, I'm not sure if you're asking if it is doable or you
>> already know?)
>
> I'm pretty sure it's doable with what we have in verifier, but I'm not
> sure about all the details and amount of work. So consider this an
> initiation of a medium-term discussion. I was also curious to hear an
> opinion from Alexei and Daniel whether that's would be the right way
> to do this moving forward (not necessarily with your changes, though).

From my side I can vouch that getting convert_ctx_access is not easy to
get right (at least for me) when backing structure is non-trivial,
e.g. has pointers or unions.

v4 will contain two fixes exactly in this area. I also have a patch for
how verifier handles narrow loads when load size <= target field size <
ctx field size.

That is to say, any alternative approach that "automates" this would be
very welcome.

I've accumulated quite a few changes already since v3, so I was planning
to roll out v4 to keep things moving while we continue the discussion.

>
>>
>> Off the top of my head, I have one concern, I'm exposing the selected
>> socket in the context. This is for the benefit of one program being
>> aware of other program's selection, if multiple programs are attached.
>>
>> I understand that any piece of data reachable from struct sock *, would
>> be readable by SK_LOOKUP prog (writes can be blocked in
>> is_valid_access). And that this is a desired property for tracing. Not
>> sure how to limit it for a network program that doesn't need all that
>> info.
>>
>> >
>> >> +{
>> >> +       struct bpf_insn *insn = insn_buf;
>> >> +#if IS_ENABLED(CONFIG_IPV6)
>> >> +       int off;
>> >> +#endif
>> >> +
>> >
>> > [...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-10  8:37         ` Jakub Sitnicki
@ 2020-07-10 18:55           ` Andrii Nakryiko
  2020-07-10 19:24             ` Jakub Sitnicki
  0 siblings, 1 reply; 51+ messages in thread
From: Andrii Nakryiko @ 2020-07-10 18:55 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Fri, Jul 10, 2020 at 1:37 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Fri, Jul 10, 2020 at 01:13 AM CEST, Andrii Nakryiko wrote:
> > On Thu, Jul 9, 2020 at 8:51 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> On Thu, Jul 09, 2020 at 06:23 AM CEST, Andrii Nakryiko wrote:
> >> > On Thu, Jul 2, 2020 at 2:25 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >>
> >> >> Make libbpf aware of the newly added program type, and assign it a
> >> >> section name.
> >> >>
> >> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> ---
> >> >>
> >> >> Notes:
> >> >>     v3:
> >> >>     - Move new libbpf symbols to version 0.1.0.
> >> >>     - Set expected_attach_type in probe_load for new prog type.
> >> >>
> >> >>     v2:
> >> >>     - Add new libbpf symbols to version 0.0.9. (Andrii)
> >> >>
> >> >>  tools/lib/bpf/libbpf.c        | 3 +++
> >> >>  tools/lib/bpf/libbpf.h        | 2 ++
> >> >>  tools/lib/bpf/libbpf.map      | 2 ++
> >> >>  tools/lib/bpf/libbpf_probes.c | 3 +++
> >> >>  4 files changed, 10 insertions(+)
> >> >>
> >> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> >> index 4ea7f4f1a691..ddcbb5dd78df 100644
> >> >> --- a/tools/lib/bpf/libbpf.c
> >> >> +++ b/tools/lib/bpf/libbpf.c
> >> >> @@ -6793,6 +6793,7 @@ BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
> >> >>  BPF_PROG_TYPE_FNS(tracing, BPF_PROG_TYPE_TRACING);
> >> >>  BPF_PROG_TYPE_FNS(struct_ops, BPF_PROG_TYPE_STRUCT_OPS);
> >> >>  BPF_PROG_TYPE_FNS(extension, BPF_PROG_TYPE_EXT);
> >> >> +BPF_PROG_TYPE_FNS(sk_lookup, BPF_PROG_TYPE_SK_LOOKUP);
> >> >>
> >> >>  enum bpf_attach_type
> >> >>  bpf_program__get_expected_attach_type(struct bpf_program *prog)
> >> >> @@ -6969,6 +6970,8 @@ static const struct bpf_sec_def section_defs[] = {
> >> >>         BPF_EAPROG_SEC("cgroup/setsockopt",     BPF_PROG_TYPE_CGROUP_SOCKOPT,
> >> >>                                                 BPF_CGROUP_SETSOCKOPT),
> >> >>         BPF_PROG_SEC("struct_ops",              BPF_PROG_TYPE_STRUCT_OPS),
> >> >> +       BPF_EAPROG_SEC("sk_lookup",             BPF_PROG_TYPE_SK_LOOKUP,
> >> >> +                                               BPF_SK_LOOKUP),
> >> >
> >> > So it's a BPF_PROG_TYPE_SK_LOOKUP with attach type BPF_SK_LOOKUP. What
> >> > other potential attach types could there be for
> >> > BPF_PROG_TYPE_SK_LOOKUP? How the section name will look like in that
> >> > case?
> >>
> >> BPF_PROG_TYPE_SK_LOOKUP won't have any other attach types that I can
> >> forsee. There is a single attach type shared by tcp4, tcp6, udp4, and
> >> udp6 hook points. If we hook it up in the future say to sctp, I expect
> >> the same attach point will be reused.
> >
> > So you needed to add to bpf_attach_type just to fit into link_create
> > model of attach_type -> prog_type, right? As I mentioned extending
> > bpf_attach_type has a real cost on each cgroup, so we either need to
> > solve that problem (and I think that would be the best) or we can
> > change link_create logic to not require attach_type for programs like
> > SK_LOOKUP, where it's clear without attach type.
>
> Right. I was thinking about that a bit. For prog types map 1:1 to an
> attach type, like flow_dissector or proposed sk_lookup, we don't really
> to know the attach type to attach the program.
>
> PROG_QUERY is more problematic though. But I imagine we could introduce
> a flag like BPF_QUERY_F_BY_PROG_TYPE that would make the kernel
> interpret attr->query.attach_type as prog type.
>
> PROG_DETACH is yet another story but sk_lookup uses only link-based
> attachment, so I'm ignoring it here.
>
> What also might get in the way is the fact that there is no
> bpf_attach_type value reserved for unspecified attach type at the
> moment. We would have to ensure that the first enum,
> BPF_CGROUP_INET_INGRESS, is not treated as an exact attach type.
>

I think we should just solve this for cgroup the same way you did it
for netns. We'll keep adding new attach types regardless, so better
solve the problem, rather than artificially avoid it.


> >
> > Second order question was if we have another attach type, having
> > SEC("sk_lookup/just_kidding_something_else") would be a bit weird :)
> > But it seems like that's not a concern.
>
> Yes. Sorry, I didn't mean to leave it unanswered. Just assumed that it
> was obvious that it's not the case.
>
> I've been happily using the part of section name following "sk_lookup"
> prefix to name the programs just to make section names in ELF object
> unique:
>
>   SEC("sk_lookup/lookup_pass")
>   SEC("sk_lookup/lookup_drop")
>   SEC("sk_lookup/redir_port")

oh, right, which reminds me: how about adding / to sk_lookup in that
libbpf table, so that it's always sk_lookup/<something> for section
name? We did similar change to xdp_devmap recently, and it seems like
a good trend overall to have / separation between program type and
whatever extra name user wants to give?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments
  2020-07-09 22:02       ` Andrii Nakryiko
@ 2020-07-10 19:23         ` Jakub Sitnicki
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-10 19:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Fri, Jul 10, 2020 at 12:02 AM CEST, Andrii Nakryiko wrote:
> On Thu, Jul 9, 2020 at 5:49 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Thu, Jul 09, 2020 at 05:44 AM CEST, Andrii Nakryiko wrote:
>> > On Thu, Jul 2, 2020 at 2:24 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>
>> >> Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
>> >> prog_array at given position when link gets attached/updated/released.
>> >>
>> >> This let's us lift the limit of having just one link attached for the new
>> >> attach type introduced by subsequent patch.
>> >>
>> >> No functional changes intended.
>> >>
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>
>> >> Notes:
>> >>     v3:
>> >>     - New in v3 to support multi-prog attachments. (Alexei)
>> >>
>> >>  include/linux/bpf.h        |  4 ++
>> >>  kernel/bpf/core.c          | 22 ++++++++++
>> >>  kernel/bpf/net_namespace.c | 88 +++++++++++++++++++++++++++++++++++---
>> >>  3 files changed, 107 insertions(+), 7 deletions(-)
>> >>
>> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> >> index 3d2ade703a35..26bc70533db0 100644
>> >> --- a/include/linux/bpf.h
>> >> +++ b/include/linux/bpf.h
>> >> @@ -928,6 +928,10 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
>> >>
>> >>  void bpf_prog_array_delete_safe(struct bpf_prog_array *progs,
>> >>                                 struct bpf_prog *old_prog);
>> >> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
>> >> +                                  unsigned int index);
>> >> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
>> >> +                             struct bpf_prog *prog);
>> >>  int bpf_prog_array_copy_info(struct bpf_prog_array *array,
>> >>                              u32 *prog_ids, u32 request_cnt,
>> >>                              u32 *prog_cnt);
>> >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> >> index 9df4cc9a2907..d4b3b9ee6bf1 100644
>> >> --- a/kernel/bpf/core.c
>> >> +++ b/kernel/bpf/core.c
>> >> @@ -1958,6 +1958,28 @@ void bpf_prog_array_delete_safe(struct bpf_prog_array *array,
>> >>                 }
>> >>  }
>> >>
>> >> +void bpf_prog_array_delete_safe_at(struct bpf_prog_array *array,
>> >> +                                  unsigned int index)
>> >> +{
>> >> +       bpf_prog_array_update_at(array, index, &dummy_bpf_prog.prog);
>> >> +}
>> >> +
>> >> +void bpf_prog_array_update_at(struct bpf_prog_array *array, unsigned int index,
>> >> +                             struct bpf_prog *prog)
>> >
>> > it's a good idea to mention it in a comment for both delete_safe_at
>> > and update_at that slots with dummy entries are ignored.
>>
>> I agree. These two need doc comments. update_at doesn't event hint that
>> this is not a regular update operation. Will add in v4.
>>
>> >
>> > Also, given that index can be out of bounds, should these functions
>> > actually return error if the slot is not found?
>>
>> That won't hurt. I mean, from bpf-netns PoV getting such an error would
>> indicate that there is a bug in the code that manages prog_array. But
>> perhaps other future users of this new prog_array API can benefit.
>>
>> >
>> >> +{
>> >> +       struct bpf_prog_array_item *item;
>> >> +
>> >> +       for (item = array->items; item->prog; item++) {
>> >> +               if (item->prog == &dummy_bpf_prog.prog)
>> >> +                       continue;
>> >> +               if (!index) {
>> >> +                       WRITE_ONCE(item->prog, prog);
>> >> +                       break;
>> >> +               }
>> >> +               index--;
>> >> +       }
>> >> +}
>> >> +
>> >>  int bpf_prog_array_copy(struct bpf_prog_array *old_array,
>> >>                         struct bpf_prog *exclude_prog,
>> >>                         struct bpf_prog *include_prog,
>> >> diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
>> >> index 247543380fa6..6011122c35b6 100644
>> >> --- a/kernel/bpf/net_namespace.c
>> >> +++ b/kernel/bpf/net_namespace.c
>> >> @@ -36,11 +36,51 @@ static void netns_bpf_run_array_detach(struct net *net,
>> >>         bpf_prog_array_free(run_array);
>> >>  }
>> >>
>> >> +static unsigned int link_index(struct net *net,
>> >> +                              enum netns_bpf_attach_type type,
>> >> +                              struct bpf_netns_link *link)
>> >> +{
>> >> +       struct bpf_netns_link *pos;
>> >> +       unsigned int i = 0;
>> >> +
>> >> +       list_for_each_entry(pos, &net->bpf.links[type], node) {
>> >> +               if (pos == link)
>> >> +                       return i;
>> >> +               i++;
>> >> +       }
>> >> +       return UINT_MAX;
>> >
>> > Why not return a negative error, if the slot is not found? Feels a bit
>> > unusual as far as error reporting goes.
>>
>> Returning uint played well with the consumer of link_index() return
>> value, that is bpf_prog_array_update_at(). update at takes an index into
>> the array, which must not be negative.
>
> Yeah, it did, but it's also quite implicit. I think just doing
> BUG_ON() for update_at or delete_at would be good enough there.

BUG_ON got deprecated [0], but I will WARN.

[0] https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on

[...]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type
  2020-07-10 18:55           ` Andrii Nakryiko
@ 2020-07-10 19:24             ` Jakub Sitnicki
  0 siblings, 0 replies; 51+ messages in thread
From: Jakub Sitnicki @ 2020-07-10 19:24 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, kernel-team, Alexei Starovoitov,
	Daniel Borkmann, David S. Miller, Jakub Kicinski

On Fri, Jul 10, 2020 at 08:55 PM CEST, Andrii Nakryiko wrote:
> On Fri, Jul 10, 2020 at 1:37 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:

[...]

>> I've been happily using the part of section name following "sk_lookup"
>> prefix to name the programs just to make section names in ELF object
>> unique:
>>
>>   SEC("sk_lookup/lookup_pass")
>>   SEC("sk_lookup/lookup_drop")
>>   SEC("sk_lookup/redir_port")
>
> oh, right, which reminds me: how about adding / to sk_lookup in that
> libbpf table, so that it's always sk_lookup/<something> for section
> name? We did similar change to xdp_devmap recently, and it seems like
> a good trend overall to have / separation between program type and
> whatever extra name user wants to give?

Will do. Thanks for pointing out it. I didn't pick up on it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-07-10 19:24 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-02  9:24 [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 01/16] bpf, netns: Handle multiple link attachments Jakub Sitnicki
2020-07-09  3:44   ` Andrii Nakryiko
2020-07-09 12:49     ` Jakub Sitnicki
2020-07-09 22:02       ` Andrii Nakryiko
2020-07-10 19:23         ` Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
2020-07-04 18:42   ` Yonghong Song
2020-07-06 11:44     ` Jakub Sitnicki
2020-07-05  9:20   ` kernel test robot
2020-07-05  9:20     ` kernel test robot
2020-07-05  9:20   ` [RFC PATCH] bpf: sk_lookup_prog_ops can be static kernel test robot
2020-07-05  9:20     ` kernel test robot
2020-07-07  9:21   ` [PATCH bpf-next v3 02/16] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
2020-07-09  4:08   ` Andrii Nakryiko
2020-07-09 13:25     ` Jakub Sitnicki
2020-07-09 23:09       ` Andrii Nakryiko
2020-07-10  8:55         ` Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 03/16] inet: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 04/16] inet: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-07-02 10:27   ` Lorenz Bauer
2020-07-02 12:46     ` Jakub Sitnicki
2020-07-02 13:19       ` Lorenz Bauer
2020-07-06 11:24         ` Jakub Sitnicki
2020-07-06 12:06   ` Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 05/16] inet6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 06/16] inet6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 07/16] udp: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 08/16] udp: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 09/16] udp6: Extract helper for selecting socket from reuseport group Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 10/16] udp6: Run SK_LOOKUP BPF program on socket lookup Jakub Sitnicki
2020-07-02 14:51   ` kernel test robot
2020-07-02 14:51     ` kernel test robot
2020-07-03 13:04     ` Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 11/16] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 12/16] libbpf: Add support for SK_LOOKUP program type Jakub Sitnicki
2020-07-09  4:23   ` Andrii Nakryiko
2020-07-09 15:51     ` Jakub Sitnicki
2020-07-09 23:13       ` Andrii Nakryiko
2020-07-10  8:37         ` Jakub Sitnicki
2020-07-10 18:55           ` Andrii Nakryiko
2020-07-10 19:24             ` Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 13/16] tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 14/16] selftests/bpf: Add verifier tests for bpf_sk_lookup context access Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 15/16] selftests/bpf: Rename test_sk_lookup_kern.c to test_ref_track_kern.c Jakub Sitnicki
2020-07-02  9:24 ` [PATCH bpf-next v3 16/16] selftests/bpf: Tests for BPF_SK_LOOKUP attach point Jakub Sitnicki
2020-07-02 11:01   ` Lorenz Bauer
2020-07-02 12:59     ` Jakub Sitnicki
2020-07-09  4:28       ` Andrii Nakryiko
2020-07-09 15:54         ` Jakub Sitnicki
2020-07-02 11:05 ` [PATCH bpf-next v3 00/16] Run a BPF program on socket lookup Lorenz Bauer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.