bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt
@ 2021-06-25 20:04 Martin KaFai Lau
  2021-06-25 20:04 ` [PATCH bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Martin KaFai Lau
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:04 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

This set is to allow bpf tcp iter to call bpf_setsockopt.

With bpf-tcp-cc, new algo rollout happens more often.  Instead of
restarting the applications to pick up the new tcp-cc, this set
allows the bpf tcp iter with the netadmin cap to call
bpf_setsockopt(TCP_CONGESTION).  It is not limited to TCP_CONGESTION
and the bpf tcp iter can call bpf_setsockopt() with other options.
The bpf tcp iter can read into all the fields of a tcp_sock, so
there is a lot of flexibility to select the desired sk to do
setsockopt(), e.g. it can test for TCP_LISTEN only and leave
the established connections untouched, or check the addr/port,
or check the current tcp-cc name, ...etc.

Patch 1-4 are some cleanup and prep work in the tcp and bpf seq_file.

Patch 5 is to have the tcp seq_file iterate on the
port+addr lhash2 instead of the port only listening_hash.

Patch 6 is to have the bpf tcp iter doing batching which
then allows lock_sock.  lock_sock is needed for setsockopt.

Patch 7 allows the bpf tcp iter to call bpf_setsockopt.

Martin KaFai Lau (8):
  tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos
  tcp: seq_file: Refactor net and family matching
  bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state
  tcp: seq_file: Add listening_get_first()
  tcp: seq_file: Replace listening_hash with lhash2
  bpf: tcp: bpf iter batching and lock_sock
  bpf: tcp: Support bpf_setsockopt in bpf tcp iter
  bpf: selftest: Test batching and bpf_setsockopt in bpf tcp iter

 include/linux/bpf.h                           |   7 +
 include/net/inet_hashtables.h                 |   6 +
 include/net/tcp.h                             |   1 -
 kernel/bpf/bpf_iter.c                         |  22 +
 kernel/trace/bpf_trace.c                      |   7 +-
 net/core/filter.c                             |  17 +
 net/ipv4/tcp_ipv4.c                           | 409 ++++++++++++++----
 tools/testing/selftests/bpf/network_helpers.c |  85 +++-
 tools/testing/selftests/bpf/network_helpers.h |   4 +
 .../bpf/prog_tests/bpf_iter_setsockopt.c      | 226 ++++++++++
 .../selftests/bpf/progs/bpf_iter_setsockopt.c |  76 ++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   4 +
 12 files changed, 767 insertions(+), 97 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
@ 2021-06-25 20:04 ` Martin KaFai Lau
  2021-06-25 20:04 ` [PATCH bpf-next 2/8] tcp: seq_file: Refactor net and family matching Martin KaFai Lau
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:04 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

st->bucket stores the current bucket number.
st->offset stores the offset within this bucket that is the sk to be
seq_show().  Thus, st->offset only makes sense within the same
st->bucket.

These two variables are an optimization for the common no-lseek case.
When resuming the seq_file iteration (i.e. seq_start()),
tcp_seek_last_pos() tries to continue from the st->offset
at bucket st->bucket.

However, it is possible that the bucket pointed by st->bucket
has changed and st->offset may end up skipping the whole st->bucket
without finding a sk.  In this case, tcp_seek_last_pos() currently
continues to satisfy the offset condition in the next (and incorrect)
bucket.  Instead, regardless of the offset value, the first sk of the
next bucket should be returned.  Thus, "bucket == st->bucket" check is
added to tcp_seek_last_pos().

The chance of hitting this is small and the issue is a decade old,
so targeting for the next tree.

Fixes: a8b690f98baf ("tcp: Fix slowness in read /proc/net/tcp")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp_ipv4.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6cb8e269f1ab..7a49427eefbf 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2451,6 +2451,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
 static void *tcp_seek_last_pos(struct seq_file *seq)
 {
 	struct tcp_iter_state *st = seq->private;
+	int bucket = st->bucket;
 	int offset = st->offset;
 	int orig_num = st->num;
 	void *rc = NULL;
@@ -2461,7 +2462,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 			break;
 		st->state = TCP_SEQ_STATE_LISTENING;
 		rc = listening_get_next(seq, NULL);
-		while (offset-- && rc)
+		while (offset-- && rc && bucket == st->bucket)
 			rc = listening_get_next(seq, rc);
 		if (rc)
 			break;
@@ -2472,7 +2473,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 		if (st->bucket > tcp_hashinfo.ehash_mask)
 			break;
 		rc = established_get_first(seq);
-		while (offset-- && rc)
+		while (offset-- && rc && bucket == st->bucket)
 			rc = established_get_next(seq, rc);
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 2/8] tcp: seq_file: Refactor net and family matching
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
  2021-06-25 20:04 ` [PATCH bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Martin KaFai Lau
@ 2021-06-25 20:04 ` Martin KaFai Lau
  2021-06-25 20:05 ` [PATCH bpf-next 3/8] bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state Martin KaFai Lau
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:04 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

This patch refactors the net and family matching into
two new helpers, seq_sk_match() and seq_file_family().

seq_file_family() is in the later part of the file to prepare
the change of a following patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp_ipv4.c | 68 ++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 38 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7a49427eefbf..13a8b6e8d6bc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2277,6 +2277,17 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock);
 #ifdef CONFIG_PROC_FS
 /* Proc filesystem TCP sock list dumping. */
 
+static unsigned short seq_file_family(const struct seq_file *seq);
+
+static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
+{
+	unsigned short family = seq_file_family(seq);
+
+	/* AF_UNSPEC is used as a match all */
+	return ((family == AF_UNSPEC || family == sk->sk_family) &&
+		net_eq(sock_net(sk), seq_file_net(seq)));
+}
+
 /*
  * Get next listener socket follow cur.  If cur is NULL, get first socket
  * starting from bucket given in st->bucket; when st->bucket is zero the
@@ -2284,18 +2295,11 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock);
  */
 static void *listening_get_next(struct seq_file *seq, void *cur)
 {
-	struct tcp_seq_afinfo *afinfo;
 	struct tcp_iter_state *st = seq->private;
-	struct net *net = seq_file_net(seq);
 	struct inet_listen_hashbucket *ilb;
 	struct hlist_nulls_node *node;
 	struct sock *sk = cur;
 
-	if (st->bpf_seq_afinfo)
-		afinfo = st->bpf_seq_afinfo;
-	else
-		afinfo = PDE_DATA(file_inode(seq->file));
-
 	if (!sk) {
 get_head:
 		ilb = &tcp_hashinfo.listening_hash[st->bucket];
@@ -2311,10 +2315,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	sk = sk_nulls_next(sk);
 get_sk:
 	sk_nulls_for_each_from(sk, node) {
-		if (!net_eq(sock_net(sk), net))
-			continue;
-		if (afinfo->family == AF_UNSPEC ||
-		    sk->sk_family == afinfo->family)
+		if (seq_sk_match(seq, sk))
 			return sk;
 	}
 	spin_unlock(&ilb->lock);
@@ -2351,15 +2352,7 @@ static inline bool empty_bucket(const struct tcp_iter_state *st)
  */
 static void *established_get_first(struct seq_file *seq)
 {
-	struct tcp_seq_afinfo *afinfo;
 	struct tcp_iter_state *st = seq->private;
-	struct net *net = seq_file_net(seq);
-	void *rc = NULL;
-
-	if (st->bpf_seq_afinfo)
-		afinfo = st->bpf_seq_afinfo;
-	else
-		afinfo = PDE_DATA(file_inode(seq->file));
 
 	st->offset = 0;
 	for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) {
@@ -2373,32 +2366,20 @@ static void *established_get_first(struct seq_file *seq)
 
 		spin_lock_bh(lock);
 		sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
-			if ((afinfo->family != AF_UNSPEC &&
-			     sk->sk_family != afinfo->family) ||
-			    !net_eq(sock_net(sk), net)) {
-				continue;
-			}
-			rc = sk;
-			goto out;
+			if (seq_sk_match(seq, sk))
+				return sk;
 		}
 		spin_unlock_bh(lock);
 	}
-out:
-	return rc;
+
+	return NULL;
 }
 
 static void *established_get_next(struct seq_file *seq, void *cur)
 {
-	struct tcp_seq_afinfo *afinfo;
 	struct sock *sk = cur;
 	struct hlist_nulls_node *node;
 	struct tcp_iter_state *st = seq->private;
-	struct net *net = seq_file_net(seq);
-
-	if (st->bpf_seq_afinfo)
-		afinfo = st->bpf_seq_afinfo;
-	else
-		afinfo = PDE_DATA(file_inode(seq->file));
 
 	++st->num;
 	++st->offset;
@@ -2406,9 +2387,7 @@ static void *established_get_next(struct seq_file *seq, void *cur)
 	sk = sk_nulls_next(sk);
 
 	sk_nulls_for_each_from(sk, node) {
-		if ((afinfo->family == AF_UNSPEC ||
-		     sk->sk_family == afinfo->family) &&
-		    net_eq(sock_net(sk), net))
+		if (seq_sk_match(seq, sk))
 			return sk;
 	}
 
@@ -2754,6 +2733,19 @@ static const struct seq_operations bpf_iter_tcp_seq_ops = {
 	.stop		= bpf_iter_tcp_seq_stop,
 };
 #endif
+static unsigned short seq_file_family(const struct seq_file *seq)
+{
+	const struct tcp_iter_state *st = seq->private;
+	const struct tcp_seq_afinfo *afinfo = st->bpf_seq_afinfo;
+
+	/* Iterated from bpf_iter.  Let the bpf prog to filter instead. */
+	if (afinfo)
+		return AF_UNSPEC;
+
+	/* Iterated from proc fs */
+	afinfo = PDE_DATA(file_inode(seq->file));
+	return afinfo->family;
+}
 
 static const struct seq_operations tcp4_seq_ops = {
 	.show		= tcp4_seq_show,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 3/8] bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
  2021-06-25 20:04 ` [PATCH bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Martin KaFai Lau
  2021-06-25 20:04 ` [PATCH bpf-next 2/8] tcp: seq_file: Refactor net and family matching Martin KaFai Lau
@ 2021-06-25 20:05 ` Martin KaFai Lau
  2021-06-25 20:05 ` [PATCH bpf-next 4/8] tcp: seq_file: Add listening_get_first() Martin KaFai Lau
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

A following patch will create a separate struct to store extra
bpf_iter state and it will embed the existing tcp_iter_state like this:
struct bpf_tcp_iter_state {
	struct tcp_iter_state state;
	/* More bpf_iter specific states here ... */
}

As a prep work, this patch removes the
"struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is
to tell if it is iterating from bpf_iter instead of proc fs.
Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from
bpf_iter.  The kernel should not filter by the addr family and
leave this filtering decision to the bpf prog.

Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the
"seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating
from the bpf iter.

The bpf_iter_(init|fini)_tcp() is left here to prepare for
the change of a following patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/tcp.h   |  1 -
 net/ipv4/tcp_ipv4.c | 25 +++++--------------------
 2 files changed, 5 insertions(+), 21 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e668f1bf780d..06ce38967890 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1955,7 +1955,6 @@ struct tcp_iter_state {
 	struct seq_net_private	p;
 	enum tcp_seq_states	state;
 	struct sock		*syn_wait_sk;
-	struct tcp_seq_afinfo	*bpf_seq_afinfo;
 	int			bucket, offset, sbucket, num;
 	loff_t			last_pos;
 };
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 13a8b6e8d6bc..ca55e87f6cc9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2735,12 +2735,13 @@ static const struct seq_operations bpf_iter_tcp_seq_ops = {
 #endif
 static unsigned short seq_file_family(const struct seq_file *seq)
 {
-	const struct tcp_iter_state *st = seq->private;
-	const struct tcp_seq_afinfo *afinfo = st->bpf_seq_afinfo;
+	const struct tcp_seq_afinfo *afinfo;
 
+#ifdef CONFIG_BPF_SYSCALL
 	/* Iterated from bpf_iter.  Let the bpf prog to filter instead. */
-	if (afinfo)
+	if (seq->op == &bpf_iter_tcp_seq_ops)
 		return AF_UNSPEC;
+#endif
 
 	/* Iterated from proc fs */
 	afinfo = PDE_DATA(file_inode(seq->file));
@@ -2998,27 +2999,11 @@ DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta,
 
 static int bpf_iter_init_tcp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	struct tcp_iter_state *st = priv_data;
-	struct tcp_seq_afinfo *afinfo;
-	int ret;
-
-	afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN);
-	if (!afinfo)
-		return -ENOMEM;
-
-	afinfo->family = AF_UNSPEC;
-	st->bpf_seq_afinfo = afinfo;
-	ret = bpf_iter_init_seq_net(priv_data, aux);
-	if (ret)
-		kfree(afinfo);
-	return ret;
+	return bpf_iter_init_seq_net(priv_data, aux);
 }
 
 static void bpf_iter_fini_tcp(void *priv_data)
 {
-	struct tcp_iter_state *st = priv_data;
-
-	kfree(st->bpf_seq_afinfo);
 	bpf_iter_fini_seq_net(priv_data);
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 4/8] tcp: seq_file: Add listening_get_first()
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2021-06-25 20:05 ` [PATCH bpf-next 3/8] bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state Martin KaFai Lau
@ 2021-06-25 20:05 ` Martin KaFai Lau
  2021-06-25 20:05 ` [PATCH bpf-next 5/8] tcp: seq_file: Replace listening_hash with lhash2 Martin KaFai Lau
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

The current listening_get_next() is overloaded by passing
NULL to the 2nd arg, like listening_get_next(seq, NULL), to
mean get_first().

This patch moves some logic from the listening_get_next() into
a new function listening_get_first().  It will be equivalent
to the current established_get_first() and established_get_next()
setup.  get_first() is to find a non empty bucket and return
the first sk.  get_next() is to find the next sk of the current
bucket and then resorts to get_first() if the current bucket is
exhausted.

The next patch is to move the listener seq_file iteration from
listening_hash (port only) to lhash2 (port+addr).
Separating out listening_get_first() from listening_get_next()
here will make the following lhash2 changes cleaner and easier to
follow.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp_ipv4.c | 59 ++++++++++++++++++++++++++++++---------------
 1 file changed, 39 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ca55e87f6cc9..693bda155876 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2288,10 +2288,38 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
 		net_eq(sock_net(sk), seq_file_net(seq)));
 }
 
-/*
- * Get next listener socket follow cur.  If cur is NULL, get first socket
- * starting from bucket given in st->bucket; when st->bucket is zero the
- * very first socket in the hash table is returned.
+/* Find a non empty bucket (starting from st->bucket)
+ * and return the first sk from it.
+ */
+static void *listening_get_first(struct seq_file *seq)
+{
+	struct tcp_iter_state *st = seq->private;
+
+	st->offset = 0;
+	for (; st->bucket < INET_LHTABLE_SIZE; st->bucket++) {
+		struct inet_listen_hashbucket *ilb;
+		struct hlist_nulls_node *node;
+		struct sock *sk;
+
+		ilb = &tcp_hashinfo.listening_hash[st->bucket];
+		if (hlist_nulls_empty(&ilb->nulls_head))
+			continue;
+
+		spin_lock(&ilb->lock);
+		sk_nulls_for_each(sk, node, &ilb->nulls_head) {
+			if (seq_sk_match(seq, sk))
+				return sk;
+		}
+		spin_unlock(&ilb->lock);
+	}
+
+	return NULL;
+}
+
+/* Find the next sk of "cur" within the same bucket (i.e. st->bucket).
+ * If "cur" is the last one in the st->bucket,
+ * call listening_get_first() to return the first sk of the next
+ * non empty bucket.
  */
 static void *listening_get_next(struct seq_file *seq, void *cur)
 {
@@ -2300,29 +2328,20 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct hlist_nulls_node *node;
 	struct sock *sk = cur;
 
-	if (!sk) {
-get_head:
-		ilb = &tcp_hashinfo.listening_hash[st->bucket];
-		spin_lock(&ilb->lock);
-		sk = sk_nulls_head(&ilb->nulls_head);
-		st->offset = 0;
-		goto get_sk;
-	}
-	ilb = &tcp_hashinfo.listening_hash[st->bucket];
 	++st->num;
 	++st->offset;
 
 	sk = sk_nulls_next(sk);
-get_sk:
+
 	sk_nulls_for_each_from(sk, node) {
 		if (seq_sk_match(seq, sk))
 			return sk;
 	}
+
+	ilb = &tcp_hashinfo.listening_hash[st->bucket];
 	spin_unlock(&ilb->lock);
-	st->offset = 0;
-	if (++st->bucket < INET_LHTABLE_SIZE)
-		goto get_head;
-	return NULL;
+	++st->bucket;
+	return listening_get_first(seq);
 }
 
 static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
@@ -2332,7 +2351,7 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos)
 
 	st->bucket = 0;
 	st->offset = 0;
-	rc = listening_get_next(seq, NULL);
+	rc = listening_get_first(seq);
 
 	while (rc && *pos) {
 		rc = listening_get_next(seq, rc);
@@ -2440,7 +2459,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 		if (st->bucket >= INET_LHTABLE_SIZE)
 			break;
 		st->state = TCP_SEQ_STATE_LISTENING;
-		rc = listening_get_next(seq, NULL);
+		rc = listening_get_first(seq);
 		while (offset-- && rc && bucket == st->bucket)
 			rc = listening_get_next(seq, rc);
 		if (rc)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 5/8] tcp: seq_file: Replace listening_hash with lhash2
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2021-06-25 20:05 ` [PATCH bpf-next 4/8] tcp: seq_file: Add listening_get_first() Martin KaFai Lau
@ 2021-06-25 20:05 ` Martin KaFai Lau
  2021-06-25 20:05 ` [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock Martin KaFai Lau
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

This patch moves the tcp seq_file iteration on listeners
from the port only listening_hash to the port+addr lhash2.

When iterating from the bpf iter, the next patch will need to
lock the socket such that the bpf iter with netadmin cap can do
setsockopt (e.g. to change the TCP_CONGESTION).  To avoid locking the
bucket and then locking the sock, the bpf iter will first batch some
sockets from the same bucket and then unlock the bucket.  If the bucket
size is small (which usually is), it is easier to batch the whole
bucket such that it is less likely to miss a setsockopt on a socket
due to changes in the bucket.

However, the port only listening_hash could have many listeners
hashed to a bucket (e.g. many individual VIP(s):443 and also
multiple by the number of SO_REUSEPORT).  We have seen bucket size in
tens of thousands range.  Also, the chance of having changes
in some popular port buckets (e.g. 443) is also high.

The port+addr lhash2 was introduced to solve this large listener bucket
issue.  Also, the listening_hash usage has already been replaced with
lhash2 in the fast path inet[6]_lookup_listener().  This patch follows
the same direction on moving to lhash2 and iterates the lhash2
instead of listening_hash.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/inet_hashtables.h |  6 ++++++
 net/ipv4/tcp_ipv4.c           | 35 ++++++++++++++++++-----------------
 2 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index ca6a3ea9057e..f72ec113ae56 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -160,6 +160,12 @@ struct inet_hashinfo {
 					____cacheline_aligned_in_smp;
 };
 
+#define inet_lhash2_for_each_icsk_continue(__icsk) \
+	hlist_for_each_entry_continue(__icsk, icsk_listen_portaddr_node)
+
+#define inet_lhash2_for_each_icsk(__icsk, list) \
+	hlist_for_each_entry(__icsk, list, icsk_listen_portaddr_node)
+
 #define inet_lhash2_for_each_icsk_rcu(__icsk, list) \
 	hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node)
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 693bda155876..0d851289a89e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2296,21 +2296,22 @@ static void *listening_get_first(struct seq_file *seq)
 	struct tcp_iter_state *st = seq->private;
 
 	st->offset = 0;
-	for (; st->bucket < INET_LHTABLE_SIZE; st->bucket++) {
-		struct inet_listen_hashbucket *ilb;
-		struct hlist_nulls_node *node;
+	for (; st->bucket <= tcp_hashinfo.lhash2_mask; st->bucket++) {
+		struct inet_listen_hashbucket *ilb2;
+		struct inet_connection_sock *icsk;
 		struct sock *sk;
 
-		ilb = &tcp_hashinfo.listening_hash[st->bucket];
-		if (hlist_nulls_empty(&ilb->nulls_head))
+		ilb2 = &tcp_hashinfo.lhash2[st->bucket];
+		if (hlist_empty(&ilb2->head))
 			continue;
 
-		spin_lock(&ilb->lock);
-		sk_nulls_for_each(sk, node, &ilb->nulls_head) {
+		spin_lock(&ilb2->lock);
+		inet_lhash2_for_each_icsk(icsk, &ilb2->head) {
+			sk = (struct sock *)icsk;
 			if (seq_sk_match(seq, sk))
 				return sk;
 		}
-		spin_unlock(&ilb->lock);
+		spin_unlock(&ilb2->lock);
 	}
 
 	return NULL;
@@ -2324,22 +2325,22 @@ static void *listening_get_first(struct seq_file *seq)
 static void *listening_get_next(struct seq_file *seq, void *cur)
 {
 	struct tcp_iter_state *st = seq->private;
-	struct inet_listen_hashbucket *ilb;
-	struct hlist_nulls_node *node;
+	struct inet_listen_hashbucket *ilb2;
+	struct inet_connection_sock *icsk;
 	struct sock *sk = cur;
 
 	++st->num;
 	++st->offset;
 
-	sk = sk_nulls_next(sk);
-
-	sk_nulls_for_each_from(sk, node) {
+	icsk = inet_csk(sk);
+	inet_lhash2_for_each_icsk_continue(icsk) {
+		sk = (struct sock *)icsk;
 		if (seq_sk_match(seq, sk))
 			return sk;
 	}
 
-	ilb = &tcp_hashinfo.listening_hash[st->bucket];
-	spin_unlock(&ilb->lock);
+	ilb2 = &tcp_hashinfo.lhash2[st->bucket];
+	spin_unlock(&ilb2->lock);
 	++st->bucket;
 	return listening_get_first(seq);
 }
@@ -2456,7 +2457,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq)
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
-		if (st->bucket >= INET_LHTABLE_SIZE)
+		if (st->bucket > tcp_hashinfo.lhash2_mask)
 			break;
 		st->state = TCP_SEQ_STATE_LISTENING;
 		rc = listening_get_first(seq);
@@ -2541,7 +2542,7 @@ void tcp_seq_stop(struct seq_file *seq, void *v)
 	switch (st->state) {
 	case TCP_SEQ_STATE_LISTENING:
 		if (v != SEQ_START_TOKEN)
-			spin_unlock(&tcp_hashinfo.listening_hash[st->bucket].lock);
+			spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock);
 		break;
 	case TCP_SEQ_STATE_ESTABLISHED:
 		if (v)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2021-06-25 20:05 ` [PATCH bpf-next 5/8] tcp: seq_file: Replace listening_hash with lhash2 Martin KaFai Lau
@ 2021-06-25 20:05 ` Martin KaFai Lau
  2021-06-29 17:27   ` Yonghong Song
  2021-06-25 20:05 ` [PATCH bpf-next 7/8] bpf: tcp: Support bpf_setsockopt in bpf tcp iter Martin KaFai Lau
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

This patch does batching and lock_sock for the bpf tcp iter.
It does not affect the proc fs iteration.

With bpf-tcp-cc, new algo rollout happens more often.  Instead of
restarting the application to pick up the new tcp-cc, the next patch
will allow bpf iter with CAP_NET_ADMIN to do setsockopt(TCP_CONGESTION).
This requires locking the sock.

Also, unlike the proc iteration (cat /proc/net/tcp[6]), the bpf iter
can inspect all fields of a tcp_sock.  It will be useful to have a
consistent view on some of the fields (e.g. the ones reported in
tcp_get_info() that also acquires the sock lock).

Double lock: locking the bucket first and then locking the sock could
lead to deadlock.  This patch takes a batching approach similar to
inet_diag.  While holding the bucket lock, it batch a number of sockets
into an array first and then unlock the bucket.  Before doing show(),
it then calls lock_sock_fast().

In a machine with ~400k connections, the maximum number of
sk in a bucket of the established hashtable is 7.  0.02% of
the established connections fall into this bucket size.

For listen hash (port+addr lhash2), the bucket is usually very
small also except for the SO_REUSEPORT use case which the
userspace could have one SO_REUSEPORT socket per thread.

While batching is used, it can also minimize the chance of missing
sock in the setsockopt use case if the whole bucket is batched.
This patch will start with a batch array with INIT_BATCH_SZ (16)
which will be enough for the most common cases.  bpf_iter_tcp_batch()
will try to realloc to a larger array to handle exception case (e.g.
the SO_REUSEPORT case in the lhash2).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp_ipv4.c | 236 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 230 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0d851289a89e..856144d33f52 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2687,6 +2687,15 @@ static int tcp4_seq_show(struct seq_file *seq, void *v)
 }
 
 #ifdef CONFIG_BPF_SYSCALL
+struct bpf_tcp_iter_state {
+	struct tcp_iter_state state;
+	unsigned int cur_sk;
+	unsigned int end_sk;
+	unsigned int max_sk;
+	struct sock **batch;
+	bool st_bucket_done;
+};
+
 struct bpf_iter__tcp {
 	__bpf_md_ptr(struct bpf_iter_meta *, meta);
 	__bpf_md_ptr(struct sock_common *, sk_common);
@@ -2705,16 +2714,203 @@ static int tcp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
 	return bpf_iter_run_prog(prog, &ctx);
 }
 
+static void bpf_iter_tcp_put_batch(struct bpf_tcp_iter_state *iter)
+{
+	while (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
+}
+
+static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
+				      unsigned int new_batch_sz)
+{
+	struct sock **new_batch;
+
+	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER);
+	if (!new_batch)
+		return -ENOMEM;
+
+	bpf_iter_tcp_put_batch(iter);
+	kvfree(iter->batch);
+	iter->batch = new_batch;
+	iter->max_sk = new_batch_sz;
+
+	return 0;
+}
+
+static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq,
+						 struct sock *start_sk)
+{
+	struct bpf_tcp_iter_state *iter = seq->private;
+	struct tcp_iter_state *st = &iter->state;
+	struct inet_connection_sock *icsk;
+	unsigned int expected = 1;
+	struct sock *sk;
+
+	sock_hold(start_sk);
+	iter->batch[iter->end_sk++] = start_sk;
+
+	icsk = inet_csk(start_sk);
+	inet_lhash2_for_each_icsk_continue(icsk) {
+		sk = (struct sock *)icsk;
+		if (seq_sk_match(seq, sk)) {
+			if (iter->end_sk < iter->max_sk) {
+				sock_hold(sk);
+				iter->batch[iter->end_sk++] = sk;
+			}
+			expected++;
+		}
+	}
+	spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock);
+
+	return expected;
+}
+
+static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq,
+						   struct sock *start_sk)
+{
+	struct bpf_tcp_iter_state *iter = seq->private;
+	struct tcp_iter_state *st = &iter->state;
+	struct hlist_nulls_node *node;
+	unsigned int expected = 1;
+	struct sock *sk;
+
+	sock_hold(start_sk);
+	iter->batch[iter->end_sk++] = start_sk;
+
+	sk = sk_nulls_next(start_sk);
+	sk_nulls_for_each_from(sk, node) {
+		if (seq_sk_match(seq, sk)) {
+			if (iter->end_sk < iter->max_sk) {
+				sock_hold(sk);
+				iter->batch[iter->end_sk++] = sk;
+			}
+			expected++;
+		}
+	}
+	spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket));
+
+	return expected;
+}
+
+static struct sock *bpf_iter_tcp_batch(struct seq_file *seq)
+{
+	struct bpf_tcp_iter_state *iter = seq->private;
+	struct tcp_iter_state *st = &iter->state;
+	unsigned int expected;
+	bool resized = false;
+	struct sock *sk;
+
+	/* The st->bucket is done.  Directly advance to the next
+	 * bucket instead of having the tcp_seek_last_pos() to skip
+	 * one by one in the current bucket and eventually find out
+	 * it has to advance to the next bucket.
+	 */
+	if (iter->st_bucket_done) {
+		st->offset = 0;
+		st->bucket++;
+		if (st->state == TCP_SEQ_STATE_LISTENING &&
+		    st->bucket > tcp_hashinfo.lhash2_mask) {
+			st->state = TCP_SEQ_STATE_ESTABLISHED;
+			st->bucket = 0;
+		}
+	}
+
+again:
+	/* Get a new batch */
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+
+	sk = tcp_seek_last_pos(seq);
+	if (!sk)
+		return NULL; /* Done */
+
+	if (st->state == TCP_SEQ_STATE_LISTENING)
+		expected = bpf_iter_tcp_listening_batch(seq, sk);
+	else
+		expected = bpf_iter_tcp_established_batch(seq, sk);
+
+	if (iter->end_sk == expected) {
+		iter->st_bucket_done = true;
+		return sk;
+	}
+
+	if (!resized && !bpf_iter_tcp_realloc_batch(iter, expected * 3 / 2)) {
+		resized = true;
+		goto again;
+	}
+
+	return sk;
+}
+
+static void *bpf_iter_tcp_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* bpf iter does not support lseek, so it always
+	 * continue from where it was stop()-ped.
+	 */
+	if (*pos)
+		return bpf_iter_tcp_batch(seq);
+
+	return SEQ_START_TOKEN;
+}
+
+static void *bpf_iter_tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_tcp_iter_state *iter = seq->private;
+	struct tcp_iter_state *st = &iter->state;
+	struct sock *sk;
+
+	/* Whenever seq_next() is called, the iter->cur_sk is
+	 * done with seq_show(), so advance to the next sk in
+	 * the batch.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		/* Keeping st->num consistent in tcp_iter_state.
+		 * bpf_iter_tcp does not use st->num.
+		 * meta.seq_num is used instead.
+		 */
+		st->num++;
+		/* Move st->offset to the next sk in the bucket such that
+		 * the future start() will resume at st->offset in
+		 * st->bucket.  See tcp_seek_last_pos().
+		 */
+		st->offset++;
+		sock_put(iter->batch[iter->cur_sk++]);
+	}
+
+	if (iter->cur_sk < iter->end_sk)
+		sk = iter->batch[iter->cur_sk];
+	else
+		sk = bpf_iter_tcp_batch(seq);
+
+	++*pos;
+	/* Keeping st->last_pos consistent in tcp_iter_state.
+	 * bpf iter does not do lseek, so st->last_pos always equals to *pos.
+	 */
+	st->last_pos = *pos;
+	return sk;
+}
+
 static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 {
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 	struct sock *sk = v;
+	bool slow;
 	uid_t uid;
+	int ret;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
+	if (sk_fullsock(sk))
+		slow = lock_sock_fast(sk);
+
+	if (unlikely(sk_unhashed(sk))) {
+		ret = SEQ_SKIP;
+		goto unlock;
+	}
+
 	if (sk->sk_state == TCP_TIME_WAIT) {
 		uid = 0;
 	} else if (sk->sk_state == TCP_NEW_SYN_RECV) {
@@ -2728,11 +2924,18 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 
 	meta.seq = seq;
 	prog = bpf_iter_get_info(&meta, false);
-	return tcp_prog_seq_show(prog, &meta, v, uid);
+	ret = tcp_prog_seq_show(prog, &meta, v, uid);
+
+unlock:
+	if (sk_fullsock(sk))
+		unlock_sock_fast(sk, slow);
+	return ret;
+
 }
 
 static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct bpf_tcp_iter_state *iter = seq->private;
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 
@@ -2743,13 +2946,16 @@ static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v)
 			(void)tcp_prog_seq_show(prog, &meta, v, 0);
 	}
 
-	tcp_seq_stop(seq, v);
+	if (iter->cur_sk < iter->end_sk) {
+		bpf_iter_tcp_put_batch(iter);
+		iter->st_bucket_done = false;
+	}
 }
 
 static const struct seq_operations bpf_iter_tcp_seq_ops = {
 	.show		= bpf_iter_tcp_seq_show,
-	.start		= tcp_seq_start,
-	.next		= tcp_seq_next,
+	.start		= bpf_iter_tcp_seq_start,
+	.next		= bpf_iter_tcp_seq_next,
 	.stop		= bpf_iter_tcp_seq_stop,
 };
 #endif
@@ -3017,21 +3223,39 @@ static struct pernet_operations __net_initdata tcp_sk_ops = {
 DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta,
 		     struct sock_common *sk_common, uid_t uid)
 
+#define INIT_BATCH_SZ 16
+
 static int bpf_iter_init_tcp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	return bpf_iter_init_seq_net(priv_data, aux);
+	struct bpf_tcp_iter_state *iter = priv_data;
+	int err;
+
+	err = bpf_iter_init_seq_net(priv_data, aux);
+	if (err)
+		return err;
+
+	err = bpf_iter_tcp_realloc_batch(iter, INIT_BATCH_SZ);
+	if (err) {
+		bpf_iter_fini_seq_net(priv_data);
+		return err;
+	}
+
+	return 0;
 }
 
 static void bpf_iter_fini_tcp(void *priv_data)
 {
+	struct bpf_tcp_iter_state *iter = priv_data;
+
 	bpf_iter_fini_seq_net(priv_data);
+	kvfree(iter->batch);
 }
 
 static const struct bpf_iter_seq_info tcp_seq_info = {
 	.seq_ops		= &bpf_iter_tcp_seq_ops,
 	.init_seq_private	= bpf_iter_init_tcp,
 	.fini_seq_private	= bpf_iter_fini_tcp,
-	.seq_priv_size		= sizeof(struct tcp_iter_state),
+	.seq_priv_size		= sizeof(struct bpf_tcp_iter_state),
 };
 
 static struct bpf_iter_reg tcp_reg_info = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 7/8] bpf: tcp: Support bpf_setsockopt in bpf tcp iter
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2021-06-25 20:05 ` [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock Martin KaFai Lau
@ 2021-06-25 20:05 ` Martin KaFai Lau
  2021-06-25 20:05 ` [PATCH bpf-next 8/8] bpf: selftest: Test batching and " Martin KaFai Lau
  2021-06-29 19:04 ` [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Yonghong Song
  8 siblings, 0 replies; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

This patch allows bpf tcp iter with netadmin cap to call bpf_setsockopt.
To allow a specific bpf iter (tcp here) to call a set of helpers,
get_func_proto function pointer is added to bpf_iter_reg.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  7 +++++++
 kernel/bpf/bpf_iter.c    | 22 ++++++++++++++++++++++
 kernel/trace/bpf_trace.c |  7 ++++++-
 net/core/filter.c        | 17 +++++++++++++++++
 net/ipv4/tcp_ipv4.c      | 15 +++++++++++++++
 5 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f309fc1509f2..31f51a601925 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1397,6 +1397,9 @@ typedef void (*bpf_iter_show_fdinfo_t) (const struct bpf_iter_aux_info *aux,
 					struct seq_file *seq);
 typedef int (*bpf_iter_fill_link_info_t)(const struct bpf_iter_aux_info *aux,
 					 struct bpf_link_info *info);
+typedef const struct bpf_func_proto *
+(*bpf_iter_get_func_proto_t)(enum bpf_func_id func_id,
+			     const struct bpf_prog *prog);
 
 enum bpf_iter_feature {
 	BPF_ITER_RESCHED	= BIT(0),
@@ -1409,6 +1412,7 @@ struct bpf_iter_reg {
 	bpf_iter_detach_target_t detach_target;
 	bpf_iter_show_fdinfo_t show_fdinfo;
 	bpf_iter_fill_link_info_t fill_link_info;
+	bpf_iter_get_func_proto_t get_func_proto;
 	u32 ctx_arg_info_size;
 	u32 feature;
 	struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
@@ -1431,6 +1435,8 @@ struct bpf_iter__bpf_map_elem {
 int bpf_iter_reg_target(const struct bpf_iter_reg *reg_info);
 void bpf_iter_unreg_target(const struct bpf_iter_reg *reg_info);
 bool bpf_iter_prog_supported(struct bpf_prog *prog);
+const struct bpf_func_proto *
+bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
 int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog);
 int bpf_iter_new_fd(struct bpf_link *link);
 bool bpf_link_is_iter(struct bpf_link *link);
@@ -1997,6 +2003,7 @@ extern const struct bpf_func_proto bpf_task_storage_get_proto;
 extern const struct bpf_func_proto bpf_task_storage_delete_proto;
 extern const struct bpf_func_proto bpf_for_each_map_elem_proto;
 extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto;
+extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 2d4fbdbb194e..2e9d47bb40ff 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -360,6 +360,28 @@ bool bpf_iter_prog_supported(struct bpf_prog *prog)
 	return supported;
 }
 
+const struct bpf_func_proto *
+bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	const struct bpf_iter_target_info *tinfo;
+	const struct bpf_func_proto *fn = NULL;
+
+	mutex_lock(&targets_mutex);
+	list_for_each_entry(tinfo, &targets, list) {
+		if (tinfo->btf_id == prog->aux->attach_btf_id) {
+			const struct bpf_iter_reg *reg_info;
+
+			reg_info = tinfo->reg_info;
+			if (reg_info->get_func_proto)
+				fn = reg_info->get_func_proto(func_id, prog);
+			break;
+		}
+	}
+	mutex_unlock(&targets_mutex);
+
+	return fn;
+}
+
 static void bpf_iter_link_release(struct bpf_link *link)
 {
 	struct bpf_iter_link *iter_link =
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 7a52bc172841..11e69734d67b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1428,6 +1428,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 const struct bpf_func_proto *
 tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
+	const struct bpf_func_proto *fn;
+
 	switch (func_id) {
 #ifdef CONFIG_NET
 	case BPF_FUNC_skb_output:
@@ -1468,7 +1470,10 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_d_path:
 		return &bpf_d_path_proto;
 	default:
-		return raw_tp_prog_func_proto(func_id, prog);
+		fn = raw_tp_prog_func_proto(func_id, prog);
+		if (!fn && prog->expected_attach_type == BPF_TRACE_ITER)
+			fn = bpf_iter_get_func_proto(func_id, prog);
+		return fn;
 	}
 }
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d22895caa164..ab4d60c043d4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5063,6 +5063,23 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_sk_setsockopt, struct sock *, sk, int, level,
+	   int, optname, char *, optval, int, optlen)
+{
+	return _bpf_setsockopt(sk, level, optname, optval, optlen);
+}
+
+const struct bpf_func_proto bpf_sk_setsockopt_proto = {
+	.func		= bpf_sk_setsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock,
 				int optname, const u8 **start)
 {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 856144d33f52..5bfa6233fff1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3258,6 +3258,20 @@ static const struct bpf_iter_seq_info tcp_seq_info = {
 	.seq_priv_size		= sizeof(struct bpf_tcp_iter_state),
 };
 
+static const struct bpf_func_proto *
+bpf_iter_tcp_get_func_proto(enum bpf_func_id func_id,
+			    const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_setsockopt:
+		if (capable(CAP_NET_ADMIN))
+			return &bpf_sk_setsockopt_proto;
+		fallthrough;
+	default:
+		return NULL;
+	}
+}
+
 static struct bpf_iter_reg tcp_reg_info = {
 	.target			= "tcp",
 	.ctx_arg_info_size	= 1,
@@ -3265,6 +3279,7 @@ static struct bpf_iter_reg tcp_reg_info = {
 		{ offsetof(struct bpf_iter__tcp, sk_common),
 		  PTR_TO_BTF_ID_OR_NULL },
 	},
+	.get_func_proto		= bpf_iter_tcp_get_func_proto,
 	.seq_info		= &tcp_seq_info,
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf-next 8/8] bpf: selftest: Test batching and bpf_setsockopt in bpf tcp iter
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2021-06-25 20:05 ` [PATCH bpf-next 7/8] bpf: tcp: Support bpf_setsockopt in bpf tcp iter Martin KaFai Lau
@ 2021-06-25 20:05 ` Martin KaFai Lau
  2021-06-29 19:00   ` Yonghong Song
  2021-06-29 19:04 ` [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Yonghong Song
  8 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-25 20:05 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yonghong Song, Yuchung Cheng

This patch adds tests for the batching and bpf_setsockopt in bpf tcp iter.

It first creates:
a) 1 non SO_REUSEPORT listener in lhash2.
b) 256 passive and active fds connected to the listener in (a).
c) 256 SO_REUSEPORT listeners in one of the lhash2 bucket.

The test sets all listeners and connections to bpf_cubic before
running the bpf iter.

The bpf iter then calls setsockopt(TCP_CONGESTION) to switch
each listener and connection from bpf_cubic to bpf_dctcp.

The bpf iter has a random_retry mode such that it can return EAGAIN
to the usespace in the middle of a batch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/network_helpers.c |  85 ++++++-
 tools/testing/selftests/bpf/network_helpers.h |   4 +
 .../bpf/prog_tests/bpf_iter_setsockopt.c      | 226 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_setsockopt.c |  76 ++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   4 +
 5 files changed, 386 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 2060bc122c53..26468a8f44f3 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -66,17 +66,13 @@ int settimeo(int fd, int timeout_ms)
 
 #define save_errno_close(fd) ({ int __save = errno; close(fd); errno = __save; })
 
-int start_server(int family, int type, const char *addr_str, __u16 port,
-		 int timeout_ms)
+static int __start_server(int type, const struct sockaddr *addr,
+			  socklen_t addrlen, int timeout_ms, bool reuseport)
 {
-	struct sockaddr_storage addr = {};
-	socklen_t len;
+	int on = 1;
 	int fd;
 
-	if (make_sockaddr(family, addr_str, port, &addr, &len))
-		return -1;
-
-	fd = socket(family, type, 0);
+	fd = socket(addr->sa_family, type, 0);
 	if (fd < 0) {
 		log_err("Failed to create server socket");
 		return -1;
@@ -85,7 +81,13 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
 	if (settimeo(fd, timeout_ms))
 		goto error_close;
 
-	if (bind(fd, (const struct sockaddr *)&addr, len) < 0) {
+	if (reuseport &&
+	    setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on))) {
+		log_err("Failed to set SO_REUSEPORT");
+		return -1;
+	}
+
+	if (bind(fd, addr, addrlen) < 0) {
 		log_err("Failed to bind socket");
 		goto error_close;
 	}
@@ -104,6 +106,69 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
 	return -1;
 }
 
+int start_server(int family, int type, const char *addr_str, __u16 port,
+		 int timeout_ms)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen;
+
+	if (make_sockaddr(family, addr_str, port, &addr, &addrlen))
+		return -1;
+
+	return __start_server(type, (struct sockaddr *)&addr,
+			      addrlen, timeout_ms, false);
+}
+
+int *start_reuseport_server(int family, int type, const char *addr_str,
+			    __u16 port, int timeout_ms, unsigned int nr_listens)
+{
+	struct sockaddr_storage addr;
+	unsigned int nr_fds = 0;
+	socklen_t addrlen;
+	int *fds;
+
+	if (!nr_listens)
+		return NULL;
+
+	if (make_sockaddr(family, addr_str, port, &addr, &addrlen))
+		return NULL;
+
+	fds = malloc(sizeof(*fds) * nr_listens);
+	if (!fds)
+		return NULL;
+
+	fds[0] = __start_server(type, (struct sockaddr *)&addr, addrlen,
+				timeout_ms, true);
+	if (fds[0] == -1)
+		goto close_fds;
+	nr_fds = 1;
+
+	if (getsockname(fds[0], (struct sockaddr *)&addr, &addrlen))
+		goto close_fds;
+
+	for (; nr_fds < nr_listens; nr_fds++) {
+		fds[nr_fds] = __start_server(type, (struct sockaddr *)&addr,
+					     addrlen, timeout_ms, true);
+		if (fds[nr_fds] == -1)
+			goto close_fds;
+	}
+
+	return fds;
+
+close_fds:
+	free_fds(fds, nr_fds);
+	return NULL;
+}
+
+void free_fds(int *fds, unsigned int nr_close_fds)
+{
+	if (fds) {
+		while (nr_close_fds)
+			close(fds[--nr_close_fds]);
+		free(fds);
+	}
+}
+
 int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
 		     int timeout_ms)
 {
@@ -217,6 +282,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
 	if (family == AF_INET) {
 		struct sockaddr_in *sin = (void *)addr;
 
+		memset(addr, 0, sizeof(*sin));
 		sin->sin_family = AF_INET;
 		sin->sin_port = htons(port);
 		if (addr_str &&
@@ -230,6 +296,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
 	} else if (family == AF_INET6) {
 		struct sockaddr_in6 *sin6 = (void *)addr;
 
+		memset(addr, 0, sizeof(*sin6));
 		sin6->sin6_family = AF_INET6;
 		sin6->sin6_port = htons(port);
 		if (addr_str &&
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 5e0d51c07b63..d60bc2897770 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -36,6 +36,10 @@ extern struct ipv6_packet pkt_v6;
 int settimeo(int fd, int timeout_ms);
 int start_server(int family, int type, const char *addr, __u16 port,
 		 int timeout_ms);
+int *start_reuseport_server(int family, int type, const char *addr_str,
+			    __u16 port, int timeout_ms,
+			    unsigned int nr_listens);
+void free_fds(int *fds, unsigned int nr_close_fds);
 int connect_to_fd(int server_fd, int timeout_ms);
 int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
 int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c
new file mode 100644
index 000000000000..85babb0487b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c
@@ -0,0 +1,226 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#define _GNU_SOURCE
+#include <sched.h>
+#include <test_progs.h>
+#include "network_helpers.h"
+#include "bpf_dctcp.skel.h"
+#include "bpf_cubic.skel.h"
+#include "bpf_iter_setsockopt.skel.h"
+
+static int create_netns(void)
+{
+	if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns"))
+		return -1;
+
+	if (!ASSERT_OK(system("ip link set dev lo up"), "bring up lo"))
+		return -1;
+
+	return 0;
+}
+
+static unsigned int set_bpf_cubic(int *fds, unsigned int nr_fds)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr_fds; i++) {
+		if (setsockopt(fds[i], SOL_TCP, TCP_CONGESTION, "bpf_cubic",
+			       sizeof("bpf_cubic")))
+			return i;
+	}
+
+	return nr_fds;
+}
+
+static unsigned int check_bpf_dctcp(int *fds, unsigned int nr_fds)
+{
+	char tcp_cc[16];
+	socklen_t optlen = sizeof(tcp_cc);
+	unsigned int i;
+
+	for (i = 0; i < nr_fds; i++) {
+		if (getsockopt(fds[i], SOL_TCP, TCP_CONGESTION,
+			       tcp_cc, &optlen) ||
+		    strcmp(tcp_cc, "bpf_dctcp"))
+			return i;
+	}
+
+	return nr_fds;
+}
+
+static int *make_established(int listen_fd, unsigned int nr_est,
+			     int **paccepted_fds)
+{
+	int *est_fds, *accepted_fds;
+	unsigned int i;
+
+	est_fds = malloc(sizeof(*est_fds) * nr_est);
+	if (!est_fds)
+		return NULL;
+
+	accepted_fds = malloc(sizeof(*accepted_fds) * nr_est);
+	if (!accepted_fds) {
+		free(est_fds);
+		return NULL;
+	}
+
+	for (i = 0; i < nr_est; i++) {
+		est_fds[i] = connect_to_fd(listen_fd, 0);
+		if (est_fds[i] == -1)
+			break;
+		if (set_bpf_cubic(&est_fds[i], 1) != 1) {
+			close(est_fds[i]);
+			break;
+		}
+
+		accepted_fds[i] = accept(listen_fd, NULL, 0);
+		if (accepted_fds[i] == -1) {
+			close(est_fds[i]);
+			break;
+		}
+	}
+
+	if (!ASSERT_EQ(i, nr_est, "create established fds")) {
+		free_fds(accepted_fds, i);
+		free_fds(est_fds, i);
+		return NULL;
+	}
+
+	*paccepted_fds = accepted_fds;
+	return est_fds;
+}
+
+static unsigned short get_local_port(int fd)
+{
+	struct sockaddr_in6 addr;
+	socklen_t addrlen = sizeof(addr);
+
+	if (!getsockname(fd, &addr, &addrlen))
+		return ntohs(addr.sin6_port);
+
+	return 0;
+}
+
+static void do_bpf_iter_setsockopt(struct bpf_iter_setsockopt *iter_skel,
+				   bool random_retry)
+{
+	int *reuse_listen_fds = NULL, *accepted_fds = NULL, *est_fds = NULL;
+	unsigned int nr_reuse_listens = 256, nr_est = 256;
+	int err, iter_fd = -1, listen_fd = -1;
+	char buf;
+
+	/* Prepare non-reuseport listen_fd */
+	listen_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0);
+	if (!ASSERT_GE(listen_fd, 0, "start_server"))
+		return;
+	if (!ASSERT_EQ(set_bpf_cubic(&listen_fd, 1), 1,
+		       "set listen_fd to cubic"))
+		goto done;
+	iter_skel->bss->listen_hport = get_local_port(listen_fd);
+	if (!ASSERT_NEQ(iter_skel->bss->listen_hport, 0,
+			"get_local_port(listen_fd)"))
+		goto done;
+
+	/* Connect to non-reuseport listen_fd */
+	est_fds = make_established(listen_fd, nr_est, &accepted_fds);
+	if (!ASSERT_OK_PTR(est_fds, "create established"))
+		goto done;
+
+	/* Prepare reuseport listen fds */
+	reuse_listen_fds = start_reuseport_server(AF_INET6, SOCK_STREAM,
+						  "::1", 0, 0,
+						  nr_reuse_listens);
+	if (!ASSERT_OK_PTR(reuse_listen_fds, "start_reuseport_server"))
+		goto done;
+	if (!ASSERT_EQ(set_bpf_cubic(reuse_listen_fds, nr_reuse_listens),
+		       nr_reuse_listens, "set reuse_listen_fds to cubic"))
+		goto done;
+	iter_skel->bss->reuse_listen_hport = get_local_port(reuse_listen_fds[0]);
+	if (!ASSERT_NEQ(iter_skel->bss->reuse_listen_hport, 0,
+			"get_local_port(reuse_listen_fds[0])"))
+		goto done;
+
+	/* Run bpf tcp iter to switch from bpf_cubic to bpf_dctcp */
+	iter_skel->bss->random_retry = random_retry;
+	iter_fd = bpf_iter_create(bpf_link__fd(iter_skel->links.change_tcp_cc));
+	if (!ASSERT_GE(iter_fd, 0, "create iter_fd"))
+		goto done;
+
+	while ((err = read(iter_fd, &buf, sizeof(buf))) == -1 &&
+	       errno == EAGAIN)
+		;
+	if (!ASSERT_OK(err, "read iter error"))
+		goto done;
+
+	/* Check reuseport listen fds for dctcp */
+	ASSERT_EQ(check_bpf_dctcp(reuse_listen_fds, nr_reuse_listens),
+		  nr_reuse_listens,
+		  "check reuse_listen_fds dctcp");
+
+	/* Check non reuseport listen fd for dctcp */
+	ASSERT_EQ(check_bpf_dctcp(&listen_fd, 1), 1,
+		  "check listen_fd dctcp");
+
+	/* Check established fds for dctcp */
+	ASSERT_EQ(check_bpf_dctcp(est_fds, nr_est), nr_est,
+		  "check est_fds dctcp");
+
+	/* Check accepted fds for dctcp */
+	ASSERT_EQ(check_bpf_dctcp(accepted_fds, nr_est), nr_est,
+		  "check accepted_fds dctcp");
+
+done:
+	if (iter_fd != -1)
+		close(iter_fd);
+	if (listen_fd != -1)
+		close(listen_fd);
+	free_fds(reuse_listen_fds, nr_reuse_listens);
+	free_fds(accepted_fds, nr_est);
+	free_fds(est_fds, nr_est);
+}
+
+void test_bpf_iter_setsockopt(void)
+{
+	struct bpf_iter_setsockopt *iter_skel = NULL;
+	struct bpf_cubic *cubic_skel = NULL;
+	struct bpf_dctcp *dctcp_skel = NULL;
+	struct bpf_link *cubic_link = NULL;
+	struct bpf_link *dctcp_link = NULL;
+
+	if (create_netns())
+		return;
+
+	/* Load iter_skel */
+	iter_skel = bpf_iter_setsockopt__open_and_load();
+	if (!ASSERT_OK_PTR(iter_skel, "iter_skel"))
+		return;
+	iter_skel->links.change_tcp_cc = bpf_program__attach_iter(iter_skel->progs.change_tcp_cc, NULL);
+	if (!ASSERT_OK_PTR(iter_skel->links.change_tcp_cc, "attach iter"))
+		goto done;
+
+	/* Load bpf_cubic */
+	cubic_skel = bpf_cubic__open_and_load();
+	if (!ASSERT_OK_PTR(cubic_skel, "cubic_skel"))
+		goto done;
+	cubic_link = bpf_map__attach_struct_ops(cubic_skel->maps.cubic);
+	if (!ASSERT_OK_PTR(cubic_link, "cubic_link"))
+		goto done;
+
+	/* Load bpf_dctcp */
+	dctcp_skel = bpf_dctcp__open_and_load();
+	if (!ASSERT_OK_PTR(dctcp_skel, "dctcp_skel"))
+		goto done;
+	dctcp_link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp);
+	if (!ASSERT_OK_PTR(dctcp_link, "dctcp_link"))
+		goto done;
+
+	do_bpf_iter_setsockopt(iter_skel, true);
+	do_bpf_iter_setsockopt(iter_skel, false);
+
+done:
+	bpf_link__destroy(cubic_link);
+	bpf_link__destroy(dctcp_link);
+	bpf_cubic__destroy(cubic_skel);
+	bpf_dctcp__destroy(dctcp_skel);
+	bpf_iter_setsockopt__destroy(iter_skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c
new file mode 100644
index 000000000000..72cc4c1c681e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include "bpf_iter.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define sk_num			__sk_common.skc_num
+#define sk_dport		__sk_common.skc_dport
+#define sk_state		__sk_common.skc_state
+#define sk_family		__sk_common.skc_family
+
+#define bpf_tcp_sk(skc)	({				\
+	struct sock_common *_skc = skc;			\
+	sk = NULL;					\
+	tp = NULL;					\
+	if (_skc) {					\
+		tp = bpf_skc_to_tcp_sock(_skc);		\
+		sk = (struct sock *)tp;			\
+	}						\
+	tp;						\
+})
+
+unsigned short reuse_listen_hport = 0;
+unsigned short listen_hport = 0;
+char old_cc[TCP_CA_NAME_MAX] = "bpf_cubic";
+char new_cc[TCP_CA_NAME_MAX] = "bpf_dctcp";
+bool random_retry = false;
+
+static bool tcp_ca_eq(const char *a, const char *b)
+{
+	int i;
+
+	for (i = 0; i < TCP_CA_NAME_MAX; i++) {
+		if (a[i] != b[i])
+			return false;
+		if (!a[i])
+			break;
+	}
+
+	return true;
+}
+
+SEC("iter/tcp")
+int change_tcp_cc(struct bpf_iter__tcp *ctx)
+{
+	struct bpf_iter_meta *meta = ctx->meta;
+	struct inet_connection_sock *icsk;
+	struct seq_file *seq = meta->seq;
+	struct tcp_sock *tp;
+	struct sock *sk;
+	int ret;
+
+	if (!bpf_tcp_sk(ctx->sk_common))
+		return 0;
+
+	if (sk->sk_family != AF_INET6 ||
+	    (sk->sk_state != TCP_LISTEN &&
+	     sk->sk_state != TCP_ESTABLISHED) ||
+	    (sk->sk_num != reuse_listen_hport &&
+	     sk->sk_num != listen_hport &&
+	     bpf_ntohs(sk->sk_dport) != listen_hport))
+		return 0;
+
+	icsk = (struct inet_connection_sock *)tp;
+	if (!tcp_ca_eq(icsk->icsk_ca_ops->name, old_cc))
+		return 0;
+
+	if (random_retry && bpf_get_prandom_u32() % 4 == 1)
+		return 1;
+
+	bpf_setsockopt(tp, SOL_TCP, TCP_CONGESTION, new_cc, sizeof(new_cc));
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 01378911252b..d66c4caaeec1 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -5,6 +5,10 @@
 #define AF_INET			2
 #define AF_INET6		10
 
+#define SOL_TCP			6
+#define TCP_CONGESTION		13
+#define TCP_CA_NAME_MAX		16
+
 #define ICSK_TIME_RETRANS	1
 #define ICSK_TIME_PROBE0	3
 #define ICSK_TIME_LOSS_PROBE	5
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock
  2021-06-25 20:05 ` [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock Martin KaFai Lau
@ 2021-06-29 17:27   ` Yonghong Song
  2021-06-29 17:44     ` Martin KaFai Lau
  0 siblings, 1 reply; 16+ messages in thread
From: Yonghong Song @ 2021-06-29 17:27 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yuchung Cheng



On 6/25/21 1:05 PM, Martin KaFai Lau wrote:
> This patch does batching and lock_sock for the bpf tcp iter.
> It does not affect the proc fs iteration.
> 
> With bpf-tcp-cc, new algo rollout happens more often.  Instead of
> restarting the application to pick up the new tcp-cc, the next patch
> will allow bpf iter with CAP_NET_ADMIN to do setsockopt(TCP_CONGESTION).
> This requires locking the sock.
> 
> Also, unlike the proc iteration (cat /proc/net/tcp[6]), the bpf iter
> can inspect all fields of a tcp_sock.  It will be useful to have a
> consistent view on some of the fields (e.g. the ones reported in
> tcp_get_info() that also acquires the sock lock).
> 
> Double lock: locking the bucket first and then locking the sock could
> lead to deadlock.  This patch takes a batching approach similar to
> inet_diag.  While holding the bucket lock, it batch a number of sockets
> into an array first and then unlock the bucket.  Before doing show(),
> it then calls lock_sock_fast().
> 
> In a machine with ~400k connections, the maximum number of
> sk in a bucket of the established hashtable is 7.  0.02% of
> the established connections fall into this bucket size.
> 
> For listen hash (port+addr lhash2), the bucket is usually very
> small also except for the SO_REUSEPORT use case which the
> userspace could have one SO_REUSEPORT socket per thread.
> 
> While batching is used, it can also minimize the chance of missing
> sock in the setsockopt use case if the whole bucket is batched.
> This patch will start with a batch array with INIT_BATCH_SZ (16)
> which will be enough for the most common cases.  bpf_iter_tcp_batch()
> will try to realloc to a larger array to handle exception case (e.g.
> the SO_REUSEPORT case in the lhash2).
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   net/ipv4/tcp_ipv4.c | 236 ++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 230 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 0d851289a89e..856144d33f52 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2687,6 +2687,15 @@ static int tcp4_seq_show(struct seq_file *seq, void *v)
>   }
>   
>   #ifdef CONFIG_BPF_SYSCALL
> +struct bpf_tcp_iter_state {
> +	struct tcp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
>   struct bpf_iter__tcp {
>   	__bpf_md_ptr(struct bpf_iter_meta *, meta);
>   	__bpf_md_ptr(struct sock_common *, sk_common);
> @@ -2705,16 +2714,203 @@ static int tcp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>   	return bpf_iter_run_prog(prog, &ctx);
>   }
>   
> +static void bpf_iter_tcp_put_batch(struct bpf_tcp_iter_state *iter)
> +{
> +	while (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
> +}
> +
> +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
> +				      unsigned int new_batch_sz)
> +{
> +	struct sock **new_batch;
> +
> +	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER);

Since we return -ENOMEM below, should we have __GFP_NOWARN in kvmalloc 
flags?

> +	if (!new_batch)
> +		return -ENOMEM;
> +
> +	bpf_iter_tcp_put_batch(iter);
> +	kvfree(iter->batch);
> +	iter->batch = new_batch;
> +	iter->max_sk = new_batch_sz;
> +
> +	return 0;
> +}
> +
[...]
> +
>   static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>   {
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
> +	bool slow;
>   	uid_t uid;
> +	int ret;
>   
>   	if (v == SEQ_START_TOKEN)
>   		return 0;
>   
> +	if (sk_fullsock(sk))
> +		slow = lock_sock_fast(sk);
> +
> +	if (unlikely(sk_unhashed(sk))) {
> +		ret = SEQ_SKIP;
> +		goto unlock;
> +	}

I am not a tcp expert. Maybe a dummy question.
Is it possible to do setsockopt() for listening socket?
What will happen if the listening sock is unhashed after the
above check?

> +
>   	if (sk->sk_state == TCP_TIME_WAIT) {
>   		uid = 0;
>   	} else if (sk->sk_state == TCP_NEW_SYN_RECV) {
> @@ -2728,11 +2924,18 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>   
>   	meta.seq = seq;
>   	prog = bpf_iter_get_info(&meta, false);
> -	return tcp_prog_seq_show(prog, &meta, v, uid);
> +	ret = tcp_prog_seq_show(prog, &meta, v, uid);
> +
> +unlock:
> +	if (sk_fullsock(sk))
> +		unlock_sock_fast(sk, slow);
> +	return ret;
> +
>   }
>   
>   static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v)
>   {
> +	struct bpf_tcp_iter_state *iter = seq->private;
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   
> @@ -2743,13 +2946,16 @@ static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v)
>   			(void)tcp_prog_seq_show(prog, &meta, v, 0);
>   	}
>   
> -	tcp_seq_stop(seq, v);
> +	if (iter->cur_sk < iter->end_sk) {
> +		bpf_iter_tcp_put_batch(iter);
> +		iter->st_bucket_done = false;
> +	}
>   }
>   
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock
  2021-06-29 17:27   ` Yonghong Song
@ 2021-06-29 17:44     ` Martin KaFai Lau
  2021-06-29 17:57       ` Yonghong Song
  0 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-29 17:44 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Neal Cardwell, netdev, Yuchung Cheng

On Tue, Jun 29, 2021 at 10:27:17AM -0700, Yonghong Song wrote:
[ ... ]

> > +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
> > +				      unsigned int new_batch_sz)
> > +{
> > +	struct sock **new_batch;
> > +
> > +	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER);
> 
> Since we return -ENOMEM below, should we have __GFP_NOWARN in kvmalloc
> flags?
will add in v2.

> 
> > +	if (!new_batch)
> > +		return -ENOMEM;
> > +
> > +	bpf_iter_tcp_put_batch(iter);
> > +	kvfree(iter->batch);
> > +	iter->batch = new_batch;
> > +	iter->max_sk = new_batch_sz;
> > +
> > +	return 0;
> > +}
> > +
> [...]
> > +
> >   static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
> >   {
> >   	struct bpf_iter_meta meta;
> >   	struct bpf_prog *prog;
> >   	struct sock *sk = v;
> > +	bool slow;
> >   	uid_t uid;
> > +	int ret;
> >   	if (v == SEQ_START_TOKEN)
> >   		return 0;
> > +	if (sk_fullsock(sk))
> > +		slow = lock_sock_fast(sk);
> > +
> > +	if (unlikely(sk_unhashed(sk))) {
> > +		ret = SEQ_SKIP;
> > +		goto unlock;
> > +	}
> 
> I am not a tcp expert. Maybe a dummy question.
> Is it possible to do setsockopt() for listening socket?
> What will happen if the listening sock is unhashed after the
> above check?
It won't happen because the sk has been locked before doing the
unhashed check.

Thanks for the review.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock
  2021-06-29 17:44     ` Martin KaFai Lau
@ 2021-06-29 17:57       ` Yonghong Song
  2021-06-29 18:06         ` Martin KaFai Lau
  0 siblings, 1 reply; 16+ messages in thread
From: Yonghong Song @ 2021-06-29 17:57 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Neal Cardwell, netdev, Yuchung Cheng



On 6/29/21 10:44 AM, Martin KaFai Lau wrote:
> On Tue, Jun 29, 2021 at 10:27:17AM -0700, Yonghong Song wrote:
> [ ... ]
> 
>>> +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
>>> +				      unsigned int new_batch_sz)
>>> +{
>>> +	struct sock **new_batch;
>>> +
>>> +	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER);
>>
>> Since we return -ENOMEM below, should we have __GFP_NOWARN in kvmalloc
>> flags?
> will add in v2.
> 
>>
>>> +	if (!new_batch)
>>> +		return -ENOMEM;
>>> +
>>> +	bpf_iter_tcp_put_batch(iter);
>>> +	kvfree(iter->batch);
>>> +	iter->batch = new_batch;
>>> +	iter->max_sk = new_batch_sz;
>>> +
>>> +	return 0;
>>> +}
>>> +
>> [...]
>>> +
>>>    static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>>    {
>>>    	struct bpf_iter_meta meta;
>>>    	struct bpf_prog *prog;
>>>    	struct sock *sk = v;
>>> +	bool slow;
>>>    	uid_t uid;
>>> +	int ret;
>>>    	if (v == SEQ_START_TOKEN)
>>>    		return 0;
>>> +	if (sk_fullsock(sk))
>>> +		slow = lock_sock_fast(sk);
>>> +
>>> +	if (unlikely(sk_unhashed(sk))) {
>>> +		ret = SEQ_SKIP;
>>> +		goto unlock;
>>> +	}
>>
>> I am not a tcp expert. Maybe a dummy question.
>> Is it possible to do setsockopt() for listening socket?
>> What will happen if the listening sock is unhashed after the
>> above check?
> It won't happen because the sk has been locked before doing the
> unhashed check.

Ya, that is true. I guess I probably mean TCP_TIME_WAIT and
TCP_NEW_SYN_RECV sockets. We cannot do setsockopt() for
TCP_TIME_WAIT sockets since user space shouldn't be able
to access the socket any more.

But how about TCP_NEW_SYN_RECV sockets?

> 
> Thanks for the review.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock
  2021-06-29 17:57       ` Yonghong Song
@ 2021-06-29 18:06         ` Martin KaFai Lau
  2021-06-29 18:55           ` Yonghong Song
  0 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2021-06-29 18:06 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Neal Cardwell, netdev, Yuchung Cheng

On Tue, Jun 29, 2021 at 10:57:46AM -0700, Yonghong Song wrote:
> 
> 
> On 6/29/21 10:44 AM, Martin KaFai Lau wrote:
> > On Tue, Jun 29, 2021 at 10:27:17AM -0700, Yonghong Song wrote:
> > [ ... ]
> > 
> > > > +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
> > > > +				      unsigned int new_batch_sz)
> > > > +{
> > > > +	struct sock **new_batch;
> > > > +
> > > > +	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER);
> > > 
> > > Since we return -ENOMEM below, should we have __GFP_NOWARN in kvmalloc
> > > flags?
> > will add in v2.
> > 
> > > 
> > > > +	if (!new_batch)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	bpf_iter_tcp_put_batch(iter);
> > > > +	kvfree(iter->batch);
> > > > +	iter->batch = new_batch;
> > > > +	iter->max_sk = new_batch_sz;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > [...]
> > > > +
> > > >    static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
> > > >    {
> > > >    	struct bpf_iter_meta meta;
> > > >    	struct bpf_prog *prog;
> > > >    	struct sock *sk = v;
> > > > +	bool slow;
> > > >    	uid_t uid;
> > > > +	int ret;
> > > >    	if (v == SEQ_START_TOKEN)
> > > >    		return 0;
> > > > +	if (sk_fullsock(sk))
> > > > +		slow = lock_sock_fast(sk);
> > > > +
> > > > +	if (unlikely(sk_unhashed(sk))) {
> > > > +		ret = SEQ_SKIP;
> > > > +		goto unlock;
> > > > +	}
> > > 
> > > I am not a tcp expert. Maybe a dummy question.
> > > Is it possible to do setsockopt() for listening socket?
> > > What will happen if the listening sock is unhashed after the
> > > above check?
> > It won't happen because the sk has been locked before doing the
> > unhashed check.
> 
> Ya, that is true. I guess I probably mean TCP_TIME_WAIT and
> TCP_NEW_SYN_RECV sockets. We cannot do setsockopt() for
> TCP_TIME_WAIT sockets since user space shouldn't be able
> to access the socket any more.
> 
> But how about TCP_NEW_SYN_RECV sockets?
_bpf_setsockopt() will return -EINVAL for non fullsock.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock
  2021-06-29 18:06         ` Martin KaFai Lau
@ 2021-06-29 18:55           ` Yonghong Song
  0 siblings, 0 replies; 16+ messages in thread
From: Yonghong Song @ 2021-06-29 18:55 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	kernel-team, Neal Cardwell, netdev, Yuchung Cheng



On 6/29/21 11:06 AM, Martin KaFai Lau wrote:
> On Tue, Jun 29, 2021 at 10:57:46AM -0700, Yonghong Song wrote:
>>
>>
>> On 6/29/21 10:44 AM, Martin KaFai Lau wrote:
>>> On Tue, Jun 29, 2021 at 10:27:17AM -0700, Yonghong Song wrote:
>>> [ ... ]
>>>
>>>>> +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter,
>>>>> +				      unsigned int new_batch_sz)
>>>>> +{
>>>>> +	struct sock **new_batch;
>>>>> +
>>>>> +	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER);
>>>>
>>>> Since we return -ENOMEM below, should we have __GFP_NOWARN in kvmalloc
>>>> flags?
>>> will add in v2.
>>>
>>>>
>>>>> +	if (!new_batch)
>>>>> +		return -ENOMEM;
>>>>> +
>>>>> +	bpf_iter_tcp_put_batch(iter);
>>>>> +	kvfree(iter->batch);
>>>>> +	iter->batch = new_batch;
>>>>> +	iter->max_sk = new_batch_sz;
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>> [...]
>>>>> +
>>>>>     static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>>>>     {
>>>>>     	struct bpf_iter_meta meta;
>>>>>     	struct bpf_prog *prog;
>>>>>     	struct sock *sk = v;
>>>>> +	bool slow;
>>>>>     	uid_t uid;
>>>>> +	int ret;
>>>>>     	if (v == SEQ_START_TOKEN)
>>>>>     		return 0;
>>>>> +	if (sk_fullsock(sk))
>>>>> +		slow = lock_sock_fast(sk);
>>>>> +
>>>>> +	if (unlikely(sk_unhashed(sk))) {
>>>>> +		ret = SEQ_SKIP;
>>>>> +		goto unlock;
>>>>> +	}
>>>>
>>>> I am not a tcp expert. Maybe a dummy question.
>>>> Is it possible to do setsockopt() for listening socket?
>>>> What will happen if the listening sock is unhashed after the
>>>> above check?
>>> It won't happen because the sk has been locked before doing the
>>> unhashed check.
>>
>> Ya, that is true. I guess I probably mean TCP_TIME_WAIT and
>> TCP_NEW_SYN_RECV sockets. We cannot do setsockopt() for
>> TCP_TIME_WAIT sockets since user space shouldn't be able
>> to access the socket any more.
>>
>> But how about TCP_NEW_SYN_RECV sockets?
> _bpf_setsockopt() will return -EINVAL for non fullsock.

That makes sense. I think whether we could block calling 
bpf_setsockopt() for unsupported sockets outside bpf program.
But indeed letting bpf to do filtering in such cases should
be simpler.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 8/8] bpf: selftest: Test batching and bpf_setsockopt in bpf tcp iter
  2021-06-25 20:05 ` [PATCH bpf-next 8/8] bpf: selftest: Test batching and " Martin KaFai Lau
@ 2021-06-29 19:00   ` Yonghong Song
  0 siblings, 0 replies; 16+ messages in thread
From: Yonghong Song @ 2021-06-29 19:00 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yuchung Cheng



On 6/25/21 1:05 PM, Martin KaFai Lau wrote:
> This patch adds tests for the batching and bpf_setsockopt in bpf tcp iter.
> 
> It first creates:
> a) 1 non SO_REUSEPORT listener in lhash2.
> b) 256 passive and active fds connected to the listener in (a).
> c) 256 SO_REUSEPORT listeners in one of the lhash2 bucket.
> 
> The test sets all listeners and connections to bpf_cubic before
> running the bpf iter.
> 
> The bpf iter then calls setsockopt(TCP_CONGESTION) to switch
> each listener and connection from bpf_cubic to bpf_dctcp.
> 
> The bpf iter has a random_retry mode such that it can return EAGAIN
> to the usespace in the middle of a batch.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>   tools/testing/selftests/bpf/network_helpers.c |  85 ++++++-
>   tools/testing/selftests/bpf/network_helpers.h |   4 +
>   .../bpf/prog_tests/bpf_iter_setsockopt.c      | 226 ++++++++++++++++++
>   .../selftests/bpf/progs/bpf_iter_setsockopt.c |  76 ++++++
>   .../selftests/bpf/progs/bpf_tracing_net.h     |   4 +
>   5 files changed, 386 insertions(+), 9 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c
>   create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c
> 
> diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> index 2060bc122c53..26468a8f44f3 100644
> --- a/tools/testing/selftests/bpf/network_helpers.c
> +++ b/tools/testing/selftests/bpf/network_helpers.c
> @@ -66,17 +66,13 @@ int settimeo(int fd, int timeout_ms)
>   
>   #define save_errno_close(fd) ({ int __save = errno; close(fd); errno = __save; })
>   
> -int start_server(int family, int type, const char *addr_str, __u16 port,
> -		 int timeout_ms)
> +static int __start_server(int type, const struct sockaddr *addr,
> +			  socklen_t addrlen, int timeout_ms, bool reuseport)
>   {
> -	struct sockaddr_storage addr = {};
> -	socklen_t len;
> +	int on = 1;
>   	int fd;
>   
> -	if (make_sockaddr(family, addr_str, port, &addr, &len))
> -		return -1;
> -
> -	fd = socket(family, type, 0);
> +	fd = socket(addr->sa_family, type, 0);
>   	if (fd < 0) {
>   		log_err("Failed to create server socket");
>   		return -1;
> @@ -85,7 +81,13 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
>   	if (settimeo(fd, timeout_ms))
>   		goto error_close;
>   
> -	if (bind(fd, (const struct sockaddr *)&addr, len) < 0) {
> +	if (reuseport &&
> +	    setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on))) {
> +		log_err("Failed to set SO_REUSEPORT");
> +		return -1;
> +	}
> +
> +	if (bind(fd, addr, addrlen) < 0) {
>   		log_err("Failed to bind socket");
>   		goto error_close;
>   	}
> @@ -104,6 +106,69 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
>   	return -1;
>   }
>   
[...]
> +void test_bpf_iter_setsockopt(void)
> +{
> +	struct bpf_iter_setsockopt *iter_skel = NULL;
> +	struct bpf_cubic *cubic_skel = NULL;
> +	struct bpf_dctcp *dctcp_skel = NULL;
> +	struct bpf_link *cubic_link = NULL;
> +	struct bpf_link *dctcp_link = NULL;
> +
> +	if (create_netns())
> +		return;
> +
> +	/* Load iter_skel */
> +	iter_skel = bpf_iter_setsockopt__open_and_load();
> +	if (!ASSERT_OK_PTR(iter_skel, "iter_skel"))
> +		return;
> +	iter_skel->links.change_tcp_cc = bpf_program__attach_iter(iter_skel->progs.change_tcp_cc, NULL);
> +	if (!ASSERT_OK_PTR(iter_skel->links.change_tcp_cc, "attach iter"))
> +		goto done;
> +
> +	/* Load bpf_cubic */
> +	cubic_skel = bpf_cubic__open_and_load();
> +	if (!ASSERT_OK_PTR(cubic_skel, "cubic_skel"))
> +		goto done;
> +	cubic_link = bpf_map__attach_struct_ops(cubic_skel->maps.cubic);
> +	if (!ASSERT_OK_PTR(cubic_link, "cubic_link"))
> +		goto done;
> +
> +	/* Load bpf_dctcp */
> +	dctcp_skel = bpf_dctcp__open_and_load();
> +	if (!ASSERT_OK_PTR(dctcp_skel, "dctcp_skel"))
> +		goto done;
> +	dctcp_link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp);
> +	if (!ASSERT_OK_PTR(dctcp_link, "dctcp_link"))
> +		goto done;
> +
> +	do_bpf_iter_setsockopt(iter_skel, true);
> +	do_bpf_iter_setsockopt(iter_skel, false);
> +
> +done:
> +	bpf_link__destroy(cubic_link);
> +	bpf_link__destroy(dctcp_link);
> +	bpf_cubic__destroy(cubic_skel);
> +	bpf_dctcp__destroy(dctcp_skel);
> +	bpf_iter_setsockopt__destroy(iter_skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c
> new file mode 100644
> index 000000000000..72cc4c1c681e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c
> @@ -0,0 +1,76 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021 Facebook */
> +#include "bpf_iter.h"
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +#define sk_num			__sk_common.skc_num
> +#define sk_dport		__sk_common.skc_dport
> +#define sk_state		__sk_common.skc_state
> +#define sk_family		__sk_common.skc_family

The above macros can be defined in bpf_tracing_net.h. For example,
sk_state and sk_family are alraedy defined in bpf_tracing_net.h.

#define sk_family               __sk_common.skc_family
#define sk_rmem_alloc           sk_backlog.rmem_alloc
#define sk_refcnt               __sk_common.skc_refcnt
#define sk_state                __sk_common.skc_state
#define sk_v6_daddr             __sk_common.skc_v6_daddr
#define sk_v6_rcv_saddr         __sk_common.skc_v6_rcv_saddr

> +
> +#define bpf_tcp_sk(skc)	({				\
> +	struct sock_common *_skc = skc;			\
> +	sk = NULL;					\
> +	tp = NULL;					\
> +	if (_skc) {					\
> +		tp = bpf_skc_to_tcp_sock(_skc);		\
> +		sk = (struct sock *)tp;			\
> +	}						\
> +	tp;						\
> +})
[...]
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> index 01378911252b..d66c4caaeec1 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> @@ -5,6 +5,10 @@
>   #define AF_INET			2
>   #define AF_INET6		10
>   
> +#define SOL_TCP			6
> +#define TCP_CONGESTION		13
> +#define TCP_CA_NAME_MAX		16
> +
>   #define ICSK_TIME_RETRANS	1
>   #define ICSK_TIME_PROBE0	3
>   #define ICSK_TIME_LOSS_PROBE	5
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt
  2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2021-06-25 20:05 ` [PATCH bpf-next 8/8] bpf: selftest: Test batching and " Martin KaFai Lau
@ 2021-06-29 19:04 ` Yonghong Song
  8 siblings, 0 replies; 16+ messages in thread
From: Yonghong Song @ 2021-06-29 19:04 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, kernel-team,
	Neal Cardwell, netdev, Yuchung Cheng



On 6/25/21 1:04 PM, Martin KaFai Lau wrote:
> This set is to allow bpf tcp iter to call bpf_setsockopt.
> 
> With bpf-tcp-cc, new algo rollout happens more often.  Instead of
> restarting the applications to pick up the new tcp-cc, this set
> allows the bpf tcp iter with the netadmin cap to call
> bpf_setsockopt(TCP_CONGESTION).  It is not limited to TCP_CONGESTION
> and the bpf tcp iter can call bpf_setsockopt() with other options.
> The bpf tcp iter can read into all the fields of a tcp_sock, so
> there is a lot of flexibility to select the desired sk to do
> setsockopt(), e.g. it can test for TCP_LISTEN only and leave
> the established connections untouched, or check the addr/port,
> or check the current tcp-cc name, ...etc.
> 
> Patch 1-4 are some cleanup and prep work in the tcp and bpf seq_file.
> 
> Patch 5 is to have the tcp seq_file iterate on the
> port+addr lhash2 instead of the port only listening_hash.
> 
> Patch 6 is to have the bpf tcp iter doing batching which
> then allows lock_sock.  lock_sock is needed for setsockopt.
> 
> Patch 7 allows the bpf tcp iter to call bpf_setsockopt.
> 
> Martin KaFai Lau (8):
>    tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos
>    tcp: seq_file: Refactor net and family matching
>    bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state
>    tcp: seq_file: Add listening_get_first()
>    tcp: seq_file: Replace listening_hash with lhash2
>    bpf: tcp: bpf iter batching and lock_sock
>    bpf: tcp: Support bpf_setsockopt in bpf tcp iter
>    bpf: selftest: Test batching and bpf_setsockopt in bpf tcp iter
> 
>   include/linux/bpf.h                           |   7 +
>   include/net/inet_hashtables.h                 |   6 +
>   include/net/tcp.h                             |   1 -
>   kernel/bpf/bpf_iter.c                         |  22 +
>   kernel/trace/bpf_trace.c                      |   7 +-
>   net/core/filter.c                             |  17 +
>   net/ipv4/tcp_ipv4.c                           | 409 ++++++++++++++----
>   tools/testing/selftests/bpf/network_helpers.c |  85 +++-
>   tools/testing/selftests/bpf/network_helpers.h |   4 +
>   .../bpf/prog_tests/bpf_iter_setsockopt.c      | 226 ++++++++++
>   .../selftests/bpf/progs/bpf_iter_setsockopt.c |  76 ++++
>   .../selftests/bpf/progs/bpf_tracing_net.h     |   4 +
>   12 files changed, 767 insertions(+), 97 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c
>   create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c

I have a few minor comments (replying to individual commits). But 
overall LGTM.

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-06-29 19:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25 20:04 [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Martin KaFai Lau
2021-06-25 20:04 ` [PATCH bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Martin KaFai Lau
2021-06-25 20:04 ` [PATCH bpf-next 2/8] tcp: seq_file: Refactor net and family matching Martin KaFai Lau
2021-06-25 20:05 ` [PATCH bpf-next 3/8] bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state Martin KaFai Lau
2021-06-25 20:05 ` [PATCH bpf-next 4/8] tcp: seq_file: Add listening_get_first() Martin KaFai Lau
2021-06-25 20:05 ` [PATCH bpf-next 5/8] tcp: seq_file: Replace listening_hash with lhash2 Martin KaFai Lau
2021-06-25 20:05 ` [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock Martin KaFai Lau
2021-06-29 17:27   ` Yonghong Song
2021-06-29 17:44     ` Martin KaFai Lau
2021-06-29 17:57       ` Yonghong Song
2021-06-29 18:06         ` Martin KaFai Lau
2021-06-29 18:55           ` Yonghong Song
2021-06-25 20:05 ` [PATCH bpf-next 7/8] bpf: tcp: Support bpf_setsockopt in bpf tcp iter Martin KaFai Lau
2021-06-25 20:05 ` [PATCH bpf-next 8/8] bpf: selftest: Test batching and " Martin KaFai Lau
2021-06-29 19:00   ` Yonghong Song
2021-06-29 19:04 ` [PATCH bpf-next 0/8] bpf: Allow bpf tcp iter to do bpf_setsockopt Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).