bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability
@ 2023-05-19 22:51 Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 1/9] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag

This patch set adds the capability to destroy sockets in BPF. We plan to
use the capability in Cilium to force client sockets to reconnect when
their remote load-balancing backends are deleted. The other use case is
on-the-fly policy enforcement where existing socket connections
prevented by policies need to be terminated.

The use cases, and more details around
the selected approach were presented at LPC 2022 -
https://lpc.events/event/16/contributions/1358/.
RFC discussion -
https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
v8 patch series -
https://lore.kernel.org/bpf/20230517175359.527917-1-aditi.ghag@isovalent.com/

v9 highlights:
Address review comments:
Martin:
- Rearranged the kfunc filter patch, and added the missing break
  statement.
- Squashed the extended selftest/bpf patch.
Yonghong:
- Revised commit message for patch 1.

(Below notes are same as v8 patch series that are still relevant. Refer to
earlier patch series versions for other notes.)
- I hit a snag while writing the kfunc where verifier complained about the
  `sock_common` type passed from TCP iterator. With kfuncs, there don't
  seem to be any options available to pass BTF type hints to the verifier
  (equivalent of `ARG_PTR_TO_BTF_ID_SOCK_COMMON`, as was the case with the
  helper).  As a result, I changed the argument type of the sock_destory
  kfunc to `sock_common`.


Aditi Ghag (9):
  bpf: tcp: Avoid taking fast sock lock in iterator
  udp: seq_file: Helper function to match socket attributes
  bpf: udp: Encapsulate logic to get udp table
  udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
  bpf: udp: Implement batching for sockets iterator
  bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
  bpf: Add bpf_sock_destroy kfunc
  selftests/bpf: Add helper to get port using getsockname
  selftests/bpf: Test bpf_sock_destroy

 include/linux/btf.h                           |  18 +-
 include/net/udp.h                             |   1 -
 kernel/bpf/btf.c                              |  61 +++-
 kernel/bpf/verifier.c                         |   7 +-
 net/core/filter.c                             |  63 ++++
 net/ipv4/tcp.c                                |   9 +-
 net/ipv4/tcp_ipv4.c                           |   7 +-
 net/ipv4/udp.c                                | 291 +++++++++++++++---
 tools/testing/selftests/bpf/network_helpers.c |  23 ++
 tools/testing/selftests/bpf/network_helpers.h |   1 +
 .../selftests/bpf/prog_tests/sock_destroy.c   | 221 +++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 145 +++++++++
 .../bpf/progs/sock_destroy_prog_fail.c        |  22 ++
 13 files changed, 791 insertions(+), 78 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 1/9] bpf: tcp: Avoid taking fast sock lock in iterator
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 2/9] udp: seq_file: Helper function to match socket attributes Aditi Ghag
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag, Yonghong Song

This is a preparatory commit to replace `lock_sock_fast` with
`lock_sock`,and facilitate BPF programs executed from the TCP sockets
iterator to be able to destroy TCP sockets using the bpf_sock_destroy
kfunc (implemented in follow-up commits).

Previously, BPF TCP iterator was acquiring the sock lock with BH
disabled. This led to scenarios where the sockets hash table bucket lock
can be acquired with BH enabled in some path versus disabled in other.
In such situation, kernel issued a warning since it thinks that in the
BH enabled path the same bucket lock *might* be acquired again in the
softirq context (BH disabled), which will lead to a potential dead lock.
Since bpf_sock_destroy also happens in a process context, the potential
deadlock warning is likely a false alarm.

Here is a snippet of annotated stack trace that motivated this change:

```

Possible interrupt unsafe locking scenario:

      CPU0                    CPU1
      ----                    ----
 lock(&h->lhash2[i].lock);
                              local_bh_disable();
                              lock(&h->lhash2[i].lock);
kernel imagined possible scenario:
  local_bh_disable();  /* Possible softirq */
  lock(&h->lhash2[i].lock);
*** Potential Deadlock ***

process context:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH enabled
__inet_hash+0x4b/0x210
inet_csk_listen_start+0xe6/0x100
inet_listen+0x95/0x1d0
__sys_listen+0x69/0xb0
__x64_sys_listen+0x14/0x20
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc


bpf_sock_destroy run from iterator:

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock with BH disabled
inet_unhash+0x9a/0x110
tcp_set_state+0x6a/0x210
tcp_abort+0x10d/0x200
bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
bpf_iter_run_prog+0x1ff/0x340
------> lock_sock_fast that acquires sock lock with BH disabled
bpf_iter_tcp_seq_show+0xca/0x190
bpf_seq_read+0x177/0x450

```

Also, Yonghong reported a deadlock for non-listening TCP sockets that
this change resolves. Previously, `lock_sock_fast` held the sock spin
lock with BH which was again being acquired in `tcp_abort`:

```
watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
Call Trace:
 <TASK>
 _raw_spin_lock+0x84/0x90
 tcp_abort+0x13c/0x1f0
 bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
 bpf_iter_run_prog+0x1aa/0x2c0
 ? preempt_count_sub+0x1c/0xd0
 ? from_kuid_munged+0x1c8/0x210
 bpf_iter_tcp_seq_show+0x14e/0x1b0
 bpf_seq_read+0x36c/0x6a0


bpf_iter_tcp_seq_show
   lock_sock_fast
     __lock_sock_fast
       spin_lock_bh(&sk->sk_lock.slock);
	/* * Fast path return with bottom halves disabled and * sock::sk_lock.slock held.* */

 ...
 tcp_abort
   local_bh_disable();
   spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)

```

With the switch to `lock_sock`, it calls `spin_unlock_bh` before returning:

```
lock_sock
    lock_sock_nested
       spin_lock_bh(&sk->sk_lock.slock);
       :
       spin_unlock_bh(&sk->sk_lock.slock);
```

Acked-by: Yonghong Song <yhs@meta.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/tcp_ipv4.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..f2d370a9450f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 	struct sock *sk = v;
-	bool slow;
 	uid_t uid;
 	int ret;
 
@@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 		return 0;
 
 	if (sk_fullsock(sk))
-		slow = lock_sock_fast(sk);
+		lock_sock(sk);
 
 	if (unlikely(sk_unhashed(sk))) {
 		ret = SEQ_SKIP;
@@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 
 unlock:
 	if (sk_fullsock(sk))
-		unlock_sock_fast(sk, slow);
+		release_sock(sk);
 	return ret;
 
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 2/9] udp: seq_file: Helper function to match socket attributes
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 1/9] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 3/9] bpf: udp: Encapsulate logic to get udp table Aditi Ghag
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag

This is a preparatory commit to refactor code that matches socket
attributes in iterators to a helper function, and use it in the
proc fs iterator.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..71e3fef44fd5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2983,6 +2983,16 @@ EXPORT_SYMBOL(udp_prot);
 /* ------------------------------------------------------------------------ */
 #ifdef CONFIG_PROC_FS
 
+static unsigned short seq_file_family(const struct seq_file *seq);
+static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
+{
+	unsigned short family = seq_file_family(seq);
+
+	/* AF_UNSPEC is used as a match all */
+	return ((family == AF_UNSPEC || family == sk->sk_family) &&
+		net_eq(sock_net(sk), seq_file_net(seq)));
+}
+
 static struct udp_table *udp_get_table_afinfo(struct udp_seq_afinfo *afinfo,
 					      struct net *net)
 {
@@ -3013,10 +3023,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 
 		spin_lock_bh(&hslot->lock);
 		sk_for_each(sk, &hslot->head) {
-			if (!net_eq(sock_net(sk), net))
-				continue;
-			if (afinfo->family == AF_UNSPEC ||
-			    sk->sk_family == afinfo->family)
+			if (seq_sk_match(seq, sk))
 				goto found;
 		}
 		spin_unlock_bh(&hslot->lock);
@@ -3040,9 +3047,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 
 	do {
 		sk = sk_next(sk);
-	} while (sk && (!net_eq(sock_net(sk), net) ||
-			(afinfo->family != AF_UNSPEC &&
-			 sk->sk_family != afinfo->family)));
+	} while (sk && !seq_sk_match(seq, sk));
 
 	if (!sk) {
 		udptable = udp_get_table_afinfo(afinfo, net);
@@ -3205,6 +3210,21 @@ static const struct seq_operations bpf_iter_udp_seq_ops = {
 };
 #endif
 
+static unsigned short seq_file_family(const struct seq_file *seq)
+{
+	const struct udp_seq_afinfo *afinfo;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* BPF iterator: bpf programs to filter sockets. */
+	if (seq->op == &bpf_iter_udp_seq_ops)
+		return AF_UNSPEC;
+#endif
+
+	/* Proc fs iterator */
+	afinfo = pde_data(file_inode(seq->file));
+	return afinfo->family;
+}
+
 const struct seq_operations udp_seq_ops = {
 	.start		= udp_seq_start,
 	.next		= udp_seq_next,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 3/9] bpf: udp: Encapsulate logic to get udp table
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 1/9] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 2/9] udp: seq_file: Helper function to match socket attributes Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 4/9] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag, Martin KaFai Lau

This is a preparatory commit that encapsulates the logic
to get udp table in iterator inside udp_get_table_afinfo, and
renames the function to `udp_get_table_seq` accordingly.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 71e3fef44fd5..c426ebafeb13 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2993,9 +2993,16 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
 		net_eq(sock_net(sk), seq_file_net(seq)));
 }
 
-static struct udp_table *udp_get_table_afinfo(struct udp_seq_afinfo *afinfo,
-					      struct net *net)
+static struct udp_table *udp_get_table_seq(struct seq_file *seq,
+					   struct net *net)
 {
+	const struct udp_iter_state *state = seq->private;
+	const struct udp_seq_afinfo *afinfo;
+
+	if (state->bpf_seq_afinfo)
+		return net->ipv4.udp_table;
+
+	afinfo = pde_data(file_inode(seq->file));
 	return afinfo->udp_table ? : net->ipv4.udp_table;
 }
 
@@ -3003,16 +3010,10 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 {
 	struct udp_iter_state *state = seq->private;
 	struct net *net = seq_file_net(seq);
-	struct udp_seq_afinfo *afinfo;
 	struct udp_table *udptable;
 	struct sock *sk;
 
-	if (state->bpf_seq_afinfo)
-		afinfo = state->bpf_seq_afinfo;
-	else
-		afinfo = pde_data(file_inode(seq->file));
-
-	udptable = udp_get_table_afinfo(afinfo, net);
+	udptable = udp_get_table_seq(seq, net);
 
 	for (state->bucket = start; state->bucket <= udptable->mask;
 	     ++state->bucket) {
@@ -3037,20 +3038,14 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 {
 	struct udp_iter_state *state = seq->private;
 	struct net *net = seq_file_net(seq);
-	struct udp_seq_afinfo *afinfo;
 	struct udp_table *udptable;
 
-	if (state->bpf_seq_afinfo)
-		afinfo = state->bpf_seq_afinfo;
-	else
-		afinfo = pde_data(file_inode(seq->file));
-
 	do {
 		sk = sk_next(sk);
 	} while (sk && !seq_sk_match(seq, sk));
 
 	if (!sk) {
-		udptable = udp_get_table_afinfo(afinfo, net);
+		udptable = udp_get_table_seq(seq, net);
 
 		if (state->bucket <= udptable->mask)
 			spin_unlock_bh(&udptable->hash[state->bucket].lock);
@@ -3096,15 +3091,9 @@ EXPORT_SYMBOL(udp_seq_next);
 void udp_seq_stop(struct seq_file *seq, void *v)
 {
 	struct udp_iter_state *state = seq->private;
-	struct udp_seq_afinfo *afinfo;
 	struct udp_table *udptable;
 
-	if (state->bpf_seq_afinfo)
-		afinfo = state->bpf_seq_afinfo;
-	else
-		afinfo = pde_data(file_inode(seq->file));
-
-	udptable = udp_get_table_afinfo(afinfo, seq_file_net(seq));
+	udptable = udp_get_table_seq(seq, seq_file_net(seq));
 
 	if (state->bucket <= udptable->mask)
 		spin_unlock_bh(&udptable->hash[state->bucket].lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 4/9] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (2 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 3/9] bpf: udp: Encapsulate logic to get udp table Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator Aditi Ghag
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag, Martin KaFai Lau

This is a preparatory commit to remove the field. The field was
previously shared between proc fs and BPF UDP socket iterators. As the
follow-up commits will decouple the implementation for the iterators,
remove the field. As for BPF socket iterator, filtering of sockets is
exepected to be done in BPF programs.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 include/net/udp.h |  1 -
 net/ipv4/udp.c    | 27 +++++++--------------------
 2 files changed, 7 insertions(+), 21 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index de4b528522bb..5cad44318d71 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -437,7 +437,6 @@ struct udp_seq_afinfo {
 struct udp_iter_state {
 	struct seq_net_private  p;
 	int			bucket;
-	struct udp_seq_afinfo	*bpf_seq_afinfo;
 };
 
 void *udp_seq_start(struct seq_file *seq, loff_t *pos);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c426ebafeb13..289ef05b5c15 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2993,14 +2993,18 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
 		net_eq(sock_net(sk), seq_file_net(seq)));
 }
 
+#ifdef CONFIG_BPF_SYSCALL
+static const struct seq_operations bpf_iter_udp_seq_ops;
+#endif
 static struct udp_table *udp_get_table_seq(struct seq_file *seq,
 					   struct net *net)
 {
-	const struct udp_iter_state *state = seq->private;
 	const struct udp_seq_afinfo *afinfo;
 
-	if (state->bpf_seq_afinfo)
+#ifdef CONFIG_BPF_SYSCALL
+	if (seq->op == &bpf_iter_udp_seq_ops)
 		return net->ipv4.udp_table;
+#endif
 
 	afinfo = pde_data(file_inode(seq->file));
 	return afinfo->udp_table ? : net->ipv4.udp_table;
@@ -3424,28 +3428,11 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	struct udp_iter_state *st = priv_data;
-	struct udp_seq_afinfo *afinfo;
-	int ret;
-
-	afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN);
-	if (!afinfo)
-		return -ENOMEM;
-
-	afinfo->family = AF_UNSPEC;
-	afinfo->udp_table = NULL;
-	st->bpf_seq_afinfo = afinfo;
-	ret = bpf_iter_init_seq_net(priv_data, aux);
-	if (ret)
-		kfree(afinfo);
-	return ret;
+	return bpf_iter_init_seq_net(priv_data, aux);
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
-	struct udp_iter_state *st = priv_data;
-
-	kfree(st->bpf_seq_afinfo);
 	bpf_iter_fini_seq_net(priv_data);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (3 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 4/9] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-09-20  0:38   ` Martin KaFai Lau
  2023-05-19 22:51 ` [PATCH v9 bpf-next 6/9] bpf: Add kfunc filter function to 'struct btf_kfunc_id_set' Aditi Ghag
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag, Martin KaFai Lau

Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators.  With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 199 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 289ef05b5c15..8fe2fd6255cc 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3150,6 +3150,143 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+struct bpf_udp_iter_state {
+	struct udp_iter_state state;
+	unsigned int cur_sk;
+	unsigned int end_sk;
+	unsigned int max_sk;
+	int offset;
+	struct sock **batch;
+	bool st_bucket_done;
+};
+
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz);
+static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
+	struct net *net = seq_file_net(seq);
+	struct udp_table *udptable;
+	unsigned int batch_sks = 0;
+	bool resized = false;
+	struct sock *sk;
+
+	/* The current batch is done, so advance the bucket. */
+	if (iter->st_bucket_done) {
+		state->bucket++;
+		iter->offset = 0;
+	}
+
+	udptable = udp_get_table_seq(seq, net);
+
+again:
+	/* New batch for the next bucket.
+	 * Iterate over the hash table to find a bucket with sockets matching
+	 * the iterator attributes, and return the first matching socket from
+	 * the bucket. The remaining matched sockets from the bucket are batched
+	 * before releasing the bucket lock. This allows BPF programs that are
+	 * called in seq_show to acquire the bucket lock if needed.
+	 */
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	batch_sks = 0;
+
+	for (; state->bucket <= udptable->mask; state->bucket++) {
+		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
+
+		if (hlist_empty(&hslot2->head)) {
+			iter->offset = 0;
+			continue;
+		}
+
+		spin_lock_bh(&hslot2->lock);
+		udp_portaddr_for_each_entry(sk, &hslot2->head) {
+			if (seq_sk_match(seq, sk)) {
+				/* Resume from the last iterated socket at the
+				 * offset in the bucket before iterator was stopped.
+				 */
+				if (iter->offset) {
+					--iter->offset;
+					continue;
+				}
+				if (iter->end_sk < iter->max_sk) {
+					sock_hold(sk);
+					iter->batch[iter->end_sk++] = sk;
+				}
+				batch_sks++;
+			}
+		}
+		spin_unlock_bh(&hslot2->lock);
+
+		if (iter->end_sk)
+			break;
+
+		/* Reset the current bucket's offset before moving to the next bucket. */
+		iter->offset = 0;
+	}
+
+	/* All done: no batch made. */
+	if (!iter->end_sk)
+		return NULL;
+
+	if (iter->end_sk == batch_sks) {
+		/* Batching is done for the current bucket; return the first
+		 * socket to be iterated from the batch.
+		 */
+		iter->st_bucket_done = true;
+		goto done;
+	}
+	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
+		resized = true;
+		/* After allocating a larger batch, retry one more time to grab
+		 * the whole bucket.
+		 */
+		state->bucket--;
+		goto again;
+	}
+done:
+	return iter->batch[0];
+}
+
+static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct sock *sk;
+
+	/* Whenever seq_next() is called, the iter->cur_sk is
+	 * done with seq_show(), so unref the iter->cur_sk.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sock_put(iter->batch[iter->cur_sk++]);
+		++iter->offset;
+	}
+
+	/* After updating iter->cur_sk, check if there are more sockets
+	 * available in the current bucket batch.
+	 */
+	if (iter->cur_sk < iter->end_sk)
+		sk = iter->batch[iter->cur_sk];
+	else
+		/* Prepare a new batch. */
+		sk = bpf_iter_udp_batch(seq);
+
+	++*pos;
+	return sk;
+}
+
+static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* bpf iter does not support lseek, so it always
+	 * continue from where it was stop()-ped.
+	 */
+	if (*pos)
+		return bpf_iter_udp_batch(seq);
+
+	return SEQ_START_TOKEN;
+}
+
 static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
 			     struct udp_sock *udp_sk, uid_t uid, int bucket)
 {
@@ -3170,18 +3307,37 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
+	int ret;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
+	lock_sock(sk);
+
+	if (unlikely(sk_unhashed(sk))) {
+		ret = SEQ_SKIP;
+		goto unlock;
+	}
+
 	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
 	meta.seq = seq;
 	prog = bpf_iter_get_info(&meta, false);
-	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+	ret = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+
+unlock:
+	release_sock(sk);
+	return ret;
+}
+
+static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
+{
+	while (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct bpf_udp_iter_state *iter = seq->private;
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 
@@ -3192,12 +3348,15 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
 	}
 
-	udp_seq_stop(seq, v);
+	if (iter->cur_sk < iter->end_sk) {
+		bpf_iter_udp_put_batch(iter);
+		iter->st_bucket_done = false;
+	}
 }
 
 static const struct seq_operations bpf_iter_udp_seq_ops = {
-	.start		= udp_seq_start,
-	.next		= udp_seq_next,
+	.start		= bpf_iter_udp_seq_start,
+	.next		= bpf_iter_udp_seq_next,
 	.stop		= bpf_iter_udp_seq_stop,
 	.show		= bpf_iter_udp_seq_show,
 };
@@ -3426,21 +3585,55 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
 DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 		     struct udp_sock *udp_sk, uid_t uid, int bucket)
 
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz)
+{
+	struct sock **new_batch;
+
+	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
+				   GFP_USER | __GFP_NOWARN);
+	if (!new_batch)
+		return -ENOMEM;
+
+	bpf_iter_udp_put_batch(iter);
+	kvfree(iter->batch);
+	iter->batch = new_batch;
+	iter->max_sk = new_batch_sz;
+
+	return 0;
+}
+
+#define INIT_BATCH_SZ 16
+
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	return bpf_iter_init_seq_net(priv_data, aux);
+	struct bpf_udp_iter_state *iter = priv_data;
+	int ret;
+
+	ret = bpf_iter_init_seq_net(priv_data, aux);
+	if (ret)
+		return ret;
+
+	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
+	if (ret)
+		bpf_iter_fini_seq_net(priv_data);
+
+	return ret;
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
+	struct bpf_udp_iter_state *iter = priv_data;
+
 	bpf_iter_fini_seq_net(priv_data);
+	kvfree(iter->batch);
 }
 
 static const struct bpf_iter_seq_info udp_seq_info = {
 	.seq_ops		= &bpf_iter_udp_seq_ops,
 	.init_seq_private	= bpf_iter_init_udp,
 	.fini_seq_private	= bpf_iter_fini_udp,
-	.seq_priv_size		= sizeof(struct udp_iter_state),
+	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
 };
 
 static struct bpf_iter_reg udp_reg_info = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 6/9] bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (4 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 7/9] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag, Martin KaFai Lau

This commit adds the ability to filter kfuncs to certain BPF program
types. This is required to limit bpf_sock_destroy kfunc implemented in
follow-up commits to programs with attach type 'BPF_TRACE_ITER'.

The commit adds a callback filter to 'struct btf_kfunc_id_set'.  The
filter has access to the `bpf_prog` construct including its properties
such as `expected_attached_type`.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 include/linux/btf.h   | 18 ++++++++-----
 kernel/bpf/btf.c      | 61 ++++++++++++++++++++++++++++++++++++-------
 kernel/bpf/verifier.c |  7 ++---
 3 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 495250162422..918a0b6379bd 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -99,10 +99,14 @@ struct btf_type;
 union bpf_attr;
 struct btf_show;
 struct btf_id_set;
+struct bpf_prog;
+
+typedef int (*btf_kfunc_filter_t)(const struct bpf_prog *prog, u32 kfunc_id);
 
 struct btf_kfunc_id_set {
 	struct module *owner;
 	struct btf_id_set8 *set;
+	btf_kfunc_filter_t filter;
 };
 
 struct btf_id_dtor_kfunc {
@@ -482,7 +486,6 @@ static inline void *btf_id_set8_contains(const struct btf_id_set8 *set, u32 id)
 	return bsearch(&id, set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func);
 }
 
-struct bpf_prog;
 struct bpf_verifier_log;
 
 #ifdef CONFIG_BPF_SYSCALL
@@ -490,10 +493,10 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
 struct btf *btf_parse_vmlinux(void);
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog);
-u32 *btf_kfunc_id_set_contains(const struct btf *btf,
-			       enum bpf_prog_type prog_type,
-			       u32 kfunc_btf_id);
-u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id);
+u32 *btf_kfunc_id_set_contains(const struct btf *btf, u32 kfunc_btf_id,
+			       const struct bpf_prog *prog);
+u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
+				const struct bpf_prog *prog);
 int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 			      const struct btf_kfunc_id_set *s);
 int register_btf_fmodret_id_set(const struct btf_kfunc_id_set *kset);
@@ -520,8 +523,9 @@ static inline const char *btf_name_by_offset(const struct btf *btf,
 	return NULL;
 }
 static inline u32 *btf_kfunc_id_set_contains(const struct btf *btf,
-					     enum bpf_prog_type prog_type,
-					     u32 kfunc_btf_id)
+					     u32 kfunc_btf_id,
+					     struct bpf_prog *prog)
+
 {
 	return NULL;
 }
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 913b9d717a4a..c22325daf4a5 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -218,10 +218,17 @@ enum btf_kfunc_hook {
 enum {
 	BTF_KFUNC_SET_MAX_CNT = 256,
 	BTF_DTOR_KFUNC_MAX_CNT = 256,
+	BTF_KFUNC_FILTER_MAX_CNT = 16,
+};
+
+struct btf_kfunc_hook_filter {
+	btf_kfunc_filter_t filters[BTF_KFUNC_FILTER_MAX_CNT];
+	u32 nr_filters;
 };
 
 struct btf_kfunc_set_tab {
 	struct btf_id_set8 *sets[BTF_KFUNC_HOOK_MAX];
+	struct btf_kfunc_hook_filter hook_filters[BTF_KFUNC_HOOK_MAX];
 };
 
 struct btf_id_dtor_kfunc_tab {
@@ -7720,9 +7727,12 @@ static int btf_check_kfunc_protos(struct btf *btf, u32 func_id, u32 func_flags)
 /* Kernel Function (kfunc) BTF ID set registration API */
 
 static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
-				  struct btf_id_set8 *add_set)
+				  const struct btf_kfunc_id_set *kset)
 {
+	struct btf_kfunc_hook_filter *hook_filter;
+	struct btf_id_set8 *add_set = kset->set;
 	bool vmlinux_set = !btf_is_module(btf);
+	bool add_filter = !!kset->filter;
 	struct btf_kfunc_set_tab *tab;
 	struct btf_id_set8 *set;
 	u32 set_cnt;
@@ -7737,6 +7747,22 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
 		return 0;
 
 	tab = btf->kfunc_set_tab;
+
+	if (tab && add_filter) {
+		int i;
+
+		hook_filter = &tab->hook_filters[hook];
+		for (i = 0; i < hook_filter->nr_filters; i++) {
+			if (hook_filter->filters[i] == kset->filter) {
+				add_filter = false;
+				break;
+			}
+		}
+
+		if (add_filter && hook_filter->nr_filters == BTF_KFUNC_FILTER_MAX_CNT)
+			return -E2BIG;
+	}
+
 	if (!tab) {
 		tab = kzalloc(sizeof(*tab), GFP_KERNEL | __GFP_NOWARN);
 		if (!tab)
@@ -7759,7 +7785,7 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
 	 */
 	if (!vmlinux_set) {
 		tab->sets[hook] = add_set;
-		return 0;
+		goto do_add_filter;
 	}
 
 	/* In case of vmlinux sets, there may be more than one set being
@@ -7801,6 +7827,11 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
 
 	sort(set->pairs, set->cnt, sizeof(set->pairs[0]), btf_id_cmp_func, NULL);
 
+do_add_filter:
+	if (add_filter) {
+		hook_filter = &tab->hook_filters[hook];
+		hook_filter->filters[hook_filter->nr_filters++] = kset->filter;
+	}
 	return 0;
 end:
 	btf_free_kfunc_set_tab(btf);
@@ -7809,15 +7840,22 @@ static int btf_populate_kfunc_set(struct btf *btf, enum btf_kfunc_hook hook,
 
 static u32 *__btf_kfunc_id_set_contains(const struct btf *btf,
 					enum btf_kfunc_hook hook,
+					const struct bpf_prog *prog,
 					u32 kfunc_btf_id)
 {
+	struct btf_kfunc_hook_filter *hook_filter;
 	struct btf_id_set8 *set;
-	u32 *id;
+	u32 *id, i;
 
 	if (hook >= BTF_KFUNC_HOOK_MAX)
 		return NULL;
 	if (!btf->kfunc_set_tab)
 		return NULL;
+	hook_filter = &btf->kfunc_set_tab->hook_filters[hook];
+	for (i = 0; i < hook_filter->nr_filters; i++) {
+		if (hook_filter->filters[i](prog, kfunc_btf_id))
+			return NULL;
+	}
 	set = btf->kfunc_set_tab->sets[hook];
 	if (!set)
 		return NULL;
@@ -7870,23 +7908,25 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
  * protection for looking up a well-formed btf->kfunc_set_tab.
  */
 u32 *btf_kfunc_id_set_contains(const struct btf *btf,
-			       enum bpf_prog_type prog_type,
-			       u32 kfunc_btf_id)
+			       u32 kfunc_btf_id,
+			       const struct bpf_prog *prog)
 {
+	enum bpf_prog_type prog_type = resolve_prog_type(prog);
 	enum btf_kfunc_hook hook;
 	u32 *kfunc_flags;
 
-	kfunc_flags = __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_COMMON, kfunc_btf_id);
+	kfunc_flags = __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_COMMON, prog, kfunc_btf_id);
 	if (kfunc_flags)
 		return kfunc_flags;
 
 	hook = bpf_prog_type_to_kfunc_hook(prog_type);
-	return __btf_kfunc_id_set_contains(btf, hook, kfunc_btf_id);
+	return __btf_kfunc_id_set_contains(btf, hook, prog, kfunc_btf_id);
 }
 
-u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id)
+u32 *btf_kfunc_is_modify_return(const struct btf *btf, u32 kfunc_btf_id,
+				const struct bpf_prog *prog)
 {
-	return __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_FMODRET, kfunc_btf_id);
+	return __btf_kfunc_id_set_contains(btf, BTF_KFUNC_HOOK_FMODRET, prog, kfunc_btf_id);
 }
 
 static int __register_btf_kfunc_id_set(enum btf_kfunc_hook hook,
@@ -7917,7 +7957,8 @@ static int __register_btf_kfunc_id_set(enum btf_kfunc_hook hook,
 			goto err_out;
 	}
 
-	ret = btf_populate_kfunc_set(btf, hook, kset->set);
+	ret = btf_populate_kfunc_set(btf, hook, kset);
+
 err_out:
 	btf_put(btf);
 	return ret;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6db6de3e9ea..8d9519210935 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10534,7 +10534,7 @@ static int fetch_kfunc_meta(struct bpf_verifier_env *env,
 		*kfunc_name = func_name;
 	func_proto = btf_type_by_id(desc_btf, func->type);
 
-	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, resolve_prog_type(env->prog), func_id);
+	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, func_id, env->prog);
 	if (!kfunc_flags) {
 		return -EACCES;
 	}
@@ -18526,7 +18526,8 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
 				 * in the fmodret id set with the KF_SLEEPABLE flag.
 				 */
 				else {
-					u32 *flags = btf_kfunc_is_modify_return(btf, btf_id);
+					u32 *flags = btf_kfunc_is_modify_return(btf, btf_id,
+										prog);
 
 					if (flags && (*flags & KF_SLEEPABLE))
 						ret = 0;
@@ -18554,7 +18555,7 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
 				return -EINVAL;
 			}
 			ret = -EINVAL;
-			if (btf_kfunc_is_modify_return(btf, btf_id) ||
+			if (btf_kfunc_is_modify_return(btf, btf_id, prog) ||
 			    !check_attach_modify_return(addr, tname))
 				ret = 0;
 			if (ret) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 7/9] bpf: Add bpf_sock_destroy kfunc
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (5 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 6/9] bpf: Add kfunc filter function to 'struct btf_kfunc_id_set' Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 8/9] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium
load-balancing to terminate client sockets that continue to connect to
deleted backends.  The other use case is on-the-fly policy enforcement
where existing socket connections prevented by policies need to be
forcefully terminated.  The kfunc also allows terminating sockets that may
or may not be actively sending traffic.

The kfunc can currently be called only from BPF TCP and UDP iterators
where users can filter, and terminate selected sockets. More
specifically, it can only be called from  BPF contexts that ensure
socket locking in order to allow synchronous execution of protocol
specific `diag_destroy` handlers. The previous commit that batches UDP
sockets during iteration facilitated a synchronous invocation of the UDP
destroy callback from BPF context by skipping socket locks in
`udp_abort`. TCP iterator already supported batching of sockets being
iterated. To that end, `tracing_iter_filter` callback filter is added so
that verifier can restrict the kfunc to programs with `BPF_TRACE_ITER`
attach type, and reject other programs.

The kfunc takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have a
special handling in the diag_destroy handlers.

Additionally, the kfunc is defined with `KF_TRUSTED_ARGS` flag to avoid the
cases where a `PTR_TO_BTF_ID` sk is obtained by following another pointer.
eg. getting a sk pointer (may be even NULL) by following another sk
pointer. The pointer socket argument passed in TCP and UDP iterators is
tagged as `PTR_TRUSTED` in {tcp,udp}_reg_info.  The TRUSTED arg changes
are contributed by Martin KaFai Lau <martin.lau@kernel.org>.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/core/filter.c   | 63 +++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c      |  9 ++++---
 net/ipv4/tcp_ipv4.c |  2 +-
 net/ipv4/udp.c      |  8 +++---
 4 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 727c5269867d..efeac5d8f19c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11715,3 +11715,66 @@ static int __init bpf_kfunc_init(void)
 	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
 }
 late_initcall(bpf_kfunc_init);
+
+/* Disables missing prototype warnings */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
+ *
+ * The function expects a non-NULL pointer to a socket, and invokes the
+ * protocol specific socket destroy handlers.
+ *
+ * The helper can only be called from BPF contexts that have acquired the socket
+ * locks.
+ *
+ * Parameters:
+ * @sock: Pointer to socket to be destroyed
+ *
+ * Return:
+ * On error, may return EPROTONOSUPPORT, EINVAL.
+ * EPROTONOSUPPORT if protocol specific destroy handler is not supported.
+ * 0 otherwise
+ */
+__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
+{
+	struct sock *sk = (struct sock *)sock;
+
+	/* The locking semantics that allow for synchronous execution of the
+	 * destroy handlers are only supported for TCP and UDP.
+	 * Supporting protocols will need to acquire sock lock in the BPF context
+	 * prior to invoking this kfunc.
+	 */
+	if (!sk->sk_prot->diag_destroy || (sk->sk_protocol != IPPROTO_TCP &&
+					   sk->sk_protocol != IPPROTO_UDP))
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
+}
+
+__diag_pop()
+
+BTF_SET8_START(bpf_sk_iter_check_kfunc_set)
+BTF_ID_FLAGS(func, bpf_sock_destroy, KF_TRUSTED_ARGS)
+BTF_SET8_END(bpf_sk_iter_check_kfunc_set)
+
+static int tracing_iter_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (btf_id_set8_contains(&bpf_sk_iter_check_kfunc_set, kfunc_id) &&
+	    prog->expected_attach_type != BPF_TRACE_ITER)
+		return -EACCES;
+	return 0;
+}
+
+static const struct btf_kfunc_id_set bpf_sk_iter_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &bpf_sk_iter_check_kfunc_set,
+	.filter = tracing_iter_filter,
+};
+
+static int init_subsystem(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sk_iter_kfunc_set);
+}
+late_initcall(init_subsystem);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..fd41fdc09211 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4679,8 +4679,10 @@ int tcp_abort(struct sock *sk, int err)
 		return 0;
 	}
 
-	/* Don't race with userspace socket closes such as tcp_close. */
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		/* Don't race with userspace socket closes such as tcp_close. */
+		lock_sock(sk);
 
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
@@ -4704,7 +4706,8 @@ int tcp_abort(struct sock *sk, int err)
 	bh_unlock_sock(sk);
 	local_bh_enable();
 	tcp_write_queue_purge(sk);
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_abort);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f2d370a9450f..af75ddcbee62 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3354,7 +3354,7 @@ static struct bpf_iter_reg tcp_reg_info = {
 	.ctx_arg_info_size	= 1,
 	.ctx_arg_info		= {
 		{ offsetof(struct bpf_iter__tcp, sk_common),
-		  PTR_TO_BTF_ID_OR_NULL },
+		  PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED},
 	},
 	.get_func_proto		= bpf_iter_tcp_get_func_proto,
 	.seq_info		= &tcp_seq_info,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8fe2fd6255cc..289fbbec633e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2925,7 +2925,8 @@ EXPORT_SYMBOL(udp_poll);
 
 int udp_abort(struct sock *sk, int err)
 {
-	lock_sock(sk);
+	if (!has_current_bpf_ctx())
+		lock_sock(sk);
 
 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
 	 * with close()
@@ -2938,7 +2939,8 @@ int udp_abort(struct sock *sk, int err)
 	__udp_disconnect(sk, 0);
 
 out:
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 
 	return 0;
 }
@@ -3641,7 +3643,7 @@ static struct bpf_iter_reg udp_reg_info = {
 	.ctx_arg_info_size	= 1,
 	.ctx_arg_info		= {
 		{ offsetof(struct bpf_iter__udp, udp_sk),
-		  PTR_TO_BTF_ID_OR_NULL },
+		  PTR_TO_BTF_ID_OR_NULL | PTR_TRUSTED },
 	},
 	.seq_info		= &udp_seq_info,
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 8/9] selftests/bpf: Add helper to get port using getsockname
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (6 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 7/9] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-19 22:51 ` [PATCH v9 bpf-next 9/9] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
  2023-05-20  6:00 ` [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability patchwork-bot+netdevbpf
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag

The helper will be used to programmatically retrieve
and pass ports in userspace and kernel selftest programs.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 23 +++++++++++++++++++
 tools/testing/selftests/bpf/network_helpers.h |  1 +
 2 files changed, 24 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 596caa176582..a105c0cd008a 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -427,3 +427,26 @@ void close_netns(struct nstoken *token)
 	close(token->orig_netns_fd);
 	free(token);
 }
+
+int get_socket_local_port(int sock_fd)
+{
+	struct sockaddr_storage addr;
+	socklen_t addrlen = sizeof(addr);
+	int err;
+
+	err = getsockname(sock_fd, (struct sockaddr *)&addr, &addrlen);
+	if (err < 0)
+		return err;
+
+	if (addr.ss_family == AF_INET) {
+		struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+
+		return sin->sin_port;
+	} else if (addr.ss_family == AF_INET6) {
+		struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+
+		return sin->sin6_port;
+	}
+
+	return -1;
+}
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index f882c691b790..694185644da6 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -56,6 +56,7 @@ int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
 int make_sockaddr(int family, const char *addr_str, __u16 port,
 		  struct sockaddr_storage *addr, socklen_t *len);
 char *ping_command(int family);
+int get_socket_local_port(int sock_fd);
 
 struct nstoken;
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 bpf-next 9/9] selftests/bpf: Test bpf_sock_destroy
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (7 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 8/9] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
@ 2023-05-19 22:51 ` Aditi Ghag
  2023-05-20  6:00 ` [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability patchwork-bot+netdevbpf
  9 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-05-19 22:51 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, aditi.ghag

The test cases for destroying sockets mirror the intended usages of the
bpf_sock_destroy kfunc using iterators.

The destroy helpers set `ECONNABORTED` error code that we can validate
in the test code with client sockets. But UDP sockets have an overriding
error code from `disconnect()` called during abort, so the error code
validation is only done for TCP sockets.

The failure test cases validate that the `bpf_sock_destroy` kfunc is not
allowed from program attach types other than BPF trace iterator, and
such programs fail to load.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 .../selftests/bpf/prog_tests/sock_destroy.c   | 221 ++++++++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 145 ++++++++++++
 .../bpf/progs/sock_destroy_prog_fail.c        |  22 ++
 3 files changed, 388 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
new file mode 100644
index 000000000000..b0583309a94e
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <bpf/bpf_endian.h>
+
+#include "sock_destroy_prog.skel.h"
+#include "sock_destroy_prog_fail.skel.h"
+#include "network_helpers.h"
+
+#define TEST_NS "sock_destroy_netns"
+
+static void start_iter_sockets(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	char buf[50] = {};
+	int iter_fd, len;
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+		goto free_link;
+
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+	ASSERT_GE(len, 0, "read");
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+}
+
+static void test_tcp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, accept_serv = -1, n;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup;
+
+	accept_serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(accept_serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_EQ(n, 1, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys connected client sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
+
+cleanup:
+	if (clien != -1)
+		close(clien);
+	if (accept_serv != -1)
+		close(accept_serv);
+	if (serv != -1)
+		close(serv);
+}
+
+static void test_tcp_server(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, accept_serv = -1, n, serv_port;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup;
+	serv_port = get_socket_local_port(serv);
+	if (!ASSERT_GE(serv_port, 0, "get_sock_local_port"))
+		goto cleanup;
+	skel->bss->serv_port = (__be16) serv_port;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup;
+
+	accept_serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(accept_serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_EQ(n, 1, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_server);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
+
+cleanup:
+	if (clien != -1)
+		close(clien);
+	if (accept_serv != -1)
+		close(accept_serv);
+	if (serv != -1)
+		close(serv);
+}
+
+static void test_udp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_EQ(n, 1, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	/* UDP sockets have an overriding error code after they are disconnected,
+	 * so we don't check for ECONNABORTED error code.
+	 */
+
+cleanup:
+	if (clien != -1)
+		close(clien);
+	if (serv != -1)
+		close(serv);
+}
+
+static void test_udp_server(struct sock_destroy_prog *skel)
+{
+	int *listen_fds = NULL, n, i, serv_port;
+	unsigned int num_listens = 5;
+	char buf[1];
+
+	/* Start reuseport servers. */
+	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
+					    "::1", 0, 0, num_listens);
+	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
+		goto cleanup;
+	serv_port = get_socket_local_port(listen_fds[0]);
+	if (!ASSERT_GE(serv_port, 0, "get_sock_local_port"))
+		goto cleanup;
+	skel->bss->serv_port = (__be16) serv_port;
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_server);
+
+	for (i = 0; i < num_listens; ++i) {
+		n = read(listen_fds[i], buf, sizeof(buf));
+		if (!ASSERT_EQ(n, -1, "read") ||
+		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
+			break;
+	}
+	ASSERT_EQ(i, num_listens, "server socket");
+
+cleanup:
+	free_fds(listen_fds, num_listens);
+}
+
+void test_sock_destroy(void)
+{
+	struct sock_destroy_prog *skel;
+	struct nstoken *nstoken = NULL;
+	int cgroup_fd;
+
+	skel = sock_destroy_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	cgroup_fd = test__join_cgroup("/sock_destroy");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.sock_connect = bpf_program__attach_cgroup(
+		skel->progs.sock_connect, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
+		goto cleanup;
+
+	SYS(cleanup, "ip netns add %s", TEST_NS);
+	SYS(cleanup, "ip -net %s link set dev lo up", TEST_NS);
+
+	nstoken = open_netns(TEST_NS);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto cleanup;
+
+	if (test__start_subtest("tcp_client"))
+		test_tcp_client(skel);
+	if (test__start_subtest("tcp_server"))
+		test_tcp_server(skel);
+	if (test__start_subtest("udp_client"))
+		test_udp_client(skel);
+	if (test__start_subtest("udp_server"))
+		test_udp_server(skel);
+
+	RUN_TESTS(sock_destroy_prog_fail);
+
+cleanup:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " TEST_NS " &> /dev/null");
+	if (cgroup_fd >= 0)
+		close(cgroup_fd);
+	sock_destroy_prog__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
new file mode 100644
index 000000000000..9e0bf7a54cec
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include "bpf_tracing_net.h"
+
+__be16 serv_port = 0;
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} tcp_conn_sockets SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} udp_conn_sockets SEC(".maps");
+
+SEC("cgroup/connect6")
+int sock_connect(struct bpf_sock_addr *ctx)
+{
+	__u64 sock_cookie = 0;
+	int key = 0;
+	__u32 keyc = 0;
+
+	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
+		return 1;
+
+	sock_cookie = bpf_get_socket_cookie(ctx);
+	if (ctx->protocol == IPPROTO_TCP)
+		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
+	else if (ctx->protocol == IPPROTO_UDP)
+		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
+	else
+		return 1;
+
+	return 1;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_client(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	__u64 sock_cookie = 0;
+	__u64 *val;
+	int key = 0;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk_common);
+	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_server(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	const struct inet_connection_sock *icsk;
+	const struct inet_sock *inet;
+	struct tcp6_sock *tcp_sk;
+	__be16 srcp;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
+	if (!tcp_sk)
+		return 0;
+
+	icsk = &tcp_sk->tcp.inet_conn;
+	inet = &icsk->icsk_inet;
+	srcp = inet->inet_sport;
+
+	/* Destroy server sockets. */
+	if (srcp == serv_port)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+
+SEC("iter/udp")
+int iter_udp6_client(struct bpf_iter__udp *ctx)
+{
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u64 sock_cookie = 0, *val;
+	int key = 0;
+
+	if (!sk)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk);
+	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+SEC("iter/udp")
+int iter_udp6_server(struct bpf_iter__udp *ctx)
+{
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	struct inet_sock *inet;
+	__be16 srcp;
+
+	if (!sk)
+		return 0;
+
+	inet = &udp_sk->inet;
+	srcp = inet->inet_sport;
+	if (srcp == serv_port)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
new file mode 100644
index 000000000000..dd6850b58e25
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+SEC("tp_btf/tcp_destroy_sock")
+__failure __msg("calling kernel function bpf_sock_destroy is not allowed")
+int BPF_PROG(trace_tcp_destroy_sock, struct sock *sk)
+{
+	/* should not load */
+	bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability
  2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
                   ` (8 preceding siblings ...)
  2023-05-19 22:51 ` [PATCH v9 bpf-next 9/9] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
@ 2023-05-20  6:00 ` patchwork-bot+netdevbpf
  9 siblings, 0 replies; 18+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-05-20  6:00 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, sdf

Hello:

This series was applied to bpf/bpf-next.git (master)
by Martin KaFai Lau <martin.lau@kernel.org>:

On Fri, 19 May 2023 22:51:48 +0000 you wrote:
> This patch set adds the capability to destroy sockets in BPF. We plan to
> use the capability in Cilium to force client sockets to reconnect when
> their remote load-balancing backends are deleted. The other use case is
> on-the-fly policy enforcement where existing socket connections
> prevented by policies need to be terminated.
> 
> The use cases, and more details around
> the selected approach were presented at LPC 2022 -
> https://lpc.events/event/16/contributions/1358/.
> RFC discussion -
> https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
> v8 patch series -
> https://lore.kernel.org/bpf/20230517175359.527917-1-aditi.ghag@isovalent.com/
> 
> [...]

Here is the summary with links:
  - [v9,bpf-next,1/9] bpf: tcp: Avoid taking fast sock lock in iterator
    https://git.kernel.org/bpf/bpf-next/c/9378096e8a65
  - [v9,bpf-next,2/9] udp: seq_file: Helper function to match socket attributes
    https://git.kernel.org/bpf/bpf-next/c/f44b1c515833
  - [v9,bpf-next,3/9] bpf: udp: Encapsulate logic to get udp table
    https://git.kernel.org/bpf/bpf-next/c/7625d2e9741c
  - [v9,bpf-next,4/9] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
    https://git.kernel.org/bpf/bpf-next/c/e4fe1bf13e09
  - [v9,bpf-next,5/9] bpf: udp: Implement batching for sockets iterator
    https://git.kernel.org/bpf/bpf-next/c/c96dac8d369f
  - [v9,bpf-next,6/9] bpf: Add kfunc filter function to 'struct btf_kfunc_id_set'
    https://git.kernel.org/bpf/bpf-next/c/e924e80ee6a3
  - [v9,bpf-next,7/9] bpf: Add bpf_sock_destroy kfunc
    https://git.kernel.org/bpf/bpf-next/c/4ddbcb886268
  - [v9,bpf-next,8/9] selftests/bpf: Add helper to get port using getsockname
    https://git.kernel.org/bpf/bpf-next/c/176ba657e6aa
  - [v9,bpf-next,9/9] selftests/bpf: Test bpf_sock_destroy
    https://git.kernel.org/bpf/bpf-next/c/1a8bc2299f40

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-05-19 22:51 ` [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator Aditi Ghag
@ 2023-09-20  0:38   ` Martin KaFai Lau
  2023-09-20 17:16     ` Aditi Ghag
  2023-09-25 23:34     ` Aditi Ghag
  0 siblings, 2 replies; 18+ messages in thread
From: Martin KaFai Lau @ 2023-09-20  0:38 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, Martin KaFai Lau, bpf, Network Development

On 5/19/23 3:51 PM, Aditi Ghag wrote:
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_table *udptable;
> +	unsigned int batch_sks = 0;
> +	bool resized = false;
> +	struct sock *sk;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done) {
> +		state->bucket++;
> +		iter->offset = 0;
> +	}
> +
> +	udptable = udp_get_table_seq(seq, net);
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	batch_sks = 0;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> +
> +		if (hlist_empty(&hslot2->head)) {
> +			iter->offset = 0;
> +			continue;
> +		}
> +
> +		spin_lock_bh(&hslot2->lock);
> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				/* Resume from the last iterated socket at the
> +				 * offset in the bucket before iterator was stopped.
> +				 */
> +				if (iter->offset) {
> +					--iter->offset;

Hi Aditi, I think this part has a bug.

When I run './test_progs -t bpf_iter/udp6' in a machine with some udp 
so_reuseport sockets, this test is never finished.

A broken case I am seeing is when the bucket has >1 sockets and bpf_seq_read() 
can only get one sk at a time before it calls bpf_iter_udp_seq_stop().

I did not try the change yet. However, from looking at the code where 
iter->offset is changed, --iter->offset here is the most likely culprit and it 
will make backward progress for the same bucket (state->bucket). Other places 
touching iter->offset look fine.

It needs a local "int offset" variable for the zero test. Could you help to take 
a look, add (or modify) a test and fix it?

The progs/bpf_iter_udp[46].c test can be used to reproduce. The test_udp[46] in 
prog_tests/bpf_iter.c needs to be changed though to ensure there is multiple sk 
in the same bucket. Probably a few so_reuseport sk should do.

Thanks.

> +					continue;
> +				}
> +				if (iter->end_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-09-20  0:38   ` Martin KaFai Lau
@ 2023-09-20 17:16     ` Aditi Ghag
  2023-09-25 23:34     ` Aditi Ghag
  1 sibling, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-09-20 17:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, Martin KaFai Lau, bpf, Network Development



> On Sep 19, 2023, at 5:38 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 5/19/23 3:51 PM, Aditi Ghag wrote:
>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>> +{
>> +	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>> +	struct net *net = seq_file_net(seq);
>> +	struct udp_table *udptable;
>> +	unsigned int batch_sks = 0;
>> +	bool resized = false;
>> +	struct sock *sk;
>> +
>> +	/* The current batch is done, so advance the bucket. */
>> +	if (iter->st_bucket_done) {
>> +		state->bucket++;
>> +		iter->offset = 0;
>> +	}
>> +
>> +	udptable = udp_get_table_seq(seq, net);
>> +
>> +again:
>> +	/* New batch for the next bucket.
>> +	 * Iterate over the hash table to find a bucket with sockets matching
>> +	 * the iterator attributes, and return the first matching socket from
>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>> +	 * before releasing the bucket lock. This allows BPF programs that are
>> +	 * called in seq_show to acquire the bucket lock if needed.
>> +	 */
>> +	iter->cur_sk = 0;
>> +	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	batch_sks = 0;
>> +
>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>> +
>> +		if (hlist_empty(&hslot2->head)) {
>> +			iter->offset = 0;
>> +			continue;
>> +		}
>> +
>> +		spin_lock_bh(&hslot2->lock);
>> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
>> +			if (seq_sk_match(seq, sk)) {
>> +				/* Resume from the last iterated socket at the
>> +				 * offset in the bucket before iterator was stopped.
>> +				 */
>> +				if (iter->offset) {
>> +					--iter->offset;
> 
> Hi Aditi, I think this part has a bug.
> 
> When I run './test_progs -t bpf_iter/udp6' in a machine with some udp so_reuseport sockets, this test is never finished.
> 
> A broken case I am seeing is when the bucket has >1 sockets and bpf_seq_read() can only get one sk at a time before it calls bpf_iter_udp_seq_stop().
> 
> I did not try the change yet. However, from looking at the code where iter->offset is changed, --iter->offset here is the most likely culprit and it will make backward progress for the same bucket (state->bucket). Other places touching iter->offset look fine.
> 
> It needs a local "int offset" variable for the zero test. Could you help to take a look, add (or modify) a test and fix it?
> 
> The progs/bpf_iter_udp[46].c test can be used to reproduce. The test_udp[46] in prog_tests/bpf_iter.c needs to be changed though to ensure there is multiple sk in the same bucket. Probably a few so_reuseport sk should do.

Hi Martin,

Thanks for the report. I'll take a look.


> 
> Thanks.
> 
>> +					continue;
>> +				}
>> +				if (iter->end_
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-09-20  0:38   ` Martin KaFai Lau
  2023-09-20 17:16     ` Aditi Ghag
@ 2023-09-25 23:34     ` Aditi Ghag
  2023-09-26  5:02       ` Martin KaFai Lau
  2023-09-26  5:07       ` Martin KaFai Lau
  1 sibling, 2 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-09-25 23:34 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: sdf, Martin KaFai Lau, bpf, Network Development



> On Sep 19, 2023, at 5:38 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 5/19/23 3:51 PM, Aditi Ghag wrote:
>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>> +{
>> +	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>> +	struct net *net = seq_file_net(seq);
>> +	struct udp_table *udptable;
>> +	unsigned int batch_sks = 0;
>> +	bool resized = false;
>> +	struct sock *sk;
>> +
>> +	/* The current batch is done, so advance the bucket. */
>> +	if (iter->st_bucket_done) {
>> +		state->bucket++;
>> +		iter->offset = 0;
>> +	}
>> +
>> +	udptable = udp_get_table_seq(seq, net);
>> +
>> +again:
>> +	/* New batch for the next bucket.
>> +	 * Iterate over the hash table to find a bucket with sockets matching
>> +	 * the iterator attributes, and return the first matching socket from
>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>> +	 * before releasing the bucket lock. This allows BPF programs that are
>> +	 * called in seq_show to acquire the bucket lock if needed.
>> +	 */
>> +	iter->cur_sk = 0;
>> +	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	batch_sks = 0;
>> +
>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>> +
>> +		if (hlist_empty(&hslot2->head)) {
>> +			iter->offset = 0;
>> +			continue;
>> +		}
>> +
>> +		spin_lock_bh(&hslot2->lock);
>> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
>> +			if (seq_sk_match(seq, sk)) {
>> +				/* Resume from the last iterated socket at the
>> +				 * offset in the bucket before iterator was stopped.
>> +				 */
>> +				if (iter->offset) {
>> +					--iter->offset;
> 
> Hi Aditi, I think this part has a bug.
> 
> When I run './test_progs -t bpf_iter/udp6' in a machine with some udp so_reuseport sockets, this test is never finished.
> 
> A broken case I am seeing is when the bucket has >1 sockets and bpf_seq_read() can only get one sk at a time before it calls bpf_iter_udp_seq_stop().

Just so that I understand the broken case better, are you doing something in your BPF iterator program so that "bpf_seq_read() can only get one sk at a time"? 

> 
> I did not try the change yet. However, from looking at the code where iter->offset is changed, --iter->offset here is the most likely culprit and it will make backward progress for the same bucket (state->bucket). Other places touching iter->offset look fine.
> 
> It needs a local "int offset" variable for the zero test. Could you help to take a look, add (or modify) a test and fix it?
> 
> The progs/bpf_iter_udp[46].c test can be used to reproduce. The test_udp[46] in prog_tests/bpf_iter.c needs to be changed though to ensure there is multiple sk in the same bucket. Probably a few so_reuseport sk should do.


The sock_destroy patch set had added a test with multiple so_reuseport sks in a bucket in order to exercise batching [1]. I was wondering if extending the test with an additional bucket should do it, or some more cases are required (asked for clarification above) to reproduce the issue. 


[1] https://elixir.bootlin.com/linux/v6.5/source/tools/testing/selftests/bpf/prog_tests/sock_destroy.c#L146

> 
> Thanks.
> 
>> +					continue;
>> +				}
>> +				if (iter->end_
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-09-25 23:34     ` Aditi Ghag
@ 2023-09-26  5:02       ` Martin KaFai Lau
  2023-09-26 16:07         ` Aditi Ghag
  2023-09-26  5:07       ` Martin KaFai Lau
  1 sibling, 1 reply; 18+ messages in thread
From: Martin KaFai Lau @ 2023-09-26  5:02 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, Martin KaFai Lau, bpf, Network Development

On 9/25/23 4:34 PM, Aditi Ghag wrote:
> 
> 
>> On Sep 19, 2023, at 5:38 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 5/19/23 3:51 PM, Aditi Ghag wrote:
>>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>> +{
>>> +	struct bpf_udp_iter_state *iter = seq->private;
>>> +	struct udp_iter_state *state = &iter->state;
>>> +	struct net *net = seq_file_net(seq);
>>> +	struct udp_table *udptable;
>>> +	unsigned int batch_sks = 0;
>>> +	bool resized = false;
>>> +	struct sock *sk;
>>> +
>>> +	/* The current batch is done, so advance the bucket. */
>>> +	if (iter->st_bucket_done) {
>>> +		state->bucket++;
>>> +		iter->offset = 0;
>>> +	}
>>> +
>>> +	udptable = udp_get_table_seq(seq, net);
>>> +
>>> +again:
>>> +	/* New batch for the next bucket.
>>> +	 * Iterate over the hash table to find a bucket with sockets matching
>>> +	 * the iterator attributes, and return the first matching socket from
>>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>>> +	 * before releasing the bucket lock. This allows BPF programs that are
>>> +	 * called in seq_show to acquire the bucket lock if needed.
>>> +	 */
>>> +	iter->cur_sk = 0;
>>> +	iter->end_sk = 0;
>>> +	iter->st_bucket_done = false;
>>> +	batch_sks = 0;
>>> +
>>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>>> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>> +
>>> +		if (hlist_empty(&hslot2->head)) {
>>> +			iter->offset = 0;
>>> +			continue;
>>> +		}
>>> +
>>> +		spin_lock_bh(&hslot2->lock);
>>> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>> +			if (seq_sk_match(seq, sk)) {
>>> +				/* Resume from the last iterated socket at the
>>> +				 * offset in the bucket before iterator was stopped.
>>> +				 */
>>> +				if (iter->offset) {
>>> +					--iter->offset;
>>
>> Hi Aditi, I think this part has a bug.
>>
>> When I run './test_progs -t bpf_iter/udp6' in a machine with some udp so_reuseport sockets, this test is never finished.
>>
>> A broken case I am seeing is when the bucket has >1 sockets and bpf_seq_read() can only get one sk at a time before it calls bpf_iter_udp_seq_stop().
> 
> Just so that I understand the broken case better, are you doing something in your BPF iterator program so that "bpf_seq_read() can only get one sk at a time"?
> 
>>
>> I did not try the change yet. However, from looking at the code where iter->offset is changed, --iter->offset here is the most likely culprit and it will make backward progress for the same bucket (state->bucket). Other places touching iter->offset look fine.
>>
>> It needs a local "int offset" variable for the zero test. Could you help to take a look, add (or modify) a test and fix it?
>>
>> The progs/bpf_iter_udp[46].c test can be used to reproduce. The test_udp[46] in prog_tests/bpf_iter.c needs to be changed though to ensure there is multiple sk in the same bucket. Probably a few so_reuseport sk should do.
> 
> 
> The sock_destroy patch set had added a test with multiple so_reuseport sks in a bucket in order to exercise batching [1]. I was wondering if extending the test with an additional bucket should do it, or some more cases are required (asked for clarification above) to reproduce the issue.

Number of bucket should not matter. It should only need a bucket to have 
multiple sockets.

I did notice test_udp_server() has 5 so_reuseport udp sk in the same bucket when 
trying to understand how this issue was missed. It is enough on the hashtable 
side. This is the easier part and one start_reuseport_server() call will do. 
Having multiple sk in a bucket is not enough to reprod though.

The bpf prog 'iter_udp6_server' in the sock_destroy test is not doing 
bpf_seq_printf(). bpf_seq_printf() is necessary to reproduce the issue. The 
read() buf from the userspace program side also needs to be small. It needs to 
hit the "if (seq->count >= size) break;" condition in the "while (1)" loop in 
the kernel/bpf/bpf_iter.c.

You can try to add both to the sock_destroy test. I was suggesting 
bpf_iter/udp[46] test instead (i.e. the test_udp[46] function) because the 
bpf_seq_printf and the buf[] size are all aligned to reprod the problem already. 
  Try to add a start_reuseport_server(..., 5) to the beginning of test_udp6() in 
prog_tests/bpf_iter.c to ensure there is multiple udp sk in a bucket. It should 
be enough to reprod.

In the final fix, I don't have strong preference on where the test should be.
Modifying one of the two existing tests (i.e. sock_destroy or bpf_iter) or a 
completely new test.

Let me know if you have problem reproducing it. Thanks.

> 
> 
> [1] https://elixir.bootlin.com/linux/v6.5/source/tools/testing/selftests/bpf/prog_tests/sock_destroy.c#L146
> 
>>
>> Thanks.
>>
>>> +					continue;
>>> +				}
>>> +				if (iter->end_
>>
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-09-25 23:34     ` Aditi Ghag
  2023-09-26  5:02       ` Martin KaFai Lau
@ 2023-09-26  5:07       ` Martin KaFai Lau
  1 sibling, 0 replies; 18+ messages in thread
From: Martin KaFai Lau @ 2023-09-26  5:07 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, Martin KaFai Lau, bpf, Network Development

On 9/25/23 4:34 PM, Aditi Ghag wrote:
> Just so that I understand the broken case better, are you doing something in your BPF iterator program so that "bpf_seq_read() can only get one sk at a time"?

ah, hit send too early.

Yes, bpf_seq_printf(). It is why I was suggesting to use the bpf_iter/udp[46] to 
reprod by adding start_reuseport_server(). Please see my earlier reply.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-09-26  5:02       ` Martin KaFai Lau
@ 2023-09-26 16:07         ` Aditi Ghag
  2023-10-24 22:50           ` Aditi Ghag
  0 siblings, 1 reply; 18+ messages in thread
From: Aditi Ghag @ 2023-09-26 16:07 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: sdf, Martin KaFai Lau, bpf, Network Development



> On Sep 25, 2023, at 10:02 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 9/25/23 4:34 PM, Aditi Ghag wrote:
>>> On Sep 19, 2023, at 5:38 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 5/19/23 3:51 PM, Aditi Ghag wrote:
>>>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>> +{
>>>> +	struct bpf_udp_iter_state *iter = seq->private;
>>>> +	struct udp_iter_state *state = &iter->state;
>>>> +	struct net *net = seq_file_net(seq);
>>>> +	struct udp_table *udptable;
>>>> +	unsigned int batch_sks = 0;
>>>> +	bool resized = false;
>>>> +	struct sock *sk;
>>>> +
>>>> +	/* The current batch is done, so advance the bucket. */
>>>> +	if (iter->st_bucket_done) {
>>>> +		state->bucket++;
>>>> +		iter->offset = 0;
>>>> +	}
>>>> +
>>>> +	udptable = udp_get_table_seq(seq, net);
>>>> +
>>>> +again:
>>>> +	/* New batch for the next bucket.
>>>> +	 * Iterate over the hash table to find a bucket with sockets matching
>>>> +	 * the iterator attributes, and return the first matching socket from
>>>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>>>> +	 * before releasing the bucket lock. This allows BPF programs that are
>>>> +	 * called in seq_show to acquire the bucket lock if needed.
>>>> +	 */
>>>> +	iter->cur_sk = 0;
>>>> +	iter->end_sk = 0;
>>>> +	iter->st_bucket_done = false;
>>>> +	batch_sks = 0;
>>>> +
>>>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>>>> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>> +
>>>> +		if (hlist_empty(&hslot2->head)) {
>>>> +			iter->offset = 0;
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		spin_lock_bh(&hslot2->lock);
>>>> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>> +			if (seq_sk_match(seq, sk)) {
>>>> +				/* Resume from the last iterated socket at the
>>>> +				 * offset in the bucket before iterator was stopped.
>>>> +				 */
>>>> +				if (iter->offset) {
>>>> +					--iter->offset;
>>> 
>>> Hi Aditi, I think this part has a bug.
>>> 
>>> When I run './test_progs -t bpf_iter/udp6' in a machine with some udp so_reuseport sockets, this test is never finished.
>>> 
>>> A broken case I am seeing is when the bucket has >1 sockets and bpf_seq_read() can only get one sk at a time before it calls bpf_iter_udp_seq_stop().
>> Just so that I understand the broken case better, are you doing something in your BPF iterator program so that "bpf_seq_read() can only get one sk at a time"?
>>> 
>>> I did not try the change yet. However, from looking at the code where iter->offset is changed, --iter->offset here is the most likely culprit and it will make backward progress for the same bucket (state->bucket). Other places touching iter->offset look fine.
>>> 
>>> It needs a local "int offset" variable for the zero test. Could you help to take a look, add (or modify) a test and fix it?
>>> 
>>> The progs/bpf_iter_udp[46].c test can be used to reproduce. The test_udp[46] in prog_tests/bpf_iter.c needs to be changed though to ensure there is multiple sk in the same bucket. Probably a few so_reuseport sk should do.
>> The sock_destroy patch set had added a test with multiple so_reuseport sks in a bucket in order to exercise batching [1]. I was wondering if extending the test with an additional bucket should do it, or some more cases are required (asked for clarification above) to reproduce the issue.
> 
> Number of bucket should not matter. It should only need a bucket to have multiple sockets.
> 
> I did notice test_udp_server() has 5 so_reuseport udp sk in the same bucket when trying to understand how this issue was missed. It is enough on the hashtable side. This is the easier part and one start_reuseport_server() call will do. Having multiple sk in a bucket is not enough to reprod though.
> 
> The bpf prog 'iter_udp6_server' in the sock_destroy test is not doing bpf_seq_printf(). bpf_seq_printf() is necessary to reproduce the issue. The read() buf from the userspace program side also needs to be small. It needs to hit the "if (seq->count >= size) break;" condition in the "while (1)" loop in the kernel/bpf/bpf_iter.c.
> 
> You can try to add both to the sock_destroy test. I was suggesting bpf_iter/udp[46] test instead (i.e. the test_udp[46] function) because the bpf_seq_printf and the buf[] size are all aligned to reprod the problem already.  Try to add a start_reuseport_server(..., 5) to the beginning of test_udp6() in prog_tests/bpf_iter.c to ensure there is multiple udp sk in a bucket. It should be enough to reprod.


Gotcha! I think I understand the repro steps. The offset field in question was added for this scenario where an iterator is stopped and resumed that the sock_destroy test cases don't entirely exercise. 
Thanks! 

> 
> In the final fix, I don't have strong preference on where the test should be.
> Modifying one of the two existing tests (i.e. sock_destroy or bpf_iter) or a completely new test.
> 
> Let me know if you have problem reproducing it. Thanks.
> 
>> [1] https://elixir.bootlin.com/linux/v6.5/source/tools/testing/selftests/bpf/prog_tests/sock_destroy.c#L146
>>> 
>>> Thanks.
>>> 
>>>> +					continue;
>>>> +				}
>>>> +				if (iter->end_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator
  2023-09-26 16:07         ` Aditi Ghag
@ 2023-10-24 22:50           ` Aditi Ghag
  0 siblings, 0 replies; 18+ messages in thread
From: Aditi Ghag @ 2023-10-24 22:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, Martin KaFai Lau, bpf, Network Development



> On Sep 26, 2023, at 9:07 AM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Sep 25, 2023, at 10:02 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 9/25/23 4:34 PM, Aditi Ghag wrote:
>>>> On Sep 19, 2023, at 5:38 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>> 
>>>> On 5/19/23 3:51 PM, Aditi Ghag wrote:
>>>>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>> +{
>>>>> +	struct bpf_udp_iter_state *iter = seq->private;
>>>>> +	struct udp_iter_state *state = &iter->state;
>>>>> +	struct net *net = seq_file_net(seq);
>>>>> +	struct udp_table *udptable;
>>>>> +	unsigned int batch_sks = 0;
>>>>> +	bool resized = false;
>>>>> +	struct sock *sk;
>>>>> +
>>>>> +	/* The current batch is done, so advance the bucket. */
>>>>> +	if (iter->st_bucket_done) {
>>>>> +		state->bucket++;
>>>>> +		iter->offset = 0;
>>>>> +	}
>>>>> +
>>>>> +	udptable = udp_get_table_seq(seq, net);
>>>>> +
>>>>> +again:
>>>>> +	/* New batch for the next bucket.
>>>>> +	 * Iterate over the hash table to find a bucket with sockets matching
>>>>> +	 * the iterator attributes, and return the first matching socket from
>>>>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>>>>> +	 * before releasing the bucket lock. This allows BPF programs that are
>>>>> +	 * called in seq_show to acquire the bucket lock if needed.
>>>>> +	 */
>>>>> +	iter->cur_sk = 0;
>>>>> +	iter->end_sk = 0;
>>>>> +	iter->st_bucket_done = false;
>>>>> +	batch_sks = 0;
>>>>> +
>>>>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>> +
>>>>> +		if (hlist_empty(&hslot2->head)) {
>>>>> +			iter->offset = 0;
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>> +		spin_lock_bh(&hslot2->lock);
>>>>> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>>> +			if (seq_sk_match(seq, sk)) {
>>>>> +				/* Resume from the last iterated socket at the
>>>>> +				 * offset in the bucket before iterator was stopped.
>>>>> +				 */
>>>>> +				if (iter->offset) {
>>>>> +					--iter->offset;
>>>> 
>>>> Hi Aditi, I think this part has a bug.
>>>> 
>>>> When I run './test_progs -t bpf_iter/udp6' in a machine with some udp so_reuseport sockets, this test is never finished.
>>>> 
>>>> A broken case I am seeing is when the bucket has >1 sockets and bpf_seq_read() can only get one sk at a time before it calls bpf_iter_udp_seq_stop().
>>> Just so that I understand the broken case better, are you doing something in your BPF iterator program so that "bpf_seq_read() can only get one sk at a time"?
>>>> 
>>>> I did not try the change yet. However, from looking at the code where iter->offset is changed, --iter->offset here is the most likely culprit and it will make backward progress for the same bucket (state->bucket). Other places touching iter->offset look fine.
>>>> 
>>>> It needs a local "int offset" variable for the zero test. Could you help to take a look, add (or modify) a test and fix it?
>>>> 
>>>> The progs/bpf_iter_udp[46].c test can be used to reproduce. The test_udp[46] in prog_tests/bpf_iter.c needs to be changed though to ensure there is multiple sk in the same bucket. Probably a few so_reuseport sk should do.
>>> The sock_destroy patch set had added a test with multiple so_reuseport sks in a bucket in order to exercise batching [1]. I was wondering if extending the test with an additional bucket should do it, or some more cases are required (asked for clarification above) to reproduce the issue.
>> 
>> Number of bucket should not matter. It should only need a bucket to have multiple sockets.
>> 
>> I did notice test_udp_server() has 5 so_reuseport udp sk in the same bucket when trying to understand how this issue was missed. It is enough on the hashtable side. This is the easier part and one start_reuseport_server() call will do. Having multiple sk in a bucket is not enough to reprod though.
>> 
>> The bpf prog 'iter_udp6_server' in the sock_destroy test is not doing bpf_seq_printf(). bpf_seq_printf() is necessary to reproduce the issue. The read() buf from the userspace program side also needs to be small. It needs to hit the "if (seq->count >= size) break;" condition in the "while (1)" loop in the kernel/bpf/bpf_iter.c.
>> 
>> You can try to add both to the sock_destroy test. I was suggesting bpf_iter/udp[46] test instead (i.e. the test_udp[46] function) because the bpf_seq_printf and the buf[] size are all aligned to reprod the problem already.  Try to add a start_reuseport_server(..., 5) to the beginning of test_udp6() in prog_tests/bpf_iter.c to ensure there is multiple udp sk in a bucket. It should be enough to reprod.
> 
> 
> Gotcha! I think I understand the repro steps. The offset field in question was added for this scenario where an iterator is stopped and resumed that the sock_destroy test cases don't entirely exercise. 
> Thanks! 


Just a small update: I was able to reproduce the issue where the so_reuseport test hangs by modifying the read buffer in the existing sock_destroy test (test_udp_server). I have a fix, and verified that the test doesn't hang with the fixed code. Will send a patch soon.


> 
>> 
>> In the final fix, I don't have strong preference on where the test should be.
>> Modifying one of the two existing tests (i.e. sock_destroy or bpf_iter) or a completely new test.
>> 
>> Let me know if you have problem reproducing it. Thanks.
>> 
>>> [1] https://elixir.bootlin.com/linux/v6.5/source/tools/testing/selftests/bpf/prog_tests/sock_destroy.c#L146
>>>> 
>>>> Thanks.
>>>> 
>>>>> +					continue;
>>>>> +				}
>>>>> +				if (iter->end_


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-10-24 22:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-19 22:51 [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 1/9] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 2/9] udp: seq_file: Helper function to match socket attributes Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 3/9] bpf: udp: Encapsulate logic to get udp table Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 4/9] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 5/9] bpf: udp: Implement batching for sockets iterator Aditi Ghag
2023-09-20  0:38   ` Martin KaFai Lau
2023-09-20 17:16     ` Aditi Ghag
2023-09-25 23:34     ` Aditi Ghag
2023-09-26  5:02       ` Martin KaFai Lau
2023-09-26 16:07         ` Aditi Ghag
2023-10-24 22:50           ` Aditi Ghag
2023-09-26  5:07       ` Martin KaFai Lau
2023-05-19 22:51 ` [PATCH v9 bpf-next 6/9] bpf: Add kfunc filter function to 'struct btf_kfunc_id_set' Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 7/9] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 8/9] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
2023-05-19 22:51 ` [PATCH v9 bpf-next 9/9] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
2023-05-20  6:00 ` [PATCH v9 bpf-next 0/9] bpf: Add socket destroy capability patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).