bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
@ 2023-04-18 15:31 Aditi Ghag
  2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

This patch adds the capability to destroy sockets in BPF. We plan to use
the capability in Cilium to force client sockets to reconnect when their
remote load-balancing backends are deleted. The other use case is
on-the-fly policy enforcement where existing socket connections prevented
by policies need to be terminated.

The use cases, and more details around
the selected approach were presented at LPC 2022 -
https://lpc.events/event/16/contributions/1358/.
RFC discussion -
https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
v5 patch series -
https://lore.kernel.org/bpf/20230330151758.531170-1-aditi.ghag@isovalent.com/

v6 highlights:
Address review comments:
Martin:
- Simplified UDP batching iterator logic.
- Run tests in a separate netns.
Stan:
- Make destroy handler checks opt-in.
- Extended network helper to get socket port for v4.
kernel test robot:
- Updated seq_sk_match and seq_file_family with the necessary ifdefs.

(Below notes are same as v5 patch series that are still relevant. Refer to
earlier patch series for other notes.)
- I hit a snag while writing the kfunc where verifier complained about the
  `sock_common` type passed from TCP iterator. With kfuncs, there don't
  seem to be any options available to pass BTF type hints to the verifier
  (equivalent of `ARG_PTR_TO_BTF_ID_SOCK_COMMON`, as was the case with the
  helper).  As a result, I changed the argument type of the sock_destory
  kfunc to `sock_common`.
- Martin's patch restricts the helper to only BPF iterators -
  https://lore.kernel.org/bpf/20230404060959.2259448-1-martin.lau@linux.dev/.
  Applied, and tested the patch locally.


Aditi Ghag (7):
  bpf: tcp: Avoid taking fast sock lock in iterator
  udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
  udp: seq_file: Helper function to match socket attributes
  bpf: udp: Implement batching for sockets iterator
  bpf: Add bpf_sock_destroy kfunc
  selftests/bpf: Add helper to get port using getsockname
  selftests/bpf: Test bpf_sock_destroy

 include/net/udp.h                             |   1 -
 net/core/filter.c                             |  57 ++++
 net/ipv4/tcp.c                                |  10 +-
 net/ipv4/tcp_ipv4.c                           |   5 +-
 net/ipv4/udp.c                                | 269 +++++++++++++++---
 tools/testing/selftests/bpf/network_helpers.c |  28 ++
 tools/testing/selftests/bpf/network_helpers.h |   1 +
 .../selftests/bpf/prog_tests/sock_destroy.c   | 217 ++++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 147 ++++++++++
 9 files changed, 690 insertions(+), 45 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-20  8:55   ` Paolo Abeni
  2023-04-25  5:45   ` Yonghong Song
  2023-04-18 15:31 ` [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

Previously, BPF TCP iterator was acquiring fast version of sock lock that
disables the BH. This introduced a circular dependency with code paths that
later acquire sockets hash table bucket lock.
Replace the fast version of sock lock with slow that faciliates BPF
programs executed from the iterator to destroy TCP listening sockets
using the bpf_sock_destroy kfunc (implemened in follow-up commits).

Here is a stack trace that motivated this change:

```
1) sock_lock with BH disabled + bucket lock

lock_acquire+0xcd/0x330
_raw_spin_lock_bh+0x38/0x50
inet_unhash+0x96/0xd0
tcp_set_state+0x6a/0x210
tcp_abort+0x12b/0x230
bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
bpf_iter_run_prog+0x1ff/0x340
bpf_iter_tcp_seq_show+0xca/0x190
bpf_seq_read+0x177/0x450
vfs_read+0xc6/0x300
ksys_read+0x69/0xf0
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

2) sock lock with BH enable

[    1.499968]   lock_acquire+0xcd/0x330
[    1.500316]   _raw_spin_lock+0x33/0x40
[    1.500670]   sk_clone_lock+0x146/0x520
[    1.501030]   inet_csk_clone_lock+0x1b/0x110
[    1.501433]   tcp_create_openreq_child+0x22/0x3f0
[    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
[    1.502284]   tcp_check_req+0x137/0x660
[    1.502646]   tcp_v6_rcv+0xa63/0xe80
[    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
[    1.503434]   ip6_input_finish+0x72/0x140
[    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
[    1.504281]   process_backlog+0x79/0x260
[    1.504668]   __napi_poll.constprop.0+0x27/0x170
[    1.505104]   net_rx_action+0x14a/0x2a0
[    1.505469]   __do_softirq+0x165/0x510
[    1.505842]   do_softirq+0xcd/0x100
[    1.506172]   __local_bh_enable_ip+0xcc/0xf0
[    1.506588]   ip6_finish_output2+0x2a8/0xb00
[    1.506988]   ip6_finish_output+0x274/0x510
[    1.507377]   ip6_xmit+0x319/0x9b0
[    1.507726]   inet6_csk_xmit+0x12b/0x2b0
[    1.508096]   __tcp_transmit_skb+0x549/0xc40
[    1.508498]   tcp_rcv_state_process+0x362/0x1180

```

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/tcp_ipv4.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..f2d370a9450f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 	struct sock *sk = v;
-	bool slow;
 	uid_t uid;
 	int ret;
 
@@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 		return 0;
 
 	if (sk_fullsock(sk))
-		slow = lock_sock_fast(sk);
+		lock_sock(sk);
 
 	if (unlikely(sk_unhashed(sk))) {
 		ret = SEQ_SKIP;
@@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 
 unlock:
 	if (sk_fullsock(sk))
-		unlock_sock_fast(sk, slow);
+		release_sock(sk);
 	return ret;
 
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
  2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-24  0:18   ` Martin KaFai Lau
  2023-04-18 15:31 ` [PATCH 3/7] udp: seq_file: Helper function to match socket attributes Aditi Ghag
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag, Martin KaFai Lau

This is a preparatory commit to remove the field. The field was
previously shared between proc fs and BPF UDP socket iterators. As the
follow-up commits will decouple the implementation for the iterators,
remove the field. As for BPF socket iterator, filtering of sockets is
exepected to be done in BPF programs.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 include/net/udp.h |  1 -
 net/ipv4/udp.c    | 34 ++++------------------------------
 2 files changed, 4 insertions(+), 31 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index de4b528522bb..5cad44318d71 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -437,7 +437,6 @@ struct udp_seq_afinfo {
 struct udp_iter_state {
 	struct seq_net_private  p;
 	int			bucket;
-	struct udp_seq_afinfo	*bpf_seq_afinfo;
 };
 
 void *udp_seq_start(struct seq_file *seq, loff_t *pos);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..3c9eeee28678 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2997,10 +2997,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 	struct udp_table *udptable;
 	struct sock *sk;
 
-	if (state->bpf_seq_afinfo)
-		afinfo = state->bpf_seq_afinfo;
-	else
-		afinfo = pde_data(file_inode(seq->file));
+	afinfo = pde_data(file_inode(seq->file));
 
 	udptable = udp_get_table_afinfo(afinfo, net);
 
@@ -3033,10 +3030,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 	struct udp_seq_afinfo *afinfo;
 	struct udp_table *udptable;
 
-	if (state->bpf_seq_afinfo)
-		afinfo = state->bpf_seq_afinfo;
-	else
-		afinfo = pde_data(file_inode(seq->file));
+	afinfo = pde_data(file_inode(seq->file));
 
 	do {
 		sk = sk_next(sk);
@@ -3094,10 +3088,7 @@ void udp_seq_stop(struct seq_file *seq, void *v)
 	struct udp_seq_afinfo *afinfo;
 	struct udp_table *udptable;
 
-	if (state->bpf_seq_afinfo)
-		afinfo = state->bpf_seq_afinfo;
-	else
-		afinfo = pde_data(file_inode(seq->file));
+	afinfo = pde_data(file_inode(seq->file));
 
 	udptable = udp_get_table_afinfo(afinfo, seq_file_net(seq));
 
@@ -3415,28 +3406,11 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	struct udp_iter_state *st = priv_data;
-	struct udp_seq_afinfo *afinfo;
-	int ret;
-
-	afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN);
-	if (!afinfo)
-		return -ENOMEM;
-
-	afinfo->family = AF_UNSPEC;
-	afinfo->udp_table = NULL;
-	st->bpf_seq_afinfo = afinfo;
-	ret = bpf_iter_init_seq_net(priv_data, aux);
-	if (ret)
-		kfree(afinfo);
-	return ret;
+	return bpf_iter_init_seq_net(priv_data, aux);
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
-	struct udp_iter_state *st = priv_data;
-
-	kfree(st->bpf_seq_afinfo);
 	bpf_iter_fini_seq_net(priv_data);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] udp: seq_file: Helper function to match socket attributes
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
  2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
  2023-04-18 15:31 ` [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-18 15:31 ` [PATCH 4/7] bpf: udp: Implement batching for sockets iterator Aditi Ghag
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

This is a preparatory commit to refactor code that matches socket
attributes in iterators to a helper function, and use it in the
proc fs iterator.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3c9eeee28678..8689ed171776 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2983,6 +2983,16 @@ EXPORT_SYMBOL(udp_prot);
 /* ------------------------------------------------------------------------ */
 #ifdef CONFIG_PROC_FS
 
+static unsigned short seq_file_family(const struct seq_file *seq);
+static bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
+{
+	unsigned short family = seq_file_family(seq);
+
+	/* AF_UNSPEC is used as a match all */
+	return ((family == AF_UNSPEC || family == sk->sk_family) &&
+		net_eq(sock_net(sk), seq_file_net(seq)));
+}
+
 static struct udp_table *udp_get_table_afinfo(struct udp_seq_afinfo *afinfo,
 					      struct net *net)
 {
@@ -3010,10 +3020,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
 
 		spin_lock_bh(&hslot->lock);
 		sk_for_each(sk, &hslot->head) {
-			if (!net_eq(sock_net(sk), net))
-				continue;
-			if (afinfo->family == AF_UNSPEC ||
-			    sk->sk_family == afinfo->family)
+			if (seq_sk_match(seq, sk))
 				goto found;
 		}
 		spin_unlock_bh(&hslot->lock);
@@ -3034,9 +3041,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
 
 	do {
 		sk = sk_next(sk);
-	} while (sk && (!net_eq(sock_net(sk), net) ||
-			(afinfo->family != AF_UNSPEC &&
-			 sk->sk_family != afinfo->family)));
+	} while (sk && !seq_sk_match(seq, sk));
 
 	if (!sk) {
 		udptable = udp_get_table_afinfo(afinfo, net);
@@ -3196,6 +3201,21 @@ static const struct seq_operations bpf_iter_udp_seq_ops = {
 };
 #endif
 
+static unsigned short seq_file_family(const struct seq_file *seq)
+{
+	const struct udp_seq_afinfo *afinfo;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* BPF iterator: bpf programs to filter sockets. */
+	if (seq->op == &bpf_iter_udp_seq_ops)
+		return AF_UNSPEC;
+#endif
+
+	/* Proc fs iterator */
+	afinfo = pde_data(file_inode(seq->file));
+	return afinfo->family;
+}
+
 const struct seq_operations udp_seq_ops = {
 	.start		= udp_seq_start,
 	.next		= udp_seq_next,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] bpf: udp: Implement batching for sockets iterator
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
                   ` (2 preceding siblings ...)
  2023-04-18 15:31 ` [PATCH 3/7] udp: seq_file: Helper function to match socket attributes Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-20 14:07   ` Paolo Abeni
  2023-04-24  5:46   ` Martin KaFai Lau
  2023-04-18 15:31 ` [PATCH 5/7] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag, Martin KaFai Lau

Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators.  With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 203 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8689ed171776..f1c001641e53 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3148,6 +3148,145 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+struct bpf_udp_iter_state {
+	struct udp_iter_state state;
+	unsigned int cur_sk;
+	unsigned int end_sk;
+	unsigned int max_sk;
+	int offset;
+	struct sock **batch;
+	bool st_bucket_done;
+};
+
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz);
+static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
+	struct net *net = seq_file_net(seq);
+	struct udp_seq_afinfo afinfo;
+	struct udp_table *udptable;
+	unsigned int batch_sks = 0;
+	bool resized = false;
+	struct sock *sk;
+
+	/* The current batch is done, so advance the bucket. */
+	if (iter->st_bucket_done) {
+		state->bucket++;
+		iter->offset = 0;
+	}
+
+	afinfo.family = AF_UNSPEC;
+	afinfo.udp_table = NULL;
+	udptable = udp_get_table_afinfo(&afinfo, net);
+
+again:
+	/* New batch for the next bucket.
+	 * Iterate over the hash table to find a bucket with sockets matching
+	 * the iterator attributes, and return the first matching socket from
+	 * the bucket. The remaining matched sockets from the bucket are batched
+	 * before releasing the bucket lock. This allows BPF programs that are
+	 * called in seq_show to acquire the bucket lock if needed.
+	 */
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	batch_sks = 0;
+
+	for (; state->bucket <= udptable->mask; state->bucket++) {
+		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
+
+		if (hlist_empty(&hslot2->head)) {
+			iter->offset = 0;
+			continue;
+		}
+
+		spin_lock_bh(&hslot2->lock);
+		udp_portaddr_for_each_entry(sk, &hslot2->head) {
+			if (seq_sk_match(seq, sk)) {
+				/* Resume from the last iterated socket at the
+				 * offset in the bucket before iterator was stopped.
+				 */
+				if (iter->offset) {
+					--iter->offset;
+					continue;
+				}
+				if (iter->end_sk < iter->max_sk) {
+					sock_hold(sk);
+					iter->batch[iter->end_sk++] = sk;
+				}
+				batch_sks++;
+			}
+		}
+		spin_unlock_bh(&hslot2->lock);
+
+		if (iter->end_sk)
+			break;
+
+		/* Reset the current bucket's offset before moving to the next bucket. */
+		iter->offset = 0;
+	}
+
+	/* All done: no batch made. */
+	if (!iter->end_sk)
+		return NULL;
+
+	if (iter->end_sk == batch_sks) {
+		/* Batching is done for the current bucket; return the first
+		 * socket to be iterated from the batch.
+		 */
+		iter->st_bucket_done = true;
+		goto ret;
+	}
+	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
+		resized = true;
+		/* Go back to the previous bucket to resize its batch. */
+		state->bucket--;
+		goto again;
+	}
+ret:
+	return iter->batch[0];
+}
+
+static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct sock *sk;
+
+	/* Whenever seq_next() is called, the iter->cur_sk is
+	 * done with seq_show(), so unref the iter->cur_sk.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sock_put(iter->batch[iter->cur_sk++]);
+		++iter->offset;
+	}
+
+	/* After updating iter->cur_sk, check if there are more sockets
+	 * available in the current bucket batch.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sk = iter->batch[iter->cur_sk];
+	} else {
+		// Prepare a new batch.
+		sk = bpf_iter_udp_batch(seq);
+	}
+
+	++*pos;
+	return sk;
+}
+
+static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* bpf iter does not support lseek, so it always
+	 * continue from where it was stop()-ped.
+	 */
+	if (*pos)
+		return bpf_iter_udp_batch(seq);
+
+	return SEQ_START_TOKEN;
+}
+
 static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
 			     struct udp_sock *udp_sk, uid_t uid, int bucket)
 {
@@ -3168,18 +3307,37 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
+	int rc;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
+	lock_sock(sk);
+
+	if (unlikely(sk_unhashed(sk))) {
+		rc = SEQ_SKIP;
+		goto unlock;
+	}
+
 	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
 	meta.seq = seq;
 	prog = bpf_iter_get_info(&meta, false);
-	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+
+unlock:
+	release_sock(sk);
+	return rc;
+}
+
+static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
+{
+	while (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct bpf_udp_iter_state *iter = seq->private;
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 
@@ -3190,12 +3348,15 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
 	}
 
-	udp_seq_stop(seq, v);
+	if (iter->cur_sk < iter->end_sk) {
+		bpf_iter_udp_put_batch(iter);
+		iter->st_bucket_done = false;
+	}
 }
 
 static const struct seq_operations bpf_iter_udp_seq_ops = {
-	.start		= udp_seq_start,
-	.next		= udp_seq_next,
+	.start		= bpf_iter_udp_seq_start,
+	.next		= bpf_iter_udp_seq_next,
 	.stop		= bpf_iter_udp_seq_stop,
 	.show		= bpf_iter_udp_seq_show,
 };
@@ -3424,21 +3585,57 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
 DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 		     struct udp_sock *udp_sk, uid_t uid, int bucket)
 
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz)
+{
+	struct sock **new_batch;
+
+	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
+				   GFP_USER | __GFP_NOWARN);
+	if (!new_batch)
+		return -ENOMEM;
+
+	bpf_iter_udp_put_batch(iter);
+	kvfree(iter->batch);
+	iter->batch = new_batch;
+	iter->max_sk = new_batch_sz;
+
+	return 0;
+}
+
+#define INIT_BATCH_SZ 16
+
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	return bpf_iter_init_seq_net(priv_data, aux);
+	struct bpf_udp_iter_state *iter = priv_data;
+	int ret;
+
+	ret = bpf_iter_init_seq_net(priv_data, aux);
+	if (ret)
+		return ret;
+
+	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
+	if (ret) {
+		bpf_iter_fini_seq_net(priv_data);
+		return ret;
+	}
+
+	return ret;
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
+	struct bpf_udp_iter_state *iter = priv_data;
+
 	bpf_iter_fini_seq_net(priv_data);
+	kvfree(iter->batch);
 }
 
 static const struct bpf_iter_seq_info udp_seq_info = {
 	.seq_ops		= &bpf_iter_udp_seq_ops,
 	.init_seq_private	= bpf_iter_init_udp,
 	.fini_seq_private	= bpf_iter_fini_udp,
-	.seq_priv_size		= sizeof(struct udp_iter_state),
+	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
 };
 
 static struct bpf_iter_reg udp_reg_info = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] bpf: Add bpf_sock_destroy kfunc
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
                   ` (3 preceding siblings ...)
  2023-04-18 15:31 ` [PATCH 4/7] bpf: udp: Implement batching for sockets iterator Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-18 15:31 ` [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium to force
client sockets to reconnect when their remote load-balancing backends are
deleted. The other use case is on-the-fly policy enforcement where existing
socket connections prevented by policies need to be forcefully terminated.
The helper allows terminating sockets that may or may not be actively
sending traffic.

The helper is currently exposed to certain BPF iterators where users can
filter, and terminate selected sockets.  Additionally, the helper can only
be called from these BPF contexts that ensure socket locking in order to
allow synchronous execution of destroy helpers that also acquire socket
locks. The previous commit that batches UDP sockets during iteration
facilitated a synchronous invocation of the destroy helper from BPF context
by skipping taking socket locks in the destroy handler. TCP iterators
already supported batching.

The helper takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. As a comparison, BPF helpers enable this behavior with the
`ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
option available with the verifier logic that handles kfuncs where BTF
types are inferred. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe for certain sockets like request sockets, but these have
a special handling in the diag_destroy handlers.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/core/filter.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c    | 10 ++++++---
 net/ipv4/udp.c    |  6 +++--
 3 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 727c5269867d..7d1c1da77aa4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11715,3 +11715,60 @@ static int __init bpf_kfunc_init(void)
 	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &bpf_kfunc_set_xdp);
 }
 late_initcall(bpf_kfunc_init);
+
+/* Disables missing prototype warnings */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
+ *
+ * The helper expects a non-NULL pointer to a socket. It invokes the
+ * protocol specific socket destroy handlers.
+ *
+ * The helper can only be called from BPF contexts that have acquired the socket
+ * locks.
+ *
+ * Parameters:
+ * @sock: Pointer to socket to be destroyed
+ *
+ * Return:
+ * On error, may return EPROTONOSUPPORT, EINVAL.
+ * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
+ * 0 otherwise
+ */
+__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
+{
+	struct sock *sk = (struct sock *)sock;
+
+	if (!sk)
+		return -EINVAL;
+
+	/* The locking semantics that allow for synchronous execution of the
+	 * destroy handlers are only supported for TCP and UDP.
+	 * Supporting protocols will need to acquire lock_sock in the BPF context
+	 * prior to invoking this kfunc.
+	 */
+	if (!sk->sk_prot->diag_destroy || (sk->sk_protocol != IPPROTO_TCP &&
+					   sk->sk_protocol != IPPROTO_UDP))
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
+}
+
+__diag_pop()
+
+BTF_SET8_START(sock_destroy_kfunc_set)
+BTF_ID_FLAGS(func, bpf_sock_destroy)
+BTF_SET8_END(sock_destroy_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &sock_destroy_kfunc_set,
+};
+
+static int init_subsystem(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
+}
+late_initcall(init_subsystem);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..2259b4facc2f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4679,8 +4679,10 @@ int tcp_abort(struct sock *sk, int err)
 		return 0;
 	}
 
-	/* Don't race with userspace socket closes such as tcp_close. */
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		/* Don't race with userspace socket closes such as tcp_close. */
+		lock_sock(sk);
 
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
@@ -4702,9 +4704,11 @@ int tcp_abort(struct sock *sk, int err)
 	}
 
 	bh_unlock_sock(sk);
+
 	local_bh_enable();
 	tcp_write_queue_purge(sk);
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_abort);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f1c001641e53..a358a71839ef 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2925,7 +2925,8 @@ EXPORT_SYMBOL(udp_poll);
 
 int udp_abort(struct sock *sk, int err)
 {
-	lock_sock(sk);
+	if (!has_current_bpf_ctx())
+		lock_sock(sk);
 
 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
 	 * with close()
@@ -2938,7 +2939,8 @@ int udp_abort(struct sock *sk, int err)
 	__udp_disconnect(sk, 0);
 
 out:
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
                   ` (4 preceding siblings ...)
  2023-04-18 15:31 ` [PATCH 5/7] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-18 18:45   ` Stanislav Fomichev
  2023-04-18 15:31 ` [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
  2023-04-24 22:15 ` [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Martin KaFai Lau
  7 siblings, 1 reply; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The helper will be used to programmatically retrieve,
and pass ports in userspace and kernel selftest programs.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 28 +++++++++++++++++++
 tools/testing/selftests/bpf/network_helpers.h |  1 +
 2 files changed, 29 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 596caa176582..7217cac762f0 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -427,3 +427,31 @@ void close_netns(struct nstoken *token)
 	close(token->orig_netns_fd);
 	free(token);
 }
+
+int get_socket_local_port(int family, int sock_fd, __u16 *out_port)
+{
+	socklen_t addr_len;
+	int err;
+
+	if (family == AF_INET) {
+		struct sockaddr_in addr = {};
+
+		addr_len = sizeof(addr);
+		err = getsockname(sock_fd, (struct sockaddr *)&addr, &addr_len);
+		if (err < 0)
+			return err;
+		*out_port = addr.sin_port;
+		return 0;
+	} else if (family == AF_INET6) {
+		struct sockaddr_in6 addr = {};
+
+		addr_len = sizeof(addr);
+		err = getsockname(sock_fd, (struct sockaddr *)&addr, &addr_len);
+		if (err < 0)
+			return err;
+		*out_port = addr.sin6_port;
+		return 0;
+	}
+
+	return -1;
+}
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index f882c691b790..ca4a147b58b8 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -56,6 +56,7 @@ int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
 int make_sockaddr(int family, const char *addr_str, __u16 port,
 		  struct sockaddr_storage *addr, socklen_t *len);
 char *ping_command(int family);
+int get_socket_local_port(int family, int sock_fd, __u16 *out_port);
 
 struct nstoken;
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
                   ` (5 preceding siblings ...)
  2023-04-18 15:31 ` [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
@ 2023-04-18 15:31 ` Aditi Ghag
  2023-04-24 19:20   ` Martin KaFai Lau
  2023-04-24 22:15 ` [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Martin KaFai Lau
  7 siblings, 1 reply; 26+ messages in thread
From: Aditi Ghag @ 2023-04-18 15:31 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The test cases for destroying sockets mirror the intended usages of the
bpf_sock_destroy kfunc using iterators.

The destroy helpers set `ECONNABORTED` error code that we can validate in
the test code with client sockets. But UDP sockets have an overriding error
code from the disconnect called during abort, so the error code the
validation is only done for TCP sockets.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 .../selftests/bpf/prog_tests/sock_destroy.c   | 217 ++++++++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 147 ++++++++++++
 2 files changed, 364 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
new file mode 100644
index 000000000000..51f2454b7b4b
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <bpf/bpf_endian.h>
+
+#include "sock_destroy_prog.skel.h"
+#include "network_helpers.h"
+
+#define TEST_NS "sock_destroy_netns"
+
+static void start_iter_sockets(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	char buf[50] = {};
+	int iter_fd, len;
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+		goto free_link;
+
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+	ASSERT_GE(len, 0, "read");
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+}
+
+static void test_tcp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys connected client sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+static void test_tcp_server(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0, err;
+	__u16 serv_port = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+	err = get_socket_local_port(AF_INET6, serv, &serv_port);
+	if (!ASSERT_EQ(err, 0, "get_local_port"))
+		goto cleanup;
+	skel->bss->serv_port = serv_port;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_server);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+
+static void test_udp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	/* UDP sockets have an overriding error code after they are disconnected,
+	 * so we don't check for ECONNABORTED error code.
+	 */
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+static void test_udp_server(struct sock_destroy_prog *skel)
+{
+	int *listen_fds = NULL, n, i, err;
+	unsigned int num_listens = 5;
+	char buf[1];
+	__u16 serv_port;
+
+	/* Start reuseport servers. */
+	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
+					    "::1", 0, 0, num_listens);
+	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
+		goto cleanup;
+	err = get_socket_local_port(AF_INET6, listen_fds[0], &serv_port);
+	if (!ASSERT_EQ(err, 0, "get_local_port"))
+		goto cleanup;
+	skel->bss->serv_port = ntohs(serv_port);
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_server);
+
+	for (i = 0; i < num_listens; ++i) {
+		n = read(listen_fds[i], buf, sizeof(buf));
+		if (!ASSERT_EQ(n, -1, "read") ||
+		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
+			break;
+	}
+	ASSERT_EQ(i, num_listens, "server socket");
+
+cleanup:
+	free_fds(listen_fds, num_listens);
+}
+
+void test_sock_destroy(void)
+{
+	struct sock_destroy_prog *skel;
+	struct nstoken *nstoken = NULL;
+	int cgroup_fd = 0;
+
+	skel = sock_destroy_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	cgroup_fd = test__join_cgroup("/sock_destroy");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto close_cgroup_fd;
+
+	skel->links.sock_connect = bpf_program__attach_cgroup(
+		skel->progs.sock_connect, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
+		goto close_cgroup_fd;
+
+	SYS(fail, "ip netns add %s", TEST_NS);
+	SYS(fail, "ip -net %s link set dev lo up", TEST_NS);
+
+	nstoken = open_netns(TEST_NS);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto fail;
+
+	if (test__start_subtest("tcp_client"))
+		test_tcp_client(skel);
+	if (test__start_subtest("tcp_server"))
+		test_tcp_server(skel);
+	if (test__start_subtest("udp_client"))
+		test_udp_client(skel);
+	if (test__start_subtest("udp_server"))
+		test_udp_server(skel);
+
+
+fail:
+	if (nstoken)
+		close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " TEST_NS " &> /dev/null");
+close_cgroup_fd:
+	close(cgroup_fd);
+	sock_destroy_prog__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
new file mode 100644
index 000000000000..1f265e0d9dea
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include "bpf_tracing_net.h"
+
+#define AF_INET6 10
+
+__u16 serv_port = 0;
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} tcp_conn_sockets SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} udp_conn_sockets SEC(".maps");
+
+SEC("cgroup/connect6")
+int sock_connect(struct bpf_sock_addr *ctx)
+{
+	int key = 0;
+	__u64 sock_cookie = 0;
+	__u32 keyc = 0;
+
+	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
+		return 1;
+
+	sock_cookie = bpf_get_socket_cookie(ctx);
+	if (ctx->protocol == IPPROTO_TCP)
+		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
+	else if (ctx->protocol == IPPROTO_UDP)
+		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
+	else
+		return 1;
+
+	return 1;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_client(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	__u64 sock_cookie = 0;
+	__u64 *val;
+	int key = 0;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk_common);
+	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_server(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	struct tcp6_sock *tcp_sk;
+	const struct inet_connection_sock *icsk;
+	const struct inet_sock *inet;
+	__u16 srcp;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
+	if (!tcp_sk)
+		return 0;
+
+	icsk = &tcp_sk->tcp.inet_conn;
+	inet = &icsk->icsk_inet;
+	srcp = inet->inet_sport;
+
+	/* Destroy server sockets. */
+	if (srcp == serv_port)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+
+SEC("iter/udp")
+int iter_udp6_client(struct bpf_iter__udp *ctx)
+{
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u64 sock_cookie = 0, *val;
+	int key = 0;
+
+	if (!sk)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk);
+	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+SEC("iter/udp")
+int iter_udp6_server(struct bpf_iter__udp *ctx)
+{
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u16 srcp;
+	struct inet_sock *inet;
+
+	if (!sk)
+		return 0;
+
+	inet = &udp_sk->inet;
+	srcp = bpf_ntohs(inet->inet_sport);
+	if (srcp == serv_port)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname
  2023-04-18 15:31 ` [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
@ 2023-04-18 18:45   ` Stanislav Fomichev
  2023-04-24 17:58     ` Martin KaFai Lau
  0 siblings, 1 reply; 26+ messages in thread
From: Stanislav Fomichev @ 2023-04-18 18:45 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 04/18, Aditi Ghag wrote:
> The helper will be used to programmatically retrieve,
> and pass ports in userspace and kernel selftest programs.
> 
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>  tools/testing/selftests/bpf/network_helpers.c | 28 +++++++++++++++++++
>  tools/testing/selftests/bpf/network_helpers.h |  1 +
>  2 files changed, 29 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
> index 596caa176582..7217cac762f0 100644
> --- a/tools/testing/selftests/bpf/network_helpers.c
> +++ b/tools/testing/selftests/bpf/network_helpers.c
> @@ -427,3 +427,31 @@ void close_netns(struct nstoken *token)
>  	close(token->orig_netns_fd);
>  	free(token);
>  }
> +
> +int get_socket_local_port(int family, int sock_fd, __u16 *out_port)
> +{
> +	socklen_t addr_len;
> +	int err;

Sorry for keeping bikeshedding this part, but if you're going to do
another respin, we can also drop the family argument:

int get_socket_local_port(int sock_fd, __be16 *out_port)
/*                                       ^^ maybe also do be16? */
{
	struct sockaddr_storage addr; 
	socklen_t addrlen;

	addrlen = sizeof(addr);
	getsockname(sock_fd, (struct sockaddr *)&addr, &addrlen);

	if (addr.ss_family == AF_INET) {
	} else if () {
	}
}

> +
> +	if (family == AF_INET) {
> +		struct sockaddr_in addr = {};
> +
> +		addr_len = sizeof(addr);
> +		err = getsockname(sock_fd, (struct sockaddr *)&addr, &addr_len);
> +		if (err < 0)
> +			return err;
> +		*out_port = addr.sin_port;
> +		return 0;
> +	} else if (family == AF_INET6) {
> +		struct sockaddr_in6 addr = {};
> +
> +		addr_len = sizeof(addr);
> +		err = getsockname(sock_fd, (struct sockaddr *)&addr, &addr_len);
> +		if (err < 0)
> +			return err;
> +		*out_port = addr.sin6_port;
> +		return 0;
> +	}
> +
> +	return -1;
> +}
> diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
> index f882c691b790..ca4a147b58b8 100644
> --- a/tools/testing/selftests/bpf/network_helpers.h
> +++ b/tools/testing/selftests/bpf/network_helpers.h
> @@ -56,6 +56,7 @@ int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
>  int make_sockaddr(int family, const char *addr_str, __u16 port,
>  		  struct sockaddr_storage *addr, socklen_t *len);
>  char *ping_command(int family);
> +int get_socket_local_port(int family, int sock_fd, __u16 *out_port);
>  
>  struct nstoken;
>  /**
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator
  2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
@ 2023-04-20  8:55   ` Paolo Abeni
  2023-05-03 20:25     ` Aditi Ghag
  2023-04-25  5:45   ` Yonghong Song
  1 sibling, 1 reply; 26+ messages in thread
From: Paolo Abeni @ 2023-04-20  8:55 UTC (permalink / raw)
  To: Aditi Ghag, bpf; +Cc: kafai, sdf, edumazet

Hi,

On Tue, 2023-04-18 at 15:31 +0000, Aditi Ghag wrote:
> Previously, BPF TCP iterator was acquiring fast version of sock lock that
> disables the BH. This introduced a circular dependency with code paths that
> later acquire sockets hash table bucket lock.
> Replace the fast version of sock lock with slow that faciliates BPF
> programs executed from the iterator to destroy TCP listening sockets
> using the bpf_sock_destroy kfunc (implemened in follow-up commits).
> 
> Here is a stack trace that motivated this change:
> 
> ```
> 1) sock_lock with BH disabled + bucket lock
> 
> lock_acquire+0xcd/0x330
> _raw_spin_lock_bh+0x38/0x50
> inet_unhash+0x96/0xd0
> tcp_set_state+0x6a/0x210
> tcp_abort+0x12b/0x230
> bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
> bpf_iter_run_prog+0x1ff/0x340
> bpf_iter_tcp_seq_show+0xca/0x190
> bpf_seq_read+0x177/0x450
> vfs_read+0xc6/0x300
> ksys_read+0x69/0xf0
> do_syscall_64+0x3c/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
> 2) sock lock with BH enable
> 
> [    1.499968]   lock_acquire+0xcd/0x330
> [    1.500316]   _raw_spin_lock+0x33/0x40

The above is quite confusing to me, here BH are disabled as well
(otherwise the whole softirq processing would be really broken)

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] bpf: udp: Implement batching for sockets iterator
  2023-04-18 15:31 ` [PATCH 4/7] bpf: udp: Implement batching for sockets iterator Aditi Ghag
@ 2023-04-20 14:07   ` Paolo Abeni
  2023-04-24  5:46   ` Martin KaFai Lau
  1 sibling, 0 replies; 26+ messages in thread
From: Paolo Abeni @ 2023-04-20 14:07 UTC (permalink / raw)
  To: Aditi Ghag, bpf; +Cc: kafai, sdf, edumazet, Martin KaFai Lau

On Tue, 2023-04-18 at 15:31 +0000, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.
> 
> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators.  With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.
> 
> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> 
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>  net/ipv4/udp.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 203 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 8689ed171776..f1c001641e53 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3148,6 +3148,145 @@ struct bpf_iter__udp {
>  	int bucket __aligned(8);
>  };
>  
> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	int offset;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo afinfo;
> +	struct udp_table *udptable;
> +	unsigned int batch_sks = 0;
> +	bool resized = false;
> +	struct sock *sk;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done) {
> +		state->bucket++;
> +		iter->offset = 0;
> +	}
> +
> +	afinfo.family = AF_UNSPEC;
> +	afinfo.udp_table = NULL;
> +	udptable = udp_get_table_afinfo(&afinfo, net);
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	batch_sks = 0;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> +
> +		if (hlist_empty(&hslot2->head)) {
> +			iter->offset = 0;
> +			continue;
> +		}
> +
> +		spin_lock_bh(&hslot2->lock);
> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				/* Resume from the last iterated socket at the
> +				 * offset in the bucket before iterator was stopped.
> +				 */
> +				if (iter->offset) {
> +					--iter->offset;
> +					continue;
> +				}
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				batch_sks++;
> +			}
> +		}
> +		spin_unlock_bh(&hslot2->lock);
> +
> +		if (iter->end_sk)
> +			break;
> +
> +		/* Reset the current bucket's offset before moving to the next bucket. */
> +		iter->offset = 0;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!iter->end_sk)
> +		return NULL;
> +
> +	if (iter->end_sk == batch_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		goto ret;
> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */
> +		state->bucket--;
> +		goto again;
> +	}
> +ret:
> +	return iter->batch[0];
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct sock *sk;
> +
> +	/* Whenever seq_next() is called, the iter->cur_sk is
> +	 * done with seq_show(), so unref the iter->cur_sk.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sock_put(iter->batch[iter->cur_sk++]);
> +		++iter->offset;
> +	}
> +
> +	/* After updating iter->cur_sk, check if there are more sockets
> +	 * available in the current bucket batch.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sk = iter->batch[iter->cur_sk];
> +	} else {
> +		// Prepare a new batch.

Minor nit: please use /* */ even for single line comments.

Thanks

Paolo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
  2023-04-18 15:31 ` [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
@ 2023-04-24  0:18   ` Martin KaFai Lau
  2023-05-01 22:39     ` Aditi Ghag
  0 siblings, 1 reply; 26+ messages in thread
From: Martin KaFai Lau @ 2023-04-24  0:18 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, edumazet, Martin KaFai Lau, bpf

On 4/18/23 8:31 AM, Aditi Ghag wrote:
> This is a preparatory commit to remove the field. The field was
> previously shared between proc fs and BPF UDP socket iterators. As the
> follow-up commits will decouple the implementation for the iterators,
> remove the field. As for BPF socket iterator, filtering of sockets is
> exepected to be done in BPF programs.
> 
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   include/net/udp.h |  1 -
>   net/ipv4/udp.c    | 34 ++++------------------------------
>   2 files changed, 4 insertions(+), 31 deletions(-)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..5cad44318d71 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -437,7 +437,6 @@ struct udp_seq_afinfo {
>   struct udp_iter_state {
>   	struct seq_net_private  p;
>   	int			bucket;
> -	struct udp_seq_afinfo	*bpf_seq_afinfo;
>   };
>   
>   void *udp_seq_start(struct seq_file *seq, loff_t *pos);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..3c9eeee28678 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2997,10 +2997,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
>   	struct udp_table *udptable;
>   	struct sock *sk;
>   
> -	if (state->bpf_seq_afinfo)
> -		afinfo = state->bpf_seq_afinfo;
> -	else
> -		afinfo = pde_data(file_inode(seq->file));
> +	afinfo = pde_data(file_inode(seq->file));

I can see how this change will work after patch 4. However, this patch alone 
cannot work independently as is. The udp bpf iter still uses the 
udp_get_{first,next} and udp_seq_stop() up-to this patch.

First, patch 3 refactoring should be done before patch 2 here. The removal of 
'struct udp_seq_afinfo *bpf_seq_afinfo' in patch 2 should be done when all the 
necessary refactoring is in-place first.

Also, this afinfo is passed to udp_get_table_afinfo(). How about renaming 
udp_get_table_afinfo() to udp_get_table_seq() and having it take the "seq" as 
the arg instead. This probably will deserve another refactoring patch before 
finally removing bpf_seq_afinfo. Something like this (un-compiled code):

static struct udp_table *udp_get_table_seq(struct seq_file *seq,
                                            struct net *net)
{
  	const struct udp_seq_afinfo *afinfo;

         if (st->bpf_seq_afinfo)
                 return net->ipv4.udp_table;

	afinfo = pde_data(file_inode(seq->file));
         return afinfo->udp_table ? : net->ipv4.udp_table;
}

Of course, when the later patch finally removes the bpf_seq_afinfo, the 'if 
(st->bpf_seq_afinfo)' test should be replaced with the 'if (seq->op == 
&bpf_iter_udp_seq_ops)' test.

That will also make the afinfo dance in bpf_iter_udp_batch() in patch 4 goes away.

>   
>   	udptable = udp_get_table_afinfo(afinfo, net);
>   
> @@ -3033,10 +3030,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
>   	struct udp_seq_afinfo *afinfo;
>   	struct udp_table *udptable;
>   
> -	if (state->bpf_seq_afinfo)
> -		afinfo = state->bpf_seq_afinfo;
> -	else
> -		afinfo = pde_data(file_inode(seq->file));
> +	afinfo = pde_data(file_inode(seq->file));
>   
>   	do {
>   		sk = sk_next(sk);
> @@ -3094,10 +3088,7 @@ void udp_seq_stop(struct seq_file *seq, void *v)
>   	struct udp_seq_afinfo *afinfo;
>   	struct udp_table *udptable;
>   
> -	if (state->bpf_seq_afinfo)
> -		afinfo = state->bpf_seq_afinfo;
> -	else
> -		afinfo = pde_data(file_inode(seq->file));
> +	afinfo = pde_data(file_inode(seq->file));
>   
>   	udptable = udp_get_table_afinfo(afinfo, seq_file_net(seq));
>   
> @@ -3415,28 +3406,11 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>   
>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   {
> -	struct udp_iter_state *st = priv_data;
> -	struct udp_seq_afinfo *afinfo;
> -	int ret;
> -
> -	afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN);
> -	if (!afinfo)
> -		return -ENOMEM;
> -
> -	afinfo->family = AF_UNSPEC;
> -	afinfo->udp_table = NULL;
> -	st->bpf_seq_afinfo = afinfo;
> -	ret = bpf_iter_init_seq_net(priv_data, aux);
> -	if (ret)
> -		kfree(afinfo);
> -	return ret;
> +	return bpf_iter_init_seq_net(priv_data, aux);

Nice simplification with the bpf_seq_afinfo cleanup.

>   }
>   
>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> -	struct udp_iter_state *st = priv_data;
> -
> -	kfree(st->bpf_seq_afinfo);
>   	bpf_iter_fini_seq_net(priv_data);
>   }
>   


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/7] bpf: udp: Implement batching for sockets iterator
  2023-04-18 15:31 ` [PATCH 4/7] bpf: udp: Implement batching for sockets iterator Aditi Ghag
  2023-04-20 14:07   ` Paolo Abeni
@ 2023-04-24  5:46   ` Martin KaFai Lau
  1 sibling, 0 replies; 26+ messages in thread
From: Martin KaFai Lau @ 2023-04-24  5:46 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, Martin KaFai Lau, bpf

On 4/18/23 8:31 AM, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.
> 
> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators.  With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.
> 
> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> 
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/udp.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 203 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 8689ed171776..f1c001641e53 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3148,6 +3148,145 @@ struct bpf_iter__udp {
>   	int bucket __aligned(8);
>   };
>   
> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	int offset;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo afinfo;
> +	struct udp_table *udptable;
> +	unsigned int batch_sks = 0;
> +	bool resized = false;
> +	struct sock *sk;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done) {
> +		state->bucket++;
> +		iter->offset = 0;
> +	}
> +
> +	afinfo.family = AF_UNSPEC;
> +	afinfo.udp_table = NULL;
> +	udptable = udp_get_table_afinfo(&afinfo, net);
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	batch_sks = 0;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> +
> +		if (hlist_empty(&hslot2->head)) {
> +			iter->offset = 0;
> +			continue;
> +		}
> +
> +		spin_lock_bh(&hslot2->lock);
> +		udp_portaddr_for_each_entry(sk, &hslot2->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				/* Resume from the last iterated socket at the
> +				 * offset in the bucket before iterator was stopped.
> +				 */
> +				if (iter->offset) {
> +					--iter->offset;
> +					continue;
> +				}
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				batch_sks++;
> +			}
> +		}
> +		spin_unlock_bh(&hslot2->lock);
> +
> +		if (iter->end_sk)
> +			break;
> +
> +		/* Reset the current bucket's offset before moving to the next bucket. */
> +		iter->offset = 0;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!iter->end_sk)
> +		return NULL;
> +
> +	if (iter->end_sk == batch_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		goto ret;

nit. "ret" is a variable name in some other changes in this patch. How about 
directly "return iter->batch[0];" here or change the label name to something 
else like "done"?

> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */

nit. I found the "to resize its batch" part confusing. The resize has already 
been done in the above bpf_iter_udp_realloc_batch(). How about something like, 
"After getting a larger max_sk, retry one more time to grab the whole bucket" ?

> +		state->bucket--;
> +		goto again;
> +	}
> +ret:
> +	return iter->batch[0];
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct sock *sk;
> +
> +	/* Whenever seq_next() is called, the iter->cur_sk is
> +	 * done with seq_show(), so unref the iter->cur_sk.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sock_put(iter->batch[iter->cur_sk++]);
> +		++iter->offset;
> +	}
> +
> +	/* After updating iter->cur_sk, check if there are more sockets
> +	 * available in the current bucket batch.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sk = iter->batch[iter->cur_sk];
> +	} else {
> +		// Prepare a new batch.
> +		sk = bpf_iter_udp_batch(seq);
> +	}

nit. remove "{ }" for one liner if-else statement.

> +
> +	++*pos;
> +	return sk;
> +}
> +
> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	/* bpf iter does not support lseek, so it always
> +	 * continue from where it was stop()-ped.
> +	 */
> +	if (*pos)
> +		return bpf_iter_udp_batch(seq);
> +
> +	return SEQ_START_TOKEN;
> +}
> +
>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>   			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   {
> @@ -3168,18 +3307,37 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
>   	uid_t uid;
> +	int rc;

nit. be consistent with variable name. "ret" is used in other changes in this patch.

>   
>   	if (v == SEQ_START_TOKEN)
>   		return 0;
>   
> +	lock_sock(sk);
> +
> +	if (unlikely(sk_unhashed(sk))) {
> +		rc = SEQ_SKIP;
> +		goto unlock;
> +	}
> +
>   	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>   	meta.seq = seq;
>   	prog = bpf_iter_get_info(&meta, false);
> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +
> +unlock:
> +	release_sock(sk);
> +	return rc;
> +}
> +
> +static void bpf_iter_udp_put_batch(struct bpf_udp_iter_state *iter)
> +{
> +	while (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
>   }
>   
>   static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   {
> +	struct bpf_udp_iter_state *iter = seq->private;
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   
> @@ -3190,12 +3348,15 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>   	}
>   
> -	udp_seq_stop(seq, v);
> +	if (iter->cur_sk < iter->end_sk) {
> +		bpf_iter_udp_put_batch(iter);
> +		iter->st_bucket_done = false;
> +	}
>   }
>   
>   static const struct seq_operations bpf_iter_udp_seq_ops = {
> -	.start		= udp_seq_start,
> -	.next		= udp_seq_next,
> +	.start		= bpf_iter_udp_seq_start,
> +	.next		= bpf_iter_udp_seq_next,
>   	.stop		= bpf_iter_udp_seq_stop,
>   	.show		= bpf_iter_udp_seq_show,
>   };
> @@ -3424,21 +3585,57 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
>   DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>   		     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz)
> +{
> +	struct sock **new_batch;
> +
> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> +				   GFP_USER | __GFP_NOWARN);
> +	if (!new_batch)
> +		return -ENOMEM;
> +
> +	bpf_iter_udp_put_batch(iter);
> +	kvfree(iter->batch);
> +	iter->batch = new_batch;
> +	iter->max_sk = new_batch_sz;
> +
> +	return 0;
> +}
> +
> +#define INIT_BATCH_SZ 16
> +
>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   {
> -	return bpf_iter_init_seq_net(priv_data, aux);
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	int ret;
> +
> +	ret = bpf_iter_init_seq_net(priv_data, aux);
> +	if (ret)
> +		return ret;
> +
> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> +	if (ret) {
> +		bpf_iter_fini_seq_net(priv_data);
> +		return ret;

nit. remove this "return ret;" statement.

Others lgtm. I will continue with the rest of the patchset tomorrow.

> +	}
> +
> +	return ret;
>   }
>   
>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> +	struct bpf_udp_iter_state *iter = priv_data;
> +
>   	bpf_iter_fini_seq_net(priv_data);
> +	kvfree(iter->batch);
>   }
>   
>   static const struct bpf_iter_seq_info udp_seq_info = {
>   	.seq_ops		= &bpf_iter_udp_seq_ops,
>   	.init_seq_private	= bpf_iter_init_udp,
>   	.fini_seq_private	= bpf_iter_fini_udp,
> -	.seq_priv_size		= sizeof(struct udp_iter_state),
> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>   };
>   
>   static struct bpf_iter_reg udp_reg_info = {


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname
  2023-04-18 18:45   ` Stanislav Fomichev
@ 2023-04-24 17:58     ` Martin KaFai Lau
  0 siblings, 0 replies; 26+ messages in thread
From: Martin KaFai Lau @ 2023-04-24 17:58 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet, Stanislav Fomichev

On 4/18/23 11:45 AM, Stanislav Fomichev wrote:
>> +
>> +int get_socket_local_port(int family, int sock_fd, __u16 *out_port)
>> +{
>> +	socklen_t addr_len;
>> +	int err;
> 
> Sorry for keeping bikeshedding this part, but if you're going to do
> another respin, we can also drop the family argument:
> 
> int get_socket_local_port(int sock_fd, __be16 *out_port)
> /*                                       ^^ maybe also do be16? */

I would also just return the port as the return value instead of having another 
arg for this. The int is more than enough.

> {
> 	struct sockaddr_storage addr;
> 	socklen_t addrlen;
> 
> 	addrlen = sizeof(addr);
> 	getsockname(sock_fd, (struct sockaddr *)&addr, &addrlen);
> 
> 	if (addr.ss_family == AF_INET) {
> 	} else if () {
> 	}
> }
> 
>> +
>> +	if (family == AF_INET) {
>> +		struct sockaddr_in addr = {};
>> +
>> +		addr_len = sizeof(addr);
>> +		err = getsockname(sock_fd, (struct sockaddr *)&addr, &addr_len);
>> +		if (err < 0)
>> +			return err;
>> +		*out_port = addr.sin_port;
>> +		return 0;
>> +	} else if (family == AF_INET6) {
>> +		struct sockaddr_in6 addr = {};
>> +
>> +		addr_len = sizeof(addr);
>> +		err = getsockname(sock_fd, (struct sockaddr *)&addr, &addr_len);
>> +		if (err < 0)
>> +			return err;
>> +		*out_port = addr.sin6_port;
>> +		return 0;
>> +	}
>> +
>> +	return -1;
>> +}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy
  2023-04-18 15:31 ` [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
@ 2023-04-24 19:20   ` Martin KaFai Lau
  0 siblings, 0 replies; 26+ messages in thread
From: Martin KaFai Lau @ 2023-04-24 19:20 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, edumazet, bpf

On 4/18/23 8:31 AM, Aditi Ghag wrote:
> The test cases for destroying sockets mirror the intended usages of the
> bpf_sock_destroy kfunc using iterators.
> 
> The destroy helpers set `ECONNABORTED` error code that we can validate in
> the test code with client sockets. But UDP sockets have an overriding error
> code from the disconnect called during abort, so the error code the
> validation is only done for TCP sockets.
> 
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   .../selftests/bpf/prog_tests/sock_destroy.c   | 217 ++++++++++++++++++
>   .../selftests/bpf/progs/sock_destroy_prog.c   | 147 ++++++++++++
>   2 files changed, 364 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>   create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> new file mode 100644
> index 000000000000..51f2454b7b4b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> @@ -0,0 +1,217 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <bpf/bpf_endian.h>
> +
> +#include "sock_destroy_prog.skel.h"
> +#include "network_helpers.h"
> +
> +#define TEST_NS "sock_destroy_netns"
> +
> +static void start_iter_sockets(struct bpf_program *prog)
> +{
> +	struct bpf_link *link;
> +	char buf[50] = {};
> +	int iter_fd, len;
> +
> +	link = bpf_program__attach_iter(prog, NULL);
> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
> +		return;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> +		goto free_link;
> +
> +	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
> +		;
> +	ASSERT_GE(len, 0, "read");
> +
> +	close(iter_fd);
> +
> +free_link:
> +	bpf_link__destroy(link);
> +}
> +
> +static void test_tcp_client(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;

Together with the multiple goto labels, all the ' = -1' init looks weird. May as 
well: keep this ' = -1' init, always test for -1 before close() and remove the 
need to have mulpite goto label in each test_*() function.

The 'n = 0' init is not needed for sure.

> +
> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> +	if (!ASSERT_GE(serv, 0, "start_server"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> +		goto cleanup_serv;
> +
> +	serv = accept(serv, NULL, NULL);
> +	if (!ASSERT_GE(serv, 0, "serv accept"))
> +		goto cleanup;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_GE(n, 0, "client send"))

nit. Could be more strict. ASSERT_EQ(n, 1, ...)?

> +		goto cleanup;
> +
> +	/* Run iterator program that destroys connected client sockets. */
> +	start_iter_sockets(skel->progs.iter_tcp6_client);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> +		goto cleanup;
> +	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
> +
> +

Patchwork 
(https://patchwork.kernel.org/project/netdevbpf/patch/20230418153148.2231644-8-aditi.ghag@isovalent.com/) 
reports:

CHECK: Please don't use multiple blank lines
#90: FILE: tools/testing/selftests/bpf/prog_tests/sock_destroy.c:62:
+
+

CHECK: Please don't use multiple blank lines
#130: FILE: tools/testing/selftests/bpf/prog_tests/sock_destroy.c:102:
+
+

CHECK: Please don't use multiple blank lines
#137: FILE: tools/testing/selftests/bpf/prog_tests/sock_destroy.c:109:
+
+

> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +static void test_tcp_server(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0, err;
> +	__u16 serv_port = 0;

serv_port init is also not needed.

> +
> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> +	if (!ASSERT_GE(serv, 0, "start_server"))
> +		goto cleanup_serv;
> +	err = get_socket_local_port(AF_INET6, serv, &serv_port);
> +	if (!ASSERT_EQ(err, 0, "get_local_port"))
> +		goto cleanup;

Like here, it is obvious that it should be 'goto cleanup_serv;'
Test -1 before close() and avoid this multiple goto label confusion.

> +	skel->bss->serv_port = serv_port;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> +		goto cleanup_serv;
> +
> +	serv = accept(serv, NULL, NULL);
> +	if (!ASSERT_GE(serv, 0, "serv accept"))
> +		goto cleanup;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_GE(n, 0, "client send"))
> +		goto cleanup;
> +
> +	/* Run iterator program that destroys server sockets. */
> +	start_iter_sockets(skel->progs.iter_tcp6_server);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> +		goto cleanup;
> +	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
> +
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +
> +static void test_udp_client(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;
> +
> +	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 0, 0);
> +	if (!ASSERT_GE(serv, 0, "start_server"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> +		goto cleanup_serv;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_GE(n, 0, "client send"))
> +		goto cleanup;
> +
> +	/* Run iterator program that destroys sockets. */
> +	start_iter_sockets(skel->progs.iter_udp6_client);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> +		goto cleanup;
> +	/* UDP sockets have an overriding error code after they are disconnected,
> +	 * so we don't check for ECONNABORTED error code.
> +	 */
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +static void test_udp_server(struct sock_destroy_prog *skel)
> +{
> +	int *listen_fds = NULL, n, i, err;
> +	unsigned int num_listens = 5;
> +	char buf[1];
> +	__u16 serv_port;
> +
> +	/* Start reuseport servers. */
> +	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
> +					    "::1", 0, 0, num_listens);
> +	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
> +		goto cleanup;
> +	err = get_socket_local_port(AF_INET6, listen_fds[0], &serv_port);
> +	if (!ASSERT_EQ(err, 0, "get_local_port"))
> +		goto cleanup;
> +	skel->bss->serv_port = ntohs(serv_port);

This is different from test_tcp_server() which has serv_port in network order. 
Either order is fine but be consistent across tests.

> +
> +	/* Run iterator program that destroys server sockets. */
> +	start_iter_sockets(skel->progs.iter_udp6_server);
> +
> +	for (i = 0; i < num_listens; ++i) {
> +		n = read(listen_fds[i], buf, sizeof(buf));
> +		if (!ASSERT_EQ(n, -1, "read") ||
> +		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
> +			break;
> +	}
> +	ASSERT_EQ(i, num_listens, "server socket");
> +
> +cleanup:
> +	free_fds(listen_fds, num_listens);
> +}
> +
> +void test_sock_destroy(void)
> +{
> +	struct sock_destroy_prog *skel;
> +	struct nstoken *nstoken = NULL;

This init looks unnecessary.

> +	int cgroup_fd = 0;

Looks wrong to init a fd to 0. I don't think an init is needed either.

> +
> +	skel = sock_destroy_prog__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		return;
> +
> +	cgroup_fd = test__join_cgroup("/sock_destroy");
> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> +		goto close_cgroup_fd;
> +
> +	skel->links.sock_connect = bpf_program__attach_cgroup(
> +		skel->progs.sock_connect, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
> +		goto close_cgroup_fd;
> +
> +	SYS(fail, "ip netns add %s", TEST_NS);
> +	SYS(fail, "ip -net %s link set dev lo up", TEST_NS);
> +
> +	nstoken = open_netns(TEST_NS);
> +	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
> +		goto fail;
> +
> +	if (test__start_subtest("tcp_client"))
> +		test_tcp_client(skel);
> +	if (test__start_subtest("tcp_server"))
> +		test_tcp_server(skel);
> +	if (test__start_subtest("udp_client"))
> +		test_udp_client(skel);
> +	if (test__start_subtest("udp_server"))
> +		test_udp_server(skel);
> +
> +
> +fail:
> +	if (nstoken)
> +		close_netns(nstoken);
> +	SYS_NOFAIL("ip netns del " TEST_NS " &> /dev/null");
> +close_cgroup_fd:
> +	close(cgroup_fd);
> +	sock_destroy_prog__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> new file mode 100644
> index 000000000000..1f265e0d9dea
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> @@ -0,0 +1,147 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +#include "bpf_tracing_net.h"
> +
> +#define AF_INET6 10

Not needed. This has already been defined in bpf_tracing_net.h.

> +
> +__u16 serv_port = 0;
> +
> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} tcp_conn_sockets SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} udp_conn_sockets SEC(".maps");
> +
> +SEC("cgroup/connect6")
> +int sock_connect(struct bpf_sock_addr *ctx)
> +{
> +	int key = 0;
> +	__u64 sock_cookie = 0;
> +	__u32 keyc = 0;
> +
> +	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> +		return 1;
> +
> +	sock_cookie = bpf_get_socket_cookie(ctx);
> +	if (ctx->protocol == IPPROTO_TCP)
> +		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
> +	else if (ctx->protocol == IPPROTO_UDP)
> +		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
> +	else
> +		return 1;
> +
> +	return 1;
> +}
> +
> +SEC("iter/tcp")
> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
> +{
> +	struct sock_common *sk_common = ctx->sk_common;
> +	__u64 sock_cookie = 0;
> +	__u64 *val;
> +	int key = 0;
> +
> +	if (!sk_common)
> +		return 0;
> +
> +	if (sk_common->skc_family != AF_INET6)
> +		return 0;
> +
> +	sock_cookie  = bpf_get_socket_cookie(sk_common);
> +	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
> +	if (!val)
> +		return 0;
> +	/* Destroy connected client sockets. */
> +	if (sock_cookie == *val)
> +		bpf_sock_destroy(sk_common);
> +
> +	return 0;
> +}
> +
> +SEC("iter/tcp")
> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
> +{
> +	struct sock_common *sk_common = ctx->sk_common;
> +	struct tcp6_sock *tcp_sk;
> +	const struct inet_connection_sock *icsk;
> +	const struct inet_sock *inet;
> +	__u16 srcp;
> +
> +	if (!sk_common)
> +		return 0;
> +
> +	if (sk_common->skc_family != AF_INET6)
> +		return 0;
> +
> +	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
> +	if (!tcp_sk)
> +		return 0;
> +
> +	icsk = &tcp_sk->tcp.inet_conn;
> +	inet = &icsk->icsk_inet;
> +	srcp = inet->inet_sport;
> +
> +	/* Destroy server sockets. */
> +	if (srcp == serv_port)
> +		bpf_sock_destroy(sk_common);
> +
> +	return 0;
> +}
> +
> +
> +SEC("iter/udp")
> +int iter_udp6_client(struct bpf_iter__udp *ctx)
> +{
> +	struct udp_sock *udp_sk = ctx->udp_sk;
> +	struct sock *sk = (struct sock *) udp_sk;
> +	__u64 sock_cookie = 0, *val;
> +	int key = 0;
> +
> +	if (!sk)
> +		return 0;
> +
> +	sock_cookie  = bpf_get_socket_cookie(sk);
> +	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
> +	if (!val)
> +		return 0;
> +	/* Destroy connected client sockets. */
> +	if (sock_cookie == *val)
> +		bpf_sock_destroy((struct sock_common *)sk);
> +
> +	return 0;
> +}
> +
> +SEC("iter/udp")
> +int iter_udp6_server(struct bpf_iter__udp *ctx)
> +{
> +	struct udp_sock *udp_sk = ctx->udp_sk;
> +	struct sock *sk = (struct sock *) udp_sk;
> +	__u16 srcp;
> +	struct inet_sock *inet;
> +
> +	if (!sk)
> +		return 0;
> +
> +	inet = &udp_sk->inet;
> +	srcp = bpf_ntohs(inet->inet_sport);

Try sk->sk_num if host order is preferred.

> +	if (srcp == serv_port)
> +		bpf_sock_destroy((struct sock_common *)sk);
> +
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
                   ` (6 preceding siblings ...)
  2023-04-18 15:31 ` [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
@ 2023-04-24 22:15 ` Martin KaFai Lau
  2023-05-01 23:32   ` Aditi Ghag
  7 siblings, 1 reply; 26+ messages in thread
From: Martin KaFai Lau @ 2023-04-24 22:15 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, edumazet, bpf

On 4/18/23 8:31 AM, Aditi Ghag wrote:
> This patch adds the capability to destroy sockets in BPF. We plan to use
> the capability in Cilium to force client sockets to reconnect when their
> remote load-balancing backends are deleted. The other use case is
> on-the-fly policy enforcement where existing socket connections prevented
> by policies need to be terminated.

If the earlier kfunc filter patch 
(https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) 
looks fine to you, please include it into the next revision. This patchset needs 
it. Usual thing to do is to keep my sob (and author if not much has changed) and 
add your sob. The test needs to be broken out into a separate patch though. It 
needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is 
not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.

Please also fix the subject in the patches. They are all missing the bpf-next 
and revision tag.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator
  2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
  2023-04-20  8:55   ` Paolo Abeni
@ 2023-04-25  5:45   ` Yonghong Song
  2023-05-03 20:26     ` Aditi Ghag
  1 sibling, 1 reply; 26+ messages in thread
From: Yonghong Song @ 2023-04-25  5:45 UTC (permalink / raw)
  To: Aditi Ghag, bpf; +Cc: kafai, sdf, edumazet



On 4/18/23 8:31 AM, Aditi Ghag wrote:
> Previously, BPF TCP iterator was acquiring fast version of sock lock that
> disables the BH. This introduced a circular dependency with code paths that
> later acquire sockets hash table bucket lock.
> Replace the fast version of sock lock with slow that faciliates BPF
> programs executed from the iterator to destroy TCP listening sockets
> using the bpf_sock_destroy kfunc (implemened in follow-up commits).
> 
> Here is a stack trace that motivated this change:
> 
> ```
> 1) sock_lock with BH disabled + bucket lock
> 
> lock_acquire+0xcd/0x330
> _raw_spin_lock_bh+0x38/0x50
> inet_unhash+0x96/0xd0
> tcp_set_state+0x6a/0x210
> tcp_abort+0x12b/0x230
> bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
> bpf_iter_run_prog+0x1ff/0x340
> bpf_iter_tcp_seq_show+0xca/0x190
> bpf_seq_read+0x177/0x450
> vfs_read+0xc6/0x300
> ksys_read+0x69/0xf0
> do_syscall_64+0x3c/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc

IIUC, the above deadlock is due to

 > lock_acquire+0xcd/0x330
 > _raw_spin_lock_bh+0x38/0x50
 > inet_unhash+0x96/0xd0
 > tcp_set_state+0x6a/0x210
 > tcp_abort+0x12b/0x230
 > bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
 > bpf_iter_run_prog+0x1ff/0x340
 > ... lock_acquire for sock lock ...
 > bpf_iter_tcp_seq_show+0xca/0x190
 > bpf_seq_read+0x177/0x450
 > vfs_read+0xc6/0x300
 > ksys_read+0x69/0xf0
 > do_syscall_64+0x3c/0x90
 > entry_SYSCALL_64_after_hwframe+0x72/0xdc

I could be great to make it explicit with the stack trace so
it is clear where the circular dependency is.

> 
> 2) sock lock with BH enable
> 
> [    1.499968]   lock_acquire+0xcd/0x330
> [    1.500316]   _raw_spin_lock+0x33/0x40
> [    1.500670]   sk_clone_lock+0x146/0x520
> [    1.501030]   inet_csk_clone_lock+0x1b/0x110
> [    1.501433]   tcp_create_openreq_child+0x22/0x3f0
> [    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
> [    1.502284]   tcp_check_req+0x137/0x660
> [    1.502646]   tcp_v6_rcv+0xa63/0xe80
> [    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
> [    1.503434]   ip6_input_finish+0x72/0x140
> [    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
> [    1.504281]   process_backlog+0x79/0x260
> [    1.504668]   __napi_poll.constprop.0+0x27/0x170
> [    1.505104]   net_rx_action+0x14a/0x2a0
> [    1.505469]   __do_softirq+0x165/0x510
> [    1.505842]   do_softirq+0xcd/0x100
> [    1.506172]   __local_bh_enable_ip+0xcc/0xf0
> [    1.506588]   ip6_finish_output2+0x2a8/0xb00
> [    1.506988]   ip6_finish_output+0x274/0x510
> [    1.507377]   ip6_xmit+0x319/0x9b0
> [    1.507726]   inet6_csk_xmit+0x12b/0x2b0
> [    1.508096]   __tcp_transmit_skb+0x549/0xc40
> [    1.508498]   tcp_rcv_state_process+0x362/0x1180

Similarly, it would be good to illustrate where is the
deadlock in this case.

> 
> ```
> 
> Acked-by: Stanislav Fomichev <sdf@google.com>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/tcp_ipv4.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ea370afa70ed..f2d370a9450f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
> -	bool slow;
>   	uid_t uid;
>   	int ret;
>   
> @@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>   		return 0;
>   
>   	if (sk_fullsock(sk))
> -		slow = lock_sock_fast(sk);
> +		lock_sock(sk);
>   
>   	if (unlikely(sk_unhashed(sk))) {
>   		ret = SEQ_SKIP;
> @@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>   
>   unlock:
>   	if (sk_fullsock(sk))
> -		unlock_sock_fast(sk, slow);
> +		release_sock(sk);
>   	return ret;
>   
>   }

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state
  2023-04-24  0:18   ` Martin KaFai Lau
@ 2023-05-01 22:39     ` Aditi Ghag
  0 siblings, 0 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-05-01 22:39 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, Eric Dumazet, Martin KaFai Lau, bpf



> On Apr 23, 2023, at 5:18 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>> This is a preparatory commit to remove the field. The field was
>> previously shared between proc fs and BPF UDP socket iterators. As the
>> follow-up commits will decouple the implementation for the iterators,
>> remove the field. As for BPF socket iterator, filtering of sockets is
>> exepected to be done in BPF programs.
>> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  include/net/udp.h |  1 -
>>  net/ipv4/udp.c    | 34 ++++------------------------------
>>  2 files changed, 4 insertions(+), 31 deletions(-)
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index de4b528522bb..5cad44318d71 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -437,7 +437,6 @@ struct udp_seq_afinfo {
>>  struct udp_iter_state {
>>  	struct seq_net_private  p;
>>  	int			bucket;
>> -	struct udp_seq_afinfo	*bpf_seq_afinfo;
>>  };
>>    void *udp_seq_start(struct seq_file *seq, loff_t *pos);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index c605d171eb2d..3c9eeee28678 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -2997,10 +2997,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start)
>>  	struct udp_table *udptable;
>>  	struct sock *sk;
>>  -	if (state->bpf_seq_afinfo)
>> -		afinfo = state->bpf_seq_afinfo;
>> -	else
>> -		afinfo = pde_data(file_inode(seq->file));
>> +	afinfo = pde_data(file_inode(seq->file));
> 
> I can see how this change will work after patch 4. However, this patch alone cannot work independently as is. The udp bpf iter still uses the udp_get_{first,next} and udp_seq_stop() up-to this patch.
> 
> First, patch 3 refactoring should be done before patch 2 here. The removal of 'struct udp_seq_afinfo *bpf_seq_afinfo' in patch 2 should be done when all the necessary refactoring is in-place first.
> 
> Also, this afinfo is passed to udp_get_table_afinfo(). How about renaming udp_get_table_afinfo() to udp_get_table_seq() and having it take the "seq" as the arg instead. This probably will deserve another refactoring patch before finally removing bpf_seq_afinfo. Something like this (un-compiled code):
> 
> static struct udp_table *udp_get_table_seq(struct seq_file *seq,
>                                           struct net *net)
> {
> 	const struct udp_seq_afinfo *afinfo;
> 
>        if (st->bpf_seq_afinfo)
>                return net->ipv4.udp_table;
> 
> 	afinfo = pde_data(file_inode(seq->file));
>        return afinfo->udp_table ? : net->ipv4.udp_table;
> }
> 
> Of course, when the later patch finally removes the bpf_seq_afinfo, the 'if (st->bpf_seq_afinfo)' test should be replaced with the 'if (seq->op == &bpf_iter_udp_seq_ops)' test.
> 
> That will also make the afinfo dance in bpf_iter_udp_batch() in patch 4 goes away.

Sweet! I suppose it was worth resolving a few conflicts while creating the new preparatory patch, especially since the refactoring simplified unnecessary setting of afinfo in bpf_iter_udp_batch(). The additional minor change that was needed was to forward declare bpf_iter_udp_seq_ops. And of course, the if (seq->op == &bpf_iter_udp_seq_ops) check needed to be wrapped in the CONFIG_BPF_SYSCALL ifdef.


> 
>>    	udptable = udp_get_table_afinfo(afinfo, net);
>>  @@ -3033,10 +3030,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk)
>>  	struct udp_seq_afinfo *afinfo;
>>  	struct udp_table *udptable;
>>  -	if (state->bpf_seq_afinfo)
>> -		afinfo = state->bpf_seq_afinfo;
>> -	else
>> -		afinfo = pde_data(file_inode(seq->file));
>> +	afinfo = pde_data(file_inode(seq->file));
>>    	do {
>>  		sk = sk_next(sk);
>> @@ -3094,10 +3088,7 @@ void udp_seq_stop(struct seq_file *seq, void *v)
>>  	struct udp_seq_afinfo *afinfo;
>>  	struct udp_table *udptable;
>>  -	if (state->bpf_seq_afinfo)
>> -		afinfo = state->bpf_seq_afinfo;
>> -	else
>> -		afinfo = pde_data(file_inode(seq->file));
>> +	afinfo = pde_data(file_inode(seq->file));
>>    	udptable = udp_get_table_afinfo(afinfo, seq_file_net(seq));
>>  @@ -3415,28 +3406,11 @@ DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>>    static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>>  {
>> -	struct udp_iter_state *st = priv_data;
>> -	struct udp_seq_afinfo *afinfo;
>> -	int ret;
>> -
>> -	afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN);
>> -	if (!afinfo)
>> -		return -ENOMEM;
>> -
>> -	afinfo->family = AF_UNSPEC;
>> -	afinfo->udp_table = NULL;
>> -	st->bpf_seq_afinfo = afinfo;
>> -	ret = bpf_iter_init_seq_net(priv_data, aux);
>> -	if (ret)
>> -		kfree(afinfo);
>> -	return ret;
>> +	return bpf_iter_init_seq_net(priv_data, aux);
> 
> Nice simplification with the bpf_seq_afinfo cleanup.
> 
>>  }
>>    static void bpf_iter_fini_udp(void *priv_data)
>>  {
>> -	struct udp_iter_state *st = priv_data;
>> -
>> -	kfree(st->bpf_seq_afinfo);
>>  	bpf_iter_fini_seq_net(priv_data);
>>  }


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-04-24 22:15 ` [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Martin KaFai Lau
@ 2023-05-01 23:32   ` Aditi Ghag
  2023-05-01 23:37     ` Aditi Ghag
  2023-05-02 22:52     ` Aditi Ghag
  0 siblings, 2 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-05-01 23:32 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, edumazet, bpf



> On Apr 24, 2023, at 3:15 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>> This patch adds the capability to destroy sockets in BPF. We plan to use
>> the capability in Cilium to force client sockets to reconnect when their
>> remote load-balancing backends are deleted. The other use case is
>> on-the-fly policy enforcement where existing socket connections prevented
>> by policies need to be terminated.
> 
> If the earlier kfunc filter patch (https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) looks fine to you, please include it into the next revision. This patchset needs it. Usual thing to do is to keep my sob (and author if not much has changed) and add your sob. The test needs to be broken out into a separate patch though. It needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.
> 

Yeah, ok. I was waiting for your confirmation. The patch doesn't need my sob though (maybe tested-by).
I've created a separate patch for the test. 


> Please also fix the subject in the patches. They are all missing the bpf-next and revision tag.
> 

Took me a few moments to realize that as I was looking at earlier series. Looks like I forgot to add the tags to subsequent patches in this series. I'll fix it up in the next push.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-05-01 23:32   ` Aditi Ghag
@ 2023-05-01 23:37     ` Aditi Ghag
  2023-05-02 23:24       ` Martin KaFai Lau
  2023-05-02 22:52     ` Aditi Ghag
  1 sibling, 1 reply; 26+ messages in thread
From: Aditi Ghag @ 2023-05-01 23:37 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, Eric Dumazet, bpf



> On May 1, 2023, at 4:32 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Apr 24, 2023, at 3:15 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>>> This patch adds the capability to destroy sockets in BPF. We plan to use
>>> the capability in Cilium to force client sockets to reconnect when their
>>> remote load-balancing backends are deleted. The other use case is
>>> on-the-fly policy enforcement where existing socket connections prevented
>>> by policies need to be terminated.
>> 
>> If the earlier kfunc filter patch (https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) looks fine to you, please include it into the next revision. This patchset needs it. Usual thing to do is to keep my sob (and author if not much has changed) and add your sob. The test needs to be broken out into a separate patch though. It needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.
>> 
> 
> Yeah, ok. I was waiting for your confirmation. The patch doesn't need my sob though (maybe tested-by).
> I've created a separate patch for the test. 
> 

Ah, looks like the patch is missing a proper description. While I can add something wrt sock_destroy use case, if you have a blurb, feel free to post it here.

> 
>> Please also fix the subject in the patches. They are all missing the bpf-next and revision tag.
>> 
> 
> Took me a few moments to realize that as I was looking at earlier series. Looks like I forgot to add the tags to subsequent patches in this series. I'll fix it up in the next push.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-05-01 23:32   ` Aditi Ghag
  2023-05-01 23:37     ` Aditi Ghag
@ 2023-05-02 22:52     ` Aditi Ghag
  2023-05-02 23:40       ` Martin KaFai Lau
  1 sibling, 1 reply; 26+ messages in thread
From: Aditi Ghag @ 2023-05-02 22:52 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, edumazet, bpf



> On May 1, 2023, at 4:32 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Apr 24, 2023, at 3:15 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>>> This patch adds the capability to destroy sockets in BPF. We plan to use
>>> the capability in Cilium to force client sockets to reconnect when their
>>> remote load-balancing backends are deleted. The other use case is
>>> on-the-fly policy enforcement where existing socket connections prevented
>>> by policies need to be terminated.
>> 
>> If the earlier kfunc filter patch (https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) looks fine to you, please include it into the next revision. This patchset needs it. Usual thing to do is to keep my sob (and author if not much has changed) and add your sob. The test needs to be broken out into a separate patch though. It needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.
>> 
> 
> Yeah, ok. I was waiting for your confirmation. The patch doesn't need my sob though (maybe tested-by).
> I've created a separate patch for the test. 


Here is the patch diff for the extended test case for your reference. I'm ready to push a new version once I get an ack from you. 

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
index a889c53e93c7..afed8cad94ee 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
@@ -3,6 +3,7 @@
 #include <bpf/bpf_endian.h>

 #include "sock_destroy_prog.skel.h"
+#include "sock_destroy_prog_fail.skel.h"
 #include "network_helpers.h"

 #define TEST_NS "sock_destroy_netns"
@@ -207,6 +208,8 @@ void test_sock_destroy(void)
                test_udp_server(skel);


+       RUN_TESTS(sock_destroy_prog_fail);
+
 cleanup:
        if (nstoken)
                close_netns(nstoken);
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
new file mode 100644
index 000000000000..dd6850b58e25
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+SEC("tp_btf/tcp_destroy_sock")
+__failure __msg("calling kernel function bpf_sock_destroy is not allowed")
+int BPF_PROG(trace_tcp_destroy_sock, struct sock *sk)
+{
+       /* should not load */
+       bpf_sock_destroy((struct sock_common *)sk);
+
+       return 0;
+}

> 
> 
>> Please also fix the subject in the patches. They are all missing the bpf-next and revision tag.
>> 
> 
> Took me a few moments to realize that as I was looking at earlier series. Looks like I forgot to add the tags to subsequent patches in this series. I'll fix it up in the next push.


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-05-01 23:37     ` Aditi Ghag
@ 2023-05-02 23:24       ` Martin KaFai Lau
  0 siblings, 0 replies; 26+ messages in thread
From: Martin KaFai Lau @ 2023-05-02 23:24 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: Stanislav Fomichev, Eric Dumazet, bpf

On 5/1/23 4:37 PM, Aditi Ghag wrote:
> 
> 
>> On May 1, 2023, at 4:32 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>>
>>
>>
>>> On Apr 24, 2023, at 3:15 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>>>> This patch adds the capability to destroy sockets in BPF. We plan to use
>>>> the capability in Cilium to force client sockets to reconnect when their
>>>> remote load-balancing backends are deleted. The other use case is
>>>> on-the-fly policy enforcement where existing socket connections prevented
>>>> by policies need to be terminated.
>>>
>>> If the earlier kfunc filter patch (https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) looks fine to you, please include it into the next revision. This patchset needs it. Usual thing to do is to keep my sob (and author if not much has changed) and add your sob. The test needs to be broken out into a separate patch though. It needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.
>>>
>>
>> Yeah, ok. I was waiting for your confirmation. The patch doesn't need my sob though (maybe tested-by).
>> I've created a separate patch for the test.

I believe your sob is still needed since you will post the patch.

>>
> 
> Ah, looks like the patch is missing a proper description. While I can add something wrt sock_destroy use case, if you have a blurb, feel free to post it here.

Right, some of the RFC commit message is irrelevant. You can develop the 
description based on the useful part of the RFC commit message, like "... added 
a callback filter to 'struct btf_kfunc_id_set'. The filter has
access to the prog such that it can filter by other properties of a prog.
The prog->expected_attached_type is used in the tracing_iter_filter() ...". This 
is the how part. You need to explain why the patch is needed in the commit 
message also.

> 
>>
>>> Please also fix the subject in the patches. They are all missing the bpf-next and revision tag.
>>>
>>
>> Took me a few moments to realize that as I was looking at earlier series. Looks like I forgot to add the tags to subsequent patches in this series. I'll fix it up in the next push.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-05-02 22:52     ` Aditi Ghag
@ 2023-05-02 23:40       ` Martin KaFai Lau
  2023-05-04 17:32         ` Aditi Ghag
  0 siblings, 1 reply; 26+ messages in thread
From: Martin KaFai Lau @ 2023-05-02 23:40 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: Stanislav Fomichev, edumazet, bpf

On 5/2/23 3:52 PM, Aditi Ghag wrote:
> 
> 
>> On May 1, 2023, at 4:32 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>>
>>
>>
>>> On Apr 24, 2023, at 3:15 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>>>> This patch adds the capability to destroy sockets in BPF. We plan to use
>>>> the capability in Cilium to force client sockets to reconnect when their
>>>> remote load-balancing backends are deleted. The other use case is
>>>> on-the-fly policy enforcement where existing socket connections prevented
>>>> by policies need to be terminated.
>>>
>>> If the earlier kfunc filter patch (https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) looks fine to you, please include it into the next revision. This patchset needs it. Usual thing to do is to keep my sob (and author if not much has changed) and add your sob. The test needs to be broken out into a separate patch though. It needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.
>>>
>>
>> Yeah, ok. I was waiting for your confirmation. The patch doesn't need my sob though (maybe tested-by).
>> I've created a separate patch for the test.
> 
> 
> Here is the patch diff for the extended test case for your reference. I'm ready to push a new version once I get an ack from you.
Looks reasonable to me.

One thing I have been thinking is the bpf_sock_destroy kfunc should need a 
KF_TRUSTED_ARGS but I suspect that may need a change in the tcp_reg_info in 
tcp_ipv4.c. Not sure yet. Regardless, I don't think this will have a major 
effect on other patches in this set. Please go ahead to respin considering there 
are a few comments that need to be addressed already. At worst it can use one 
final revision to address KF_TRUSTED_ARGS.
[ btw, I don't see your reply/confirmation on the Patch 1 discussion also. 
Please ensure those will also be clarified/addressed in the next respin. ]


> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> index a889c53e93c7..afed8cad94ee 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> @@ -3,6 +3,7 @@
>   #include <bpf/bpf_endian.h>
> 
>   #include "sock_destroy_prog.skel.h"
> +#include "sock_destroy_prog_fail.skel.h"
>   #include "network_helpers.h"
> 
>   #define TEST_NS "sock_destroy_netns"
> @@ -207,6 +208,8 @@ void test_sock_destroy(void)
>                  test_udp_server(skel);
> 
> 
> +       RUN_TESTS(sock_destroy_prog_fail);
> +
>   cleanup:
>          if (nstoken)
>                  close_netns(nstoken);
> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
> new file mode 100644
> index 000000000000..dd6850b58e25
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> +
> +SEC("tp_btf/tcp_destroy_sock")
> +__failure __msg("calling kernel function bpf_sock_destroy is not allowed")
> +int BPF_PROG(trace_tcp_destroy_sock, struct sock *sk)
> +{
> +       /* should not load */
> +       bpf_sock_destroy((struct sock_common *)sk);
> +
> +       return 0;
> +}
> 
>>
>>
>>> Please also fix the subject in the patches. They are all missing the bpf-next and revision tag.
>>>
>>
>> Took me a few moments to realize that as I was looking at earlier series. Looks like I forgot to add the tags to subsequent patches in this series. I'll fix it up in the next push.
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator
  2023-04-20  8:55   ` Paolo Abeni
@ 2023-05-03 20:25     ` Aditi Ghag
  0 siblings, 0 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-05-03 20:25 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: bpf, kafai, sdf, edumazet



> On Apr 20, 2023, at 1:55 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
> Hi,
> 
> On Tue, 2023-04-18 at 15:31 +0000, Aditi Ghag wrote:
>> Previously, BPF TCP iterator was acquiring fast version of sock lock that
>> disables the BH. This introduced a circular dependency with code paths that
>> later acquire sockets hash table bucket lock.
>> Replace the fast version of sock lock with slow that faciliates BPF
>> programs executed from the iterator to destroy TCP listening sockets
>> using the bpf_sock_destroy kfunc (implemened in follow-up commits).
>> 
>> Here is a stack trace that motivated this change:
>> 
>> ```
>> 1) sock_lock with BH disabled + bucket lock
>> 
>> lock_acquire+0xcd/0x330
>> _raw_spin_lock_bh+0x38/0x50
>> inet_unhash+0x96/0xd0
>> tcp_set_state+0x6a/0x210
>> tcp_abort+0x12b/0x230
>> bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
>> bpf_iter_run_prog+0x1ff/0x340
>> bpf_iter_tcp_seq_show+0xca/0x190
>> bpf_seq_read+0x177/0x450
>> vfs_read+0xc6/0x300
>> ksys_read+0x69/0xf0
>> do_syscall_64+0x3c/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> 
>> 2) sock lock with BH enable
>> 
>> [    1.499968]   lock_acquire+0xcd/0x330
>> [    1.500316]   _raw_spin_lock+0x33/0x40
> 
> The above is quite confusing to me, here BH are disabled as well
> (otherwise the whole softirq processing would be really broken)

Hi! Looks like the commit warrants more detailed explanation. 
I've annotated the relevant stack traces with the locks in question (also updated the patch description locally):

lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> sock lock acquired with BH enabled
sk_clone_lock+0x146/0x520
inet_csk_clone_lock+0x1b/0x110
tcp_create_openreq_child+0x22/0x3f0
tcp_v6_syn_recv_sock+0x96/0x940


lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock (may cause deadlock if interrupted)
__inet_hash+0x4b/0x210
inet_csk_listen_start+0xe6/0x100
inet_listen+0x95/0x1d0
__sys_listen+0x69/0xb0
__x64_sys_listen+0x14/0x20
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc


lock_acquire+0xcd/0x330
_raw_spin_lock+0x33/0x40
------> Acquire (bucket) lhash2.lock
inet_unhash+0x9a/0x110
tcp_set_state+0x6a/0x210
tcp_abort+0x10d/0x200
bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
bpf_iter_run_prog+0x1ff/0x340
------> Release (bucket) lhash2.lock
bpf_iter_tcp_seq_show+0xca/0x190
------> Acquire (bucket) lhash2.lock
------> sock lock acquired with BH disabled
bpf_seq_read+0x177/0x450

---------------------------------------------------------------------------------
Here is the full stack trace:

[    1.543344] and this task is already holding:
[    1.543786] ffffa37903b400b0 (slock-AF_INET6){+.-.}-{2:2}, at: __lock_sock_fast+0x33/0x70
[    1.544410] which would create a new lock dependency:
[    1.544797]  (slock-AF_INET6){+.-.}-{2:2} -> (&h->lhash2[i].lock){+.+.}-{2:2}
[    1.545361]
[    1.545361] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    1.545961]  (slock-AF_INET6){+.-.}-{2:2}
[    1.545963]
[    1.545963] ... which became SOFTIRQ-irq-safe at:
[    1.546745]   lock_acquire+0xcd/0x330
[    1.547033]   _raw_spin_lock+0x33/0x40
[    1.547325]   sk_clone_lock+0x146/0x520
[    1.547623]   inet_csk_clone_lock+0x1b/0x110
[    1.547960]   tcp_create_openreq_child+0x22/0x3f0
[    1.548327]   tcp_v6_syn_recv_sock+0x96/0x940
[    1.548672]   tcp_check_req+0x13f/0x640
[    1.548977]   tcp_v6_rcv+0xa62/0xe80
[    1.549258]   ip6_protocol_deliver_rcu+0x78/0x590
[    1.549621]   ip6_input_finish+0x72/0x140
[    1.549931]   __netif_receive_skb_one_core+0x63/0xa0
[    1.550313]   process_backlog+0x79/0x260
[    1.550619]   __napi_poll.constprop.0+0x27/0x170
[    1.550976]   net_rx_action+0x14a/0x2a0
[    1.551272]   __do_softirq+0x165/0x510
[    1.551563]   do_softirq+0xcd/0x100
[    1.551836]   __local_bh_enable_ip+0xcc/0xf0
[    1.552168]   ip6_finish_output2+0x27c/0xb10
[    1.552500]   ip6_finish_output+0x274/0x510
[    1.552823]   ip6_xmit+0x319/0x9b0
[    1.553095]   inet6_csk_xmit+0x12b/0x2b0
[    1.553398]   __tcp_transmit_skb+0x543/0xc30
[    1.553731]   tcp_rcv_state_process+0x362/0x1180
[    1.554088]   tcp_v6_do_rcv+0x10f/0x540
[    1.554387]   __release_sock+0x6a/0xe0
[    1.554679]   release_sock+0x2f/0xb0
[    1.554957]   __inet_stream_connect+0x1ac/0x3a0
[    1.555308]   inet_stream_connect+0x3b/0x60
[    1.555632]   __sys_connect+0xa3/0xd0
[    1.555915]   __x64_sys_connect+0x18/0x20
[    1.556222]   do_syscall_64+0x3c/0x90
[    1.556510]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.556909]
[    1.556909] to a SOFTIRQ-irq-unsafe lock:
[    1.557326]  (&h->lhash2[i].lock){+.+.}-{2:2}
[    1.557329]
[    1.557329] ... which became SOFTIRQ-irq-unsafe at:
[    1.558148] ...
[    1.558149]   lock_acquire+0xcd/0x330
[    1.558579]   _raw_spin_lock+0x33/0x40
[    1.558874]   __inet_hash+0x4b/0x210
[    1.559154]   inet_csk_listen_start+0xe6/0x100
[    1.559503]   inet_listen+0x95/0x1d0
[    1.559782]   __sys_listen+0x69/0xb0
[    1.560063]   __x64_sys_listen+0x14/0x20
[    1.560365]   do_syscall_64+0x3c/0x90
[    1.560652]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.561052]
[    1.561052] other info that might help us debug this:
[    1.561052]
[    1.561658]  Possible interrupt unsafe locking scenario:
[    1.561658]
[    1.562171]        CPU0                             CPU1
[    1.562521]        ----                                  ----
[    1.562870]   lock(&h->lhash2[i].lock);
[    1.563167]                                               local_irq_disable();
[    1.563618]                                               lock(slock-AF_INET6);
[    1.564076]                                               lock(&h->lhash2[i].lock);
[    1.564558]   <Interrupt>
[    1.564763]     lock(slock-AF_INET6);
[    1.565053]
[    1.565053]  *** DEADLOCK ***
[    1.565053]
[    1.565505] 4 locks held by test_progs/117:
[    1.565829]  #0: ffffa37903af60a0 (&p->lock){+.+.}-{3:3}, at: bpf_seq_read+0x3f/0x450
[    1.566428]  #1: ffffa37903b40130 (sk_lock-AF_INET6){+.+.}-{0:0}, at: bpf_seq_read+0x177/0x450
[    1.567091]  #2: ffffa37903b400b0 (slock-AF_INET6){+.-.}-{2:2}, at: __lock_sock_fast+0x33/0x70
[    1.567780]  #3: ffffffffb4f966c0 (rcu_read_lock){....}-{1:2}, at: bpf_iter_run_prog+0x18b/0x340
[    1.568453]
[    1.568453] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[    1.569132] -> (slock-AF_INET6){+.-.}-{2:2} {
[    1.569469]    HARDIRQ-ON-W at:
[    1.569717]                     lock_acquire+0xcd/0x330
[    1.570117]                     _raw_spin_lock_bh+0x38/0x50
[    1.570543]                     lock_sock_nested+0x50/0x80
[    1.570958]                     sk_setsockopt+0x776/0x16b0
[    1.571376]                     __sys_setsockopt+0x193/0x1b0
[    1.571809]                     __x64_sys_setsockopt+0x1f/0x30
[    1.572259]                     do_syscall_64+0x3c/0x90
[    1.572671]                     entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.573185]    IN-SOFTIRQ-W at:
[    1.573433]                     lock_acquire+0xcd/0x330
[    1.573835]                     _raw_spin_lock+0x33/0x40
[    1.574242]                     sk_clone_lock+0x146/0x520
[    1.574657]                     inet_csk_clone_lock+0x1b/0x110
[    1.575106]                     tcp_create_openreq_child+0x22/0x3f0
[    1.575585]                     tcp_v6_syn_recv_sock+0x96/0x940
[    1.576042]                     tcp_check_req+0x13f/0x640
[    1.576456]                     tcp_v6_rcv+0xa62/0xe80
[    1.576853]                     ip6_protocol_deliver_rcu+0x78/0x590
[    1.577337]                     ip6_input_finish+0x72/0x140
[    1.577772]                     __netif_receive_skb_one_core+0x63/0xa0
[    1.578272]                     process_backlog+0x79/0x260
[    1.578696]                     __napi_poll.constprop.0+0x27/0x170
[    1.579169]                     net_rx_action+0x14a/0x2a0
[    1.579586]                     __do_softirq+0x165/0x510
[    1.579995]                     do_softirq+0xcd/0x100
[    1.580381]                     __local_bh_enable_ip+0xcc/0xf0
[    1.580831]                     ip6_finish_output2+0x27c/0xb10
[    1.581289]                     ip6_finish_output+0x274/0x510
[    1.581733]                     ip6_xmit+0x319/0x9b0
[    1.582118]                     inet6_csk_xmit+0x12b/0x2b0
[    1.582540]                     __tcp_transmit_skb+0x543/0xc30
[    1.582988]                     tcp_rcv_state_process+0x362/0x1180
[    1.583464]                     tcp_v6_do_rcv+0x10f/0x540
[    1.583882]                     __release_sock+0x6a/0xe0
[    1.584291]                     release_sock+0x2f/0xb0
[    1.584686]                     __inet_stream_connect+0x1ac/0x3a0
[    1.585157]                     inet_stream_connect+0x3b/0x60
[    1.585598]                     __sys_connect+0xa3/0xd0
[    1.586001]                     __x64_sys_connect+0x18/0x20
[    1.586429]                     do_syscall_64+0x3c/0x90
[    1.586832]                     entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.587346]    INITIAL USE at:
[    1.587590]                    lock_acquire+0xcd/0x330
[    1.587985]                    _raw_spin_lock_bh+0x38/0x50
[    1.588407]                    lock_sock_nested+0x50/0x80
[    1.588821]                    sk_setsockopt+0x776/0x16b0
[    1.589241]                    __sys_setsockopt+0x193/0x1b0
[    1.589666]                    __x64_sys_setsockopt+0x1f/0x30
[    1.590108]                    do_syscall_64+0x3c/0x90
[    1.590505]                    entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.591007]  }
[    1.591144]  ... key      at: [<ffffffffb6441bc0>] af_family_slock_keys+0xa0/0x2e0
[    1.591721]
[    1.591721] the dependencies between the lock to be acquired
[    1.591722]  and SOFTIRQ-irq-unsafe lock:
[    1.592596] -> (&h->lhash2[i].lock){+.+.}-{2:2} {
[    1.592968]    HARDIRQ-ON-W at:
[    1.593209]                     lock_acquire+0xcd/0x330
[    1.593604]                     _raw_spin_lock+0x33/0x40
[    1.594002]                     __inet_hash+0x4b/0x210
[    1.594387]                     inet_csk_listen_start+0xe6/0x100
[    1.594839]                     inet_listen+0x95/0x1d0
[    1.595217]                     __sys_listen+0x69/0xb0
[    1.595594]                     __x64_sys_listen+0x14/0x20
[    1.595997]                     do_syscall_64+0x3c/0x90
[    1.596386]                     entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.596891]    SOFTIRQ-ON-W at:
[    1.597133]                     lock_acquire+0xcd/0x330
[    1.597537]                     _raw_spin_lock+0x33/0x40
[    1.597930]                     __inet_hash+0x4b/0x210
[    1.598308]                     inet_csk_listen_start+0xe6/0x100
[    1.598750]                     inet_listen+0x95/0x1d0
[    1.599129]                     __sys_listen+0x69/0xb0
[    1.599509]                     __x64_sys_listen+0x14/0x20
[    1.599912]                     do_syscall_64+0x3c/0x90
[    1.600303]                     entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.600799]    INITIAL USE at:
[    1.601037]                    lock_acquire+0xcd/0x330
[    1.601416]                    _raw_spin_lock+0x33/0x40
[    1.601801]                    __inet_hash+0x4b/0x210
[    1.602178]                    inet_csk_listen_start+0xe6/0x100
[    1.602618]                    inet_listen+0x95/0x1d0
[    1.602995]                    __sys_listen+0x69/0xb0
[    1.603368]                    __x64_sys_listen+0x14/0x20
[    1.603766]                    do_syscall_64+0x3c/0x90
[    1.604150]                    entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.604647]  }
[    1.604780]  ... key      at: [<ffffffffb64567a0>] __key.1+0x0/0x10
[    1.605235]  ... acquired at:
[    1.605470]    lock_acquire+0xcd/0x330
[    1.605748]    _raw_spin_lock+0x33/0x40
[    1.606017]    inet_unhash+0x9a/0x110
[    1.606275]    tcp_set_state+0x6a/0x210
[    1.606552]    tcp_abort+0x10d/0x200
[    1.606807]    bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
[    1.607243]    bpf_iter_run_prog+0x1ff/0x340
[    1.607553]    bpf_iter_tcp_seq_show+0xca/0x190
[    1.607874]    bpf_seq_read+0x177/0x450
[    1.608145]    vfs_read+0xc6/0x300
[    1.608389]    ksys_read+0x69/0xf0
[    1.608631]    do_syscall_64+0x3c/0x90
[    1.608903]    entry_SYSCALL_64_after_hwframe+0x72/0xdc

> 
> Thanks,
> 
> Paolo


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator
  2023-04-25  5:45   ` Yonghong Song
@ 2023-05-03 20:26     ` Aditi Ghag
  0 siblings, 0 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-05-03 20:26 UTC (permalink / raw)
  To: Yonghong Song; +Cc: bpf, kafai, sdf, edumazet



> On Apr 24, 2023, at 10:45 PM, Yonghong Song <yhs@meta.com> wrote:
> 
> 
> 
> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>> Previously, BPF TCP iterator was acquiring fast version of sock lock that
>> disables the BH. This introduced a circular dependency with code paths that
>> later acquire sockets hash table bucket lock.
>> Replace the fast version of sock lock with slow that faciliates BPF
>> programs executed from the iterator to destroy TCP listening sockets
>> using the bpf_sock_destroy kfunc (implemened in follow-up commits).
>> Here is a stack trace that motivated this change:
>> ```
>> 1) sock_lock with BH disabled + bucket lock
>> lock_acquire+0xcd/0x330
>> _raw_spin_lock_bh+0x38/0x50
>> inet_unhash+0x96/0xd0
>> tcp_set_state+0x6a/0x210
>> tcp_abort+0x12b/0x230
>> bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
>> bpf_iter_run_prog+0x1ff/0x340
>> bpf_iter_tcp_seq_show+0xca/0x190
>> bpf_seq_read+0x177/0x450
>> vfs_read+0xc6/0x300
>> ksys_read+0x69/0xf0
>> do_syscall_64+0x3c/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
> IIUC, the above deadlock is due to
> 
> > lock_acquire+0xcd/0x330
> > _raw_spin_lock_bh+0x38/0x50
> > inet_unhash+0x96/0xd0
> > tcp_set_state+0x6a/0x210
> > tcp_abort+0x12b/0x230
> > bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
> > bpf_iter_run_prog+0x1ff/0x340
> > ... lock_acquire for sock lock ...
> > bpf_iter_tcp_seq_show+0xca/0x190
> > bpf_seq_read+0x177/0x450
> > vfs_read+0xc6/0x300
> > ksys_read+0x69/0xf0
> > do_syscall_64+0x3c/0x90
> > entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
> I could be great to make it explicit with the stack trace so
> it is clear where the circular dependency is.
> 
>> 2) sock lock with BH enable
>> [    1.499968]   lock_acquire+0xcd/0x330
>> [    1.500316]   _raw_spin_lock+0x33/0x40
>> [    1.500670]   sk_clone_lock+0x146/0x520
>> [    1.501030]   inet_csk_clone_lock+0x1b/0x110
>> [    1.501433]   tcp_create_openreq_child+0x22/0x3f0
>> [    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
>> [    1.502284]   tcp_check_req+0x137/0x660
>> [    1.502646]   tcp_v6_rcv+0xa63/0xe80
>> [    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
>> [    1.503434]   ip6_input_finish+0x72/0x140
>> [    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
>> [    1.504281]   process_backlog+0x79/0x260
>> [    1.504668]   __napi_poll.constprop.0+0x27/0x170
>> [    1.505104]   net_rx_action+0x14a/0x2a0
>> [    1.505469]   __do_softirq+0x165/0x510
>> [    1.505842]   do_softirq+0xcd/0x100
>> [    1.506172]   __local_bh_enable_ip+0xcc/0xf0
>> [    1.506588]   ip6_finish_output2+0x2a8/0xb00
>> [    1.506988]   ip6_finish_output+0x274/0x510
>> [    1.507377]   ip6_xmit+0x319/0x9b0
>> [    1.507726]   inet6_csk_xmit+0x12b/0x2b0
>> [    1.508096]   __tcp_transmit_skb+0x549/0xc40
>> [    1.508498]   tcp_rcv_state_process+0x362/0x1180
> 
> Similarly, it would be good to illustrate where is the
> deadlock in this case.


Ack: responded to Paolo's message directly with more details. Thanks!

> 
>> ```
>> Acked-by: Stanislav Fomichev <sdf@google.com>
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/ipv4/tcp_ipv4.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index ea370afa70ed..f2d370a9450f 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>  	struct bpf_iter_meta meta;
>>  	struct bpf_prog *prog;
>>  	struct sock *sk = v;
>> -	bool slow;
>>  	uid_t uid;
>>  	int ret;
>>  @@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>  		return 0;
>>    	if (sk_fullsock(sk))
>> -		slow = lock_sock_fast(sk);
>> +		lock_sock(sk);
>>    	if (unlikely(sk_unhashed(sk))) {
>>  		ret = SEQ_SKIP;
>> @@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>    unlock:
>>  	if (sk_fullsock(sk))
>> -		unlock_sock_fast(sk, slow);
>> +		release_sock(sk);
>>  	return ret;
>>    }


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability
  2023-05-02 23:40       ` Martin KaFai Lau
@ 2023-05-04 17:32         ` Aditi Ghag
  0 siblings, 0 replies; 26+ messages in thread
From: Aditi Ghag @ 2023-05-04 17:32 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, bpf



> On May 2, 2023, at 4:40 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 5/2/23 3:52 PM, Aditi Ghag wrote:
>>> On May 1, 2023, at 4:32 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>>> 
>>> 
>>> 
>>>> On Apr 24, 2023, at 3:15 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>> 
>>>> On 4/18/23 8:31 AM, Aditi Ghag wrote:
>>>>> This patch adds the capability to destroy sockets in BPF. We plan to use
>>>>> the capability in Cilium to force client sockets to reconnect when their
>>>>> remote load-balancing backends are deleted. The other use case is
>>>>> on-the-fly policy enforcement where existing socket connections prevented
>>>>> by policies need to be terminated.
>>>> 
>>>> If the earlier kfunc filter patch (https://lore.kernel.org/bpf/1ECC8AAA-C2E6-4F8A-B7D3-5E90BDEE7C48@isovalent.com/) looks fine to you, please include it into the next revision. This patchset needs it. Usual thing to do is to keep my sob (and author if not much has changed) and add your sob. The test needs to be broken out into a separate patch though. It needs to use the '__failure __msg("calling kernel function bpf_sock_destroy is not allowed")'. There are many examples in selftests, eg. the dynptr_fail.c.
>>>> 
>>> 
>>> Yeah, ok. I was waiting for your confirmation. The patch doesn't need my sob though (maybe tested-by).
>>> I've created a separate patch for the test.
>> Here is the patch diff for the extended test case for your reference. I'm ready to push a new version once I get an ack from you.
> Looks reasonable to me.
> 
> One thing I have been thinking is the bpf_sock_destroy kfunc should need a KF_TRUSTED_ARGS but I suspect that may need a change in the tcp_reg_info in tcp_ipv4.c. Not sure yet. Regardless, I don't think this will have a major effect on other patches in this set. Please go ahead to respin considering there are a few comments that need to be addressed already. At worst it can use one final revision to address KF_TRUSTED_ARGS.

Pushed the next revision.
Re KF_TRUSTED_ARGS: Looking at the description in the README, it's not entirely clear to me why we need it here now that we are restricting the kfunc to only the iterator programs. Is it somehow related to be able ensure that the socket argument needs to be locked?

> [ btw, I don't see your reply/confirmation on the Patch 1 discussion also. Please ensure those will also be clarified/addressed in the next respin. ]
> 
> 
>> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> index a889c53e93c7..afed8cad94ee 100644
>> --- a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> @@ -3,6 +3,7 @@
>>  #include <bpf/bpf_endian.h>
>>  #include "sock_destroy_prog.skel.h"
>> +#include "sock_destroy_prog_fail.skel.h"
>>  #include "network_helpers.h"
>>  #define TEST_NS "sock_destroy_netns"
>> @@ -207,6 +208,8 @@ void test_sock_destroy(void)
>>                 test_udp_server(skel);
>> +       RUN_TESTS(sock_destroy_prog_fail);
>> +
>>  cleanup:
>>         if (nstoken)
>>                 close_netns(nstoken);
>> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
>> new file mode 100644
>> index 000000000000..dd6850b58e25
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog_fail.c
>> @@ -0,0 +1,22 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include "vmlinux.h"
>> +#include <bpf/bpf_tracing.h>
>> +#include <bpf/bpf_helpers.h>
>> +
>> +#include "bpf_misc.h"
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>> +
>> +SEC("tp_btf/tcp_destroy_sock")
>> +__failure __msg("calling kernel function bpf_sock_destroy is not allowed")
>> +int BPF_PROG(trace_tcp_destroy_sock, struct sock *sk)
>> +{
>> +       /* should not load */
>> +       bpf_sock_destroy((struct sock_common *)sk);
>> +
>> +       return 0;
>> +}
>>> 
>>> 
>>>> Please also fix the subject in the patches. They are all missing the bpf-next and revision tag.
>>>> 
>>> 
>>> Took me a few moments to realize that as I was looking at earlier series. Looks like I forgot to add the tags to subsequent patches in this series. I'll fix it up in the next push.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-05-04 17:32 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-18 15:31 [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Aditi Ghag
2023-04-18 15:31 ` [PATCH 1/7] bpf: tcp: Avoid taking fast sock lock in iterator Aditi Ghag
2023-04-20  8:55   ` Paolo Abeni
2023-05-03 20:25     ` Aditi Ghag
2023-04-25  5:45   ` Yonghong Song
2023-05-03 20:26     ` Aditi Ghag
2023-04-18 15:31 ` [PATCH 2/7] udp: seq_file: Remove bpf_seq_afinfo from udp_iter_state Aditi Ghag
2023-04-24  0:18   ` Martin KaFai Lau
2023-05-01 22:39     ` Aditi Ghag
2023-04-18 15:31 ` [PATCH 3/7] udp: seq_file: Helper function to match socket attributes Aditi Ghag
2023-04-18 15:31 ` [PATCH 4/7] bpf: udp: Implement batching for sockets iterator Aditi Ghag
2023-04-20 14:07   ` Paolo Abeni
2023-04-24  5:46   ` Martin KaFai Lau
2023-04-18 15:31 ` [PATCH 5/7] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-04-18 15:31 ` [PATCH 6/7] selftests/bpf: Add helper to get port using getsockname Aditi Ghag
2023-04-18 18:45   ` Stanislav Fomichev
2023-04-24 17:58     ` Martin KaFai Lau
2023-04-18 15:31 ` [PATCH 7/7] selftests/bpf: Test bpf_sock_destroy Aditi Ghag
2023-04-24 19:20   ` Martin KaFai Lau
2023-04-24 22:15 ` [PATCH v6 bpf-next 0/7] bpf: Add socket destroy capability Martin KaFai Lau
2023-05-01 23:32   ` Aditi Ghag
2023-05-01 23:37     ` Aditi Ghag
2023-05-02 23:24       ` Martin KaFai Lau
2023-05-02 22:52     ` Aditi Ghag
2023-05-02 23:40       ` Martin KaFai Lau
2023-05-04 17:32         ` Aditi Ghag

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).