[PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability
@ 2023-03-21 18:45 Aditi Ghag
  2023-03-21 18:45 ` [PATCH v3 bpf-next 1/5] bpf: Implement batching in UDP iterator Aditi Ghag
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:45 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

This patch adds the capability to destroy sockets in BPF. We plan to use
the capability in Cilium to force client sockets to reconnect when their
remote load-balancing backends are deleted. The other use case is
on-the-fly policy enforcement where existing socket connections prevented
by policies need to be terminated.

The use cases, and more details around
the selected approach was presented at LPC 2022 -
https://lpc.events/event/16/contributions/1358/.
RFC discussion -
https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
v2 patch series -
https://lore.kernel.org/bpf/20230223215311.926899-1-aditi.ghag@isovalent.com/T/#t

v3 highlights:
- Martin's review comments:
  - UDP iterator batching patch supports resume operation.
  - Removed "full_sock" check from the destroy kfunc.
  - Reset of metadata in case of rebatching.
- Extended selftests to cover cases for destroying listening sockets.
- Fixes for destroying listening TCP and UDP sockets.
- Stan's review:
  - Refactored selftests to use ASSERT_* in lieu of CHECK.
  - Free leaking afinfo in fini_udp.
- Restructured test cases per Andrii's comment.

Notes to the reviewers:

- There are two RFC commits for being able to destroy listening TCP and
  UDP sockets. The TCP commit isn't quite correct, as inet_unhash could
  be invoked from BPF context for cases other than iterator.
  The UDP commit seems reasonable based on my understanding of the code,
  but it may lead to unintended behavior when there are sockets
  listening on wildcard and specific address with a common port.
  I would appreciate insights into both the commits, as I'm not
  intimately familiar with some of the overall code path.

(Below notes are same as v2 patch series.)
- I hit a snag while writing the kfunc where verifier complained about the
  `sock_common` type passed from TCP iterator. With kfuncs, there don't
  seem to be any options available to pass BTF type hints to the verifier
  (equivalent of `ARG_PTR_TO_BTF_ID_SOCK_COMMON`, as was the case with the
  helper).  As a result, I changed the argument type of the sock_destory
  kfunc to `sock_common`.

- The `vmlinux.h` import in the selftest prog unexpectedly led to libbpf
  failing to load the program. As it turns out, the libbpf kfunc related
  code doesn't seem to handle BTF `FWD` type for structs. I've attached debug
  information about the issue in case the loader logic can accommodate such
  gotchas. Although the error in this case was specific to the test imports.

Aditi Ghag (5):
  bpf: Implement batching in UDP iterator
  bpf: Add bpf_sock_destroy kfunc
  [RFC] net: Skip taking lock in BPF context
  [RFC] udp: Fix destroying UDP listening sockets
  selftests/bpf: Add tests for bpf_sock_destroy

 include/net/udp.h                             |   1 +
 net/core/filter.c                             |  54 ++++
 net/ipv4/inet_hashtables.c                    |   9 +-
 net/ipv4/tcp.c                                |  16 +-
 net/ipv4/udp.c                                | 283 +++++++++++++++++-
 .../selftests/bpf/prog_tests/sock_destroy.c   | 190 ++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++
 7 files changed, 684 insertions(+), 20 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 1/5] bpf: Implement batching in UDP iterator
  2023-03-21 18:45 [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability Aditi Ghag
@ 2023-03-21 18:45 ` Aditi Ghag
  2023-03-21 18:45 ` [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:45 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag, Martin KaFai Lau

Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators.  With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 222 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 213 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..545e56329355 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3152,6 +3152,141 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+struct bpf_udp_iter_state {
+	struct udp_iter_state state;
+	unsigned int cur_sk;
+	unsigned int end_sk;
+	unsigned int max_sk;
+	struct sock **batch;
+	bool st_bucket_done;
+};
+
+static unsigned short seq_file_family(const struct seq_file *seq);
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz);
+
+static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
+{
+	unsigned short family = seq_file_family(seq);
+
+	/* AF_UNSPEC is used as a match all */
+	return ((family == AF_UNSPEC || family == sk->sk_family) &&
+		net_eq(sock_net(sk), seq_file_net(seq)));
+}
+
+static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
+	struct net *net = seq_file_net(seq);
+	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
+	struct udp_table *udptable;
+	struct sock *first_sk = NULL;
+	struct sock *sk;
+	unsigned int bucket_sks = 0;
+	bool first;
+	bool resized = false;
+
+	/* The current batch is done, so advance the bucket. */
+	if (iter->st_bucket_done)
+		state->bucket++;
+
+	udptable = udp_get_table_afinfo(afinfo, net);
+
+again:
+	/* New batch for the next bucket.
+	 * Iterate over the hash table to find a bucket with sockets matching
+	 * the iterator attributes, and return the first matching socket from
+	 * the bucket. The remaining matched sockets from the bucket are batched
+	 * before releasing the bucket lock. This allows BPF programs that are
+	 * called in seq_show to acquire the bucket lock if needed.
+	 */
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	first = true;
+
+	for (; state->bucket <= udptable->mask; state->bucket++) {
+		struct udp_hslot *hslot = &udptable->hash[state->bucket];
+
+		if (hlist_empty(&hslot->head))
+			continue;
+
+		spin_lock_bh(&hslot->lock);
+		sk_for_each(sk, &hslot->head) {
+			if (seq_sk_match(seq, sk)) {
+				if (first) {
+					first_sk = sk;
+					first = false;
+				}
+				if (iter->end_sk < iter->max_sk) {
+					sock_hold(sk);
+					iter->batch[iter->end_sk++] = sk;
+				}
+				bucket_sks++;
+			}
+		}
+		spin_unlock_bh(&hslot->lock);
+		if (first_sk)
+			break;
+	}
+
+	/* All done: no batch made. */
+	if (!first_sk)
+		return NULL;
+
+	if (iter->end_sk == bucket_sks) {
+		/* Batching is done for the current bucket; return the first
+		 * socket to be iterated from the batch.
+		 */
+		iter->st_bucket_done = true;
+		return first_sk;
+	}
+	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
+		resized = true;
+		/* Go back to the previous bucket to resize its batch. */
+		state->bucket--;
+		goto again;
+	}
+	return first_sk;
+}
+
+static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct sock *sk;
+
+	/* Whenever seq_next() is called, the iter->cur_sk is
+	 * done with seq_show(), so unref the iter->cur_sk.
+	 */
+	if (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
+
+	/* After updating iter->cur_sk, check if there are more sockets
+	 * available in the current bucket batch.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sk = iter->batch[iter->cur_sk];
+	} else {
+		// Prepare a new batch.
+		sk = bpf_iter_udp_batch(seq);
+	}
+
+	++*pos;
+	return sk;
+}
+
+static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* bpf iter does not support lseek, so it always
+	 * continue from where it was stop()-ped.
+	 */
+	if (*pos)
+		return bpf_iter_udp_batch(seq);
+
+	return SEQ_START_TOKEN;
+}
+
 static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
 			     struct udp_sock *udp_sk, uid_t uid, int bucket)
 {
@@ -3172,18 +3307,38 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
+	bool slow;
+	int rc;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
+	slow = lock_sock_fast(sk);
+
+	if (unlikely(sk_unhashed(sk))) {
+		rc = SEQ_SKIP;
+		goto unlock;
+	}
+
 	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
 	meta.seq = seq;
 	prog = bpf_iter_get_info(&meta, false);
-	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+
+unlock:
+	unlock_sock_fast(sk, slow);
+	return rc;
+}
+
+static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
+{
+	while (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct bpf_udp_iter_state *iter = seq->private;
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 
@@ -3194,15 +3349,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
 	}
 
-	udp_seq_stop(seq, v);
+	if (iter->cur_sk < iter->end_sk) {
+		bpf_iter_udp_unref_batch(iter);
+		iter->st_bucket_done = false;
+	}
 }
 
 static const struct seq_operations bpf_iter_udp_seq_ops = {
-	.start		= udp_seq_start,
-	.next		= udp_seq_next,
+	.start		= bpf_iter_udp_seq_start,
+	.next		= bpf_iter_udp_seq_next,
 	.stop		= bpf_iter_udp_seq_stop,
 	.show		= bpf_iter_udp_seq_show,
 };
+
+static unsigned short seq_file_family(const struct seq_file *seq)
+{
+	const struct udp_seq_afinfo *afinfo;
+
+	/* BPF iterator: bpf programs to filter sockets. */
+	if (seq->op == &bpf_iter_udp_seq_ops)
+		return AF_UNSPEC;
+
+	/* Proc fs iterator */
+	afinfo = pde_data(file_inode(seq->file));
+	return afinfo->family;
+}
 #endif
 
 const struct seq_operations udp_seq_ops = {
@@ -3413,9 +3584,30 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
 DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 		     struct udp_sock *udp_sk, uid_t uid, int bucket)
 
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz)
+{
+	struct sock **new_batch;
+
+	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
+				   GFP_USER | __GFP_NOWARN);
+	if (!new_batch)
+		return -ENOMEM;
+
+	bpf_iter_udp_unref_batch(iter);
+	kvfree(iter->batch);
+	iter->batch = new_batch;
+	iter->max_sk = new_batch_sz;
+
+	return 0;
+}
+
+#define INIT_BATCH_SZ 16
+
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	struct udp_iter_state *st = priv_data;
+	struct bpf_udp_iter_state *iter = priv_data;
+	struct udp_iter_state *st = &iter->state;
 	struct udp_seq_afinfo *afinfo;
 	int ret;
 
@@ -3427,24 +3619,36 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 	afinfo->udp_table = NULL;
 	st->bpf_seq_afinfo = afinfo;
 	ret = bpf_iter_init_seq_net(priv_data, aux);
-	if (ret)
+	if (ret) {
 		kfree(afinfo);
+		return ret;
+	}
+	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
+	if (ret) {
+		bpf_iter_fini_seq_net(priv_data);
+		return ret;
+	}
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+
 	return ret;
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
-	struct udp_iter_state *st = priv_data;
+	struct bpf_udp_iter_state *iter = priv_data;
+	struct udp_iter_state *st = &iter->state;
 
-	kfree(st->bpf_seq_afinfo);
 	bpf_iter_fini_seq_net(priv_data);
+	kfree(st->bpf_seq_afinfo);
+	kvfree(iter->batch);
 }
 
 static const struct bpf_iter_seq_info udp_seq_info = {
 	.seq_ops		= &bpf_iter_udp_seq_ops,
 	.init_seq_private	= bpf_iter_init_udp,
 	.fini_seq_private	= bpf_iter_fini_udp,
-	.seq_priv_size		= sizeof(struct udp_iter_state),
+	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
 };
 
 static struct bpf_iter_reg udp_reg_info = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-21 18:45 [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability Aditi Ghag
  2023-03-21 18:45 ` [PATCH v3 bpf-next 1/5] bpf: Implement batching in UDP iterator Aditi Ghag
@ 2023-03-21 18:45 ` Aditi Ghag
  2023-03-21 21:02   ` Stanislav Fomichev
  2023-03-21 23:43   ` Martin KaFai Lau
  2023-03-21 18:45 ` [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context Aditi Ghag
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:45 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium to force
client sockets to reconnect when their remote load-balancing backends are
deleted. The other use case is on-the-fly policy enforcement where existing
socket connections prevented by policies need to be forcefully terminated.
The helper allows terminating sockets that may or may not be actively
sending traffic.

The helper is currently exposed to certain BPF iterators where users can
filter, and terminate selected sockets.  Additionally, the helper can only
be called from these BPF contexts that ensure socket locking in order to
allow synchronous execution of destroy helpers that also acquire socket
locks. The previous commit that batches UDP sockets during iteration
facilitated a synchronous invocation of the destroy helper from BPF context
by skipping taking socket locks in the destroy handler. TCP iterators
already supported batching.

The helper takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. As a comparison, BPF helpers enable this behavior with the
`ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
option available with the verifier logic that handles kfuncs where BTF
types are inferred. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe. Hence, the BPF kfunc converts the argument to a full sock
before casting.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 include/net/udp.h |  1 +
 net/core/filter.c | 54 ++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c    | 16 +++++++++----
 net/ipv4/udp.c    | 60 +++++++++++++++++++++++++++++++++++++----------
 4 files changed, 114 insertions(+), 17 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index de4b528522bb..d2999447d3f2 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -437,6 +437,7 @@ struct udp_seq_afinfo {
 struct udp_iter_state {
 	struct seq_net_private  p;
 	int			bucket;
+	int			offset;
 	struct udp_seq_afinfo	*bpf_seq_afinfo;
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 1d6f165923bf..ba3e0dac119c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
 
 	return func;
 }
+
+/* Disables missing prototype warnings */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
+ *
+ * The helper expects a non-NULL pointer to a socket. It invokes the
+ * protocol specific socket destroy handlers.
+ *
+ * The helper can only be called from BPF contexts that have acquired the socket
+ * locks.
+ *
+ * Parameters:
+ * @sock: Pointer to socket to be destroyed
+ *
+ * Return:
+ * On error, may return EPROTONOSUPPORT, EINVAL.
+ * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
+ * 0 otherwise
+ */
+__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
+{
+	struct sock *sk = (struct sock *)sock;
+
+	if (!sk)
+		return -EINVAL;
+
+	/* The locking semantics that allow for synchronous execution of the
+	 * destroy handlers are only supported for TCP and UDP.
+	 */
+	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
+}
+
+__diag_pop()
+
+BTF_SET8_START(sock_destroy_kfunc_set)
+BTF_ID_FLAGS(func, bpf_sock_destroy)
+BTF_SET8_END(sock_destroy_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &sock_destroy_kfunc_set,
+};
+
+static int init_subsystem(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
+}
+late_initcall(init_subsystem);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 33f559f491c8..59a833c0c872 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
 		return 0;
 	}
 
-	/* Don't race with userspace socket closes such as tcp_close. */
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		/* Don't race with userspace socket closes such as tcp_close. */
+		lock_sock(sk);
 
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
@@ -4688,7 +4690,8 @@ int tcp_abort(struct sock *sk, int err)
 
 	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
 	local_bh_disable();
-	bh_lock_sock(sk);
+	if (!has_current_bpf_ctx())
+		bh_lock_sock(sk);
 
 	if (!sock_flag(sk, SOCK_DEAD)) {
 		sk->sk_err = err;
@@ -4700,10 +4703,13 @@ int tcp_abort(struct sock *sk, int err)
 		tcp_done(sk);
 	}
 
-	bh_unlock_sock(sk);
+	if (!has_current_bpf_ctx())
+		bh_unlock_sock(sk);
+
 	local_bh_enable();
 	tcp_write_queue_purge(sk);
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_abort);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 545e56329355..02d357713838 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
 
 int udp_abort(struct sock *sk, int err)
 {
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		lock_sock(sk);
 
 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
 	 * with close()
@@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
 	__udp_disconnect(sk, 0);
 
 out:
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 
 	return 0;
 }
@@ -3184,15 +3187,23 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	struct sock *first_sk = NULL;
 	struct sock *sk;
 	unsigned int bucket_sks = 0;
-	bool first;
 	bool resized = false;
+	int offset = 0;
+	int new_offset;
 
 	/* The current batch is done, so advance the bucket. */
-	if (iter->st_bucket_done)
+	if (iter->st_bucket_done) {
 		state->bucket++;
+		state->offset = 0;
+	}
 
 	udptable = udp_get_table_afinfo(afinfo, net);
 
+	if (state->bucket > udptable->mask) {
+		state->bucket = 0;
+		state->offset = 0;
+	}
+
 again:
 	/* New batch for the next bucket.
 	 * Iterate over the hash table to find a bucket with sockets matching
@@ -3204,43 +3215,60 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	iter->cur_sk = 0;
 	iter->end_sk = 0;
 	iter->st_bucket_done = false;
-	first = true;
+	first_sk = NULL;
+	bucket_sks = 0;
+	offset = state->offset;
+	new_offset = offset;
 
 	for (; state->bucket <= udptable->mask; state->bucket++) {
 		struct udp_hslot *hslot = &udptable->hash[state->bucket];
 
-		if (hlist_empty(&hslot->head))
+		if (hlist_empty(&hslot->head)) {
+			offset = 0;
 			continue;
+		}
 
 		spin_lock_bh(&hslot->lock);
+		/* Resume from the last saved position in a bucket before
+		 * iterator was stopped.
+		 */
+		while (offset-- > 0) {
+			sk_for_each(sk, &hslot->head)
+				continue;
+		}
 		sk_for_each(sk, &hslot->head) {
 			if (seq_sk_match(seq, sk)) {
-				if (first) {
+				if (!first_sk)
 					first_sk = sk;
-					first = false;
-				}
 				if (iter->end_sk < iter->max_sk) {
 					sock_hold(sk);
 					iter->batch[iter->end_sk++] = sk;
 				}
 				bucket_sks++;
 			}
+			new_offset++;
 		}
 		spin_unlock_bh(&hslot->lock);
+
 		if (first_sk)
 			break;
+
+		/* Reset the current bucket's offset before moving to the next bucket. */
+		offset = 0;
+		new_offset = 0;
+
 	}
 
 	/* All done: no batch made. */
 	if (!first_sk)
-		return NULL;
+		goto ret;
 
 	if (iter->end_sk == bucket_sks) {
 		/* Batching is done for the current bucket; return the first
 		 * socket to be iterated from the batch.
 		 */
 		iter->st_bucket_done = true;
-		return first_sk;
+		goto ret;
 	}
 	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
 		resized = true;
@@ -3248,19 +3276,24 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 		state->bucket--;
 		goto again;
 	}
+ret:
+	state->offset = new_offset;
 	return first_sk;
 }
 
 static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
 	struct sock *sk;
 
 	/* Whenever seq_next() is called, the iter->cur_sk is
 	 * done with seq_show(), so unref the iter->cur_sk.
 	 */
-	if (iter->cur_sk < iter->end_sk)
+	if (iter->cur_sk < iter->end_sk) {
 		sock_put(iter->batch[iter->cur_sk++]);
+		++state->offset;
+	}
 
 	/* After updating iter->cur_sk, check if there are more sockets
 	 * available in the current bucket batch.
@@ -3630,6 +3663,9 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 	}
 	iter->cur_sk = 0;
 	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	st->bucket = 0;
+	st->offset = 0;
 
 	return ret;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 18:45 [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability Aditi Ghag
  2023-03-21 18:45 ` [PATCH v3 bpf-next 1/5] bpf: Implement batching in UDP iterator Aditi Ghag
  2023-03-21 18:45 ` [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-03-21 18:45 ` Aditi Ghag
  2023-03-21 21:31   ` Stanislav Fomichev
  2023-03-21 18:45 ` [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets Aditi Ghag
  2023-03-21 18:45 ` [PATCH 5/5] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  4 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:45 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

When sockets are destroyed in the BPF iterator context, sock
lock is already acquired, so skip taking the lock. This allows
TCP listening sockets to be destroyed from BPF programs.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/inet_hashtables.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index e41fdc38ce19..5543a3e0d1b4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
 		/* Don't disable bottom halves while acquiring the lock to
 		 * avoid circular locking dependency on PREEMPT_RT.
 		 */
-		spin_lock(&ilb2->lock);
+		if (!has_current_bpf_ctx())
+			spin_lock(&ilb2->lock);
 		if (sk_unhashed(sk)) {
-			spin_unlock(&ilb2->lock);
+			if (!has_current_bpf_ctx())
+				spin_unlock(&ilb2->lock);
 			return;
 		}
 
@@ -788,7 +790,8 @@ void inet_unhash(struct sock *sk)
 
 		__sk_nulls_del_node_init_rcu(sk);
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-		spin_unlock(&ilb2->lock);
+		if (!has_current_bpf_ctx())
+			spin_unlock(&ilb2->lock);
 	} else {
 		spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-21 18:45 [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability Aditi Ghag
                   ` (2 preceding siblings ...)
  2023-03-21 18:45 ` [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context Aditi Ghag
@ 2023-03-21 18:45 ` Aditi Ghag
  2023-03-22  0:29   ` Martin KaFai Lau
  2023-03-21 18:45 ` [PATCH 5/5] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  4 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:45 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

Previously, UDP listening sockets that bind'ed to a port
weren't getting properly destroyed via udp_abort.
Specifically, the sockets were left in the UDP hash table with
unset source port.
Fix the issue by unconditionally unhashing and resetting source
port for sockets that are getting destroyed. This would mean
that in case of sockets listening on wildcarded address and
on a specific address with a common port, users would have to
explicitly select the socket(s) they intend to destroy.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 02d357713838..a495ac88fcee 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 }
 EXPORT_SYMBOL(udp_pre_connect);
 
+int __udp_disconnect_with_abort(struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+
+	sk->sk_state = TCP_CLOSE;
+	inet->inet_daddr = 0;
+	inet->inet_dport = 0;
+	sock_rps_reset_rxhash(sk);
+	sk->sk_bound_dev_if = 0;
+	inet_reset_saddr(sk);
+	inet->inet_sport = 0;
+	sk_dst_reset(sk);
+	/* (TBD) In case of sockets listening on wildcard and specific address
+	 * with a common port, socket will be removed from {hash, hash2} table.
+	 */
+	sk->sk_prot->unhash(sk);
+	return 0;
+}
+
 int __udp_disconnect(struct sock *sk, int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -2937,7 +2956,7 @@ int udp_abort(struct sock *sk, int err)
 
 	sk->sk_err = err;
 	sk_error_report(sk);
-	__udp_disconnect(sk, 0);
+	__udp_disconnect_with_abort(sk);
 
 out:
 	if (!has_current_bpf_ctx())
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-21 18:45 [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability Aditi Ghag
                   ` (3 preceding siblings ...)
  2023-03-21 18:45 ` [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets Aditi Ghag
@ 2023-03-21 18:45 ` Aditi Ghag
  4 siblings, 0 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:45 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The test cases for TCP and UDP iterators mirror the intended usages of the
helper using BPF iterators.

The destroy helpers set `ECONNABORTED` error code that we can validate in
the test code with client sockets. But UDP sockets have an overriding error
code from the disconnect called during abort, so the error code the
validation is only done for TCP sockets.

The `struct sock` is redefined as vmlinux.h forward declares the struct,
and the loader fails to load the program as it finds the BTF FWD type for
the struct incompatible with the BTF STRUCT type.

Here are the snippets of the verifier error, and corresponding BTF output:

```
verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel

BTF for selftest prog binary:

[104] FWD 'sock' fwd_kind=struct
[70] PTR '(anon)' type_id=104
[84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
	'(anon)' type_id=70
[85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
--
[96] DATASEC '.ksyms' size=0 vlen=1
	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')

BTF for selftest vmlinux:

[74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
[48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
	'sk' type_id=1340
[1340] PTR '(anon)' type_id=2363
[2363] STRUCT 'sock' size=1280 vlen=93
```

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 .../selftests/bpf/prog_tests/sock_destroy.c   | 190 ++++++++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
 2 files changed, 341 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
new file mode 100644
index 000000000000..86c29f2c9d50
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+#include "sock_destroy_prog.skel.h"
+#include "network_helpers.h"
+
+static void start_iter_sockets(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	char buf[50] = {};
+	int iter_fd, len;
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+		goto free_link;
+
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+	ASSERT_GE(len, 0, "read");
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+}
+
+static void test_tcp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys connected client sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+static void test_tcp_server(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_server);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+
+static void test_udp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	/* UDP sockets have an overriding error code after they are disconnected,
+	 * so we don't check for ECONNABORTED error code.
+	 */
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+static void test_udp_server(struct sock_destroy_prog *skel)
+{
+	int *listen_fds = NULL, serv = 0;
+	unsigned int num_listens = 5;
+
+	/* Start reuseport servers */
+	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
+					    "::1", 6062, 0,
+					    num_listens);
+	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
+		goto cleanup;
+
+	/* Run iterator program that destroys sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_server);
+
+	/* Start a regular server that binds on the same port as the destroyed
+	 * sockets.
+	 */
+	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6062, 0);
+	ASSERT_GE(serv, 0, "start_server failed");
+
+cleanup:
+	free_fds(listen_fds, num_listens);
+}
+
+void test_sock_destroy(void)
+{
+	int cgroup_fd = 0;
+	struct sock_destroy_prog *skel;
+
+	skel = sock_destroy_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	cgroup_fd = test__join_cgroup("/sock_destroy");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto close_cgroup_fd;
+
+	skel->links.sock_connect = bpf_program__attach_cgroup(
+		skel->progs.sock_connect, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
+		goto close_cgroup_fd;
+
+	if (test__start_subtest("tcp_client"))
+		test_tcp_client(skel);
+	if (test__start_subtest("tcp_server"))
+		test_tcp_server(skel);
+	if (test__start_subtest("udp_client"))
+		test_udp_client(skel);
+	if (test__start_subtest("udp_server"))
+		test_udp_server(skel);
+
+
+close_cgroup_fd:
+	close(cgroup_fd);
+	sock_destroy_prog__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
new file mode 100644
index 000000000000..3ccce63f0245
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define sock sock___not_used
+#include "vmlinux.h"
+#undef sock
+
+#include <bpf/bpf_helpers.h>
+
+#define AF_INET6 10
+
+/* Redefine the struct: vmlinux.h forward declares it, and the loader fails
+ * to load the program as it finds the BTF FWD type for the struct incompatible
+ * with the BTF STRUCT type.
+ */
+struct sock {
+	struct sock_common	__sk_common;
+#define sk_family		__sk_common.skc_family
+#define sk_cookie		__sk_common.skc_cookie
+};
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} tcp_conn_sockets SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} udp_conn_sockets SEC(".maps");
+
+SEC("cgroup/connect6")
+int sock_connect(struct bpf_sock_addr *ctx)
+{
+	int key = 0;
+	__u64 sock_cookie = 0;
+	__u32 keyc = 0;
+
+	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
+		return 1;
+
+	sock_cookie = bpf_get_socket_cookie(ctx);
+	if (ctx->protocol == IPPROTO_TCP)
+		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
+	else if (ctx->protocol == IPPROTO_UDP)
+		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
+	else
+		return 1;
+
+	return 1;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_client(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	struct seq_file *seq = ctx->meta->seq;
+	__u64 sock_cookie = 0;
+	__u64 *val;
+	int key = 0;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk_common);
+	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_server(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	struct seq_file *seq = ctx->meta->seq;
+	__u64 sock_cookie = 0;
+	__u64 *val;
+	int key = 0;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk_common);
+	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
+
+	if (!val)
+		return 0;
+
+	/* Destroy server sockets. */
+	if (sock_cookie != *val)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+
+SEC("iter/udp")
+int iter_udp6_client(struct bpf_iter__udp *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u64 sock_cookie = 0, *val;
+	int key = 0;
+
+	if (!sk)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk);
+	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+SEC("iter/udp")
+int iter_udp6_server(struct bpf_iter__udp *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+
+	if (!sk)
+		return 0;
+
+	bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-21 18:45 ` [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-03-21 21:02   ` Stanislav Fomichev
  2023-03-23 20:02     ` Aditi Ghag
  2023-03-21 23:43   ` Martin KaFai Lau
  1 sibling, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-03-21 21:02 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 03/21, Aditi Ghag wrote:
> The socket destroy kfunc is used to forcefully terminate sockets from
> certain BPF contexts. We plan to use the capability in Cilium to force
> client sockets to reconnect when their remote load-balancing backends are
> deleted. The other use case is on-the-fly policy enforcement where  
> existing
> socket connections prevented by policies need to be forcefully terminated.
> The helper allows terminating sockets that may or may not be actively
> sending traffic.

> The helper is currently exposed to certain BPF iterators where users can
> filter, and terminate selected sockets.  Additionally, the helper can only
> be called from these BPF contexts that ensure socket locking in order to
> allow synchronous execution of destroy helpers that also acquire socket
> locks. The previous commit that batches UDP sockets during iteration
> facilitated a synchronous invocation of the destroy helper from BPF  
> context
> by skipping taking socket locks in the destroy handler. TCP iterators
> already supported batching.

> The helper takes `sock_common` type argument, even though it expects, and
> casts them to a `sock` pointer. This enables the verifier to allow the
> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> option available with the verifier logic that handles kfuncs where BTF
> types are inferred. Furthermore, as `sock_common` only has a subset of
> certain fields of `sock`, casting pointer to the latter type might not
> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> before casting.

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   include/net/udp.h |  1 +
>   net/core/filter.c | 54 ++++++++++++++++++++++++++++++++++++++++++
>   net/ipv4/tcp.c    | 16 +++++++++----
>   net/ipv4/udp.c    | 60 +++++++++++++++++++++++++++++++++++++----------
>   4 files changed, 114 insertions(+), 17 deletions(-)

> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..d2999447d3f2 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>   struct udp_iter_state {
>   	struct seq_net_private  p;
>   	int			bucket;
> +	int			offset;
>   	struct udp_seq_afinfo	*bpf_seq_afinfo;
>   };

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1d6f165923bf..ba3e0dac119c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)

>   	return func;
>   }
> +
> +/* Disables missing prototype warnings */
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error  
> code.
> + *
> + * The helper expects a non-NULL pointer to a socket. It invokes the
> + * protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired  
> the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not  
> implemented.
> + * 0 otherwise
> + */
> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
> +{
> +	struct sock *sk = (struct sock *)sock;
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;

What makes IPPROTO_RAW special? Looks like it locks the socket as well?

> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(sock_destroy_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> +BTF_SET8_END(sock_destroy_kfunc_set)
> +
> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &sock_destroy_kfunc_set,
> +};
> +
> +static int init_subsystem(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,  
> &bpf_sock_destroy_kfunc_set);
> +}
> +late_initcall(init_subsystem);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 33f559f491c8..59a833c0c872 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>   		return 0;
>   	}

> -	/* Don't race with userspace socket closes such as tcp_close. */
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		/* Don't race with userspace socket closes such as tcp_close. */
> +		lock_sock(sk);

>   	if (sk->sk_state == TCP_LISTEN) {
>   		tcp_set_state(sk, TCP_CLOSE);
> @@ -4688,7 +4690,8 @@ int tcp_abort(struct sock *sk, int err)

>   	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>   	local_bh_disable();

[..]

> -	bh_lock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_lock_sock(sk);

These are spinlocks and should probably be grabbed in the bpf context as
well?


>   	if (!sock_flag(sk, SOCK_DEAD)) {
>   		sk->sk_err = err;
> @@ -4700,10 +4703,13 @@ int tcp_abort(struct sock *sk, int err)
>   		tcp_done(sk);
>   	}

> -	bh_unlock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_unlock_sock(sk);
> +
>   	local_bh_enable();
>   	tcp_write_queue_purge(sk);
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(tcp_abort);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 545e56329355..02d357713838 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);

>   int udp_abort(struct sock *sk, int err)
>   {
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		lock_sock(sk);

>   	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>   	 * with close()
> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>   	__udp_disconnect(sk, 0);

>   out:
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);

>   	return 0;
>   }
> @@ -3184,15 +3187,23 @@ static struct sock *bpf_iter_udp_batch(struct  
> seq_file *seq)
>   	struct sock *first_sk = NULL;
>   	struct sock *sk;
>   	unsigned int bucket_sks = 0;
> -	bool first;
>   	bool resized = false;
> +	int offset = 0;
> +	int new_offset;

>   	/* The current batch is done, so advance the bucket. */
> -	if (iter->st_bucket_done)
> +	if (iter->st_bucket_done) {
>   		state->bucket++;
> +		state->offset = 0;
> +	}

>   	udptable = udp_get_table_afinfo(afinfo, net);

> +	if (state->bucket > udptable->mask) {
> +		state->bucket = 0;
> +		state->offset = 0;
> +	}
> +
>   again:
>   	/* New batch for the next bucket.
>   	 * Iterate over the hash table to find a bucket with sockets matching
> @@ -3204,43 +3215,60 @@ static struct sock *bpf_iter_udp_batch(struct  
> seq_file *seq)
>   	iter->cur_sk = 0;
>   	iter->end_sk = 0;
>   	iter->st_bucket_done = false;
> -	first = true;
> +	first_sk = NULL;
> +	bucket_sks = 0;
> +	offset = state->offset;
> +	new_offset = offset;

>   	for (; state->bucket <= udptable->mask; state->bucket++) {
>   		struct udp_hslot *hslot = &udptable->hash[state->bucket];

> -		if (hlist_empty(&hslot->head))
> +		if (hlist_empty(&hslot->head)) {
> +			offset = 0;
>   			continue;
> +		}

>   		spin_lock_bh(&hslot->lock);
> +		/* Resume from the last saved position in a bucket before
> +		 * iterator was stopped.
> +		 */
> +		while (offset-- > 0) {
> +			sk_for_each(sk, &hslot->head)
> +				continue;
> +		}
>   		sk_for_each(sk, &hslot->head) {
>   			if (seq_sk_match(seq, sk)) {
> -				if (first) {
> +				if (!first_sk)
>   					first_sk = sk;
> -					first = false;
> -				}
>   				if (iter->end_sk < iter->max_sk) {
>   					sock_hold(sk);
>   					iter->batch[iter->end_sk++] = sk;
>   				}
>   				bucket_sks++;
>   			}
> +			new_offset++;
>   		}
>   		spin_unlock_bh(&hslot->lock);
> +
>   		if (first_sk)
>   			break;
> +
> +		/* Reset the current bucket's offset before moving to the next bucket.  
> */
> +		offset = 0;
> +		new_offset = 0;
> +
>   	}

>   	/* All done: no batch made. */
>   	if (!first_sk)
> -		return NULL;
> +		goto ret;

>   	if (iter->end_sk == bucket_sks) {
>   		/* Batching is done for the current bucket; return the first
>   		 * socket to be iterated from the batch.
>   		 */
>   		iter->st_bucket_done = true;
> -		return first_sk;
> +		goto ret;
>   	}
>   	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
>   		resized = true;
> @@ -3248,19 +3276,24 @@ static struct sock *bpf_iter_udp_batch(struct  
> seq_file *seq)
>   		state->bucket--;
>   		goto again;
>   	}
> +ret:
> +	state->offset = new_offset;
>   	return first_sk;
>   }

>   static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t  
> *pos)
>   {
>   	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
>   	struct sock *sk;

>   	/* Whenever seq_next() is called, the iter->cur_sk is
>   	 * done with seq_show(), so unref the iter->cur_sk.
>   	 */
> -	if (iter->cur_sk < iter->end_sk)
> +	if (iter->cur_sk < iter->end_sk) {
>   		sock_put(iter->batch[iter->cur_sk++]);
> +		++state->offset;
> +	}

>   	/* After updating iter->cur_sk, check if there are more sockets
>   	 * available in the current bucket batch.
> @@ -3630,6 +3663,9 @@ static int bpf_iter_init_udp(void *priv_data,  
> struct bpf_iter_aux_info *aux)
>   	}
>   	iter->cur_sk = 0;
>   	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	st->bucket = 0;
> +	st->offset = 0;

>   	return ret;
>   }
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 18:45 ` [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context Aditi Ghag
@ 2023-03-21 21:31   ` Stanislav Fomichev
  2023-03-21 21:37     ` Aditi Ghag
  2023-03-21 23:39     ` Aditi Ghag
  0 siblings, 2 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-03-21 21:31 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 03/21, Aditi Ghag wrote:
> When sockets are destroyed in the BPF iterator context, sock
> lock is already acquired, so skip taking the lock. This allows
> TCP listening sockets to be destroyed from BPF programs.

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/inet_hashtables.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)

> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index e41fdc38ce19..5543a3e0d1b4 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
>   		/* Don't disable bottom halves while acquiring the lock to
>   		 * avoid circular locking dependency on PREEMPT_RT.
>   		 */
> -		spin_lock(&ilb2->lock);
> +		if (!has_current_bpf_ctx())
> +			spin_lock(&ilb2->lock);
>   		if (sk_unhashed(sk)) {
> -			spin_unlock(&ilb2->lock);
> +			if (!has_current_bpf_ctx())
> +				spin_unlock(&ilb2->lock);

That's bucket lock, why do we have to skip it?

>   			return;
>   		}

> @@ -788,7 +790,8 @@ void inet_unhash(struct sock *sk)

>   		__sk_nulls_del_node_init_rcu(sk);
>   		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> -		spin_unlock(&ilb2->lock);
> +		if (!has_current_bpf_ctx())
> +			spin_unlock(&ilb2->lock);
>   	} else {
>   		spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);

> --
> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 21:31   ` Stanislav Fomichev
@ 2023-03-21 21:37     ` Aditi Ghag
  2023-03-21 21:41       ` Eric Dumazet
  2023-03-21 21:43       ` Aditi Ghag
  2023-03-21 23:39     ` Aditi Ghag
  1 sibling, 2 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 21:37 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 21, 2023, at 2:31 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/21, Aditi Ghag wrote:
>> When sockets are destroyed in the BPF iterator context, sock
>> lock is already acquired, so skip taking the lock. This allows
>> TCP listening sockets to be destroyed from BPF programs.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/ipv4/inet_hashtables.c | 9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index e41fdc38ce19..5543a3e0d1b4 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
>>  		/* Don't disable bottom halves while acquiring the lock to
>>  		 * avoid circular locking dependency on PREEMPT_RT.
>>  		 */
>> -		spin_lock(&ilb2->lock);
>> +		if (!has_current_bpf_ctx())
>> +			spin_lock(&ilb2->lock);
>>  		if (sk_unhashed(sk)) {
>> -			spin_unlock(&ilb2->lock);
>> +			if (!has_current_bpf_ctx())
>> +				spin_unlock(&ilb2->lock);
> 
> That's bucket lock, why do we have to skip it?

Because we take the bucket lock while iterating UDP sockets. See the first commit in this series around batching. But not all BPF contexts that could invoke this function may not acquire the lock, so we can't always skip it. 

> 
>>  			return;
>>  		}
> 
>> @@ -788,7 +790,8 @@ void inet_unhash(struct sock *sk)
> 
>>  		__sk_nulls_del_node_init_rcu(sk);
>>  		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>> -		spin_unlock(&ilb2->lock);
>> +		if (!has_current_bpf_ctx())
>> +			spin_unlock(&ilb2->lock);
>>  	} else {
>>  		spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
> 
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 21:37     ` Aditi Ghag
@ 2023-03-21 21:41       ` Eric Dumazet
  2023-03-21 21:43       ` Aditi Ghag
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2023-03-21 21:41 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: Stanislav Fomichev, bpf, kafai

On Tue, Mar 21, 2023 at 2:37 PM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Mar 21, 2023, at 2:31 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/21, Aditi Ghag wrote:
> >> When sockets are destroyed in the BPF iterator context, sock
> >> lock is already acquired, so skip taking the lock. This allows
> >> TCP listening sockets to be destroyed from BPF programs.
> >
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  net/ipv4/inet_hashtables.c | 9 ++++++---
> >>  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> >> index e41fdc38ce19..5543a3e0d1b4 100644
> >> --- a/net/ipv4/inet_hashtables.c
> >> +++ b/net/ipv4/inet_hashtables.c
> >> @@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
> >>              /* Don't disable bottom halves while acquiring the lock to
> >>               * avoid circular locking dependency on PREEMPT_RT.
> >>               */
> >> -            spin_lock(&ilb2->lock);
> >> +            if (!has_current_bpf_ctx())
> >> +                    spin_lock(&ilb2->lock);
> >>              if (sk_unhashed(sk)) {
> >> -                    spin_unlock(&ilb2->lock);
> >> +                    if (!has_current_bpf_ctx())
> >> +                            spin_unlock(&ilb2->lock);
> >
> > That's bucket lock, why do we have to skip it?
>
> Because we take the bucket lock while iterating UDP sockets. See the first commit in this series around batching. But not all BPF contexts that could invoke this function may not acquire the lock, so we can't always skip it.

This means the kernel will deadlock right before this patch is applied ?

Please squash this patch into the patch adding this bug.

Also, ensuring locks are owned would be nice (full LOCKDEP support,
instead of assumptions)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 21:37     ` Aditi Ghag
  2023-03-21 21:41       ` Eric Dumazet
@ 2023-03-21 21:43       ` Aditi Ghag
  1 sibling, 0 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 21:43 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 21, 2023, at 2:37 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Mar 21, 2023, at 2:31 PM, Stanislav Fomichev <sdf@google.com> wrote:
>> 
>> On 03/21, Aditi Ghag wrote:
>>> When sockets are destroyed in the BPF iterator context, sock
>>> lock is already acquired, so skip taking the lock. This allows
>>> TCP listening sockets to be destroyed from BPF programs.
>> 
>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>> ---
>>> net/ipv4/inet_hashtables.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>> 
>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>> index e41fdc38ce19..5543a3e0d1b4 100644
>>> --- a/net/ipv4/inet_hashtables.c
>>> +++ b/net/ipv4/inet_hashtables.c
>>> @@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
>>> 		/* Don't disable bottom halves while acquiring the lock to
>>> 		 * avoid circular locking dependency on PREEMPT_RT.
>>> 		 */
>>> -		spin_lock(&ilb2->lock);
>>> +		if (!has_current_bpf_ctx())
>>> +			spin_lock(&ilb2->lock);
>>> 		if (sk_unhashed(sk)) {
>>> -			spin_unlock(&ilb2->lock);
>>> +			if (!has_current_bpf_ctx())
>>> +				spin_unlock(&ilb2->lock);
>> 
>> That's bucket lock, why do we have to skip it?
> 
> Because we take the bucket lock while iterating UDP sockets. See the first commit in this series around batching. But not all BPF contexts that could invoke this function may not acquire the lock, so we can't always skip it. 

Sorry, my buttery fingers hit the sent button too soon. You are right, it's the bucket and not *sock* lock. The commit that adds batching releases the bucket lock. I'll take a look at the stack trace again. 


> 
>> 
>>> 			return;
>>> 		}
>> 
>>> @@ -788,7 +790,8 @@ void inet_unhash(struct sock *sk)
>> 
>>> 		__sk_nulls_del_node_init_rcu(sk);
>>> 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>>> -		spin_unlock(&ilb2->lock);
>>> +		if (!has_current_bpf_ctx())
>>> +			spin_unlock(&ilb2->lock);
>>> 	} else {
>>> 		spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
>> 
>>> --
>>> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 21:31   ` Stanislav Fomichev
  2023-03-21 21:37     ` Aditi Ghag
@ 2023-03-21 23:39     ` Aditi Ghag
  2023-03-22  0:02       ` Martin KaFai Lau
  1 sibling, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-21 23:39 UTC (permalink / raw)
  To: Stanislav Fomichev, kafai; +Cc: bpf, Eric Dumazet



> On Mar 21, 2023, at 2:31 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/21, Aditi Ghag wrote:
>> When sockets are destroyed in the BPF iterator context, sock
>> lock is already acquired, so skip taking the lock. This allows
>> TCP listening sockets to be destroyed from BPF programs.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/ipv4/inet_hashtables.c | 9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index e41fdc38ce19..5543a3e0d1b4 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
>>  		/* Don't disable bottom halves while acquiring the lock to
>>  		 * avoid circular locking dependency on PREEMPT_RT.
>>  		 */
>> -		spin_lock(&ilb2->lock);
>> +		if (!has_current_bpf_ctx())
>> +			spin_lock(&ilb2->lock);
>>  		if (sk_unhashed(sk)) {
>> -			spin_unlock(&ilb2->lock);
>> +			if (!has_current_bpf_ctx())
>> +				spin_unlock(&ilb2->lock);
> 
> That's bucket lock, why do we have to skip it?

Good catch! We shouldn't skip it. So the issue is that BPF TCP iterator acquires the sock fast lock that disables BH, and then inet_unhash acquires the bucket lock, but this is in reverse order to when the sock lock is acquired with BH enabled. 

@Martin: I wonder if you ran into similar issues while working on the BPF TCP iterator batching changes for setsockopt? 

Here is the truncated stack trace:

    1.494510] test_progs/118 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
[    1.495123] ffff9ec9c184da10 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x96/0xd0
[    1.495867]
[    1.495867] and this task is already holding:
[    1.496401] ffff9ec9c23400b0 (slock-AF_INET6){+.-.}-{2:2}, at: __lock_sock_fast+0x33/0x70
[    1.497148] which would create a new lock dependency:
[    1.497619]  (slock-AF_INET6){+.-.}-{2:2} -> (&h->lhash2[i].lock){+.+.}-{2:2}
[    1.498278]
[    1.498278] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    1.499011]  (slock-AF_INET6){+.-.}-{2:2}
[    1.499014]
[    1.499014] ... which became SOFTIRQ-irq-safe at:
[    1.499968]   lock_acquire+0xcd/0x330
[    1.500316]   _raw_spin_lock+0x33/0x40
[    1.500670]   sk_clone_lock+0x146/0x520
[    1.501030]   inet_csk_clone_lock+0x1b/0x110
[    1.501433]   tcp_create_openreq_child+0x22/0x3f0
[    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
[    1.502284]   tcp_check_req+0x137/0x660
[    1.502646]   tcp_v6_rcv+0xa63/0xe80
[    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
[    1.503434]   ip6_input_finish+0x72/0x140
[    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
:
:

[    1.512311] to a SOFTIRQ-irq-unsafe lock:
[    1.512773]  (&h->lhash2[i].lock){+.+.}-{2:2}
[    1.512776]
[    1.512776] ... which became SOFTIRQ-irq-unsafe at:
[    1.513691] ...
[    1.513692]   lock_acquire+0xcd/0x330
[    1.514168]   _raw_spin_lock+0x33/0x40
[    1.514492]   __inet_hash+0x4b/0x210
[    1.514802]   inet_csk_listen_start+0xe6/0x100
[    1.515188]   inet_listen+0x95/0x1d0
[    1.515510]   __sys_listen+0x69/0xb0
[    1.515825]   __x64_sys_listen+0x14/0x20
[    1.516163]   do_syscall_64+0x3c/0x90
[    1.516481]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
:
:

[    1.560494]  ... acquired at:
[    1.560634]    lock_acquire+0xcd/0x330
[    1.560804]    _raw_spin_lock_bh+0x38/0x50
[    1.561002]    inet_unhash+0x96/0xd0
[    1.561168]    tcp_set_state+0x6a/0x210
[    1.561352]    tcp_abort+0x12b/0x230
[    1.561518]    bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
[    1.561799]    bpf_iter_run_prog+0x1ff/0x340
[    1.561990]    bpf_iter_tcp_seq_show+0xca/0x190
[    1.562193]    bpf_seq_read+0x177/0x450
[    1.562363]    vfs_read+0xc6/0x300
[    1.562516]    ksys_read+0x69/0xf0
[    1.562665]    do_syscall_64+0x3c/0x90
[    1.562838]    entry_SYSCALL_64_after_hwframe+0x72/0xdc

> 
>>  			return;
>>  		}
> 
>> @@ -788,7 +790,8 @@ void inet_unhash(struct sock *sk)
> 
>>  		__sk_nulls_del_node_init_rcu(sk);
>>  		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>> -		spin_unlock(&ilb2->lock);
>> +		if (!has_current_bpf_ctx())
>> +			spin_unlock(&ilb2->lock);
>>  	} else {
>>  		spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
> 
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-21 18:45 ` [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
  2023-03-21 21:02   ` Stanislav Fomichev
@ 2023-03-21 23:43   ` Martin KaFai Lau
  2023-03-22  0:00     ` Aditi Ghag
  1 sibling, 1 reply; 25+ messages in thread
From: Martin KaFai Lau @ 2023-03-21 23:43 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, bpf

On 3/21/23 11:45 AM, Aditi Ghag wrote:
> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..d2999447d3f2 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>   struct udp_iter_state {
>   	struct seq_net_private  p;
>   	int			bucket;
> +	int			offset;

All offset change is easier to review with patch 1 together, so please move them 
to patch 1.

>   	struct udp_seq_afinfo	*bpf_seq_afinfo;
>   };
>   
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1d6f165923bf..ba3e0dac119c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>   
>   	return func;
>   }
> +
> +/* Disables missing prototype warnings */
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> + *
> + * The helper expects a non-NULL pointer to a socket. It invokes the
> + * protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> + * 0 otherwise
> + */
> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
> +{
> +	struct sock *sk = (struct sock *)sock;
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;
> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(sock_destroy_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> +BTF_SET8_END(sock_destroy_kfunc_set)
> +
> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &sock_destroy_kfunc_set,
> +};
> +
> +static int init_subsystem(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);

It still needs a way to guarantee the running context is safe to use this kfunc. 
  BPF_PROG_TYPE_TRACING is too broad. Trying to brainstorm ways here instead of 
needing to filter by expected_attach_type. I wonder using KF_TRUSTED_ARGS like this:

BTF_ID_FLAGS(func, bpf_sock_destroy, KF_TRUSTED_ARGS)

is enough or it needs some care to filter out BPF_TRACE_RAW_TP after looking at 
prog_args_trusted().
or it needs another tag to specify this kfunc requires the arg to be locked also.

> +}
> +late_initcall(init_subsystem);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 33f559f491c8..59a833c0c872 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>   		return 0;
>   	}
>   
> -	/* Don't race with userspace socket closes such as tcp_close. */
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		/* Don't race with userspace socket closes such as tcp_close. */
> +		lock_sock(sk);

This is ok.

>   
>   	if (sk->sk_state == TCP_LISTEN) {
>   		tcp_set_state(sk, TCP_CLOSE);
> @@ -4688,7 +4690,8 @@ int tcp_abort(struct sock *sk, int err)
>   
>   	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>   	local_bh_disable();
> -	bh_lock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_lock_sock(sk);

This may or may not work, depending on the earlier lock_sock_fast() done in 
bpf_iter_tcp_seq_show() returned true or false. It is probably a good reason to 
replace the lock_sock_fast() with lock_sock() in bpf_iter_tcp_seq_show().

>   
>   	if (!sock_flag(sk, SOCK_DEAD)) {
>   		sk->sk_err = err;
> @@ -4700,10 +4703,13 @@ int tcp_abort(struct sock *sk, int err)
>   		tcp_done(sk);
>   	}
>   
> -	bh_unlock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_unlock_sock(sk);
> +
>   	local_bh_enable();
>   	tcp_write_queue_purge(sk);
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(tcp_abort);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-21 23:43   ` Martin KaFai Lau
@ 2023-03-22  0:00     ` Aditi Ghag
  2023-03-22  0:36       ` Martin KaFai Lau
  0 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-22  0:00 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, edumazet, bpf


> On Mar 21, 2023, at 4:43 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index de4b528522bb..d2999447d3f2 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>>  struct udp_iter_state {
>>  	struct seq_net_private  p;
>>  	int			bucket;
>> +	int			offset;
> 
> All offset change is easier to review with patch 1 together, so please move them to patch 1.

Thanks for the quick review! 

Oh boy... Absolutely! Looks like I misplaced the changes during interactive rebase. Can I fix this up in this patch itself instead of creating a new patch series? That way, I can batch things up in the next revision. 

> 
>>  	struct udp_seq_afinfo	*bpf_seq_afinfo;
>>  };
>>  diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1d6f165923bf..ba3e0dac119c 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>>    	return func;
>>  }
>> +
>> +/* Disables missing prototype warnings */
>> +__diag_push();
>> +__diag_ignore_all("-Wmissing-prototypes",
>> +		  "Global functions as their definitions will be in vmlinux BTF");
>> +
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a socket. It invokes the
>> + * protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
>> +{
>> +	struct sock *sk = (struct sock *)sock;
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
>> +
>> +BTF_SET8_START(sock_destroy_kfunc_set)
>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>> +BTF_SET8_END(sock_destroy_kfunc_set)
>> +
>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &sock_destroy_kfunc_set,
>> +};
>> +
>> +static int init_subsystem(void)
>> +{
>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> 
> It still needs a way to guarantee the running context is safe to use this kfunc.  BPF_PROG_TYPE_TRACING is too broad. Trying to brainstorm ways here instead of needing to filter by expected_attach_type. I wonder using KF_TRUSTED_ARGS like this:
> 
> BTF_ID_FLAGS(func, bpf_sock_destroy, KF_TRUSTED_ARGS)
> 
> is enough or it needs some care to filter out BPF_TRACE_RAW_TP after looking at prog_args_trusted().
> or it needs another tag to specify this kfunc requires the arg to be locked also.
> 


Agreed. We had some open ended discussion in the earlier patch. 


>> +}
>> +late_initcall(init_subsystem);
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 33f559f491c8..59a833c0c872 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>  		return 0;
>>  	}
>>  -	/* Don't race with userspace socket closes such as tcp_close. */
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		/* Don't race with userspace socket closes such as tcp_close. */
>> +		lock_sock(sk);
> 
> This is ok.
> 
>>    	if (sk->sk_state == TCP_LISTEN) {
>>  		tcp_set_state(sk, TCP_CLOSE);
>> @@ -4688,7 +4690,8 @@ int tcp_abort(struct sock *sk, int err)
>>    	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>  	local_bh_disable();
>> -	bh_lock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_lock_sock(sk);
> 
> This may or may not work, depending on the earlier lock_sock_fast() done in bpf_iter_tcp_seq_show() returned true or false. It is probably a good reason to replace the lock_sock_fast() with lock_sock() in bpf_iter_tcp_seq_show().


:) We would have to replace lock_sock_fast with lock_sock anyway, else this causes problems in inet_unhash while destroying TCP listening socket, see - https://lore.kernel.org/bpf/20230321184541.1857363-1-aditi.ghag@isovalent.com/T/#m0bb85df6482e7eae296b00394c482254db544748. 


> 
>>    	if (!sock_flag(sk, SOCK_DEAD)) {
>>  		sk->sk_err = err;
>> @@ -4700,10 +4703,13 @@ int tcp_abort(struct sock *sk, int err)
>>  		tcp_done(sk);
>>  	}
>>  -	bh_unlock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_unlock_sock(sk);
>> +
>>  	local_bh_enable();
>>  	tcp_write_queue_purge(sk);
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
>>  	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tcp_abort);
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context
  2023-03-21 23:39     ` Aditi Ghag
@ 2023-03-22  0:02       ` Martin KaFai Lau
  0 siblings, 0 replies; 25+ messages in thread
From: Martin KaFai Lau @ 2023-03-22  0:02 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, Eric Dumazet, Stanislav Fomichev

On 3/21/23 4:39 PM, Aditi Ghag wrote:
> 
> 
>> On Mar 21, 2023, at 2:31 PM, Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On 03/21, Aditi Ghag wrote:
>>> When sockets are destroyed in the BPF iterator context, sock
>>> lock is already acquired, so skip taking the lock. This allows
>>> TCP listening sockets to be destroyed from BPF programs.
>>
>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>> ---
>>>   net/ipv4/inet_hashtables.c | 9 ++++++---
>>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>> index e41fdc38ce19..5543a3e0d1b4 100644
>>> --- a/net/ipv4/inet_hashtables.c
>>> +++ b/net/ipv4/inet_hashtables.c
>>> @@ -777,9 +777,11 @@ void inet_unhash(struct sock *sk)
>>>   		/* Don't disable bottom halves while acquiring the lock to
>>>   		 * avoid circular locking dependency on PREEMPT_RT.
>>>   		 */
>>> -		spin_lock(&ilb2->lock);
>>> +		if (!has_current_bpf_ctx())
>>> +			spin_lock(&ilb2->lock);
>>>   		if (sk_unhashed(sk)) {
>>> -			spin_unlock(&ilb2->lock);
>>> +			if (!has_current_bpf_ctx())
>>> +				spin_unlock(&ilb2->lock);
>>
>> That's bucket lock, why do we have to skip it?
> 
> Good catch! We shouldn't skip it. So the issue is that BPF TCP iterator acquires the sock fast lock that disables BH, and then inet_unhash acquires the bucket lock, but this is in reverse order to when the sock lock is acquired with BH enabled.
> 
> @Martin: I wonder if you ran into similar issues while working on the BPF TCP iterator batching changes for setsockopt?

The existing helpers don't need to acquire bucket log. afaik, the destroy kfunc 
is the first. [ Unrelated note, the bh disable had recently be removed for lhash 
in commit 4f9bf2a2f5aa. ]

If I read this splat correctly, similar to the earlier reply in patch 2 for a 
different issue, an easy solution is to use lock_sock() instead of 
lock_sock_fast() in bpf_iter_tcp_seq_show() .

> 
> Here is the truncated stack trace:
> 
>      1.494510] test_progs/118 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
> [    1.495123] ffff9ec9c184da10 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x96/0xd0
> [    1.495867]
> [    1.495867] and this task is already holding:
> [    1.496401] ffff9ec9c23400b0 (slock-AF_INET6){+.-.}-{2:2}, at: __lock_sock_fast+0x33/0x70
> [    1.497148] which would create a new lock dependency:
> [    1.497619]  (slock-AF_INET6){+.-.}-{2:2} -> (&h->lhash2[i].lock){+.+.}-{2:2}
> [    1.498278]
> [    1.498278] but this new dependency connects a SOFTIRQ-irq-safe lock:
> [    1.499011]  (slock-AF_INET6){+.-.}-{2:2}
> [    1.499014]
> [    1.499014] ... which became SOFTIRQ-irq-safe at:
> [    1.499968]   lock_acquire+0xcd/0x330
> [    1.500316]   _raw_spin_lock+0x33/0x40
> [    1.500670]   sk_clone_lock+0x146/0x520
> [    1.501030]   inet_csk_clone_lock+0x1b/0x110
> [    1.501433]   tcp_create_openreq_child+0x22/0x3f0
> [    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
> [    1.502284]   tcp_check_req+0x137/0x660
> [    1.502646]   tcp_v6_rcv+0xa63/0xe80
> [    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
> [    1.503434]   ip6_input_finish+0x72/0x140
> [    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
> :
> :
> 
> [    1.512311] to a SOFTIRQ-irq-unsafe lock:
> [    1.512773]  (&h->lhash2[i].lock){+.+.}-{2:2}
> [    1.512776]
> [    1.512776] ... which became SOFTIRQ-irq-unsafe at:
> [    1.513691] ...
> [    1.513692]   lock_acquire+0xcd/0x330
> [    1.514168]   _raw_spin_lock+0x33/0x40
> [    1.514492]   __inet_hash+0x4b/0x210
> [    1.514802]   inet_csk_listen_start+0xe6/0x100
> [    1.515188]   inet_listen+0x95/0x1d0
> [    1.515510]   __sys_listen+0x69/0xb0
> [    1.515825]   __x64_sys_listen+0x14/0x20
> [    1.516163]   do_syscall_64+0x3c/0x90
> [    1.516481]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
> :
> :
> 
> [    1.560494]  ... acquired at:
> [    1.560634]    lock_acquire+0xcd/0x330
> [    1.560804]    _raw_spin_lock_bh+0x38/0x50
> [    1.561002]    inet_unhash+0x96/0xd0
> [    1.561168]    tcp_set_state+0x6a/0x210
> [    1.561352]    tcp_abort+0x12b/0x230
> [    1.561518]    bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
> [    1.561799]    bpf_iter_run_prog+0x1ff/0x340
> [    1.561990]    bpf_iter_tcp_seq_show+0xca/0x190
> [    1.562193]    bpf_seq_read+0x177/0x450
> [    1.562363]    vfs_read+0xc6/0x300
> [    1.562516]    ksys_read+0x69/0xf0
> [    1.562665]    do_syscall_64+0x3c/0x90
> [    1.562838]    entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
>>
>>>   			return;
>>>   		}
>>
>>> @@ -788,7 +790,8 @@ void inet_unhash(struct sock *sk)
>>
>>>   		__sk_nulls_del_node_init_rcu(sk);
>>>   		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>>> -		spin_unlock(&ilb2->lock);
>>> +		if (!has_current_bpf_ctx())
>>> +			spin_unlock(&ilb2->lock);
>>>   	} else {
>>>   		spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
>>
>>> --
>>> 2.34.1
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-21 18:45 ` [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets Aditi Ghag
@ 2023-03-22  0:29   ` Martin KaFai Lau
  2023-03-22  0:59     ` Aditi Ghag
  0 siblings, 1 reply; 25+ messages in thread
From: Martin KaFai Lau @ 2023-03-22  0:29 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, bpf

On 3/21/23 11:45 AM, Aditi Ghag wrote:
> Previously, UDP listening sockets that bind'ed to a port
> weren't getting properly destroyed via udp_abort.
> Specifically, the sockets were left in the UDP hash table with
> unset source port.
> Fix the issue by unconditionally unhashing and resetting source
> port for sockets that are getting destroyed. This would mean
> that in case of sockets listening on wildcarded address and
> on a specific address with a common port, users would have to
> explicitly select the socket(s) they intend to destroy.
> 
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/udp.c | 21 ++++++++++++++++++++-
>   1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 02d357713838..a495ac88fcee 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>   }
>   EXPORT_SYMBOL(udp_pre_connect);
>   
> +int __udp_disconnect_with_abort(struct sock *sk)
> +{
> +	struct inet_sock *inet = inet_sk(sk);
> +
> +	sk->sk_state = TCP_CLOSE;
> +	inet->inet_daddr = 0;
> +	inet->inet_dport = 0;
> +	sock_rps_reset_rxhash(sk);
> +	sk->sk_bound_dev_if = 0;
> +	inet_reset_saddr(sk);
> +	inet->inet_sport = 0;
> +	sk_dst_reset(sk);
> +	/* (TBD) In case of sockets listening on wildcard and specific address
> +	 * with a common port, socket will be removed from {hash, hash2} table.
> +	 */
> +	sk->sk_prot->unhash(sk);

hmm... not sure if I understand the use case. The idea is to enforce the user 
space to bind() again when it gets error from read(fd) because the source 
ip/port needs to change when sending to another dst IP/port? Does it have a 
usage example in the selftests?

> +	return 0;
> +}
> +
>   int __udp_disconnect(struct sock *sk, int flags)
>   {
>   	struct inet_sock *inet = inet_sk(sk);
> @@ -2937,7 +2956,7 @@ int udp_abort(struct sock *sk, int err)
>   
>   	sk->sk_err = err;
>   	sk_error_report(sk);
> -	__udp_disconnect(sk, 0);
> +	__udp_disconnect_with_abort(sk);
>   
>   out:
>   	if (!has_current_bpf_ctx())


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-22  0:00     ` Aditi Ghag
@ 2023-03-22  0:36       ` Martin KaFai Lau
  0 siblings, 0 replies; 25+ messages in thread
From: Martin KaFai Lau @ 2023-03-22  0:36 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, Stanislav Fomichev, edumazet, bpf

On 3/21/23 5:00 PM, Aditi Ghag wrote:
>> n Mar 21, 2023, at 4:43 PM, Martin KaFai Lau<martin.lau@linux.dev>  wrote:
>>
>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>> diff --git a/include/net/udp.h b/include/net/udp.h
>>> index de4b528522bb..d2999447d3f2 100644
>>> --- a/include/net/udp.h
>>> +++ b/include/net/udp.h
>>> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>>>   struct udp_iter_state {
>>>   	struct seq_net_private  p;
>>>   	int			bucket;
>>> +	int			offset;
>> All offset change is easier to review with patch 1 together, so please move them to patch 1.
> Thanks for the quick review!
> 
> Oh boy... Absolutely! Looks like I misplaced the changes during interactive rebase. Can I fix this up in this patch itself instead of creating a new patch series? That way, I can batch things up in the next revision.
> 

Instead of re-posting a single patch, it will be easier to batch up all the 
changes in one set in the next revision after the review comments in v3.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-22  0:29   ` Martin KaFai Lau
@ 2023-03-22  0:59     ` Aditi Ghag
  2023-03-22 22:55       ` Martin KaFai Lau
  0 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-22  0:59 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, edumazet, bpf



> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>> Previously, UDP listening sockets that bind'ed to a port
>> weren't getting properly destroyed via udp_abort.
>> Specifically, the sockets were left in the UDP hash table with
>> unset source port.
>> Fix the issue by unconditionally unhashing and resetting source
>> port for sockets that are getting destroyed. This would mean
>> that in case of sockets listening on wildcarded address and
>> on a specific address with a common port, users would have to
>> explicitly select the socket(s) they intend to destroy.
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>  1 file changed, 20 insertions(+), 1 deletion(-)
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 02d357713838..a495ac88fcee 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>  }
>>  EXPORT_SYMBOL(udp_pre_connect);
>>  +int __udp_disconnect_with_abort(struct sock *sk)
>> +{
>> +	struct inet_sock *inet = inet_sk(sk);
>> +
>> +	sk->sk_state = TCP_CLOSE;
>> +	inet->inet_daddr = 0;
>> +	inet->inet_dport = 0;
>> +	sock_rps_reset_rxhash(sk);
>> +	sk->sk_bound_dev_if = 0;
>> +	inet_reset_saddr(sk);
>> +	inet->inet_sport = 0;
>> +	sk_dst_reset(sk);
>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>> +	 */
>> +	sk->sk_prot->unhash(sk);
> 
> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?


> Does it have a usage example in the selftests?

Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?   

I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]). 

[1] https://lore.kernel.org/bpf/FB695169-4640-4E50-901D-84CF145765F2@isovalent.com/T/#u

> 
>> +	return 0;
>> +}
>> +
>>  int __udp_disconnect(struct sock *sk, int flags)
>>  {
>>  	struct inet_sock *inet = inet_sk(sk);
>> @@ -2937,7 +2956,7 @@ int udp_abort(struct sock *sk, int err)
>>    	sk->sk_err = err;
>>  	sk_error_report(sk);
>> -	__udp_disconnect(sk, 0);
>> +	__udp_disconnect_with_abort(sk);
>>    out:
>>  	if (!has_current_bpf_ctx())


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-22  0:59     ` Aditi Ghag
@ 2023-03-22 22:55       ` Martin KaFai Lau
  2023-03-23  0:32         ` Aditi Ghag
  0 siblings, 1 reply; 25+ messages in thread
From: Martin KaFai Lau @ 2023-03-22 22:55 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, Stanislav Fomichev, edumazet, bpf

On 3/21/23 5:59 PM, Aditi Ghag wrote:
> 
> 
>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>> Previously, UDP listening sockets that bind'ed to a port
>>> weren't getting properly destroyed via udp_abort.
>>> Specifically, the sockets were left in the UDP hash table with
>>> unset source port.
>>> Fix the issue by unconditionally unhashing and resetting source
>>> port for sockets that are getting destroyed. This would mean
>>> that in case of sockets listening on wildcarded address and
>>> on a specific address with a common port, users would have to
>>> explicitly select the socket(s) they intend to destroy.
>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>> ---
>>>   net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>   1 file changed, 20 insertions(+), 1 deletion(-)
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 02d357713838..a495ac88fcee 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>   }
>>>   EXPORT_SYMBOL(udp_pre_connect);
>>>   +int __udp_disconnect_with_abort(struct sock *sk)
>>> +{
>>> +	struct inet_sock *inet = inet_sk(sk);
>>> +
>>> +	sk->sk_state = TCP_CLOSE;
>>> +	inet->inet_daddr = 0;
>>> +	inet->inet_dport = 0;
>>> +	sock_rps_reset_rxhash(sk);
>>> +	sk->sk_bound_dev_if = 0;
>>> +	inet_reset_saddr(sk);
>>> +	inet->inet_sport = 0;
>>> +	sk_dst_reset(sk);
>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>> +	 */
>>> +	sk->sk_prot->unhash(sk);
>>
>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
> 
> 
>> Does it have a usage example in the selftests?
> 
> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
> 
> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).

The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip 
and port. afaik, the tcp server which binds to a particular ip and port will 
observe similar behavior as the udp server.

When the user space notices a read() error from a UDP server socket (because of 
udp_abort), should the user space close() this udp server socket first before 
opening a new one and then bind to a different src ip and src port?
or I am still missing some pieces in the use case where there is other cgroup 
bpf programs doing bind?


> 
> [1] https://lore.kernel.org/bpf/FB695169-4640-4E50-901D-84CF145765F2@isovalent.com/T/#u


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-22 22:55       ` Martin KaFai Lau
@ 2023-03-23  0:32         ` Aditi Ghag
  2023-03-23  1:08           ` Martin KaFai Lau
  0 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-23  0:32 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, Eric Dumazet, bpf



> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>> Previously, UDP listening sockets that bind'ed to a port
>>>> weren't getting properly destroyed via udp_abort.
>>>> Specifically, the sockets were left in the UDP hash table with
>>>> unset source port.
>>>> Fix the issue by unconditionally unhashing and resetting source
>>>> port for sockets that are getting destroyed. This would mean
>>>> that in case of sockets listening on wildcarded address and
>>>> on a specific address with a common port, users would have to
>>>> explicitly select the socket(s) they intend to destroy.
>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>> ---
>>>>  net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 02d357713838..a495ac88fcee 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>  }
>>>>  EXPORT_SYMBOL(udp_pre_connect);
>>>>  +int __udp_disconnect_with_abort(struct sock *sk)
>>>> +{
>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>> +
>>>> +	sk->sk_state = TCP_CLOSE;
>>>> +	inet->inet_daddr = 0;
>>>> +	inet->inet_dport = 0;
>>>> +	sock_rps_reset_rxhash(sk);
>>>> +	sk->sk_bound_dev_if = 0;
>>>> +	inet_reset_saddr(sk);
>>>> +	inet->inet_sport = 0;
>>>> +	sk_dst_reset(sk);
>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>> +	 */
>>>> +	sk->sk_prot->unhash(sk);
>>> 
>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>> Does it have a usage example in the selftests?
>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
> 
> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
> 
> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?

I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.

Let's take an example of the test_udp_server test. It does the following - 

1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
2) Run BPF iterators that destroys the server sockets.
3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits. 

Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error? 

> 
> 
>> [1] https://lore.kernel.org/bpf/FB695169-4640-4E50-901D-84CF145765F2@isovalent.com/T/#u


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-23  0:32         ` Aditi Ghag
@ 2023-03-23  1:08           ` Martin KaFai Lau
  2023-03-23  1:35             ` Aditi Ghag
  0 siblings, 1 reply; 25+ messages in thread
From: Martin KaFai Lau @ 2023-03-23  1:08 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, Stanislav Fomichev, Eric Dumazet, bpf

On 3/22/23 5:32 PM, Aditi Ghag wrote:
> 
> 
>> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>>> Previously, UDP listening sockets that bind'ed to a port
>>>>> weren't getting properly destroyed via udp_abort.
>>>>> Specifically, the sockets were left in the UDP hash table with
>>>>> unset source port.
>>>>> Fix the issue by unconditionally unhashing and resetting source
>>>>> port for sockets that are getting destroyed. This would mean
>>>>> that in case of sockets listening on wildcarded address and
>>>>> on a specific address with a common port, users would have to
>>>>> explicitly select the socket(s) they intend to destroy.
>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>> ---
>>>>>   net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>>   1 file changed, 20 insertions(+), 1 deletion(-)
>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>> index 02d357713838..a495ac88fcee 100644
>>>>> --- a/net/ipv4/udp.c
>>>>> +++ b/net/ipv4/udp.c
>>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>>   }
>>>>>   EXPORT_SYMBOL(udp_pre_connect);
>>>>>   +int __udp_disconnect_with_abort(struct sock *sk)
>>>>> +{
>>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>>> +
>>>>> +	sk->sk_state = TCP_CLOSE;
>>>>> +	inet->inet_daddr = 0;
>>>>> +	inet->inet_dport = 0;
>>>>> +	sock_rps_reset_rxhash(sk);
>>>>> +	sk->sk_bound_dev_if = 0;
>>>>> +	inet_reset_saddr(sk);
>>>>> +	inet->inet_sport = 0;
>>>>> +	sk_dst_reset(sk);
>>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>>> +	 */
>>>>> +	sk->sk_prot->unhash(sk);
>>>>
>>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>>> Does it have a usage example in the selftests?
>>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
>>
>> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
>>
>> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
>> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?
> 
> I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.
> 
> Let's take an example of the test_udp_server test. It does the following -

Yep, I was looking at the test_udp_server test.

> 
> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)

Note that UDP has no listen(). It is just a bind().

> 2) Run BPF iterators that destroys the server sockets.

destroy does not mean it is gone from the kernel space or user space. User space 
still has a fd.

> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits.
> 
> Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error?

Yes, (3) is expected (at least the current abort behavior). The user space still 
holds the fd of (1). The user space did bind(fd1) to request for a specific 
local addr/port and the user space has not closed it yet. The current udp_abort 
does not undo this ip/port binding which is a sensible thing, imo.

I have been answering a lot of questions. May be time to go back to my earlier 
question, why the user space server cannot do a close on the fd after the 
previous read() return ECONNABORTED? It seems to be the most sensible server 
retry logic when getting ECONNABORTED and then open another socket to do bind() 
again. test_udp_server() is also creating a new socket to do the bind() after 
the earlier the kfunc destroy is done, so the old fd1 is supposed to be useless 
anyway.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-23  1:08           ` Martin KaFai Lau
@ 2023-03-23  1:35             ` Aditi Ghag
  2023-03-23  6:14               ` Aditi Ghag
  0 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-23  1:35 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, Eric Dumazet, bpf



> On Mar 22, 2023, at 6:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/22/23 5:32 PM, Aditi Ghag wrote:
>>> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>> 
>>>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>>>> Previously, UDP listening sockets that bind'ed to a port
>>>>>> weren't getting properly destroyed via udp_abort.
>>>>>> Specifically, the sockets were left in the UDP hash table with
>>>>>> unset source port.
>>>>>> Fix the issue by unconditionally unhashing and resetting source
>>>>>> port for sockets that are getting destroyed. This would mean
>>>>>> that in case of sockets listening on wildcarded address and
>>>>>> on a specific address with a common port, users would have to
>>>>>> explicitly select the socket(s) they intend to destroy.
>>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>>> ---
>>>>>>  net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>>> index 02d357713838..a495ac88fcee 100644
>>>>>> --- a/net/ipv4/udp.c
>>>>>> +++ b/net/ipv4/udp.c
>>>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>>>  }
>>>>>>  EXPORT_SYMBOL(udp_pre_connect);
>>>>>>  +int __udp_disconnect_with_abort(struct sock *sk)
>>>>>> +{
>>>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>>>> +
>>>>>> +	sk->sk_state = TCP_CLOSE;
>>>>>> +	inet->inet_daddr = 0;
>>>>>> +	inet->inet_dport = 0;
>>>>>> +	sock_rps_reset_rxhash(sk);
>>>>>> +	sk->sk_bound_dev_if = 0;
>>>>>> +	inet_reset_saddr(sk);
>>>>>> +	inet->inet_sport = 0;
>>>>>> +	sk_dst_reset(sk);
>>>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>>>> +	 */
>>>>>> +	sk->sk_prot->unhash(sk);
>>>>> 
>>>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>>>> Does it have a usage example in the selftests?
>>>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>>>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
>>> 
>>> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
>>> 
>>> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
>>> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?
>> I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.
>> Let's take an example of the test_udp_server test. It does the following -
> 
> Yep, I was looking at the test_udp_server test.
> 
>> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
> 
> Note that UDP has no listen(). It is just a bind().
> 
>> 2) Run BPF iterators that destroys the server sockets.
> 
> destroy does not mean it is gone from the kernel space or user space. User space still has a fd.
> 
>> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits.
>> Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error?
> 
> Yes, (3) is expected (at least the current abort behavior). The user space still holds the fd of (1). The user space did bind(fd1) to request for a specific local addr/port and the user space has not closed it yet. The current udp_abort does not undo this ip/port binding which is a sensible thing, imo.

Got it. 

> 
> I have been answering a lot of questions. May be time to go back to my earlier question,
> why the user space server cannot do a close on the fd after the previous read() return ECONNABORTED? It seems to be the most sensible server retry logic when getting ECONNABORTED and then open another socket to do bind() again. test_udp_server() is also creating a new socket to do the bind() after the earlier the kfunc destroy is done, so the old fd1 is supposed to be useless anyway.


Thanks for the clarifications. The client selftests are doing the same validations: check if read() fails, and check for ECONNABORTED error code. So yes, we can do the same for servers as well.
I've been on the same page with userspace checking for error codes, or failures in the socket read call. It's just the bind case that tripped me: I expected the kernel to free up source ports that the destroyed sockets had previously called bind on.

I'll update the server tests. 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets
  2023-03-23  1:35             ` Aditi Ghag
@ 2023-03-23  6:14               ` Aditi Ghag
  0 siblings, 0 replies; 25+ messages in thread
From: Aditi Ghag @ 2023-03-23  6:14 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, Eric Dumazet, bpf



> On Mar 22, 2023, at 6:35 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Mar 22, 2023, at 6:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 3/22/23 5:32 PM, Aditi Ghag wrote:
>>>> On Mar 22, 2023, at 3:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>> 
>>>> On 3/21/23 5:59 PM, Aditi Ghag wrote:
>>>>>> On Mar 21, 2023, at 5:29 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>>> 
>>>>>> On 3/21/23 11:45 AM, Aditi Ghag wrote:
>>>>>>> Previously, UDP listening sockets that bind'ed to a port
>>>>>>> weren't getting properly destroyed via udp_abort.
>>>>>>> Specifically, the sockets were left in the UDP hash table with
>>>>>>> unset source port.
>>>>>>> Fix the issue by unconditionally unhashing and resetting source
>>>>>>> port for sockets that are getting destroyed. This would mean
>>>>>>> that in case of sockets listening on wildcarded address and
>>>>>>> on a specific address with a common port, users would have to
>>>>>>> explicitly select the socket(s) they intend to destroy.
>>>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>>>> ---
>>>>>>> net/ipv4/udp.c | 21 ++++++++++++++++++++-
>>>>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>>>> index 02d357713838..a495ac88fcee 100644
>>>>>>> --- a/net/ipv4/udp.c
>>>>>>> +++ b/net/ipv4/udp.c
>>>>>>> @@ -1965,6 +1965,25 @@ int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>>>>>>> }
>>>>>>> EXPORT_SYMBOL(udp_pre_connect);
>>>>>>> +int __udp_disconnect_with_abort(struct sock *sk)
>>>>>>> +{
>>>>>>> +	struct inet_sock *inet = inet_sk(sk);
>>>>>>> +
>>>>>>> +	sk->sk_state = TCP_CLOSE;
>>>>>>> +	inet->inet_daddr = 0;
>>>>>>> +	inet->inet_dport = 0;
>>>>>>> +	sock_rps_reset_rxhash(sk);
>>>>>>> +	sk->sk_bound_dev_if = 0;
>>>>>>> +	inet_reset_saddr(sk);
>>>>>>> +	inet->inet_sport = 0;
>>>>>>> +	sk_dst_reset(sk);
>>>>>>> +	/* (TBD) In case of sockets listening on wildcard and specific address
>>>>>>> +	 * with a common port, socket will be removed from {hash, hash2} table.
>>>>>>> +	 */
>>>>>>> +	sk->sk_prot->unhash(sk);
>>>>>> 
>>>>>> hmm... not sure if I understand the use case. The idea is to enforce the user space to bind() again when it gets error from read(fd) because the source ip/port needs to change when sending to another dst IP/port?
>>>>>> Does it have a usage example in the selftests?
>>>>> Yes, there is a new selftest case where I intend to exercise the UDP sockets batching changes (check the udp_server test case). Well, the Cilium use case is to destroy client sockets (the selftests from v1/v2 patch mirror the use case), but we would want to be able destroy listening sockets too since we don't have any code preventing that?
>>>>> I expected when UDP listening server sockets are destroyed, they are removed from the hash table, and a subsequent bind on the overlapping port would succeed? At least, I observed similar behavior for TCP sockets (minus the bind part, of course) in the test, and the connected client sockets were reset when the server sockets were destroyed. That's not what I observed for UDP listening sockets though (shared the debugging notes in the v2 patch [1]).
>>>> 
>>>> The tcp 'clien', from 'connect_to_fd()', was not bind() to a particular local ip and port. afaik, the tcp server which binds to a particular ip and port will observe similar behavior as the udp server.
>>>> 
>>>> When the user space notices a read() error from a UDP server socket (because of udp_abort), should the user space close() this udp server socket first before opening a new one and then bind to a different src ip and src port?
>>>> or I am still missing some pieces in the use case where there is other cgroup bpf programs doing bind?
>>> I'm not sure if we are talking about the same selftests. It's possible that the new server related selftests are not validating the behavior in the right way.
>>> Let's take an example of the test_udp_server test. It does the following -
>> 
>> Yep, I was looking at the test_udp_server test.
>> 
>>> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
>> 
>> Note that UDP has no listen(). It is just a bind().
>> 
>>> 2) Run BPF iterators that destroys the server sockets.
>> 
>> destroy does not mean it is gone from the kernel space or user space. User space still has a fd.
>> 
>>> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind. Moreover, the UDP sockets were lingering in the kernel hash table without the fix in one of the commits.
>>> Are you suggesting that (3) is expected?  The destroyed sockets are expected to be also present in the hash table? Are you saying that userspace needs to first close the sockets that were destroyed in the kernel when they encounter read() error?
>> 
>> Yes, (3) is expected (at least the current abort behavior). The user space still holds the fd of (1). The user space did bind(fd1) to request for a specific local addr/port and the user space has not closed it yet. The current udp_abort does not undo this ip/port binding which is a sensible thing, imo.
> 
> Got it. 
> 
>> 
>> I have been answering a lot of questions. May be time to go back to my earlier question,
>> why the user space server cannot do a close on the fd after the previous read() return ECONNABORTED? It seems to be the most sensible server retry logic when getting ECONNABORTED and then open another socket to do bind() again. test_udp_server() is also creating a new socket to do the bind() after the earlier the kfunc destroy is done, so the old fd1 is supposed to be useless anyway.
> 
> 
> Thanks for the clarifications. The client selftests are doing the same validations: check if read() fails, and check for ECONNABORTED error code. So yes, we can do the same for servers as well.
> I've been on the same page with userspace checking for error codes, or failures in the socket read call. It's just the bind case that tripped me: I expected the kernel to free up source ports that the destroyed sockets had previously called bind on.
> 
> I'll update the server tests. 


Just to close the loop on this: I updated the server tests to check for errno == ECONNABORTED, and they pass (no surprises there!). (Just to clarify, read() doesn't return ECONNABORTED, but I presume you meant errno. ) The more interesting scenarios to consider are asynchronous I/O (e.g., epoll) that can be non-blocking. In case of aborted sockets, EPOLLERR should bubble up the error condition on an fd to userspace applications. I'm not seeing related code in udp_poll that checks for sk_err though. Am I missing something, or not looking at the right place?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-21 21:02   ` Stanislav Fomichev
@ 2023-03-23 20:02     ` Aditi Ghag
  2023-03-23 21:45       ` Stanislav Fomichev
  0 siblings, 1 reply; 25+ messages in thread
From: Aditi Ghag @ 2023-03-23 20:02 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 21, 2023, at 2:02 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/21, Aditi Ghag wrote:
>> The socket destroy kfunc is used to forcefully terminate sockets from
>> certain BPF contexts. We plan to use the capability in Cilium to force
>> client sockets to reconnect when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly policy enforcement where existing
>> socket connections prevented by policies need to be forcefully terminated.
>> The helper allows terminating sockets that may or may not be actively
>> sending traffic.
> 
>> The helper is currently exposed to certain BPF iterators where users can
>> filter, and terminate selected sockets.  Additionally, the helper can only
>> be called from these BPF contexts that ensure socket locking in order to
>> allow synchronous execution of destroy helpers that also acquire socket
>> locks. The previous commit that batches UDP sockets during iteration
>> facilitated a synchronous invocation of the destroy helper from BPF context
>> by skipping taking socket locks in the destroy handler. TCP iterators
>> already supported batching.
> 
>> The helper takes `sock_common` type argument, even though it expects, and
>> casts them to a `sock` pointer. This enables the verifier to allow the
>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>> option available with the verifier logic that handles kfuncs where BTF
>> types are inferred. Furthermore, as `sock_common` only has a subset of
>> certain fields of `sock`, casting pointer to the latter type might not
>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>> before casting.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  include/net/udp.h |  1 +
>>  net/core/filter.c | 54 ++++++++++++++++++++++++++++++++++++++++++
>>  net/ipv4/tcp.c    | 16 +++++++++----
>>  net/ipv4/udp.c    | 60 +++++++++++++++++++++++++++++++++++++----------
>>  4 files changed, 114 insertions(+), 17 deletions(-)
> 
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index de4b528522bb..d2999447d3f2 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>>  struct udp_iter_state {
>>  	struct seq_net_private  p;
>>  	int			bucket;
>> +	int			offset;
>>  	struct udp_seq_afinfo	*bpf_seq_afinfo;
>>  };
> 
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1d6f165923bf..ba3e0dac119c 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> 
>>  	return func;
>>  }
>> +
>> +/* Disables missing prototype warnings */
>> +__diag_push();
>> +__diag_ignore_all("-Wmissing-prototypes",
>> +		  "Global functions as their definitions will be in vmlinux BTF");
>> +
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a socket. It invokes the
>> + * protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
>> +{
>> +	struct sock *sk = (struct sock *)sock;
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
> 
> What makes IPPROTO_RAW special? Looks like it locks the socket as well?

I haven't looked at the locking semantics for IPPROTO_RAW. These can be handled in a follow-up patch. Wdyt?

> 
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
>> +
>> +BTF_SET8_START(sock_destroy_kfunc_set)
>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>> +BTF_SET8_END(sock_destroy_kfunc_set)
>> +
>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &sock_destroy_kfunc_set,
>> +};
>> +
>> +static int init_subsystem(void)
>> +{
>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
>> +}
>> +late_initcall(init_subsystem);
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 33f559f491c8..59a833c0c872 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>  		return 0;
>>  	}
> 
>> -	/* Don't race with userspace socket closes such as tcp_close. */
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		/* Don't race with userspace socket closes such as tcp_close. */
>> +		lock_sock(sk);
> 
>>  	if (sk->sk_state == TCP_LISTEN) {
>>  		tcp_set_state(sk, TCP_CLOSE);
>> @@ -4688,7 +4690,8 @@ int tcp_abort(struct sock *sk, int err)
> 
>>  	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>  	local_bh_disable();
> 
> [..]
> 
>> -	bh_lock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_lock_sock(sk);
> 
> These are spinlocks and should probably be grabbed in the bpf context as
> well?

Fixed in the next revision. Thanks!

> 
> 
>>  	if (!sock_flag(sk, SOCK_DEAD)) {
>>  		sk->sk_err = err;
>> @@ -4700,10 +4703,13 @@ int tcp_abort(struct sock *sk, int err)
>>  		tcp_done(sk);
>>  	}
> 
>> -	bh_unlock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_unlock_sock(sk);
>> +
>>  	local_bh_enable();
>>  	tcp_write_queue_purge(sk);
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
>>  	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tcp_abort);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 545e56329355..02d357713838 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
> 
>>  int udp_abort(struct sock *sk, int err)
>>  {
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		lock_sock(sk);
> 
>>  	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>  	 * with close()
>> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>>  	__udp_disconnect(sk, 0);
> 
>>  out:
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
> 
>>  	return 0;
>>  }
>> @@ -3184,15 +3187,23 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>  	struct sock *first_sk = NULL;
>>  	struct sock *sk;
>>  	unsigned int bucket_sks = 0;
>> -	bool first;
>>  	bool resized = false;
>> +	int offset = 0;
>> +	int new_offset;
> 
>>  	/* The current batch is done, so advance the bucket. */
>> -	if (iter->st_bucket_done)
>> +	if (iter->st_bucket_done) {
>>  		state->bucket++;
>> +		state->offset = 0;
>> +	}
> 
>>  	udptable = udp_get_table_afinfo(afinfo, net);
> 
>> +	if (state->bucket > udptable->mask) {
>> +		state->bucket = 0;
>> +		state->offset = 0;
>> +	}
>> +
>>  again:
>>  	/* New batch for the next bucket.
>>  	 * Iterate over the hash table to find a bucket with sockets matching
>> @@ -3204,43 +3215,60 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>  	iter->cur_sk = 0;
>>  	iter->end_sk = 0;
>>  	iter->st_bucket_done = false;
>> -	first = true;
>> +	first_sk = NULL;
>> +	bucket_sks = 0;
>> +	offset = state->offset;
>> +	new_offset = offset;
> 
>>  	for (; state->bucket <= udptable->mask; state->bucket++) {
>>  		struct udp_hslot *hslot = &udptable->hash[state->bucket];
> 
>> -		if (hlist_empty(&hslot->head))
>> +		if (hlist_empty(&hslot->head)) {
>> +			offset = 0;
>>  			continue;
>> +		}
> 
>>  		spin_lock_bh(&hslot->lock);
>> +		/* Resume from the last saved position in a bucket before
>> +		 * iterator was stopped.
>> +		 */
>> +		while (offset-- > 0) {
>> +			sk_for_each(sk, &hslot->head)
>> +				continue;
>> +		}
>>  		sk_for_each(sk, &hslot->head) {
>>  			if (seq_sk_match(seq, sk)) {
>> -				if (first) {
>> +				if (!first_sk)
>>  					first_sk = sk;
>> -					first = false;
>> -				}
>>  				if (iter->end_sk < iter->max_sk) {
>>  					sock_hold(sk);
>>  					iter->batch[iter->end_sk++] = sk;
>>  				}
>>  				bucket_sks++;
>>  			}
>> +			new_offset++;
>>  		}
>>  		spin_unlock_bh(&hslot->lock);
>> +
>>  		if (first_sk)
>>  			break;
>> +
>> +		/* Reset the current bucket's offset before moving to the next bucket. */
>> +		offset = 0;
>> +		new_offset = 0;
>> +
>>  	}
> 
>>  	/* All done: no batch made. */
>>  	if (!first_sk)
>> -		return NULL;
>> +		goto ret;
> 
>>  	if (iter->end_sk == bucket_sks) {
>>  		/* Batching is done for the current bucket; return the first
>>  		 * socket to be iterated from the batch.
>>  		 */
>>  		iter->st_bucket_done = true;
>> -		return first_sk;
>> +		goto ret;
>>  	}
>>  	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
>>  		resized = true;
>> @@ -3248,19 +3276,24 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>  		state->bucket--;
>>  		goto again;
>>  	}
>> +ret:
>> +	state->offset = new_offset;
>>  	return first_sk;
>>  }
> 
>>  static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>>  {
>>  	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>>  	struct sock *sk;
> 
>>  	/* Whenever seq_next() is called, the iter->cur_sk is
>>  	 * done with seq_show(), so unref the iter->cur_sk.
>>  	 */
>> -	if (iter->cur_sk < iter->end_sk)
>> +	if (iter->cur_sk < iter->end_sk) {
>>  		sock_put(iter->batch[iter->cur_sk++]);
>> +		++state->offset;
>> +	}
> 
>>  	/* After updating iter->cur_sk, check if there are more sockets
>>  	 * available in the current bucket batch.
>> @@ -3630,6 +3663,9 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>>  	}
>>  	iter->cur_sk = 0;
>>  	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	st->bucket = 0;
>> +	st->offset = 0;
> 
>>  	return ret;
>>  }
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc
  2023-03-23 20:02     ` Aditi Ghag
@ 2023-03-23 21:45       ` Stanislav Fomichev
  0 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-03-23 21:45 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On Thu, Mar 23, 2023 at 1:02 PM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Mar 21, 2023, at 2:02 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/21, Aditi Ghag wrote:
> >> The socket destroy kfunc is used to forcefully terminate sockets from
> >> certain BPF contexts. We plan to use the capability in Cilium to force
> >> client sockets to reconnect when their remote load-balancing backends are
> >> deleted. The other use case is on-the-fly policy enforcement where existing
> >> socket connections prevented by policies need to be forcefully terminated.
> >> The helper allows terminating sockets that may or may not be actively
> >> sending traffic.
> >
> >> The helper is currently exposed to certain BPF iterators where users can
> >> filter, and terminate selected sockets.  Additionally, the helper can only
> >> be called from these BPF contexts that ensure socket locking in order to
> >> allow synchronous execution of destroy helpers that also acquire socket
> >> locks. The previous commit that batches UDP sockets during iteration
> >> facilitated a synchronous invocation of the destroy helper from BPF context
> >> by skipping taking socket locks in the destroy handler. TCP iterators
> >> already supported batching.
> >
> >> The helper takes `sock_common` type argument, even though it expects, and
> >> casts them to a `sock` pointer. This enables the verifier to allow the
> >> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> >> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> >> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> >> option available with the verifier logic that handles kfuncs where BTF
> >> types are inferred. Furthermore, as `sock_common` only has a subset of
> >> certain fields of `sock`, casting pointer to the latter type might not
> >> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> >> before casting.
> >
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  include/net/udp.h |  1 +
> >>  net/core/filter.c | 54 ++++++++++++++++++++++++++++++++++++++++++
> >>  net/ipv4/tcp.c    | 16 +++++++++----
> >>  net/ipv4/udp.c    | 60 +++++++++++++++++++++++++++++++++++++----------
> >>  4 files changed, 114 insertions(+), 17 deletions(-)
> >
> >> diff --git a/include/net/udp.h b/include/net/udp.h
> >> index de4b528522bb..d2999447d3f2 100644
> >> --- a/include/net/udp.h
> >> +++ b/include/net/udp.h
> >> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
> >>  struct udp_iter_state {
> >>      struct seq_net_private  p;
> >>      int                     bucket;
> >> +    int                     offset;
> >>      struct udp_seq_afinfo   *bpf_seq_afinfo;
> >>  };
> >
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 1d6f165923bf..ba3e0dac119c 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> >
> >>      return func;
> >>  }
> >> +
> >> +/* Disables missing prototype warnings */
> >> +__diag_push();
> >> +__diag_ignore_all("-Wmissing-prototypes",
> >> +              "Global functions as their definitions will be in vmlinux BTF");
> >> +
> >> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> >> + *
> >> + * The helper expects a non-NULL pointer to a socket. It invokes the
> >> + * protocol specific socket destroy handlers.
> >> + *
> >> + * The helper can only be called from BPF contexts that have acquired the socket
> >> + * locks.
> >> + *
> >> + * Parameters:
> >> + * @sock: Pointer to socket to be destroyed
> >> + *
> >> + * Return:
> >> + * On error, may return EPROTONOSUPPORT, EINVAL.
> >> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> >> + * 0 otherwise
> >> + */
> >> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
> >> +{
> >> +    struct sock *sk = (struct sock *)sock;
> >> +
> >> +    if (!sk)
> >> +            return -EINVAL;
> >> +
> >> +    /* The locking semantics that allow for synchronous execution of the
> >> +     * destroy handlers are only supported for TCP and UDP.
> >> +     */
> >> +    if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> >> +            return -EOPNOTSUPP;
> >
> > What makes IPPROTO_RAW special? Looks like it locks the socket as well?
>
> I haven't looked at the locking semantics for IPPROTO_RAW. These can be handled in a follow-up patch. Wdyt?

Fine with me. Maybe make it more opt-in? (vs current "opt ipproto_raw out")

if (sk->sk_prot->diag_destroy != udp_abort &&
    sk->sk_prot->diag_destroy != tcp_abort)
        return -EOPNOTSUPP;

Is it more robust? Or does it look uglier? )
But maybe fine as is, I'm just thinking out loud..

> >
> >> +
> >> +    return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> >> +}
> >> +
> >> +__diag_pop()
> >> +
> >> +BTF_SET8_START(sock_destroy_kfunc_set)
> >> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> >> +BTF_SET8_END(sock_destroy_kfunc_set)
> >> +
> >> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> >> +    .owner = THIS_MODULE,
> >> +    .set   = &sock_destroy_kfunc_set,
> >> +};
> >> +
> >> +static int init_subsystem(void)
> >> +{
> >> +    return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> >> +}
> >> +late_initcall(init_subsystem);
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index 33f559f491c8..59a833c0c872 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
> >>              return 0;
> >>      }
> >
> >> -    /* Don't race with userspace socket closes such as tcp_close. */
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            /* Don't race with userspace socket closes such as tcp_close. */
> >> +            lock_sock(sk);
> >
> >>      if (sk->sk_state == TCP_LISTEN) {
> >>              tcp_set_state(sk, TCP_CLOSE);
> >> @@ -4688,7 +4690,8 @@ int tcp_abort(struct sock *sk, int err)
> >
> >>      /* Don't race with BH socket closes such as inet_csk_listen_stop. */
> >>      local_bh_disable();
> >
> > [..]
> >
> >> -    bh_lock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            bh_lock_sock(sk);
> >
> > These are spinlocks and should probably be grabbed in the bpf context as
> > well?
>
> Fixed in the next revision. Thanks!
>
> >
> >
> >>      if (!sock_flag(sk, SOCK_DEAD)) {
> >>              sk->sk_err = err;
> >> @@ -4700,10 +4703,13 @@ int tcp_abort(struct sock *sk, int err)
> >>              tcp_done(sk);
> >>      }
> >
> >> -    bh_unlock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            bh_unlock_sock(sk);
> >> +
> >>      local_bh_enable();
> >>      tcp_write_queue_purge(sk);
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >>      return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(tcp_abort);
> >> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> index 545e56329355..02d357713838 100644
> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
> >
> >>  int udp_abort(struct sock *sk, int err)
> >>  {
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            lock_sock(sk);
> >
> >>      /* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
> >>       * with close()
> >> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
> >>      __udp_disconnect(sk, 0);
> >
> >>  out:
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >
> >>      return 0;
> >>  }
> >> @@ -3184,15 +3187,23 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> >>      struct sock *first_sk = NULL;
> >>      struct sock *sk;
> >>      unsigned int bucket_sks = 0;
> >> -    bool first;
> >>      bool resized = false;
> >> +    int offset = 0;
> >> +    int new_offset;
> >
> >>      /* The current batch is done, so advance the bucket. */
> >> -    if (iter->st_bucket_done)
> >> +    if (iter->st_bucket_done) {
> >>              state->bucket++;
> >> +            state->offset = 0;
> >> +    }
> >
> >>      udptable = udp_get_table_afinfo(afinfo, net);
> >
> >> +    if (state->bucket > udptable->mask) {
> >> +            state->bucket = 0;
> >> +            state->offset = 0;
> >> +    }
> >> +
> >>  again:
> >>      /* New batch for the next bucket.
> >>       * Iterate over the hash table to find a bucket with sockets matching
> >> @@ -3204,43 +3215,60 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> >>      iter->cur_sk = 0;
> >>      iter->end_sk = 0;
> >>      iter->st_bucket_done = false;
> >> -    first = true;
> >> +    first_sk = NULL;
> >> +    bucket_sks = 0;
> >> +    offset = state->offset;
> >> +    new_offset = offset;
> >
> >>      for (; state->bucket <= udptable->mask; state->bucket++) {
> >>              struct udp_hslot *hslot = &udptable->hash[state->bucket];
> >
> >> -            if (hlist_empty(&hslot->head))
> >> +            if (hlist_empty(&hslot->head)) {
> >> +                    offset = 0;
> >>                      continue;
> >> +            }
> >
> >>              spin_lock_bh(&hslot->lock);
> >> +            /* Resume from the last saved position in a bucket before
> >> +             * iterator was stopped.
> >> +             */
> >> +            while (offset-- > 0) {
> >> +                    sk_for_each(sk, &hslot->head)
> >> +                            continue;
> >> +            }
> >>              sk_for_each(sk, &hslot->head) {
> >>                      if (seq_sk_match(seq, sk)) {
> >> -                            if (first) {
> >> +                            if (!first_sk)
> >>                                      first_sk = sk;
> >> -                                    first = false;
> >> -                            }
> >>                              if (iter->end_sk < iter->max_sk) {
> >>                                      sock_hold(sk);
> >>                                      iter->batch[iter->end_sk++] = sk;
> >>                              }
> >>                              bucket_sks++;
> >>                      }
> >> +                    new_offset++;
> >>              }
> >>              spin_unlock_bh(&hslot->lock);
> >> +
> >>              if (first_sk)
> >>                      break;
> >> +
> >> +            /* Reset the current bucket's offset before moving to the next bucket. */
> >> +            offset = 0;
> >> +            new_offset = 0;
> >> +
> >>      }
> >
> >>      /* All done: no batch made. */
> >>      if (!first_sk)
> >> -            return NULL;
> >> +            goto ret;
> >
> >>      if (iter->end_sk == bucket_sks) {
> >>              /* Batching is done for the current bucket; return the first
> >>               * socket to be iterated from the batch.
> >>               */
> >>              iter->st_bucket_done = true;
> >> -            return first_sk;
> >> +            goto ret;
> >>      }
> >>      if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
> >>              resized = true;
> >> @@ -3248,19 +3276,24 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> >>              state->bucket--;
> >>              goto again;
> >>      }
> >> +ret:
> >> +    state->offset = new_offset;
> >>      return first_sk;
> >>  }
> >
> >>  static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> >>  {
> >>      struct bpf_udp_iter_state *iter = seq->private;
> >> +    struct udp_iter_state *state = &iter->state;
> >>      struct sock *sk;
> >
> >>      /* Whenever seq_next() is called, the iter->cur_sk is
> >>       * done with seq_show(), so unref the iter->cur_sk.
> >>       */
> >> -    if (iter->cur_sk < iter->end_sk)
> >> +    if (iter->cur_sk < iter->end_sk) {
> >>              sock_put(iter->batch[iter->cur_sk++]);
> >> +            ++state->offset;
> >> +    }
> >
> >>      /* After updating iter->cur_sk, check if there are more sockets
> >>       * available in the current bucket batch.
> >> @@ -3630,6 +3663,9 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
> >>      }
> >>      iter->cur_sk = 0;
> >>      iter->end_sk = 0;
> >> +    iter->st_bucket_done = false;
> >> +    st->bucket = 0;
> >> +    st->offset = 0;
> >
> >>      return ret;
> >>  }
> >> --
> >> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-03-23 21:46 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-21 18:45 [PATCH v3 bpf-next 0/5] bpf-next: Add socket destroy capability Aditi Ghag
2023-03-21 18:45 ` [PATCH v3 bpf-next 1/5] bpf: Implement batching in UDP iterator Aditi Ghag
2023-03-21 18:45 ` [PATCH v3 bpf-next 2/5] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-03-21 21:02   ` Stanislav Fomichev
2023-03-23 20:02     ` Aditi Ghag
2023-03-23 21:45       ` Stanislav Fomichev
2023-03-21 23:43   ` Martin KaFai Lau
2023-03-22  0:00     ` Aditi Ghag
2023-03-22  0:36       ` Martin KaFai Lau
2023-03-21 18:45 ` [PATCH v3 bpf-next 3/5] [RFC] net: Skip taking lock in BPF context Aditi Ghag
2023-03-21 21:31   ` Stanislav Fomichev
2023-03-21 21:37     ` Aditi Ghag
2023-03-21 21:41       ` Eric Dumazet
2023-03-21 21:43       ` Aditi Ghag
2023-03-21 23:39     ` Aditi Ghag
2023-03-22  0:02       ` Martin KaFai Lau
2023-03-21 18:45 ` [PATCH v3 bpf-next 4/5] [RFC] udp: Fix destroying UDP listening sockets Aditi Ghag
2023-03-22  0:29   ` Martin KaFai Lau
2023-03-22  0:59     ` Aditi Ghag
2023-03-22 22:55       ` Martin KaFai Lau
2023-03-23  0:32         ` Aditi Ghag
2023-03-23  1:08           ` Martin KaFai Lau
2023-03-23  1:35             ` Aditi Ghag
2023-03-23  6:14               ` Aditi Ghag
2023-03-21 18:45 ` [PATCH 5/5] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.