All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/3]: Add socket destroy capability
@ 2023-02-23 21:53 Aditi Ghag
  2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-02-23 21:53 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

This patch adds the capability to destroy sockets in BPF.  We plan to use
the capability in Cilium to force client sockets to reconnect when their
remote load-balancing backends are deleted. The other use case is
on-the-fly policy enforcement where existing socket connections prevented
by policies need to be terminated. 

The use cases, and more details around
the selected approach was presented at LPC 2022 -
https://lpc.events/event/16/contributions/1358/.
RFC discussion -
https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
v1 patch series - 
https://lore.kernel.org/bpf/cover.1671242108.git.aditi.ghag@isovalent.com/

v2 highlights:
- Implemented batching support for UDP iterator.
- Converted bpf_sock_destroy helper to kfunc.
- Synchronous execution of destroy handlers to
  replace the previous workqueue implementation.
- Updated selftests to use the kfunc.

Notes to the reviewers (further details in commits description):

- I hit a snag while writing the kfunc where verifier complained about the
`sock_common` type passed from TCP iterator. With kfuncs, there don't
seem to be any options available to pass BTF type hints to the verifier
(equivalent of `ARG_PTR_TO_BTF_ID_SOCK_COMMON`, as was the case with the
helper).  As a result, I changed the argument type of the sock_destory
kfunc to `sock_common`. Discussed it from the point of view of the verifier
with my colleague (Dylan Reimerink): the verifier has a `sock_common` BTF
ID for a subset of socket types. However, it may not always be safe to cast
from `sock_common *` to 'sock *', so I added a check for full sock
availability in the kfunc.

- The `vmlinux.h` import in the selftest prog unexpectedly led to libbpf
failing to load the program. As it turns out, the libbpf kfunc related
code doesn't seem to handle BTF `FWD` type for structs. I've attached debug
information about the issue in case the loader logic can accommodate such
gotchas. Although the error in this case was specific to the test imports.

- We previously discussed the possibility of using sockmap to store sockets
to be destroyed as an optimization, so that users may not need to iterate
over all the host-wide sockets. This approach needs more discussion on the
TCP side, as we may need to extend the logic that checks for certain TCP
states while inserting sockets in a sockmap. So I've skipped those self
test cases involving sockmap from the patch. (same as v1 patch)


Aditi Ghag (3):
  bpf: Implement batching in UDP iterator
  bpf: Add bpf_sock_destroy kfunc
  selftests/bpf: Add tests for bpf_sock_destroy

 net/core/filter.c                             |  55 +++++
 net/ipv4/tcp.c                                |  17 +-
 net/ipv4/udp.c                                | 231 +++++++++++++++++-
 .../selftests/bpf/prog_tests/sock_destroy.c   | 125 ++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 110 +++++++++
 5 files changed, 522 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-02-23 21:53 [PATCH v2 bpf-next 0/3]: Add socket destroy capability Aditi Ghag
@ 2023-02-23 21:53 ` Aditi Ghag
  2023-02-24 22:32   ` Stanislav Fomichev
  2023-02-28 19:58   ` Martin KaFai Lau
  2023-02-23 21:53 ` [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
  2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  2 siblings, 2 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-02-23 21:53 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag, Martin KaFai Lau

Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators.  With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/udp.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 215 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..2f3978de45f2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3152,6 +3152,141 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+struct bpf_udp_iter_state {
+	struct udp_iter_state state;
+	unsigned int cur_sk;
+	unsigned int end_sk;
+	unsigned int max_sk;
+	struct sock **batch;
+	bool st_bucket_done;
+};
+
+static unsigned short seq_file_family(const struct seq_file *seq);
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz);
+
+static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
+{
+	unsigned short family = seq_file_family(seq);
+
+	/* AF_UNSPEC is used as a match all */
+	return ((family == AF_UNSPEC || family == sk->sk_family) &&
+		net_eq(sock_net(sk), seq_file_net(seq)));
+}
+
+static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
+	struct net *net = seq_file_net(seq);
+	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
+	struct udp_table *udptable;
+	struct sock *first_sk = NULL;
+	struct sock *sk;
+	unsigned int bucket_sks = 0;
+	bool first;
+	bool resized = false;
+
+	/* The current batch is done, so advance the bucket. */
+	if (iter->st_bucket_done)
+		state->bucket++;
+
+	udptable = udp_get_table_afinfo(afinfo, net);
+
+again:
+	/* New batch for the next bucket.
+	 * Iterate over the hash table to find a bucket with sockets matching
+	 * the iterator attributes, and return the first matching socket from
+	 * the bucket. The remaining matched sockets from the bucket are batched
+	 * before releasing the bucket lock. This allows BPF programs that are
+	 * called in seq_show to acquire the bucket lock if needed.
+	 */
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	first = true;
+
+	for (; state->bucket <= udptable->mask; state->bucket++) {
+		struct udp_hslot *hslot = &udptable->hash[state->bucket];
+
+		if (hlist_empty(&hslot->head))
+			continue;
+
+		spin_lock_bh(&hslot->lock);
+		sk_for_each(sk, &hslot->head) {
+			if (seq_sk_match(seq, sk)) {
+				if (first) {
+					first_sk = sk;
+					first = false;
+				}
+				if (iter->end_sk < iter->max_sk) {
+					sock_hold(sk);
+					iter->batch[iter->end_sk++] = sk;
+				}
+				bucket_sks++;
+			}
+		}
+		spin_unlock_bh(&hslot->lock);
+		if (first_sk)
+			break;
+	}
+
+	/* All done: no batch made. */
+	if (!first_sk)
+		return NULL;
+
+	if (iter->end_sk == bucket_sks) {
+		/* Batching is done for the current bucket; return the first
+		 * socket to be iterated from the batch.
+		 */
+		iter->st_bucket_done = true;
+		return first_sk;
+	}
+	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
+		resized = true;
+		/* Go back to the previous bucket to resize its batch. */
+		state->bucket--;
+		goto again;
+	}
+	return first_sk;
+}
+
+static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct sock *sk;
+
+	/* Whenever seq_next() is called, the iter->cur_sk is
+	 * done with seq_show(), so unref the iter->cur_sk.
+	 */
+	if (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
+
+	/* After updating iter->cur_sk, check if there are more sockets
+	 * available in the current bucket batch.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sk = iter->batch[iter->cur_sk];
+	} else {
+		// Prepare a new batch.
+		sk = bpf_iter_udp_batch(seq);
+	}
+
+	++*pos;
+	return sk;
+}
+
+static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* bpf iter does not support lseek, so it always
+	 * continue from where it was stop()-ped.
+	 */
+	if (*pos)
+		return bpf_iter_udp_batch(seq);
+
+	return SEQ_START_TOKEN;
+}
+
 static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
 			     struct udp_sock *udp_sk, uid_t uid, int bucket)
 {
@@ -3172,18 +3307,34 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
+	bool slow;
+	int rc;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
+	slow = lock_sock_fast(sk);
+
+	if (unlikely(sk_unhashed(sk))) {
+		rc = SEQ_SKIP;
+		goto unlock;
+	}
+
 	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
 	meta.seq = seq;
 	prog = bpf_iter_get_info(&meta, false);
-	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+
+unlock:
+	unlock_sock_fast(sk, slow);
+	return rc;
 }
 
+static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter);
+
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct bpf_udp_iter_state *iter = seq->private;
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 
@@ -3194,15 +3345,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
 	}
 
-	udp_seq_stop(seq, v);
+	if (iter->cur_sk < iter->end_sk) {
+		bpf_iter_udp_unref_batch(iter);
+		iter->st_bucket_done = false;
+	}
 }
 
 static const struct seq_operations bpf_iter_udp_seq_ops = {
-	.start		= udp_seq_start,
-	.next		= udp_seq_next,
+	.start		= bpf_iter_udp_seq_start,
+	.next		= bpf_iter_udp_seq_next,
 	.stop		= bpf_iter_udp_seq_stop,
 	.show		= bpf_iter_udp_seq_show,
 };
+
+static unsigned short seq_file_family(const struct seq_file *seq)
+{
+	const struct udp_seq_afinfo *afinfo;
+
+	/* BPF iterator: bpf programs to filter sockets. */
+	if (seq->op == &bpf_iter_udp_seq_ops)
+		return AF_UNSPEC;
+
+	/* Proc fs iterator */
+	afinfo = pde_data(file_inode(seq->file));
+	return afinfo->family;
+}
 #endif
 
 const struct seq_operations udp_seq_ops = {
@@ -3413,9 +3580,38 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
 DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 		     struct udp_sock *udp_sk, uid_t uid, int bucket)
 
+static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
+{
+	while (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
+}
+
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz)
+{
+	struct sock **new_batch;
+
+	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
+				   GFP_USER | __GFP_NOWARN);
+	if (!new_batch)
+		return -ENOMEM;
+
+	bpf_iter_udp_unref_batch(iter);
+	kvfree(iter->batch);
+	iter->batch = new_batch;
+	iter->max_sk = new_batch_sz;
+
+	return 0;
+}
+
+#define INIT_BATCH_SZ 16
+
+static void bpf_iter_fini_udp(void *priv_data);
+
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	struct udp_iter_state *st = priv_data;
+	struct bpf_udp_iter_state *iter = priv_data;
+	struct udp_iter_state *st = &iter->state;
 	struct udp_seq_afinfo *afinfo;
 	int ret;
 
@@ -3427,24 +3623,34 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 	afinfo->udp_table = NULL;
 	st->bpf_seq_afinfo = afinfo;
 	ret = bpf_iter_init_seq_net(priv_data, aux);
-	if (ret)
+	if (ret) {
 		kfree(afinfo);
+		return ret;
+	}
+	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
+	if (ret) {
+		bpf_iter_fini_seq_net(priv_data);
+		return ret;
+	}
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+
 	return ret;
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
-	struct udp_iter_state *st = priv_data;
+	struct bpf_udp_iter_state *iter = priv_data;
 
-	kfree(st->bpf_seq_afinfo);
 	bpf_iter_fini_seq_net(priv_data);
+	kfree(iter->batch);
 }
 
 static const struct bpf_iter_seq_info udp_seq_info = {
 	.seq_ops		= &bpf_iter_udp_seq_ops,
 	.init_seq_private	= bpf_iter_init_udp,
 	.fini_seq_private	= bpf_iter_fini_udp,
-	.seq_priv_size		= sizeof(struct udp_iter_state),
+	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
 };
 
 static struct bpf_iter_reg udp_reg_info = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-23 21:53 [PATCH v2 bpf-next 0/3]: Add socket destroy capability Aditi Ghag
  2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
@ 2023-02-23 21:53 ` Aditi Ghag
  2023-02-24 22:35   ` Stanislav Fomichev
  2023-02-28 22:55   ` Martin KaFai Lau
  2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  2 siblings, 2 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-02-23 21:53 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium to force
client sockets to reconnect when their remote load-balancing backends are
deleted. The other use case is on-the-fly policy enforcement where existing
socket connections prevented by policies need to be forcefully terminated.
The helper allows terminating sockets that may or may not be actively
sending traffic.

The helper is currently exposed to certain BPF iterators where users can
filter, and terminate selected sockets.  Additionally, the helper can only
be called from these BPF contexts that ensure socket locking in order to
allow synchronous execution of destroy helpers that also acquire socket
locks. The previous commit that batches UDP sockets during iteration
facilitated a synchronous invocation of the destroy helper from BPF context
by skipping taking socket locks in the destroy handler. TCP iterators
already supported batching.

The helper takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. As a comparison, BPF helpers enable this behavior with the
`ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
option available with the verifier logic that handles kfuncs where BTF
types are inferred. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe. Hence, the BPF kfunc converts the argument to a full sock
before casting.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c    | 17 ++++++++++-----
 net/ipv4/udp.c    |  7 ++++--
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 1d6f165923bf..79cd91ba13d0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
 
 	return func;
 }
+
+/* Disables missing prototype warnings */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
+ *
+ * The helper expects a non-NULL pointer to a full socket. It invokes
+ * the protocol specific socket destroy handlers.
+ *
+ * The helper can only be called from BPF contexts that have acquired the socket
+ * locks.
+ *
+ * Parameters:
+ * @sock: Pointer to socket to be destroyed
+ *
+ * Return:
+ * On error, may return EPROTONOSUPPORT, EINVAL.
+ * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
+ * 0 otherwise
+ */
+int bpf_sock_destroy(struct sock_common *sock)
+{
+	/* Validates the socket can be type casted to a full socket. */
+	struct sock *sk = sk_to_full_sk((struct sock *)sock);
+
+	if (!sk)
+		return -EINVAL;
+
+	/* The locking semantics that allow for synchronous execution of the
+	 * destroy handlers are only supported for TCP and UDP.
+	 */
+	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
+}
+
+__diag_pop()
+
+BTF_SET8_START(sock_destroy_kfunc_set)
+BTF_ID_FLAGS(func, bpf_sock_destroy)
+BTF_SET8_END(sock_destroy_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &sock_destroy_kfunc_set,
+};
+
+static int init_subsystem(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
+}
+late_initcall(init_subsystem);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 33f559f491c8..8123c264d8ea 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
 		return 0;
 	}
 
-	/* Don't race with userspace socket closes such as tcp_close. */
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		/* Don't race with userspace socket closes such as tcp_close. */
+		lock_sock(sk);
 
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
@@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
 
 	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
 	local_bh_disable();
-	bh_lock_sock(sk);
+	if (!has_current_bpf_ctx())
+		bh_lock_sock(sk);
+
 
 	if (!sock_flag(sk, SOCK_DEAD)) {
 		sk->sk_err = err;
@@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
 		tcp_done(sk);
 	}
 
-	bh_unlock_sock(sk);
+	if (!has_current_bpf_ctx())
+		bh_unlock_sock(sk);
+
 	local_bh_enable();
 	tcp_write_queue_purge(sk);
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_abort);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2f3978de45f2..1bc9ad92c3d4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
 
 int udp_abort(struct sock *sk, int err)
 {
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		lock_sock(sk);
 
 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
 	 * with close()
@@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
 	__udp_disconnect(sk, 0);
 
 out:
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-02-23 21:53 [PATCH v2 bpf-next 0/3]: Add socket destroy capability Aditi Ghag
  2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
  2023-02-23 21:53 ` [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-02-23 21:53 ` Aditi Ghag
  2023-02-24 22:40   ` Stanislav Fomichev
                     ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-02-23 21:53 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The test cases for TCP and UDP iterators mirror the intended usages of the
helper.

The destroy helpers set `ECONNABORTED` error code that we can validate in the
test code with client sockets. But UDP sockets have an overriding error code
from the disconnect called during abort, so the error code the validation is
only done for TCP sockets.

The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
loader fails to load the program as it finds the BTF FWD type for the struct
incompatible with the BTF STRUCT type.

Here are the snippets of the verifier error, and corresponding BTF output:

```
verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel

BTF for selftest prog binary:

[104] FWD 'sock' fwd_kind=struct
[70] PTR '(anon)' type_id=104
[84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
	'(anon)' type_id=70
[85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
--
[96] DATASEC '.ksyms' size=0 vlen=1
	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')

BTF for selftest vmlinux:

[74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
[48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
	'sk' type_id=1340
[1340] PTR '(anon)' type_id=2363
[2363] STRUCT 'sock' size=1280 vlen=93
```

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 .../selftests/bpf/prog_tests/sock_destroy.c   | 125 ++++++++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 110 +++++++++++++++
 2 files changed, 235 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
new file mode 100644
index 000000000000..d9da9d3578e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+#include "sock_destroy_prog.skel.h"
+#include "network_helpers.h"
+
+#define ECONNABORTED 103
+
+static int duration;
+
+static void start_iter_sockets(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	char buf[16] = {};
+	int iter_fd, len;
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+		goto free_link;
+
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+	CHECK(len < 0, "read", "read failed: %s\n", strerror(errno));
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+}
+
+void test_tcp(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (CHECK(serv < 0, "start_server", "failed to start server\n"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (CHECK(clien < 0, "connect_to_fd", "errno %d\n", errno))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (CHECK(serv < 0, "accept", "errno %d\n", errno))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (CHECK(n < 0, "client_send", "client failed to send on socket\n"))
+		goto cleanup;
+
+	start_iter_sockets(skel->progs.iter_tcp6);
+
+	n = send(clien, "t", 1, 0);
+	if (CHECK(n > 0, "client_send after destroy", "succeeded on destroyed socket\n"))
+		goto cleanup;
+	CHECK(errno != ECONNABORTED, "client_send", "unexpected error code on destroyed socket\n");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+
+void test_udp(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
+	if (CHECK(serv < 0, "start_server", "failed to start server\n"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (CHECK(clien < 0, "connect_to_fd", "errno %d\n", errno))
+		goto cleanup_serv;
+
+	n = send(clien, "t", 1, 0);
+	if (CHECK(n < 0, "client_send", "client failed to send on socket\n"))
+		goto cleanup;
+
+	start_iter_sockets(skel->progs.iter_udp6);
+
+	n = send(clien, "t", 1, 0);
+	if (CHECK(n > 0, "client_send after destroy", "succeeded on destroyed socket\n"))
+		goto cleanup;
+	// UDP sockets have an overriding error code after they are disconnected.
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+void test_sock_destroy(void)
+{
+	int cgroup_fd = 0;
+	struct sock_destroy_prog *skel;
+
+	skel = sock_destroy_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	cgroup_fd = test__join_cgroup("/sock_destroy");
+	if (CHECK(cgroup_fd < 0, "join_cgroup", "cgroup creation failed\n"))
+		goto close_cgroup_fd;
+
+	skel->links.sock_connect = bpf_program__attach_cgroup(
+		skel->progs.sock_connect, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
+		goto close_cgroup_fd;
+
+	test_tcp(skel);
+	test_udp(skel);
+
+
+close_cgroup_fd:
+	close(cgroup_fd);
+	sock_destroy_prog__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
new file mode 100644
index 000000000000..c6805a9b7594
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define sock sock___not_used
+#include "vmlinux.h"
+#undef sock
+
+#include <bpf/bpf_helpers.h>
+
+#define AF_INET6 10
+
+/* Redefine the struct: vmlinux.h forward declares it, and the loader fails
+ * to load the program as it finds the BTF FWD type for the struct incompatible
+ * with the BTF STRUCT type.
+ */
+struct sock {
+	struct sock_common	__sk_common;
+#define sk_family		__sk_common.skc_family
+#define sk_cookie		__sk_common.skc_cookie
+};
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} tcp_conn_sockets SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} udp_conn_sockets SEC(".maps");
+
+SEC("cgroup/connect6")
+int sock_connect(struct bpf_sock_addr *ctx)
+{
+	int key = 0;
+	__u64 sock_cookie = 0;
+	__u32 keyc = 0;
+
+	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
+		return 1;
+
+	sock_cookie = bpf_get_socket_cookie(ctx);
+	if (ctx->protocol == IPPROTO_TCP)
+		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
+	else if (ctx->protocol == IPPROTO_UDP)
+		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
+	else
+		return 1;
+
+	return 1;
+}
+
+SEC("iter/tcp")
+int iter_tcp6(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	struct seq_file *seq = ctx->meta->seq;
+	__u64 sock_cookie = 0;
+	__u64 *val;
+	int key = 0;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk_common);
+	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
+
+	if (!val)
+		return 0;
+
+	if (sock_cookie == *val)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+SEC("iter/udp")
+int iter_udp6(struct bpf_iter__udp *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u64 sock_cookie = 0;
+	int key = 0;
+	__u64 *val;
+
+	if (!sk)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk);
+	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
+
+	if (!val)
+		return 0;
+
+	if (sock_cookie == *val)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
@ 2023-02-24 22:32   ` Stanislav Fomichev
  2023-02-28 20:32     ` Martin KaFai Lau
  2023-02-28 19:58   ` Martin KaFai Lau
  1 sibling, 1 reply; 24+ messages in thread
From: Stanislav Fomichev @ 2023-02-24 22:32 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet, Martin KaFai Lau

On 02/23, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs.  This  
> facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.

> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators.  With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.

> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/ipv4/udp.c | 224 +++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 215 insertions(+), 9 deletions(-)

> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..2f3978de45f2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3152,6 +3152,141 @@ struct bpf_iter__udp {
>   	int bucket __aligned(8);
>   };

> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;

[..]

> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	struct sock **batch;
> +	bool st_bucket_done;

Any change we can generalize some of those across tcp & udp? I haven't
looked too deep, but a lot of things look like a plain copy-paste
from tcp batching. Or not worth it?

> +};
> +
> +static unsigned short seq_file_family(const struct seq_file *seq);
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +
> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock  
> *sk)
> +{
> +	unsigned short family = seq_file_family(seq);
> +
> +	/* AF_UNSPEC is used as a match all */
> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
> +		net_eq(sock_net(sk), seq_file_net(seq)));
> +}
> +
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
> +	struct udp_table *udptable;
> +	struct sock *first_sk = NULL;
> +	struct sock *sk;
> +	unsigned int bucket_sks = 0;
> +	bool first;
> +	bool resized = false;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done)
> +		state->bucket++;
> +
> +	udptable = udp_get_table_afinfo(afinfo, net);
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	first = true;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
> +
> +		if (hlist_empty(&hslot->head))
> +			continue;
> +
> +		spin_lock_bh(&hslot->lock);
> +		sk_for_each(sk, &hslot->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				if (first) {
> +					first_sk = sk;
> +					first = false;
> +				}
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				bucket_sks++;
> +			}
> +		}
> +		spin_unlock_bh(&hslot->lock);
> +		if (first_sk)
> +			break;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!first_sk)
> +		return NULL;
> +
> +	if (iter->end_sk == bucket_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		return first_sk;
> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */
> +		state->bucket--;
> +		goto again;
> +	}
> +	return first_sk;
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t  
> *pos)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct sock *sk;
> +
> +	/* Whenever seq_next() is called, the iter->cur_sk is
> +	 * done with seq_show(), so unref the iter->cur_sk.
> +	 */
> +	if (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
> +
> +	/* After updating iter->cur_sk, check if there are more sockets
> +	 * available in the current bucket batch.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sk = iter->batch[iter->cur_sk];
> +	} else {
> +		// Prepare a new batch.
> +		sk = bpf_iter_udp_batch(seq);
> +	}
> +
> +	++*pos;
> +	return sk;
> +}
> +
> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	/* bpf iter does not support lseek, so it always
> +	 * continue from where it was stop()-ped.
> +	 */
> +	if (*pos)
> +		return bpf_iter_udp_batch(seq);
> +
> +	return SEQ_START_TOKEN;
> +}
> +
>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta  
> *meta,
>   			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   {
> @@ -3172,18 +3307,34 @@ static int bpf_iter_udp_seq_show(struct seq_file  
> *seq, void *v)
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
>   	uid_t uid;
> +	bool slow;
> +	int rc;

>   	if (v == SEQ_START_TOKEN)
>   		return 0;

> +	slow = lock_sock_fast(sk);

Hm, I missed the fact that we're already using fast lock in the tcp batching
as well. Should we not use fask locks here? On a loaded system it's
probably fair to pay some backlog processing in the path that goes
over every socket (here)? Martin, WDYT?

> +
> +	if (unlikely(sk_unhashed(sk))) {
> +		rc = SEQ_SKIP;
> +		goto unlock;
> +	}
> +
>   	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>   	meta.seq = seq;
>   	prog = bpf_iter_get_info(&meta, false);
> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +
> +unlock:
> +	unlock_sock_fast(sk, slow);
> +	return rc;
>   }

> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter);

Why forward declaration? Why not define the function here?

> +
>   static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   {
> +	struct bpf_udp_iter_state *iter = seq->private;
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;

> @@ -3194,15 +3345,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file  
> *seq, void *v)
>   			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>   	}

> -	udp_seq_stop(seq, v);
> +	if (iter->cur_sk < iter->end_sk) {
> +		bpf_iter_udp_unref_batch(iter);
> +		iter->st_bucket_done = false;
> +	}
>   }

>   static const struct seq_operations bpf_iter_udp_seq_ops = {
> -	.start		= udp_seq_start,
> -	.next		= udp_seq_next,
> +	.start		= bpf_iter_udp_seq_start,
> +	.next		= bpf_iter_udp_seq_next,
>   	.stop		= bpf_iter_udp_seq_stop,
>   	.show		= bpf_iter_udp_seq_show,
>   };
> +
> +static unsigned short seq_file_family(const struct seq_file *seq)
> +{
> +	const struct udp_seq_afinfo *afinfo;
> +
> +	/* BPF iterator: bpf programs to filter sockets. */
> +	if (seq->op == &bpf_iter_udp_seq_ops)
> +		return AF_UNSPEC;
> +
> +	/* Proc fs iterator */
> +	afinfo = pde_data(file_inode(seq->file));
> +	return afinfo->family;
> +}
>   #endif

>   const struct seq_operations udp_seq_ops = {
> @@ -3413,9 +3580,38 @@ static struct pernet_operations __net_initdata  
> udp_sysctl_ops = {
>   DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>   		     struct udp_sock *udp_sk, uid_t uid, int bucket)

> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
> +{
> +	while (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
> +}
> +
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz)
> +{
> +	struct sock **new_batch;
> +
> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> +				   GFP_USER | __GFP_NOWARN);
> +	if (!new_batch)
> +		return -ENOMEM;
> +
> +	bpf_iter_udp_unref_batch(iter);
> +	kvfree(iter->batch);
> +	iter->batch = new_batch;
> +	iter->max_sk = new_batch_sz;
> +
> +	return 0;
> +}
> +
> +#define INIT_BATCH_SZ 16
> +
> +static void bpf_iter_fini_udp(void *priv_data);
> +
>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info  
> *aux)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	struct udp_iter_state *st = &iter->state;
>   	struct udp_seq_afinfo *afinfo;
>   	int ret;

> @@ -3427,24 +3623,34 @@ static int bpf_iter_init_udp(void *priv_data,  
> struct bpf_iter_aux_info *aux)
>   	afinfo->udp_table = NULL;
>   	st->bpf_seq_afinfo = afinfo;
>   	ret = bpf_iter_init_seq_net(priv_data, aux);
> -	if (ret)
> +	if (ret) {
>   		kfree(afinfo);
> +		return ret;
> +	}
> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> +	if (ret) {
> +		bpf_iter_fini_seq_net(priv_data);

Leaking afinfo here? Since we are not feeing it from bpf_iter_fini_udp
any more? (why?)

> +		return ret;
> +	}
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +
>   	return ret;
>   }

>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;

> -	kfree(st->bpf_seq_afinfo);


>   	bpf_iter_fini_seq_net(priv_data);
> +	kfree(iter->batch);
>   }

>   static const struct bpf_iter_seq_info udp_seq_info = {
>   	.seq_ops		= &bpf_iter_udp_seq_ops,
>   	.init_seq_private	= bpf_iter_init_udp,
>   	.fini_seq_private	= bpf_iter_fini_udp,
> -	.seq_priv_size		= sizeof(struct udp_iter_state),
> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>   };

>   static struct bpf_iter_reg udp_reg_info = {
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-23 21:53 ` [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-02-24 22:35   ` Stanislav Fomichev
  2023-02-27 14:56     ` Aditi Ghag
  2023-02-28 22:55   ` Martin KaFai Lau
  1 sibling, 1 reply; 24+ messages in thread
From: Stanislav Fomichev @ 2023-02-24 22:35 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 02/23, Aditi Ghag wrote:
> The socket destroy kfunc is used to forcefully terminate sockets from
> certain BPF contexts. We plan to use the capability in Cilium to force
> client sockets to reconnect when their remote load-balancing backends are
> deleted. The other use case is on-the-fly policy enforcement where  
> existing
> socket connections prevented by policies need to be forcefully terminated.
> The helper allows terminating sockets that may or may not be actively
> sending traffic.

> The helper is currently exposed to certain BPF iterators where users can
> filter, and terminate selected sockets.  Additionally, the helper can only
> be called from these BPF contexts that ensure socket locking in order to
> allow synchronous execution of destroy helpers that also acquire socket
> locks. The previous commit that batches UDP sockets during iteration
> facilitated a synchronous invocation of the destroy helper from BPF  
> context
> by skipping taking socket locks in the destroy handler. TCP iterators
> already supported batching.

> The helper takes `sock_common` type argument, even though it expects, and
> casts them to a `sock` pointer. This enables the verifier to allow the
> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> option available with the verifier logic that handles kfuncs where BTF
> types are inferred. Furthermore, as `sock_common` only has a subset of
> certain fields of `sock`, casting pointer to the latter type might not
> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> before casting.

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>   net/ipv4/tcp.c    | 17 ++++++++++-----
>   net/ipv4/udp.c    |  7 ++++--
>   3 files changed, 72 insertions(+), 7 deletions(-)

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1d6f165923bf..79cd91ba13d0 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)

>   	return func;
>   }
> +
> +/* Disables missing prototype warnings */
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error  
> code.
> + *
> + * The helper expects a non-NULL pointer to a full socket. It invokes
> + * the protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired  
> the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not  
> implemented.
> + * 0 otherwise
> + */
> +int bpf_sock_destroy(struct sock_common *sock)

Prefix with __bpf_kfunc (see other kfuncs).

> +{
> +	/* Validates the socket can be type casted to a full socket. */
> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;
> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(sock_destroy_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> +BTF_SET8_END(sock_destroy_kfunc_set)
> +
> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &sock_destroy_kfunc_set,
> +};
> +
> +static int init_subsystem(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,  
> &bpf_sock_destroy_kfunc_set);

Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
program from anywhere? What if the socket is not locked?
IOW, do we have to constrain it to the iterator programs (at least for
now)?

> +}
> +late_initcall(init_subsystem);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 33f559f491c8..8123c264d8ea 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>   		return 0;
>   	}

> -	/* Don't race with userspace socket closes such as tcp_close. */
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		/* Don't race with userspace socket closes such as tcp_close. */
> +		lock_sock(sk);

>   	if (sk->sk_state == TCP_LISTEN) {
>   		tcp_set_state(sk, TCP_CLOSE);
> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)

>   	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>   	local_bh_disable();
> -	bh_lock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_lock_sock(sk);
> +

>   	if (!sock_flag(sk, SOCK_DEAD)) {
>   		sk->sk_err = err;
> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>   		tcp_done(sk);
>   	}

> -	bh_unlock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		bh_unlock_sock(sk);
> +
>   	local_bh_enable();
>   	tcp_write_queue_purge(sk);
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(tcp_abort);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 2f3978de45f2..1bc9ad92c3d4 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);

>   int udp_abort(struct sock *sk, int err)
>   {
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		lock_sock(sk);

>   	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>   	 * with close()
> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>   	__udp_disconnect(sk, 0);

>   out:
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);

>   	return 0;
>   }
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
@ 2023-02-24 22:40   ` Stanislav Fomichev
  2023-02-27 19:37   ` Andrii Nakryiko
  2023-02-28 23:08   ` Martin KaFai Lau
  2 siblings, 0 replies; 24+ messages in thread
From: Stanislav Fomichev @ 2023-02-24 22:40 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 02/23, Aditi Ghag wrote:
> The test cases for TCP and UDP iterators mirror the intended usages of the
> helper.

> The destroy helpers set `ECONNABORTED` error code that we can validate in  
> the
> test code with client sockets. But UDP sockets have an overriding error  
> code
> from the disconnect called during abort, so the error code the validation  
> is
> only done for TCP sockets.

> The `struct sock` is redefined as vmlinux.h forward declares the struct,  
> and the
> loader fails to load the program as it finds the BTF FWD type for the  
> struct
> incompatible with the BTF STRUCT type.

> Here are the snippets of the verifier error, and corresponding BTF output:

> ```
> verifier error: extern (func ksym) ...: func_proto ... incompatible with  
> kernel

> BTF for selftest prog binary:

> [104] FWD 'sock' fwd_kind=struct
> [70] PTR '(anon)' type_id=104
> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
> 	'(anon)' type_id=70
> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
> --
> [96] DATASEC '.ksyms' size=0 vlen=1
> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')

> BTF for selftest vmlinux:

> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
> 	'sk' type_id=1340
> [1340] PTR '(anon)' type_id=2363
> [2363] STRUCT 'sock' size=1280 vlen=93
> ```

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   .../selftests/bpf/prog_tests/sock_destroy.c   | 125 ++++++++++++++++++
>   .../selftests/bpf/progs/sock_destroy_prog.c   | 110 +++++++++++++++
>   2 files changed, 235 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>   create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c  
> b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> new file mode 100644
> index 000000000000..d9da9d3578e2
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> @@ -0,0 +1,125 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +
> +#include "sock_destroy_prog.skel.h"
> +#include "network_helpers.h"
> +
> +#define ECONNABORTED 103
> +
> +static int duration;
> +
> +static void start_iter_sockets(struct bpf_program *prog)
> +{
> +	struct bpf_link *link;
> +	char buf[16] = {};
> +	int iter_fd, len;
> +
> +	link = bpf_program__attach_iter(prog, NULL);
> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
> +		return;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> +		goto free_link;
> +
> +	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
> +		;
> +	CHECK(len < 0, "read", "read failed: %s\n", strerror(errno));
> +
> +	close(iter_fd);
> +
> +free_link:
> +	bpf_link__destroy(link);
> +}
> +
> +void test_tcp(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;
> +
> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> +	if (CHECK(serv < 0, "start_server", "failed to start server\n"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (CHECK(clien < 0, "connect_to_fd", "errno %d\n", errno))
> +		goto cleanup_serv;
> +
> +	serv = accept(serv, NULL, NULL);
> +	if (CHECK(serv < 0, "accept", "errno %d\n", errno))
> +		goto cleanup;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (CHECK(n < 0, "client_send", "client failed to send on socket\n"))
> +		goto cleanup;
> +
> +	start_iter_sockets(skel->progs.iter_tcp6);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (CHECK(n > 0, "client_send after destroy", "succeeded on destroyed  
> socket\n"))
> +		goto cleanup;
> +	CHECK(errno != ECONNABORTED, "client_send", "unexpected error code on  
> destroyed socket\n");
> +
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +
> +void test_udp(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;
> +
> +	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
> +	if (CHECK(serv < 0, "start_server", "failed to start server\n"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (CHECK(clien < 0, "connect_to_fd", "errno %d\n", errno))
> +		goto cleanup_serv;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (CHECK(n < 0, "client_send", "client failed to send on socket\n"))
> +		goto cleanup;
> +
> +	start_iter_sockets(skel->progs.iter_udp6);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (CHECK(n > 0, "client_send after destroy", "succeeded on destroyed  
> socket\n"))
> +		goto cleanup;
> +	// UDP sockets have an overriding error code after they are  
> disconnected.

C++-style comments.

> +
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +void test_sock_destroy(void)
> +{
> +	int cgroup_fd = 0;
> +	struct sock_destroy_prog *skel;
> +
> +	skel = sock_destroy_prog__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		return;
> +
> +	cgroup_fd = test__join_cgroup("/sock_destroy");
> +	if (CHECK(cgroup_fd < 0, "join_cgroup", "cgroup creation failed\n"))
> +		goto close_cgroup_fd;
> +
> +	skel->links.sock_connect = bpf_program__attach_cgroup(
> +		skel->progs.sock_connect, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
> +		goto close_cgroup_fd;
> +
> +	test_tcp(skel);
> +	test_udp(skel);
> +
> +
> +close_cgroup_fd:
> +	close(cgroup_fd);
> +	sock_destroy_prog__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c  
> b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> new file mode 100644
> index 000000000000..c6805a9b7594
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> @@ -0,0 +1,110 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define sock sock___not_used
> +#include "vmlinux.h"
> +#undef sock
> +
> +#include <bpf/bpf_helpers.h>
> +
> +#define AF_INET6 10
> +
> +/* Redefine the struct: vmlinux.h forward declares it, and the loader  
> fails
> + * to load the program as it finds the BTF FWD type for the struct  
> incompatible
> + * with the BTF STRUCT type.
> + */
> +struct sock {
> +	struct sock_common	__sk_common;
> +#define sk_family		__sk_common.skc_family
> +#define sk_cookie		__sk_common.skc_cookie
> +};
> +
> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} tcp_conn_sockets SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} udp_conn_sockets SEC(".maps");
> +
> +SEC("cgroup/connect6")
> +int sock_connect(struct bpf_sock_addr *ctx)
> +{
> +	int key = 0;
> +	__u64 sock_cookie = 0;
> +	__u32 keyc = 0;
> +
> +	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> +		return 1;
> +
> +	sock_cookie = bpf_get_socket_cookie(ctx);
> +	if (ctx->protocol == IPPROTO_TCP)
> +		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
> +	else if (ctx->protocol == IPPROTO_UDP)
> +		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
> +	else
> +		return 1;
> +
> +	return 1;
> +}
> +
> +SEC("iter/tcp")
> +int iter_tcp6(struct bpf_iter__tcp *ctx)
> +{
> +	struct sock_common *sk_common = ctx->sk_common;
> +	struct seq_file *seq = ctx->meta->seq;
> +	__u64 sock_cookie = 0;
> +	__u64 *val;
> +	int key = 0;
> +
> +	if (!sk_common)
> +		return 0;
> +
> +	if (sk_common->skc_family != AF_INET6)
> +		return 0;
> +
> +	sock_cookie  = bpf_get_socket_cookie(sk_common);
> +	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
> +
> +	if (!val)
> +		return 0;
> +
> +	if (sock_cookie == *val)
> +		bpf_sock_destroy(sk_common);
> +
> +	return 0;
> +}
> +
> +SEC("iter/udp")
> +int iter_udp6(struct bpf_iter__udp *ctx)
> +{
> +	struct seq_file *seq = ctx->meta->seq;
> +	struct udp_sock *udp_sk = ctx->udp_sk;
> +	struct sock *sk = (struct sock *) udp_sk;
> +	__u64 sock_cookie = 0;
> +	int key = 0;
> +	__u64 *val;
> +
> +	if (!sk)
> +		return 0;
> +
> +	sock_cookie  = bpf_get_socket_cookie(sk);
> +	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
> +
> +	if (!val)
> +		return 0;
> +
> +	if (sock_cookie == *val)
> +		bpf_sock_destroy((struct sock_common *)sk);
> +
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-24 22:35   ` Stanislav Fomichev
@ 2023-02-27 14:56     ` Aditi Ghag
  2023-02-27 15:32       ` Aditi Ghag
  2023-02-27 17:30       ` Stanislav Fomichev
  0 siblings, 2 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-02-27 14:56 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 02/23, Aditi Ghag wrote:
>> The socket destroy kfunc is used to forcefully terminate sockets from
>> certain BPF contexts. We plan to use the capability in Cilium to force
>> client sockets to reconnect when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly policy enforcement where existing
>> socket connections prevented by policies need to be forcefully terminated.
>> The helper allows terminating sockets that may or may not be actively
>> sending traffic.
> 
>> The helper is currently exposed to certain BPF iterators where users can
>> filter, and terminate selected sockets.  Additionally, the helper can only
>> be called from these BPF contexts that ensure socket locking in order to
>> allow synchronous execution of destroy helpers that also acquire socket
>> locks. The previous commit that batches UDP sockets during iteration
>> facilitated a synchronous invocation of the destroy helper from BPF context
>> by skipping taking socket locks in the destroy handler. TCP iterators
>> already supported batching.
> 
>> The helper takes `sock_common` type argument, even though it expects, and
>> casts them to a `sock` pointer. This enables the verifier to allow the
>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>> option available with the verifier logic that handles kfuncs where BTF
>> types are inferred. Furthermore, as `sock_common` only has a subset of
>> certain fields of `sock`, casting pointer to the latter type might not
>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>> before casting.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>>  net/ipv4/tcp.c    | 17 ++++++++++-----
>>  net/ipv4/udp.c    |  7 ++++--
>>  3 files changed, 72 insertions(+), 7 deletions(-)
> 
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1d6f165923bf..79cd91ba13d0 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> 
>>  	return func;
>>  }
>> +
>> +/* Disables missing prototype warnings */
>> +__diag_push();
>> +__diag_ignore_all("-Wmissing-prototypes",
>> +		  "Global functions as their definitions will be in vmlinux BTF");
>> +
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>> + * the protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +int bpf_sock_destroy(struct sock_common *sock)
> 
> Prefix with __bpf_kfunc (see other kfuncs).

Will do!

> 
>> +{
>> +	/* Validates the socket can be type casted to a full socket. */
>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
>> +
>> +BTF_SET8_START(sock_destroy_kfunc_set)
>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>> +BTF_SET8_END(sock_destroy_kfunc_set)
>> +
>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &sock_destroy_kfunc_set,
>> +};
>> +
>> +static int init_subsystem(void)
>> +{
>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> 
> Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
> program from anywhere? What if the socket is not locked?
> IOW, do we have to constrain it to the iterator programs (at least for
> now)?

Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion -  https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@isovalent.com/. I suppose you would have a similar problem in the case of setsockopt* helpers. 
Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

> 
>> +}
>> +late_initcall(init_subsystem);
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 33f559f491c8..8123c264d8ea 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>  		return 0;
>>  	}
> 
>> -	/* Don't race with userspace socket closes such as tcp_close. */
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		/* Don't race with userspace socket closes such as tcp_close. */
>> +		lock_sock(sk);
> 
>>  	if (sk->sk_state == TCP_LISTEN) {
>>  		tcp_set_state(sk, TCP_CLOSE);
>> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
> 
>>  	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>  	local_bh_disable();
>> -	bh_lock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_lock_sock(sk);
>> +
> 
>>  	if (!sock_flag(sk, SOCK_DEAD)) {
>>  		sk->sk_err = err;
>> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>>  		tcp_done(sk);
>>  	}
> 
>> -	bh_unlock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		bh_unlock_sock(sk);
>> +
>>  	local_bh_enable();
>>  	tcp_write_queue_purge(sk);
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
>>  	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tcp_abort);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 2f3978de45f2..1bc9ad92c3d4 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
> 
>>  int udp_abort(struct sock *sk, int err)
>>  {
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		lock_sock(sk);
> 
>>  	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>  	 * with close()
>> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>>  	__udp_disconnect(sk, 0);
> 
>>  out:
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
> 
>>  	return 0;
>>  }
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-27 14:56     ` Aditi Ghag
@ 2023-02-27 15:32       ` Aditi Ghag
  2023-02-27 17:30       ` Stanislav Fomichev
  1 sibling, 0 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-02-27 15:32 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Feb 27, 2023, at 6:56 AM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@google.com> wrote:
>> 
>> On 02/23, Aditi Ghag wrote:
>>> The socket destroy kfunc is used to forcefully terminate sockets from
>>> certain BPF contexts. We plan to use the capability in Cilium to force
>>> client sockets to reconnect when their remote load-balancing backends are
>>> deleted. The other use case is on-the-fly policy enforcement where existing
>>> socket connections prevented by policies need to be forcefully terminated.
>>> The helper allows terminating sockets that may or may not be actively
>>> sending traffic.
>> 
>>> The helper is currently exposed to certain BPF iterators where users can
>>> filter, and terminate selected sockets.  Additionally, the helper can only
>>> be called from these BPF contexts that ensure socket locking in order to
>>> allow synchronous execution of destroy helpers that also acquire socket
>>> locks. The previous commit that batches UDP sockets during iteration
>>> facilitated a synchronous invocation of the destroy helper from BPF context
>>> by skipping taking socket locks in the destroy handler. TCP iterators
>>> already supported batching.
>> 
>>> The helper takes `sock_common` type argument, even though it expects, and
>>> casts them to a `sock` pointer. This enables the verifier to allow the
>>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>>> option available with the verifier logic that handles kfuncs where BTF
>>> types are inferred. Furthermore, as `sock_common` only has a subset of
>>> certain fields of `sock`, casting pointer to the latter type might not
>>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>>> before casting.
>> 
>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>> ---
>>> net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>>> net/ipv4/tcp.c    | 17 ++++++++++-----
>>> net/ipv4/udp.c    |  7 ++++--
>>> 3 files changed, 72 insertions(+), 7 deletions(-)
>> 
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 1d6f165923bf..79cd91ba13d0 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>> 
>>> 	return func;
>>> }
>>> +
>>> +/* Disables missing prototype warnings */
>>> +__diag_push();
>>> +__diag_ignore_all("-Wmissing-prototypes",
>>> +		  "Global functions as their definitions will be in vmlinux BTF");
>>> +
>>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>>> + *
>>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>>> + * the protocol specific socket destroy handlers.
>>> + *
>>> + * The helper can only be called from BPF contexts that have acquired the socket
>>> + * locks.
>>> + *
>>> + * Parameters:
>>> + * @sock: Pointer to socket to be destroyed
>>> + *
>>> + * Return:
>>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>>> + * 0 otherwise
>>> + */
>>> +int bpf_sock_destroy(struct sock_common *sock)
>> 
>> Prefix with __bpf_kfunc (see other kfuncs).
> 
> Will do!
> 
>> 
>>> +{
>>> +	/* Validates the socket can be type casted to a full socket. */
>>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>>> +
>>> +	if (!sk)
>>> +		return -EINVAL;
>>> +
>>> +	/* The locking semantics that allow for synchronous execution of the
>>> +	 * destroy handlers are only supported for TCP and UDP.
>>> +	 */
>>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>>> +}
>>> +
>>> +__diag_pop()
>>> +
>>> +BTF_SET8_START(sock_destroy_kfunc_set)
>>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>>> +BTF_SET8_END(sock_destroy_kfunc_set)
>>> +
>>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>>> +	.owner = THIS_MODULE,
>>> +	.set   = &sock_destroy_kfunc_set,
>>> +};
>>> +
>>> +static int init_subsystem(void)
>>> +{
>>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
>> 
>> Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
>> program from anywhere? What if the socket is not locked?
>> IOW, do we have to constrain it to the iterator programs (at least for
>> now)?
> 
> Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion - https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@isovalent.com/. I suppose you would have a similar problem in the case of setsockopt* helpers. 
> Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

I've submitted this as a potential topic for the lsm/mm/bpf agenda. Also, related to this discussion: could we have something similar to lockdep_sock_is_held with less overhead? 

> 
>> 
>>> +}
>>> +late_initcall(init_subsystem);
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 33f559f491c8..8123c264d8ea 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>> 		return 0;
>>> 	}
>> 
>>> -	/* Don't race with userspace socket closes such as tcp_close. */
>>> -	lock_sock(sk);
>>> +	/* BPF context ensures sock locking. */
>>> +	if (!has_current_bpf_ctx())
>>> +		/* Don't race with userspace socket closes such as tcp_close. */
>>> +		lock_sock(sk);
>> 
>>> 	if (sk->sk_state == TCP_LISTEN) {
>>> 		tcp_set_state(sk, TCP_CLOSE);
>>> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
>> 
>>> 	/* Don't race with BH socket closes such as inet_csk_listen_stop. */
>>> 	local_bh_disable();
>>> -	bh_lock_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		bh_lock_sock(sk);
>>> +
>> 
>>> 	if (!sock_flag(sk, SOCK_DEAD)) {
>>> 		sk->sk_err = err;
>>> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
>>> 		tcp_done(sk);
>>> 	}
>> 
>>> -	bh_unlock_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		bh_unlock_sock(sk);
>>> +
>>> 	local_bh_enable();
>>> 	tcp_write_queue_purge(sk);
>>> -	release_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		release_sock(sk);
>>> 	return 0;
>>> }
>>> EXPORT_SYMBOL_GPL(tcp_abort);
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 2f3978de45f2..1bc9ad92c3d4 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
>> 
>>> int udp_abort(struct sock *sk, int err)
>>> {
>>> -	lock_sock(sk);
>>> +	/* BPF context ensures sock locking. */
>>> +	if (!has_current_bpf_ctx())
>>> +		lock_sock(sk);
>> 
>>> 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>> 	 * with close()
>>> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
>>> 	__udp_disconnect(sk, 0);
>> 
>>> out:
>>> -	release_sock(sk);
>>> +	if (!has_current_bpf_ctx())
>>> +		release_sock(sk);
>> 
>>> 	return 0;
>>> }
>>> --
>>> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-27 14:56     ` Aditi Ghag
  2023-02-27 15:32       ` Aditi Ghag
@ 2023-02-27 17:30       ` Stanislav Fomichev
  1 sibling, 0 replies; 24+ messages in thread
From: Stanislav Fomichev @ 2023-02-27 17:30 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On Mon, Feb 27, 2023 at 6:56 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Feb 24, 2023, at 2:35 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 02/23, Aditi Ghag wrote:
> >> The socket destroy kfunc is used to forcefully terminate sockets from
> >> certain BPF contexts. We plan to use the capability in Cilium to force
> >> client sockets to reconnect when their remote load-balancing backends are
> >> deleted. The other use case is on-the-fly policy enforcement where existing
> >> socket connections prevented by policies need to be forcefully terminated.
> >> The helper allows terminating sockets that may or may not be actively
> >> sending traffic.
> >
> >> The helper is currently exposed to certain BPF iterators where users can
> >> filter, and terminate selected sockets.  Additionally, the helper can only
> >> be called from these BPF contexts that ensure socket locking in order to
> >> allow synchronous execution of destroy helpers that also acquire socket
> >> locks. The previous commit that batches UDP sockets during iteration
> >> facilitated a synchronous invocation of the destroy helper from BPF context
> >> by skipping taking socket locks in the destroy handler. TCP iterators
> >> already supported batching.
> >
> >> The helper takes `sock_common` type argument, even though it expects, and
> >> casts them to a `sock` pointer. This enables the verifier to allow the
> >> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> >> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> >> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> >> option available with the verifier logic that handles kfuncs where BTF
> >> types are inferred. Furthermore, as `sock_common` only has a subset of
> >> certain fields of `sock`, casting pointer to the latter type might not
> >> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> >> before casting.
> >
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  net/core/filter.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  net/ipv4/tcp.c    | 17 ++++++++++-----
> >>  net/ipv4/udp.c    |  7 ++++--
> >>  3 files changed, 72 insertions(+), 7 deletions(-)
> >
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 1d6f165923bf..79cd91ba13d0 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -11621,3 +11621,58 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> >
> >>      return func;
> >>  }
> >> +
> >> +/* Disables missing prototype warnings */
> >> +__diag_push();
> >> +__diag_ignore_all("-Wmissing-prototypes",
> >> +              "Global functions as their definitions will be in vmlinux BTF");
> >> +
> >> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> >> + *
> >> + * The helper expects a non-NULL pointer to a full socket. It invokes
> >> + * the protocol specific socket destroy handlers.
> >> + *
> >> + * The helper can only be called from BPF contexts that have acquired the socket
> >> + * locks.
> >> + *
> >> + * Parameters:
> >> + * @sock: Pointer to socket to be destroyed
> >> + *
> >> + * Return:
> >> + * On error, may return EPROTONOSUPPORT, EINVAL.
> >> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> >> + * 0 otherwise
> >> + */
> >> +int bpf_sock_destroy(struct sock_common *sock)
> >
> > Prefix with __bpf_kfunc (see other kfuncs).
>
> Will do!
>
> >
> >> +{
> >> +    /* Validates the socket can be type casted to a full socket. */
> >> +    struct sock *sk = sk_to_full_sk((struct sock *)sock);
> >> +
> >> +    if (!sk)
> >> +            return -EINVAL;
> >> +
> >> +    /* The locking semantics that allow for synchronous execution of the
> >> +     * destroy handlers are only supported for TCP and UDP.
> >> +     */
> >> +    if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> >> +            return -EOPNOTSUPP;
> >> +
> >> +    return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> >> +}
> >> +
> >> +__diag_pop()
> >> +
> >> +BTF_SET8_START(sock_destroy_kfunc_set)
> >> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> >> +BTF_SET8_END(sock_destroy_kfunc_set)
> >> +
> >> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> >> +    .owner = THIS_MODULE,
> >> +    .set   = &sock_destroy_kfunc_set,
> >> +};
> >> +
> >> +static int init_subsystem(void)
> >> +{
> >> +    return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> >
> > Is it safe? Does it mean I can call bpf_sock_destroy from any tracing
> > program from anywhere? What if the socket is not locked?
> > IOW, do we have to constrain it to the iterator programs (at least for
> > now)?
>
> Given kprobes are not considered as part of BPF_PROG_TYPE_TRACING, I'm not sure if there are other tracing programs with sock/sock_common arguments. Regardless, this is a valid point. I had brought up a similar topic earlier during the v1 discussion -  https://lore.kernel.org/bpf/78E434B0-06A9-466F-A061-B9A05DC6DE6D@isovalent.com/. I suppose you would have a similar problem in the case of setsockopt* helpers.

Sure, the same problem exists for bpf_setsockopt helper, but these
should be exposed only in a handful of hooks (where we know that the
socket is either locked or not). See, for example, [0] as one of the
recent fixes.

> Is the general topic of limiting access for kfunc to a subset of BPF_PROG_* programs being discussed?

Some of that discussion might have happened in [1] (or one of the
earlier respins). I think at that time I was thinking maybe we can use
btf_tags to annotate __locked/__unlocked socket arguments. Then, in
those annotated contexts, the verifier might allow you to call
bpf_setsockopt.. But I haven't really explored this too much.

0: https://lore.kernel.org/bpf/20230127001732.4162630-1-kuifeng@meta.com/
1: https://lore.kernel.org/bpf/20220622160346.967594-1-sdf@google.com/

> >
> >> +}
> >> +late_initcall(init_subsystem);
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index 33f559f491c8..8123c264d8ea 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
> >>              return 0;
> >>      }
> >
> >> -    /* Don't race with userspace socket closes such as tcp_close. */
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            /* Don't race with userspace socket closes such as tcp_close. */
> >> +            lock_sock(sk);
> >
> >>      if (sk->sk_state == TCP_LISTEN) {
> >>              tcp_set_state(sk, TCP_CLOSE);
> >> @@ -4688,7 +4690,9 @@ int tcp_abort(struct sock *sk, int err)
> >
> >>      /* Don't race with BH socket closes such as inet_csk_listen_stop. */
> >>      local_bh_disable();
> >> -    bh_lock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            bh_lock_sock(sk);
> >> +
> >
> >>      if (!sock_flag(sk, SOCK_DEAD)) {
> >>              sk->sk_err = err;
> >> @@ -4700,10 +4704,13 @@ int tcp_abort(struct sock *sk, int err)
> >>              tcp_done(sk);
> >>      }
> >
> >> -    bh_unlock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            bh_unlock_sock(sk);
> >> +
> >>      local_bh_enable();
> >>      tcp_write_queue_purge(sk);
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >>      return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(tcp_abort);
> >> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> index 2f3978de45f2..1bc9ad92c3d4 100644
> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -2925,7 +2925,9 @@ EXPORT_SYMBOL(udp_poll);
> >
> >>  int udp_abort(struct sock *sk, int err)
> >>  {
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            lock_sock(sk);
> >
> >>      /* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
> >>       * with close()
> >> @@ -2938,7 +2940,8 @@ int udp_abort(struct sock *sk, int err)
> >>      __udp_disconnect(sk, 0);
> >
> >>  out:
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >
> >>      return 0;
> >>  }
> >> --
> >> 2.34.1
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  2023-02-24 22:40   ` Stanislav Fomichev
@ 2023-02-27 19:37   ` Andrii Nakryiko
  2023-03-03 16:00     ` Aditi Ghag
  2023-02-28 23:08   ` Martin KaFai Lau
  2 siblings, 1 reply; 24+ messages in thread
From: Andrii Nakryiko @ 2023-02-27 19:37 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, sdf, edumazet

On Thu, Feb 23, 2023 at 2:05 PM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
> The test cases for TCP and UDP iterators mirror the intended usages of the
> helper.
>
> The destroy helpers set `ECONNABORTED` error code that we can validate in the
> test code with client sockets. But UDP sockets have an overriding error code
> from the disconnect called during abort, so the error code the validation is
> only done for TCP sockets.
>
> The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
> loader fails to load the program as it finds the BTF FWD type for the struct
> incompatible with the BTF STRUCT type.
>
> Here are the snippets of the verifier error, and corresponding BTF output:
>
> ```
> verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel
>
> BTF for selftest prog binary:
>
> [104] FWD 'sock' fwd_kind=struct
> [70] PTR '(anon)' type_id=104
> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>         '(anon)' type_id=70
> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
> --
> [96] DATASEC '.ksyms' size=0 vlen=1
>         type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>
> BTF for selftest vmlinux:
>
> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
>         'sk' type_id=1340
> [1340] PTR '(anon)' type_id=2363
> [2363] STRUCT 'sock' size=1280 vlen=93
> ```
>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>  .../selftests/bpf/prog_tests/sock_destroy.c   | 125 ++++++++++++++++++
>  .../selftests/bpf/progs/sock_destroy_prog.c   | 110 +++++++++++++++
>  2 files changed, 235 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>  create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>

[...]

> +       n = send(clien, "t", 1, 0);
> +       if (CHECK(n < 0, "client_send", "client failed to send on socket\n"))
> +               goto cleanup;
> +
> +       start_iter_sockets(skel->progs.iter_tcp6);
> +
> +       n = send(clien, "t", 1, 0);
> +       if (CHECK(n > 0, "client_send after destroy", "succeeded on destroyed socket\n"))
> +               goto cleanup;
> +       CHECK(errno != ECONNABORTED, "client_send", "unexpected error code on destroyed socket\n");
> +

please don't use CHECK() macros, prefere ASSERT_xxx() ones

> +
> +cleanup:
> +       close(clien);
> +cleanup_serv:
> +       close(serv);
> +}
> +
> +
> +void test_udp(struct sock_destroy_prog *skel)

are these meant to be subtests? If yes, model them as such?

and in either case, make these funcs static

> +{
> +       int serv = -1, clien = -1, n = 0;
> +
> +       serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
> +       if (CHECK(serv < 0, "start_server", "failed to start server\n"))
> +               goto cleanup_serv;
> +

[...]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
  2023-02-24 22:32   ` Stanislav Fomichev
@ 2023-02-28 19:58   ` Martin KaFai Lau
  2023-03-01  2:40     ` Aditi Ghag
  1 sibling, 1 reply; 24+ messages in thread
From: Martin KaFai Lau @ 2023-02-28 19:58 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, Martin KaFai Lau, bpf

On 2/23/23 1:53 PM, Aditi Ghag wrote:
> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
> +static unsigned short seq_file_family(const struct seq_file *seq);
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +
> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
> +{
> +	unsigned short family = seq_file_family(seq);
> +
> +	/* AF_UNSPEC is used as a match all */
> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
> +		net_eq(sock_net(sk), seq_file_net(seq)));
> +}
> +
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
> +	struct udp_table *udptable;
> +	struct sock *first_sk = NULL;
> +	struct sock *sk;
> +	unsigned int bucket_sks = 0;
> +	bool first;
> +	bool resized = false;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done)
> +		state->bucket++;
> +
> +	udptable = udp_get_table_afinfo(afinfo, net);
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	first = true;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];

Since it is mostly separated from the proc's udp-seq-file now, may as well 
iterate the udptable->hash"2" which is hashed by both addr and port such that 
each batch should be smaller.

> +
> +		if (hlist_empty(&hslot->head))
> +			continue;
> +
> +		spin_lock_bh(&hslot->lock);
> +		sk_for_each(sk, &hslot->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				if (first) {
> +					first_sk = sk;
> +					first = false;
> +				}
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				bucket_sks++;
> +			}
> +		}
> +		spin_unlock_bh(&hslot->lock);
> +		if (first_sk)
> +			break;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!first_sk)
> +		return NULL;

I think first_sk and bucket_sks need to be reset on the "again" case also?

If bpf_iter_udp_seq_stop() is called before a batch has been fully processed by 
the bpf prog in ".show", how does the next bpf_iter_udp_seq_start() continue 
from where it left off? The bpf_tcp_iter remembers the bucket and the 
offset-in-this-bucket. I think bpf_udp_iter can do something similar.

> +
> +	if (iter->end_sk == bucket_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		return first_sk;
> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */
> +		state->bucket--;
> +		goto again;
> +	}
> +	return first_sk;
> +}
> +

[ ... ]

>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	struct udp_iter_state *st = &iter->state;
>   	struct udp_seq_afinfo *afinfo;
>   	int ret;
>   
> @@ -3427,24 +3623,34 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   	afinfo->udp_table = NULL;
>   	st->bpf_seq_afinfo = afinfo;

Is bpf_seq_afinfo still needed in 'struct udp_iter_state'? Can it be removed?

>   	ret = bpf_iter_init_seq_net(priv_data, aux);
> -	if (ret)
> +	if (ret) {
>   		kfree(afinfo);
> +		return ret;
> +	}
> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> +	if (ret) {
> +		bpf_iter_fini_seq_net(priv_data);
> +		return ret;
> +	}
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +
>   	return ret;
>   }
>   
>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
>   
> -	kfree(st->bpf_seq_afinfo);
>   	bpf_iter_fini_seq_net(priv_data);
> +	kfree(iter->batch);
kvfree


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-02-24 22:32   ` Stanislav Fomichev
@ 2023-02-28 20:32     ` Martin KaFai Lau
  2023-02-28 20:52       ` Stanislav Fomichev
  0 siblings, 1 reply; 24+ messages in thread
From: Martin KaFai Lau @ 2023-02-28 20:32 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet, Martin KaFai Lau, Aditi Ghag

On 2/24/23 2:32 PM, Stanislav Fomichev wrote:
>> +    unsigned int cur_sk;
>> +    unsigned int end_sk;
>> +    unsigned int max_sk;
>> +    struct sock **batch;
>> +    bool st_bucket_done;
> 
> Any change we can generalize some of those across tcp & udp? I haven't
> looked too deep, but a lot of things look like a plain copy-paste
> from tcp batching. Or not worth it?

The batching has some small but subtle differences between tcp and udp, so not 
sure if it can end up sharing enough codes.

>>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>>                    struct udp_sock *udp_sk, uid_t uid, int bucket)
>>   {
>> @@ -3172,18 +3307,34 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, 
>> void *v)
>>       struct bpf_prog *prog;
>>       struct sock *sk = v;
>>       uid_t uid;
>> +    bool slow;
>> +    int rc;
> 
>>       if (v == SEQ_START_TOKEN)
>>           return 0;
> 
>> +    slow = lock_sock_fast(sk);
> 
> Hm, I missed the fact that we're already using fast lock in the tcp batching
> as well. Should we not use fask locks here? On a loaded system it's
> probably fair to pay some backlog processing in the path that goes
> over every socket (here)? Martin, WDYT?

hmm... not sure if it is needed. The lock_sock_fast was borrowed from 
tcp_get_info() which is also used in inet_diag iteration. bpf iter prog should 
be doing something pretty fast also. In the future, it could allow the bpf-iter 
program to acquire the lock by itself only when it is necessary if the current 
always lock strategy is too expensive.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-02-28 20:32     ` Martin KaFai Lau
@ 2023-02-28 20:52       ` Stanislav Fomichev
  0 siblings, 0 replies; 24+ messages in thread
From: Stanislav Fomichev @ 2023-02-28 20:52 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf, kafai, edumazet, Martin KaFai Lau, Aditi Ghag

On Tue, Feb 28, 2023 at 12:32 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/24/23 2:32 PM, Stanislav Fomichev wrote:
> >> +    unsigned int cur_sk;
> >> +    unsigned int end_sk;
> >> +    unsigned int max_sk;
> >> +    struct sock **batch;
> >> +    bool st_bucket_done;
> >
> > Any change we can generalize some of those across tcp & udp? I haven't
> > looked too deep, but a lot of things look like a plain copy-paste
> > from tcp batching. Or not worth it?
>
> The batching has some small but subtle differences between tcp and udp, so not
> sure if it can end up sharing enough codes.
>
> >>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
> >>                    struct udp_sock *udp_sk, uid_t uid, int bucket)
> >>   {
> >> @@ -3172,18 +3307,34 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq,
> >> void *v)
> >>       struct bpf_prog *prog;
> >>       struct sock *sk = v;
> >>       uid_t uid;
> >> +    bool slow;
> >> +    int rc;
> >
> >>       if (v == SEQ_START_TOKEN)
> >>           return 0;
> >
> >> +    slow = lock_sock_fast(sk);
> >
> > Hm, I missed the fact that we're already using fast lock in the tcp batching
> > as well. Should we not use fask locks here? On a loaded system it's
> > probably fair to pay some backlog processing in the path that goes
> > over every socket (here)? Martin, WDYT?
>
> hmm... not sure if it is needed. The lock_sock_fast was borrowed from
> tcp_get_info() which is also used in inet_diag iteration. bpf iter prog should
> be doing something pretty fast also. In the future, it could allow the bpf-iter
> program to acquire the lock by itself only when it is necessary if the current
> always lock strategy is too expensive.

Cool, thx!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-23 21:53 ` [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
  2023-02-24 22:35   ` Stanislav Fomichev
@ 2023-02-28 22:55   ` Martin KaFai Lau
  2023-03-16 22:37     ` Aditi Ghag
  1 sibling, 1 reply; 24+ messages in thread
From: Martin KaFai Lau @ 2023-02-28 22:55 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, bpf

On 2/23/23 1:53 PM, Aditi Ghag wrote:
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> + *
> + * The helper expects a non-NULL pointer to a full socket. It invokes
> + * the protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> + * 0 otherwise
> + */
> +int bpf_sock_destroy(struct sock_common *sock)
> +{
> +	/* Validates the socket can be type casted to a full socket. */
> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);

If sk != sock, sk is not locked.

Does it have to do sk_to_full_sk()? From looking at tcp_abort(), it can handle 
TCP_NEW_SYN_RECV and TCP_TIME_WAIT. The bpf-tcp-iter is iterating the hashtables 
that may have sk in different states. Ideally, bpf-tcp-iter should be able to 
use bpf_sock_destroy() to abort the sk in different states also.

Otherwise, the bpf_sock_destroy kfunc is aborting the listener while the bpf 
prog expects to abort a req_sk.

> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;
> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  2023-02-24 22:40   ` Stanislav Fomichev
  2023-02-27 19:37   ` Andrii Nakryiko
@ 2023-02-28 23:08   ` Martin KaFai Lau
  2023-03-01  2:17     ` Aditi Ghag
  2 siblings, 1 reply; 24+ messages in thread
From: Martin KaFai Lau @ 2023-02-28 23:08 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, bpf

On 2/23/23 1:53 PM, Aditi Ghag wrote:
> The test cases for TCP and UDP iterators mirror the intended usages of the
> helper.
> 
> The destroy helpers set `ECONNABORTED` error code that we can validate in the
> test code with client sockets. But UDP sockets have an overriding error code
> from the disconnect called during abort, so the error code the validation is
> only done for TCP sockets.
> 
> The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
> loader fails to load the program as it finds the BTF FWD type for the struct
> incompatible with the BTF STRUCT type.
> 
> Here are the snippets of the verifier error, and corresponding BTF output:
> 
> ```
> verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel
> 
> BTF for selftest prog binary:
> 
> [104] FWD 'sock' fwd_kind=struct
> [70] PTR '(anon)' type_id=104
> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
> 	'(anon)' type_id=70
> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
> --
> [96] DATASEC '.ksyms' size=0 vlen=1
> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
> 
> BTF for selftest vmlinux:
> 
> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
> 	'sk' type_id=1340
> [1340] PTR '(anon)' type_id=2363
> [2363] STRUCT 'sock' size=1280 vlen=93
> ```

> +int bpf_sock_destroy(struct sock_common *sk) __ksym;

This does not match the bpf prog's BTF dump above which has pointer [70] 
pointing to FWD 'sock' [104] as the argument. It should be at least FWD 
'sock_common' if not STRUCT 'sock_common'. I tried to change the func signature 
to 'struct sock *sk' but cannot reproduce the issue in my environment also.

Could you confirm the BTF paste and 'incompatible with kernel" error in the 
commit message do match the bpf_sock_destroy declaration? If not, can you 
re-paste the BTF output and libbpf error message that matches the 
bpf_sock_destroy signature.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-02-28 23:08   ` Martin KaFai Lau
@ 2023-03-01  2:17     ` Aditi Ghag
  2023-03-02  7:06       ` Martin KaFai Lau
  0 siblings, 1 reply; 24+ messages in thread
From: Aditi Ghag @ 2023-03-01  2:17 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, edumazet, bpf



> On Feb 28, 2023, at 3:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>> The test cases for TCP and UDP iterators mirror the intended usages of the
>> helper.
>> The destroy helpers set `ECONNABORTED` error code that we can validate in the
>> test code with client sockets. But UDP sockets have an overriding error code
>> from the disconnect called during abort, so the error code the validation is
>> only done for TCP sockets.
>> The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
>> loader fails to load the program as it finds the BTF FWD type for the struct
>> incompatible with the BTF STRUCT type.
>> Here are the snippets of the verifier error, and corresponding BTF output:
>> ```
>> verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel
>> BTF for selftest prog binary:
>> [104] FWD 'sock' fwd_kind=struct
>> [70] PTR '(anon)' type_id=104
>> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>> 	'(anon)' type_id=70
>> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
>> --
>> [96] DATASEC '.ksyms' size=0 vlen=1
>> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>> BTF for selftest vmlinux:
>> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
>> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
>> 	'sk' type_id=1340
>> [1340] PTR '(anon)' type_id=2363
>> [2363] STRUCT 'sock' size=1280 vlen=93
>> ```
> 
>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> 
> This does not match the bpf prog's BTF dump above which has pointer [70] pointing to FWD 'sock' [104] as the argument. It should be at least FWD 'sock_common' if not STRUCT 'sock_common'. I tried to change the func signature to 'struct sock *sk' but cannot reproduce the issue in my environment also.
> 
> Could you confirm the BTF paste and 'incompatible with kernel" error in the commit message do match the bpf_sock_destroy declaration? If not, can you re-paste the BTF output and libbpf error message that matches the bpf_sock_destroy signature.

I don't think you'll be able to reproduce the issue with `sock_common`, as `struct sock_common` isn't forward declared in vmlinux.h. But I find it odd that you weren't able to reproduce it with `struct sock`. Just to confirm, did you remove the minimal `struct sock` definition from the program? Per the commit description, I added that because libbpf was throwing this error -
`libbpf: extern (func ksym) 'bpf_sock_destroy': func_proto [83] incompatible with kernel [75285]`

Sending the BTF snippet again (full BTF - https://pastebin.com/etkFyuJk)

```
85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
[84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
	'(anon)' type_id=70
[70] PTR '(anon)' type_id=104
[104] FWD 'sock' fwd_kind=struct
```

Compare this to the BTF snippet once I undef and define the struct in the test prog:

```
[87] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
	type_id=87 offset=0 size=0 (FUNC 'bpf_sock_destroy')
[84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
	'(anon)' type_id=85
[85] PTR '(anon)' type_id=86
[86] STRUCT 'sock' size=136 vlen=1
	'__sk_common' type_id=34 bits_offset=0
```

(Anyway looks like I needed to define the struct in the test prog only when bpf_sock_destory had `struct sock` as the argument.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-02-28 19:58   ` Martin KaFai Lau
@ 2023-03-01  2:40     ` Aditi Ghag
  2023-03-02  6:43       ` Martin KaFai Lau
  0 siblings, 1 reply; 24+ messages in thread
From: Aditi Ghag @ 2023-03-01  2:40 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, sdf, edumazet, Martin KaFai Lau, bpf


>> +
>> +		if (hlist_empty(&hslot->head))
>> +			continue;
>> +
>> +		spin_lock_bh(&hslot->lock);
>> +		sk_for_each(sk, &hslot->head) {
>> +			if (seq_sk_match(seq, sk)) {
>> +				if (first) {
>> +					first_sk = sk;
>> +					first = false;
>> +				}
>> +				if (iter->end_sk < iter->max_sk) {
>> +					sock_hold(sk);
>> +					iter->batch[iter->end_sk++] = sk;
>> +				}
>> +				bucket_sks++;
>> +			}
>> +		}
>> +		spin_unlock_bh(&hslot->lock);
>> +		if (first_sk)
>> +			break;
>> +	}
>> +
>> +	/* All done: no batch made. */
>> +	if (!first_sk)
>> +		return NULL;
> 
> I think first_sk and bucket_sks need to be reset on the "again" case also?
> 
> If bpf_iter_udp_seq_stop() is called before a batch has been fully processed by the bpf prog in ".show", how does the next bpf_iter_udp_seq_start() continue from where it left off? The bpf_tcp_iter remembers the bucket and the offset-in-this-bucket. I think bpf_udp_iter can do something similar.

Hmm, I suppose you are referring to the `tcp_seek_last_pos` logic? This was the [1] commit that added the optimization in v2.6, but only for TCP. Extending the same logic for UDP is out of the scope of this PR? Although reading the [1] commit description, the latency issue seems to be specific to the file based iterators, no? Of course, BPF iterators are quite new, and I'm not sure if we have the same "slowness" reported in that commit. Having said that, do we expect users to start from where they previously left off by stopping BPF iterators? Regardless, I'll do some testing to ensure that we at least don't crash. 


[1] https://github.com/torvalds/linux/commit/a8b690f98baf9fb1902b8eeab801351ea603fa3a



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator
  2023-03-01  2:40     ` Aditi Ghag
@ 2023-03-02  6:43       ` Martin KaFai Lau
  0 siblings, 0 replies; 24+ messages in thread
From: Martin KaFai Lau @ 2023-03-02  6:43 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, Martin KaFai Lau, bpf

On 2/28/23 6:40 PM, Aditi Ghag wrote:
> 
>>> +
>>> +		if (hlist_empty(&hslot->head))
>>> +			continue;
>>> +
>>> +		spin_lock_bh(&hslot->lock);
>>> +		sk_for_each(sk, &hslot->head) {
>>> +			if (seq_sk_match(seq, sk)) {
>>> +				if (first) {
>>> +					first_sk = sk;
>>> +					first = false;
>>> +				}
>>> +				if (iter->end_sk < iter->max_sk) {
>>> +					sock_hold(sk);
>>> +					iter->batch[iter->end_sk++] = sk;
>>> +				}
>>> +				bucket_sks++;
>>> +			}
>>> +		}
>>> +		spin_unlock_bh(&hslot->lock);
>>> +		if (first_sk)
>>> +			break;
>>> +	}
>>> +
>>> +	/* All done: no batch made. */
>>> +	if (!first_sk)
>>> +		return NULL;
>>
>> I think first_sk and bucket_sks need to be reset on the "again" case also?
>>
>> If bpf_iter_udp_seq_stop() is called before a batch has been fully processed by the bpf prog in ".show", how does the next bpf_iter_udp_seq_start() continue from where it left off? The bpf_tcp_iter remembers the bucket and the offset-in-this-bucket. I think bpf_udp_iter can do something similar.
> 
> Hmm, I suppose you are referring to the `tcp_seek_last_pos` logic? This was the [1] commit that added the optimization in v2.6, but only for TCP. Extending the same logic for UDP is out of the scope of this PR? Although reading the [1] commit description, the latency issue seems to be specific to the file based iterators, no? Of course, BPF iterators are quite new, and I'm not sure if we have the same "slowness" reported in that commit. Having said that, do we expect users to start from where they previously left off by stopping BPF iterators? Regardless, I'll do some testing to ensure that we at least don't crash.

It is about ensuring the bpf-udp-iter makes forward progress, although speeding 
up is another plus. The iteration may have to be stopped (ie. 
bpf_iter_udp_seq_stop) for other reasons. The bpf-(udp|tcp|unix)-iter is not 
only for doing bpf_setsockopt or bpf_sock_destroy. A major use case is to 
bpf_seq_printf. eg. when seq_has_overflowed() in bpf_seq_read, 
bpf_iter_udp_seq_stop will be called.

If I read this patch correctly, bpf_iter_udp_seq_start() always starts from the 
beginning of the current bucket. The bpf-iter cannot make forward progress if 
the bpf_iter_udp_seq_stop is called at the middle of the bucket.

The "resume" is currently done by udp_get_idx(, *pos..) but is taken away from 
bpf-udp-iter in this patch. udp_get_idx(...) starts counting from the very first 
bucket and could potentially be reused here. However, I believe this patch is 
easier to just resume from the current bucket. Everything is there, all it needs 
is to remember its previous position/offset of the bucket.

> 
> 
> [1] https://github.com/torvalds/linux/commit/a8b690f98baf9fb1902b8eeab801351ea603fa3a
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-01  2:17     ` Aditi Ghag
@ 2023-03-02  7:06       ` Martin KaFai Lau
  2023-03-02 20:52         ` Aditi Ghag
  0 siblings, 1 reply; 24+ messages in thread
From: Martin KaFai Lau @ 2023-03-02  7:06 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, Stanislav Fomichev, edumazet, bpf

On 2/28/23 6:17 PM, Aditi Ghag wrote:
> 
> 
>> On Feb 28, 2023, at 3:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>>> The test cases for TCP and UDP iterators mirror the intended usages of the
>>> helper.
>>> The destroy helpers set `ECONNABORTED` error code that we can validate in the
>>> test code with client sockets. But UDP sockets have an overriding error code
>>> from the disconnect called during abort, so the error code the validation is
>>> only done for TCP sockets.
>>> The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
>>> loader fails to load the program as it finds the BTF FWD type for the struct
>>> incompatible with the BTF STRUCT type.
>>> Here are the snippets of the verifier error, and corresponding BTF output:
>>> ```
>>> verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel
>>> BTF for selftest prog binary:
>>> [104] FWD 'sock' fwd_kind=struct
>>> [70] PTR '(anon)' type_id=104
>>> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>>> 	'(anon)' type_id=70
>>> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
>>> --
>>> [96] DATASEC '.ksyms' size=0 vlen=1
>>> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>>> BTF for selftest vmlinux:
>>> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
>>> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
>>> 	'sk' type_id=1340
>>> [1340] PTR '(anon)' type_id=2363
>>> [2363] STRUCT 'sock' size=1280 vlen=93
>>> ```
>>
>>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>>
>> This does not match the bpf prog's BTF dump above which has pointer [70] pointing to FWD 'sock' [104] as the argument. It should be at least FWD 'sock_common' if not STRUCT 'sock_common'. I tried to change the func signature to 'struct sock *sk' but cannot reproduce the issue in my environment also.
>>
>> Could you confirm the BTF paste and 'incompatible with kernel" error in the commit message do match the bpf_sock_destroy declaration? If not, can you re-paste the BTF output and libbpf error message that matches the bpf_sock_destroy signature.
> 
> I don't think you'll be able to reproduce the issue with `sock_common`, as `struct sock_common` isn't forward declared in vmlinux.h. But I find it odd that you weren't able to reproduce it with `struct sock`. Just to confirm, did you remove the minimal `struct sock` definition from the program? Per the commit description, I added that because libbpf was throwing this error -
> `libbpf: extern (func ksym) 'bpf_sock_destroy': func_proto [83] incompatible with kernel [75285]`

Yep, I changed the kfunc to 'struct sock *' and removed the define/undef dance.

> 
> Sending the BTF snippet again (full BTF - https://pastebin.com/etkFyuJk)
> 
> ```
> 85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
> 	'(anon)' type_id=70
> [70] PTR '(anon)' type_id=104
> [104] FWD 'sock' fwd_kind=struct
> ```
> 
> Compare this to the BTF snippet once I undef and define the struct in the test prog:
> 
> ```
> [87] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
> 	type_id=87 offset=0 size=0 (FUNC 'bpf_sock_destroy')
> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
> 	'(anon)' type_id=85
> [85] PTR '(anon)' type_id=86
> [86] STRUCT 'sock' size=136 vlen=1
> 	'__sk_common' type_id=34 bits_offset=0
> ```
> 
> (Anyway looks like I needed to define the struct in the test prog only when bpf_sock_destory had `struct sock` as the argument.)

Right, I also think it is orthogonal to your set if the kfunc is taking 'struct 
sock_common *' anyway. [although I do feel a kernel function taking a 'struct 
sock_common *' is rather odd]

I was only asking and also trying myself because it looks pretty wrong if it can 
be reproduced and it is something that should be fixed regardless. It is pretty 
normal to have forward declaration within a bpf prog itself (not from 
vmlinux.h). From the paste, it feels like the kfunc bpf_sock_destroy btf is 
generated earlier than the 'struct sock'. Which llvm version are you using?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-02  7:06       ` Martin KaFai Lau
@ 2023-03-02 20:52         ` Aditi Ghag
  0 siblings, 0 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-03-02 20:52 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, Stanislav Fomichev, edumazet, bpf


> On Mar 1, 2023, at 11:06 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 2/28/23 6:17 PM, Aditi Ghag wrote:
>>> On Feb 28, 2023, at 3:08 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>>>> The test cases for TCP and UDP iterators mirror the intended usages of the
>>>> helper.
>>>> The destroy helpers set `ECONNABORTED` error code that we can validate in the
>>>> test code with client sockets. But UDP sockets have an overriding error code
>>>> from the disconnect called during abort, so the error code the validation is
>>>> only done for TCP sockets.
>>>> The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
>>>> loader fails to load the program as it finds the BTF FWD type for the struct
>>>> incompatible with the BTF STRUCT type.
>>>> Here are the snippets of the verifier error, and corresponding BTF output:
>>>> ```
>>>> verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel
>>>> BTF for selftest prog binary:
>>>> [104] FWD 'sock' fwd_kind=struct
>>>> [70] PTR '(anon)' type_id=104
>>>> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>>>> 	'(anon)' type_id=70
>>>> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
>>>> --
>>>> [96] DATASEC '.ksyms' size=0 vlen=1
>>>> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>>>> BTF for selftest vmlinux:
>>>> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
>>>> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
>>>> 	'sk' type_id=1340
>>>> [1340] PTR '(anon)' type_id=2363
>>>> [2363] STRUCT 'sock' size=1280 vlen=93
>>>> ```
>>> 
>>>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>>> 
>>> This does not match the bpf prog's BTF dump above which has pointer [70] pointing to FWD 'sock' [104] as the argument. It should be at least FWD 'sock_common' if not STRUCT 'sock_common'. I tried to change the func signature to 'struct sock *sk' but cannot reproduce the issue in my environment also.
>>> 
>>> Could you confirm the BTF paste and 'incompatible with kernel" error in the commit message do match the bpf_sock_destroy declaration? If not, can you re-paste the BTF output and libbpf error message that matches the bpf_sock_destroy signature.
>> I don't think you'll be able to reproduce the issue with `sock_common`, as `struct sock_common` isn't forward declared in vmlinux.h. But I find it odd that you weren't able to reproduce it with `struct sock`. Just to confirm, did you remove the minimal `struct sock` definition from the program? Per the commit description, I added that because libbpf was throwing this error -
>> `libbpf: extern (func ksym) 'bpf_sock_destroy': func_proto [83] incompatible with kernel [75285]`
> 
> Yep, I changed the kfunc to 'struct sock *' and removed the define/undef dance.
> 
>> Sending the BTF snippet again (full BTF - https://pastebin.com/etkFyuJk)
>> ```
>> 85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
>> 	type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>> 	'(anon)' type_id=70
>> [70] PTR '(anon)' type_id=104
>> [104] FWD 'sock' fwd_kind=struct
>> ```
>> Compare this to the BTF snippet once I undef and define the struct in the test prog:
>> ```
>> [87] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
>> 	type_id=87 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>> 	'(anon)' type_id=85
>> [85] PTR '(anon)' type_id=86
>> [86] STRUCT 'sock' size=136 vlen=1
>> 	'__sk_common' type_id=34 bits_offset=0
>> ```
>> (Anyway looks like I needed to define the struct in the test prog only when bpf_sock_destory had `struct sock` as the argument.)
> 
> Right, I also think it is orthogonal to your set if the kfunc is taking 'struct sock_common *' anyway. [although I do feel a kernel function taking a 'struct sock_common *' is rather odd]

Yes, this wasn't a problem with the helper taking `struct sock` as the argument in v1 patch. I'm all ears if we can have a similar signature for the kfunc.

> 
> I was only asking and also trying myself because it looks pretty wrong if it can be reproduced and it is something that should be fixed regardless. It is pretty normal to have forward declaration within a bpf prog itself (not from vmlinux.h). From the paste, it feels like the kfunc bpf_sock_destroy btf is generated earlier than the 'struct sock'. Which llvm version are you using?

$ llvm-config --version
14.0.0 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy
  2023-02-27 19:37   ` Andrii Nakryiko
@ 2023-03-03 16:00     ` Aditi Ghag
  0 siblings, 0 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-03-03 16:00 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf



> On Feb 27, 2023, at 11:37 AM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Thu, Feb 23, 2023 at 2:05 PM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>> 
>> The test cases for TCP and UDP iterators mirror the intended usages of the
>> helper.
>> 
>> The destroy helpers set `ECONNABORTED` error code that we can validate in the
>> test code with client sockets. But UDP sockets have an overriding error code
>> from the disconnect called during abort, so the error code the validation is
>> only done for TCP sockets.
>> 
>> The `struct sock` is redefined as vmlinux.h forward declares the struct, and the
>> loader fails to load the program as it finds the BTF FWD type for the struct
>> incompatible with the BTF STRUCT type.
>> 
>> Here are the snippets of the verifier error, and corresponding BTF output:
>> 
>> ```
>> verifier error: extern (func ksym) ...: func_proto ... incompatible with kernel
>> 
>> BTF for selftest prog binary:
>> 
>> [104] FWD 'sock' fwd_kind=struct
>> [70] PTR '(anon)' type_id=104
>> [84] FUNC_PROTO '(anon)' ret_type_id=2 vlen=1
>>        '(anon)' type_id=70
>> [85] FUNC 'bpf_sock_destroy' type_id=84 linkage=extern
>> --
>> [96] DATASEC '.ksyms' size=0 vlen=1
>>        type_id=85 offset=0 size=0 (FUNC 'bpf_sock_destroy')
>> 
>> BTF for selftest vmlinux:
>> 
>> [74923] FUNC 'bpf_sock_destroy' type_id=48965 linkage=static
>> [48965] FUNC_PROTO '(anon)' ret_type_id=9 vlen=1
>>        'sk' type_id=1340
>> [1340] PTR '(anon)' type_id=2363
>> [2363] STRUCT 'sock' size=1280 vlen=93
>> ```
>> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>> .../selftests/bpf/prog_tests/sock_destroy.c   | 125 ++++++++++++++++++
>> .../selftests/bpf/progs/sock_destroy_prog.c   | 110 +++++++++++++++
>> 2 files changed, 235 insertions(+)
>> create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>> 
> 
> [...]
> 
>> +       n = send(clien, "t", 1, 0);
>> +       if (CHECK(n < 0, "client_send", "client failed to send on socket\n"))
>> +               goto cleanup;
>> +
>> +       start_iter_sockets(skel->progs.iter_tcp6);
>> +
>> +       n = send(clien, "t", 1, 0);
>> +       if (CHECK(n > 0, "client_send after destroy", "succeeded on destroyed socket\n"))
>> +               goto cleanup;
>> +       CHECK(errno != ECONNABORTED, "client_send", "unexpected error code on destroyed socket\n");
>> +
> 
> please don't use CHECK() macros, prefere ASSERT_xxx() ones

Yes, this'll be handled in the next revision. Thanks!

> 
>> +
>> +cleanup:
>> +       close(clien);
>> +cleanup_serv:
>> +       close(serv);
>> +}
>> +
>> +
>> +void test_udp(struct sock_destroy_prog *skel)
> 
> are these meant to be subtests? If yes, model them as such?

I wasn't aware of subtests markers. I'll look into the other tests for reference.

> 
> and in either case, make these funcs static

 Will do!

> 
>> +{
>> +       int serv = -1, clien = -1, n = 0;
>> +
>> +       serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
>> +       if (CHECK(serv < 0, "start_server", "failed to start server\n"))
>> +               goto cleanup_serv;
>> +
> 
> [...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-02-28 22:55   ` Martin KaFai Lau
@ 2023-03-16 22:37     ` Aditi Ghag
  2023-03-21 18:49       ` Aditi Ghag
  0 siblings, 1 reply; 24+ messages in thread
From: Aditi Ghag @ 2023-03-16 22:37 UTC (permalink / raw)
  To: Martin KaFai Lau, edumazet; +Cc: kafai, Stanislav Fomichev, bpf



> On Feb 28, 2023, at 2:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>> + * the protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +int bpf_sock_destroy(struct sock_common *sock)
>> +{
>> +	/* Validates the socket can be type casted to a full socket. */
>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
> 
> If sk != sock, sk is not locked.
> 
> Does it have to do sk_to_full_sk()? From looking at tcp_abort(), it can handle TCP_NEW_SYN_RECV and TCP_TIME_WAIT. The bpf-tcp-iter is iterating the hashtables that may have sk in different states. Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.

I initially added the check for request sockets as tcp_abort references some of the fields outside of `sock_common`, but tcp_abort does have a special handling for both req and time_wait sockets, as you pointed out. So I'll remove the  `sk_to_full_sk()` call.


Eric/Martin:

>Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.

On somewhat of a related note, I ran into an interesting problem while adding more tests to exercise changes in the first commit ("Implement batching for UDP sockets iterator") more. As it turns out, UDP *listening* sockets weren't getting destroyed as client sockets. 

So here is what the test does at a high level - 
1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
2) Run BPF iterators that destroys sockets (there are only server sockets).
3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind! 

When I debugged the issue, I found that the listening UDP sockets were still lurking around in the hash table even after they were supposed to be destroyed. With the help of kprobes and print statements in the BPF test iterator program, I confirmed that tcp_abort and the internal function calls (specifically, `__udp_disconnect`) were getting invoked as expected, and the `sk_state` was also being set to `TCP_CLOSE`. Upon further investigation, I found that the socket unhash and source port reset wasn't happening. This is the relevant code - https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L1988 [1]. When I commented out the `SOCK_BINDPORT_LOCK` check, the new test passed as sockets were getting destroyed correctly.

I didn't observe similar behavior with TCP, and TCP listening sockets were correctly getting destroyed. `tcp_set_state` unhashes sockets unconditionally for `TCP_CLOSE` state.

Can you share insights into why the socket unhashing and source port reset doesn't happen for bind'ed sockets? If that's expected, I suppose we'll need to unhash and reset source ports for sockets getting destroyed, right?
(I wonder if this was an oversight when `__udp_disconnect` was introduced in commit 286c72deabaa240b7eebbd99496ed3324d69f3c0.)


If it's easier, I can push the next version of patches where I've addresses review comments, and added new tests. We can then continue this discussion there. In the latest version, I've modified [1] with a `TCP_CLOSE` state check.

[1] if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
		sk->sk_prot->unhash(sk);
		inet->inet_sport = 0;
     }

> 
> Otherwise, the bpf_sock_destroy kfunc is aborting the listener while the bpf prog expects to abort a req_sk.
> 
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc
  2023-03-16 22:37     ` Aditi Ghag
@ 2023-03-21 18:49       ` Aditi Ghag
  0 siblings, 0 replies; 24+ messages in thread
From: Aditi Ghag @ 2023-03-21 18:49 UTC (permalink / raw)
  To: Martin KaFai Lau, edumazet; +Cc: kafai, Stanislav Fomichev, bpf



> On Mar 16, 2023, at 3:37 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Feb 28, 2023, at 2:55 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> 
>> On 2/23/23 1:53 PM, Aditi Ghag wrote:
>>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>>> + *
>>> + * The helper expects a non-NULL pointer to a full socket. It invokes
>>> + * the protocol specific socket destroy handlers.
>>> + *
>>> + * The helper can only be called from BPF contexts that have acquired the socket
>>> + * locks.
>>> + *
>>> + * Parameters:
>>> + * @sock: Pointer to socket to be destroyed
>>> + *
>>> + * Return:
>>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>>> + * 0 otherwise
>>> + */
>>> +int bpf_sock_destroy(struct sock_common *sock)
>>> +{
>>> +	/* Validates the socket can be type casted to a full socket. */
>>> +	struct sock *sk = sk_to_full_sk((struct sock *)sock);
>> 
>> If sk != sock, sk is not locked.
>> 
>> Does it have to do sk_to_full_sk()? From looking at tcp_abort(), it can handle TCP_NEW_SYN_RECV and TCP_TIME_WAIT. The bpf-tcp-iter is iterating the hashtables that may have sk in different states. Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.
> 
> I initially added the check for request sockets as tcp_abort references some of the fields outside of `sock_common`, but tcp_abort does have a special handling for both req and time_wait sockets, as you pointed out. So I'll remove the `sk_to_full_sk()` call.
> 
> 
> Eric/Martin:
> 
>> Ideally, bpf-tcp-iter should be able to use bpf_sock_destroy() to abort the sk in different states also.
> 
> On somewhat of a related note, I ran into an interesting problem while adding more tests to exercise changes in the first commit ("Implement batching for UDP sockets iterator") more. As it turns out, UDP *listening* sockets weren't getting destroyed as client sockets. 
> 
> So here is what the test does at a high level - 
> 1) Start SO_REUSEPORT servers. (I hit same issue for regular UDP listening sockets as well.)
> 2) Run BPF iterators that destroys sockets (there are only server sockets).
> 3) Start a regular server that binds on the same port as the ones from (1) with the expectation that it succeeds after (1) sockets were destroyed. The server fails to bind! 
> 
> When I debugged the issue, I found that the listening UDP sockets were still lurking around in the hash table even after they were supposed to be destroyed. With the help of kprobes and print statements in the BPF test iterator program, I confirmed that tcp_abort and the internal function calls (specifically, `__udp_disconnect`) were getting invoked as expected, and the `sk_state` was also being set to `TCP_CLOSE`. Upon further investigation, I found that the socket unhash and source port reset wasn't happening. This is the relevant code - https://github.com/torvalds/linux/blob/master/net/ipv4/udp.c#L1988 [1]. When I commented out the `SOCK_BINDPORT_LOCK` check, the new test passed as sockets were getting destroyed correctly.
> 
> I didn't observe similar behavior with TCP, and TCP listening sockets were correctly getting destroyed. `tcp_set_state` unhashes sockets unconditionally for `TCP_CLOSE` state.
> 
> Can you share insights into why the socket unhashing and source port reset doesn't happen for bind'ed sockets? If that's expected, I suppose we'll need to unhash and reset source ports for sockets getting destroyed, right?
> (I wonder if this was an oversight when `__udp_disconnect` was introduced in commit 286c72deabaa240b7eebbd99496ed3324d69f3c0.)
> 
> 
> If it's easier, I can push the next version of patches where I've addresses review comments, and added new tests. We can then continue this discussion there. In the latest version, I've modified [1] with a `TCP_CLOSE` state check.
> 
> [1] if (!(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) {
> 		sk->sk_prot->unhash(sk);
> 		inet->inet_sport = 0;
>     }


Scratch my comment about adding the state check. We can discuss the fix in v3 patch series - https://lore.kernel.org/bpf/20230321184541.1857363-5-aditi.ghag@isovalent.com/T/#u. 


> 
>> 
>> Otherwise, the bpf_sock_destroy kfunc is aborting the listener while the bpf prog expects to abort a req_sk.
>> 
>>> +
>>> +	if (!sk)
>>> +		return -EINVAL;
>>> +
>>> +	/* The locking semantics that allow for synchronous execution of the
>>> +	 * destroy handlers are only supported for TCP and UDP.
>>> +	 */
>>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>>> +}
>>> +
>>> +__diag_pop()


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-03-21 18:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-23 21:53 [PATCH v2 bpf-next 0/3]: Add socket destroy capability Aditi Ghag
2023-02-23 21:53 ` [PATCH v2 bpf-next 1/3] bpf: Implement batching in UDP iterator Aditi Ghag
2023-02-24 22:32   ` Stanislav Fomichev
2023-02-28 20:32     ` Martin KaFai Lau
2023-02-28 20:52       ` Stanislav Fomichev
2023-02-28 19:58   ` Martin KaFai Lau
2023-03-01  2:40     ` Aditi Ghag
2023-03-02  6:43       ` Martin KaFai Lau
2023-02-23 21:53 ` [PATCH v2 bpf-next 2/3] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-02-24 22:35   ` Stanislav Fomichev
2023-02-27 14:56     ` Aditi Ghag
2023-02-27 15:32       ` Aditi Ghag
2023-02-27 17:30       ` Stanislav Fomichev
2023-02-28 22:55   ` Martin KaFai Lau
2023-03-16 22:37     ` Aditi Ghag
2023-03-21 18:49       ` Aditi Ghag
2023-02-23 21:53 ` [PATCH v2 bpf-next 3/3] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
2023-02-24 22:40   ` Stanislav Fomichev
2023-02-27 19:37   ` Andrii Nakryiko
2023-03-03 16:00     ` Aditi Ghag
2023-02-28 23:08   ` Martin KaFai Lau
2023-03-01  2:17     ` Aditi Ghag
2023-03-02  7:06       ` Martin KaFai Lau
2023-03-02 20:52         ` Aditi Ghag

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.