bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability
@ 2023-03-23 20:06 Aditi Ghag
  2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-23 20:06 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

This patch adds the capability to destroy sockets in BPF. We plan to use
the capability in Cilium to force client sockets to reconnect when their
remote load-balancing backends are deleted. The other use case is
on-the-fly policy enforcement where existing socket connections prevented
by policies need to be terminated.

The use cases, and more details around
the selected approach was presented at LPC 2022 -
https://lpc.events/event/16/contributions/1358/.
RFC discussion -
https://lore.kernel.org/netdev/CABG=zsBEh-P4NXk23eBJw7eajB5YJeRS7oPXnTAzs=yob4EMoQ@mail.gmail.com/T/#u.
v2 patch series -
https://lore.kernel.org/bpf/20230223215311.926899-1-aditi.ghag@isovalent.com/T/#t

v4 highlights:
- Updated locking in BPF TCP iterator.
- Adapted *-server selftests similar to the client selftests per the
  discussions.
- Addressed Stan's comment to revert skipping spin locks in tcp_abort.
- Dropped the fix to unhash UDP sockets during abort.
- Moved the iterator resume related changes to the right commit. 

(Below notes are same as last patch series.)
- I hit a snag while writing the kfunc where verifier complained about the
  `sock_common` type passed from TCP iterator. With kfuncs, there don't
  seem to be any options available to pass BTF type hints to the verifier
  (equivalent of `ARG_PTR_TO_BTF_ID_SOCK_COMMON`, as was the case with the
  helper).  As a result, I changed the argument type of the sock_destory
  kfunc to `sock_common`.


Aditi Ghag (4):
  bpf: Implement batching in UDP iterator
  bpf: Add bpf_sock_destroy kfunc
  bpf,tcp: Avoid taking fast sock lock in iterator
  selftests/bpf: Add tests for bpf_sock_destroy

 include/net/udp.h                             |   1 +
 net/core/filter.c                             |  54 ++++
 net/ipv4/tcp.c                                |  10 +-
 net/ipv4/tcp_ipv4.c                           |   5 +-
 net/ipv4/udp.c                                | 261 +++++++++++++++++-
 .../selftests/bpf/prog_tests/sock_destroy.c   | 195 +++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++
 7 files changed, 660 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-23 20:06 [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability Aditi Ghag
@ 2023-03-23 20:06 ` Aditi Ghag
  2023-03-24 21:56   ` Stanislav Fomichev
  2023-03-27 22:28   ` Martin KaFai Lau
  2023-03-23 20:06 ` [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-23 20:06 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag, Martin KaFai Lau

Batch UDP sockets from BPF iterator that allows for overlapping locking
semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
BPF socket destroy kfunc (introduced by follow-up patches) to execute from
BPF iterator programs.

Previously, BPF iterators acquired the sock lock and sockets hash table
bucket lock while executing BPF programs. This prevented BPF helpers that
again acquire these locks to be executed from BPF iterators.  With the
batching approach, we acquire a bucket lock, batch all the bucket sockets,
and then release the bucket lock. This enables BPF or kernel helpers to
skip sock locking when invoked in the supported BPF contexts.

The batching logic is similar to the logic implemented in TCP iterator:
https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 include/net/udp.h |   1 +
 net/ipv4/udp.c    | 255 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 247 insertions(+), 9 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index de4b528522bb..d2999447d3f2 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -437,6 +437,7 @@ struct udp_seq_afinfo {
 struct udp_iter_state {
 	struct seq_net_private  p;
 	int			bucket;
+	int			offset;
 	struct udp_seq_afinfo	*bpf_seq_afinfo;
 };
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..58c620243e47 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3152,6 +3152,171 @@ struct bpf_iter__udp {
 	int bucket __aligned(8);
 };
 
+struct bpf_udp_iter_state {
+	struct udp_iter_state state;
+	unsigned int cur_sk;
+	unsigned int end_sk;
+	unsigned int max_sk;
+	struct sock **batch;
+	bool st_bucket_done;
+};
+
+static unsigned short seq_file_family(const struct seq_file *seq);
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz);
+
+static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
+{
+	unsigned short family = seq_file_family(seq);
+
+	/* AF_UNSPEC is used as a match all */
+	return ((family == AF_UNSPEC || family == sk->sk_family) &&
+		net_eq(sock_net(sk), seq_file_net(seq)));
+}
+
+static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
+	struct net *net = seq_file_net(seq);
+	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
+	struct udp_table *udptable;
+	struct sock *first_sk = NULL;
+	struct sock *sk;
+	unsigned int bucket_sks = 0;
+	bool resized = false;
+	int offset = 0;
+	int new_offset;
+
+	/* The current batch is done, so advance the bucket. */
+	if (iter->st_bucket_done) {
+		state->bucket++;
+		state->offset = 0;
+	}
+
+	udptable = udp_get_table_afinfo(afinfo, net);
+
+	if (state->bucket > udptable->mask) {
+		state->bucket = 0;
+		state->offset = 0;
+		return NULL;
+	}
+
+again:
+	/* New batch for the next bucket.
+	 * Iterate over the hash table to find a bucket with sockets matching
+	 * the iterator attributes, and return the first matching socket from
+	 * the bucket. The remaining matched sockets from the bucket are batched
+	 * before releasing the bucket lock. This allows BPF programs that are
+	 * called in seq_show to acquire the bucket lock if needed.
+	 */
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	first_sk = NULL;
+	bucket_sks = 0;
+	offset = state->offset;
+	new_offset = offset;
+
+	for (; state->bucket <= udptable->mask; state->bucket++) {
+		struct udp_hslot *hslot = &udptable->hash[state->bucket];
+
+		if (hlist_empty(&hslot->head)) {
+			offset = 0;
+			continue;
+		}
+
+		spin_lock_bh(&hslot->lock);
+		/* Resume from the last saved position in a bucket before
+		 * iterator was stopped.
+		 */
+		while (offset-- > 0) {
+			sk_for_each(sk, &hslot->head)
+				continue;
+		}
+		sk_for_each(sk, &hslot->head) {
+			if (seq_sk_match(seq, sk)) {
+				if (!first_sk)
+					first_sk = sk;
+				if (iter->end_sk < iter->max_sk) {
+					sock_hold(sk);
+					iter->batch[iter->end_sk++] = sk;
+				}
+				bucket_sks++;
+			}
+			new_offset++;
+		}
+		spin_unlock_bh(&hslot->lock);
+
+		if (first_sk)
+			break;
+
+		/* Reset the current bucket's offset before moving to the next bucket. */
+		offset = 0;
+		new_offset = 0;
+	}
+
+	/* All done: no batch made. */
+	if (!first_sk)
+		goto ret;
+
+	if (iter->end_sk == bucket_sks) {
+		/* Batching is done for the current bucket; return the first
+		 * socket to be iterated from the batch.
+		 */
+		iter->st_bucket_done = true;
+		goto ret;
+	}
+	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
+		resized = true;
+		/* Go back to the previous bucket to resize its batch. */
+		state->bucket--;
+		goto again;
+	}
+ret:
+	state->offset = new_offset;
+	return first_sk;
+}
+
+static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_udp_iter_state *iter = seq->private;
+	struct udp_iter_state *state = &iter->state;
+	struct sock *sk;
+
+	/* Whenever seq_next() is called, the iter->cur_sk is
+	 * done with seq_show(), so unref the iter->cur_sk.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sock_put(iter->batch[iter->cur_sk++]);
+		++state->offset;
+	}
+
+	/* After updating iter->cur_sk, check if there are more sockets
+	 * available in the current bucket batch.
+	 */
+	if (iter->cur_sk < iter->end_sk) {
+		sk = iter->batch[iter->cur_sk];
+	} else {
+		// Prepare a new batch.
+		sk = bpf_iter_udp_batch(seq);
+	}
+
+	++*pos;
+	return sk;
+}
+
+static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* bpf iter does not support lseek, so it always
+	 * continue from where it was stop()-ped.
+	 */
+	if (*pos)
+		return bpf_iter_udp_batch(seq);
+
+	return SEQ_START_TOKEN;
+}
+
 static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
 			     struct udp_sock *udp_sk, uid_t uid, int bucket)
 {
@@ -3172,18 +3337,38 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
+	bool slow;
+	int rc;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
+	slow = lock_sock_fast(sk);
+
+	if (unlikely(sk_unhashed(sk))) {
+		rc = SEQ_SKIP;
+		goto unlock;
+	}
+
 	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
 	meta.seq = seq;
 	prog = bpf_iter_get_info(&meta, false);
-	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
+
+unlock:
+	unlock_sock_fast(sk, slow);
+	return rc;
+}
+
+static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
+{
+	while (iter->cur_sk < iter->end_sk)
+		sock_put(iter->batch[iter->cur_sk++]);
 }
 
 static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 {
+	struct bpf_udp_iter_state *iter = seq->private;
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 
@@ -3194,15 +3379,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
 			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
 	}
 
-	udp_seq_stop(seq, v);
+	if (iter->cur_sk < iter->end_sk) {
+		bpf_iter_udp_unref_batch(iter);
+		iter->st_bucket_done = false;
+	}
 }
 
 static const struct seq_operations bpf_iter_udp_seq_ops = {
-	.start		= udp_seq_start,
-	.next		= udp_seq_next,
+	.start		= bpf_iter_udp_seq_start,
+	.next		= bpf_iter_udp_seq_next,
 	.stop		= bpf_iter_udp_seq_stop,
 	.show		= bpf_iter_udp_seq_show,
 };
+
+static unsigned short seq_file_family(const struct seq_file *seq)
+{
+	const struct udp_seq_afinfo *afinfo;
+
+	/* BPF iterator: bpf programs to filter sockets. */
+	if (seq->op == &bpf_iter_udp_seq_ops)
+		return AF_UNSPEC;
+
+	/* Proc fs iterator */
+	afinfo = pde_data(file_inode(seq->file));
+	return afinfo->family;
+}
 #endif
 
 const struct seq_operations udp_seq_ops = {
@@ -3413,9 +3614,30 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
 DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
 		     struct udp_sock *udp_sk, uid_t uid, int bucket)
 
+static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
+				      unsigned int new_batch_sz)
+{
+	struct sock **new_batch;
+
+	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
+				   GFP_USER | __GFP_NOWARN);
+	if (!new_batch)
+		return -ENOMEM;
+
+	bpf_iter_udp_unref_batch(iter);
+	kvfree(iter->batch);
+	iter->batch = new_batch;
+	iter->max_sk = new_batch_sz;
+
+	return 0;
+}
+
+#define INIT_BATCH_SZ 16
+
 static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 {
-	struct udp_iter_state *st = priv_data;
+	struct bpf_udp_iter_state *iter = priv_data;
+	struct udp_iter_state *st = &iter->state;
 	struct udp_seq_afinfo *afinfo;
 	int ret;
 
@@ -3427,24 +3649,39 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
 	afinfo->udp_table = NULL;
 	st->bpf_seq_afinfo = afinfo;
 	ret = bpf_iter_init_seq_net(priv_data, aux);
-	if (ret)
+	if (ret) {
 		kfree(afinfo);
+		return ret;
+	}
+	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
+	if (ret) {
+		bpf_iter_fini_seq_net(priv_data);
+		return ret;
+	}
+	iter->cur_sk = 0;
+	iter->end_sk = 0;
+	iter->st_bucket_done = false;
+	st->bucket = 0;
+	st->offset = 0;
+
 	return ret;
 }
 
 static void bpf_iter_fini_udp(void *priv_data)
 {
-	struct udp_iter_state *st = priv_data;
+	struct bpf_udp_iter_state *iter = priv_data;
+	struct udp_iter_state *st = &iter->state;
 
-	kfree(st->bpf_seq_afinfo);
 	bpf_iter_fini_seq_net(priv_data);
+	kfree(st->bpf_seq_afinfo);
+	kvfree(iter->batch);
 }
 
 static const struct bpf_iter_seq_info udp_seq_info = {
 	.seq_ops		= &bpf_iter_udp_seq_ops,
 	.init_seq_private	= bpf_iter_init_udp,
 	.fini_seq_private	= bpf_iter_fini_udp,
-	.seq_priv_size		= sizeof(struct udp_iter_state),
+	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
 };
 
 static struct bpf_iter_reg udp_reg_info = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-23 20:06 [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability Aditi Ghag
  2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
@ 2023-03-23 20:06 ` Aditi Ghag
  2023-03-23 23:58   ` Martin KaFai Lau
  2023-03-24 21:37   ` Stanislav Fomichev
  2023-03-23 20:06 ` [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator Aditi Ghag
  2023-03-23 20:06 ` [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  3 siblings, 2 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-23 20:06 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The socket destroy kfunc is used to forcefully terminate sockets from
certain BPF contexts. We plan to use the capability in Cilium to force
client sockets to reconnect when their remote load-balancing backends are
deleted. The other use case is on-the-fly policy enforcement where existing
socket connections prevented by policies need to be forcefully terminated.
The helper allows terminating sockets that may or may not be actively
sending traffic.

The helper is currently exposed to certain BPF iterators where users can
filter, and terminate selected sockets.  Additionally, the helper can only
be called from these BPF contexts that ensure socket locking in order to
allow synchronous execution of destroy helpers that also acquire socket
locks. The previous commit that batches UDP sockets during iteration
facilitated a synchronous invocation of the destroy helper from BPF context
by skipping taking socket locks in the destroy handler. TCP iterators
already supported batching.

The helper takes `sock_common` type argument, even though it expects, and
casts them to a `sock` pointer. This enables the verifier to allow the
sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
`sock` structs. As a comparison, BPF helpers enable this behavior with the
`ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
option available with the verifier logic that handles kfuncs where BTF
types are inferred. Furthermore, as `sock_common` only has a subset of
certain fields of `sock`, casting pointer to the latter type might not
always be safe. Hence, the BPF kfunc converts the argument to a full sock
before casting.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/core/filter.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c    | 10 ++++++---
 net/ipv4/udp.c    |  6 ++++--
 3 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 1d6f165923bf..ba3e0dac119c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
 
 	return func;
 }
+
+/* Disables missing prototype warnings */
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
+ *
+ * The helper expects a non-NULL pointer to a socket. It invokes the
+ * protocol specific socket destroy handlers.
+ *
+ * The helper can only be called from BPF contexts that have acquired the socket
+ * locks.
+ *
+ * Parameters:
+ * @sock: Pointer to socket to be destroyed
+ *
+ * Return:
+ * On error, may return EPROTONOSUPPORT, EINVAL.
+ * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
+ * 0 otherwise
+ */
+__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
+{
+	struct sock *sk = (struct sock *)sock;
+
+	if (!sk)
+		return -EINVAL;
+
+	/* The locking semantics that allow for synchronous execution of the
+	 * destroy handlers are only supported for TCP and UDP.
+	 */
+	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
+		return -EOPNOTSUPP;
+
+	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
+}
+
+__diag_pop()
+
+BTF_SET8_START(sock_destroy_kfunc_set)
+BTF_ID_FLAGS(func, bpf_sock_destroy)
+BTF_SET8_END(sock_destroy_kfunc_set)
+
+static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &sock_destroy_kfunc_set,
+};
+
+static int init_subsystem(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
+}
+late_initcall(init_subsystem);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 33f559f491c8..5df6231016e3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
 		return 0;
 	}
 
-	/* Don't race with userspace socket closes such as tcp_close. */
-	lock_sock(sk);
+	/* BPF context ensures sock locking. */
+	if (!has_current_bpf_ctx())
+		/* Don't race with userspace socket closes such as tcp_close. */
+		lock_sock(sk);
 
 	if (sk->sk_state == TCP_LISTEN) {
 		tcp_set_state(sk, TCP_CLOSE);
@@ -4701,9 +4703,11 @@ int tcp_abort(struct sock *sk, int err)
 	}
 
 	bh_unlock_sock(sk);
+
 	local_bh_enable();
 	tcp_write_queue_purge(sk);
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(tcp_abort);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 58c620243e47..408836102e20 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2925,7 +2925,8 @@ EXPORT_SYMBOL(udp_poll);
 
 int udp_abort(struct sock *sk, int err)
 {
-	lock_sock(sk);
+	if (!has_current_bpf_ctx())
+		lock_sock(sk);
 
 	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
 	 * with close()
@@ -2938,7 +2939,8 @@ int udp_abort(struct sock *sk, int err)
 	__udp_disconnect(sk, 0);
 
 out:
-	release_sock(sk);
+	if (!has_current_bpf_ctx())
+		release_sock(sk);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator
  2023-03-23 20:06 [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability Aditi Ghag
  2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
  2023-03-23 20:06 ` [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-03-23 20:06 ` Aditi Ghag
  2023-03-24 21:45   ` Stanislav Fomichev
  2023-03-27 22:34   ` Martin KaFai Lau
  2023-03-23 20:06 ` [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
  3 siblings, 2 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-23 20:06 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

Previously, BPF TCP iterator was acquiring fast version of sock lock that
disables the BH. This introduced a circular dependency with code paths that
later acquire sockets hash table bucket lock.
Replace the fast version of sock lock with slow that faciliates BPF
programs executed from the iterator to destroy TCP listening sockets
using the bpf_sock_destroy kfunc.

Here is a stack trace that motivated this change:

```
1) sock_lock with BH disabled + bucket lock

lock_acquire+0xcd/0x330
_raw_spin_lock_bh+0x38/0x50
inet_unhash+0x96/0xd0
tcp_set_state+0x6a/0x210
tcp_abort+0x12b/0x230
bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
bpf_iter_run_prog+0x1ff/0x340
bpf_iter_tcp_seq_show+0xca/0x190
bpf_seq_read+0x177/0x450
vfs_read+0xc6/0x300
ksys_read+0x69/0xf0
do_syscall_64+0x3c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

2) sock lock with BH enable

[    1.499968]   lock_acquire+0xcd/0x330
[    1.500316]   _raw_spin_lock+0x33/0x40
[    1.500670]   sk_clone_lock+0x146/0x520
[    1.501030]   inet_csk_clone_lock+0x1b/0x110
[    1.501433]   tcp_create_openreq_child+0x22/0x3f0
[    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
[    1.502284]   tcp_check_req+0x137/0x660
[    1.502646]   tcp_v6_rcv+0xa63/0xe80
[    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
[    1.503434]   ip6_input_finish+0x72/0x140
[    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
[    1.504281]   process_backlog+0x79/0x260
[    1.504668]   __napi_poll.constprop.0+0x27/0x170
[    1.505104]   net_rx_action+0x14a/0x2a0
[    1.505469]   __do_softirq+0x165/0x510
[    1.505842]   do_softirq+0xcd/0x100
[    1.506172]   __local_bh_enable_ip+0xcc/0xf0
[    1.506588]   ip6_finish_output2+0x2a8/0xb00
[    1.506988]   ip6_finish_output+0x274/0x510
[    1.507377]   ip6_xmit+0x319/0x9b0
[    1.507726]   inet6_csk_xmit+0x12b/0x2b0
[    1.508096]   __tcp_transmit_skb+0x549/0xc40
[    1.508498]   tcp_rcv_state_process+0x362/0x1180

```

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 net/ipv4/tcp_ipv4.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ea370afa70ed..f2d370a9450f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 	struct bpf_iter_meta meta;
 	struct bpf_prog *prog;
 	struct sock *sk = v;
-	bool slow;
 	uid_t uid;
 	int ret;
 
@@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 		return 0;
 
 	if (sk_fullsock(sk))
-		slow = lock_sock_fast(sk);
+		lock_sock(sk);
 
 	if (unlikely(sk_unhashed(sk))) {
 		ret = SEQ_SKIP;
@@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
 
 unlock:
 	if (sk_fullsock(sk))
-		unlock_sock_fast(sk, slow);
+		release_sock(sk);
 	return ret;
 
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-23 20:06 [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability Aditi Ghag
                   ` (2 preceding siblings ...)
  2023-03-23 20:06 ` [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator Aditi Ghag
@ 2023-03-23 20:06 ` Aditi Ghag
  2023-03-24 21:52   ` Stanislav Fomichev
  3 siblings, 1 reply; 29+ messages in thread
From: Aditi Ghag @ 2023-03-23 20:06 UTC (permalink / raw)
  To: bpf; +Cc: kafai, sdf, edumazet, aditi.ghag

The test cases for destroying sockets mirror the intended usages of the
bpf_sock_destroy kfunc using iterators.

The destroy helpers set `ECONNABORTED` error code that we can validate in
the test code with client sockets. But UDP sockets have an overriding error
code from the disconnect called during abort, so the error code the
validation is only done for TCP sockets.

Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
---
 .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
 .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
 2 files changed, 346 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
new file mode 100644
index 000000000000..cbce966af568
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
@@ -0,0 +1,195 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+#include "sock_destroy_prog.skel.h"
+#include "network_helpers.h"
+
+#define SERVER_PORT 6062
+
+static void start_iter_sockets(struct bpf_program *prog)
+{
+	struct bpf_link *link;
+	char buf[50] = {};
+	int iter_fd, len;
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+		goto free_link;
+
+	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
+		;
+	ASSERT_GE(len, 0, "read");
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+}
+
+static void test_tcp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys connected client sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+static void test_tcp_server(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	serv = accept(serv, NULL, NULL);
+	if (!ASSERT_GE(serv, 0, "serv accept"))
+		goto cleanup;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_tcp6_server);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
+
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+
+static void test_udp_client(struct sock_destroy_prog *skel)
+{
+	int serv = -1, clien = -1, n = 0;
+
+	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
+	if (!ASSERT_GE(serv, 0, "start_server"))
+		goto cleanup_serv;
+
+	clien = connect_to_fd(serv, 0);
+	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
+		goto cleanup_serv;
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_GE(n, 0, "client send"))
+		goto cleanup;
+
+	/* Run iterator program that destroys sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_client);
+
+	n = send(clien, "t", 1, 0);
+	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
+		goto cleanup;
+	/* UDP sockets have an overriding error code after they are disconnected,
+	 * so we don't check for ECONNABORTED error code.
+	 */
+
+cleanup:
+	close(clien);
+cleanup_serv:
+	close(serv);
+}
+
+static void test_udp_server(struct sock_destroy_prog *skel)
+{
+	int *listen_fds = NULL, n, i;
+	unsigned int num_listens = 5;
+	char buf[1];
+
+	/* Start reuseport servers. */
+	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
+					    "::1", SERVER_PORT, 0,
+					    num_listens);
+	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
+		goto cleanup;
+
+	/* Run iterator program that destroys server sockets. */
+	start_iter_sockets(skel->progs.iter_udp6_server);
+
+	for (i = 0; i < num_listens; ++i) {
+		n = read(listen_fds[i], buf, sizeof(buf));
+		if (!ASSERT_EQ(n, -1, "read") ||
+		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
+			break;
+	}
+	ASSERT_EQ(i, num_listens, "server socket");
+
+cleanup:
+	free_fds(listen_fds, num_listens);
+}
+
+void test_sock_destroy(void)
+{
+	int cgroup_fd = 0;
+	struct sock_destroy_prog *skel;
+
+	skel = sock_destroy_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	cgroup_fd = test__join_cgroup("/sock_destroy");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto close_cgroup_fd;
+
+	skel->links.sock_connect = bpf_program__attach_cgroup(
+		skel->progs.sock_connect, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
+		goto close_cgroup_fd;
+
+	if (test__start_subtest("tcp_client"))
+		test_tcp_client(skel);
+	if (test__start_subtest("tcp_server"))
+		test_tcp_server(skel);
+	if (test__start_subtest("udp_client"))
+		test_udp_client(skel);
+	if (test__start_subtest("udp_server"))
+		test_udp_server(skel);
+
+
+close_cgroup_fd:
+	close(cgroup_fd);
+	sock_destroy_prog__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
new file mode 100644
index 000000000000..8e09d82c50f3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define AF_INET6 10
+/* Keep it in sync with prog_test/sock_destroy. */
+#define SERVER_PORT 6062
+
+int bpf_sock_destroy(struct sock_common *sk) __ksym;
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} tcp_conn_sockets SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} udp_conn_sockets SEC(".maps");
+
+SEC("cgroup/connect6")
+int sock_connect(struct bpf_sock_addr *ctx)
+{
+	int key = 0;
+	__u64 sock_cookie = 0;
+	__u32 keyc = 0;
+
+	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
+		return 1;
+
+	sock_cookie = bpf_get_socket_cookie(ctx);
+	if (ctx->protocol == IPPROTO_TCP)
+		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
+	else if (ctx->protocol == IPPROTO_UDP)
+		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
+	else
+		return 1;
+
+	return 1;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_client(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	struct seq_file *seq = ctx->meta->seq;
+	__u64 sock_cookie = 0;
+	__u64 *val;
+	int key = 0;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk_common);
+	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+SEC("iter/tcp")
+int iter_tcp6_server(struct bpf_iter__tcp *ctx)
+{
+	struct sock_common *sk_common = ctx->sk_common;
+	struct seq_file *seq = ctx->meta->seq;
+	struct tcp6_sock *tcp_sk;
+	const struct inet_connection_sock *icsk;
+	const struct inet_sock *inet;
+	__u16 srcp;
+
+	if (!sk_common)
+		return 0;
+
+	if (sk_common->skc_family != AF_INET6)
+		return 0;
+
+	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
+	if (!tcp_sk)
+		return 0;
+
+	icsk = &tcp_sk->tcp.inet_conn;
+	inet = &icsk->icsk_inet;
+	srcp = bpf_ntohs(inet->inet_sport);
+
+	/* Destroy server sockets. */
+	if (srcp == SERVER_PORT)
+		bpf_sock_destroy(sk_common);
+
+	return 0;
+}
+
+
+SEC("iter/udp")
+int iter_udp6_client(struct bpf_iter__udp *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u64 sock_cookie = 0, *val;
+	int key = 0;
+
+	if (!sk)
+		return 0;
+
+	sock_cookie  = bpf_get_socket_cookie(sk);
+	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
+	if (!val)
+		return 0;
+	/* Destroy connected client sockets. */
+	if (sock_cookie == *val)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+SEC("iter/udp")
+int iter_udp6_server(struct bpf_iter__udp *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct udp_sock *udp_sk = ctx->udp_sk;
+	struct sock *sk = (struct sock *) udp_sk;
+	__u16 srcp;
+	struct inet_sock *inet;
+
+	if (!sk)
+		return 0;
+
+	inet = &udp_sk->inet;
+	srcp = bpf_ntohs(inet->inet_sport);
+	if (srcp == SERVER_PORT)
+		bpf_sock_destroy((struct sock_common *)sk);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-23 20:06 ` [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
@ 2023-03-23 23:58   ` Martin KaFai Lau
  2023-03-24 21:37   ` Stanislav Fomichev
  1 sibling, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2023-03-23 23:58 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, edumazet, bpf, Martin KaFai Lau

On 3/23/23 1:06 PM, Aditi Ghag wrote:
> The socket destroy kfunc is used to forcefully terminate sockets from
> certain BPF contexts. We plan to use the capability in Cilium to force
> client sockets to reconnect when their remote load-balancing backends are
> deleted. The other use case is on-the-fly policy enforcement where existing
> socket connections prevented by policies need to be forcefully terminated.
> The helper allows terminating sockets that may or may not be actively
> sending traffic.
> 
> The helper is currently exposed to certain BPF iterators where users can
> filter, and terminate selected sockets.  Additionally, the helper can only
> be called from these BPF contexts that ensure socket locking in order to
> allow synchronous execution of destroy helpers that also acquire socket
> locks. The previous commit that batches UDP sockets during iteration
> facilitated a synchronous invocation of the destroy helper from BPF context
> by skipping taking socket locks in the destroy handler. TCP iterators
> already supported batching.
> 
> The helper takes `sock_common` type argument, even though it expects, and
> casts them to a `sock` pointer. This enables the verifier to allow the
> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> option available with the verifier logic that handles kfuncs where BTF
> types are inferred. Furthermore, as `sock_common` only has a subset of
> certain fields of `sock`, casting pointer to the latter type might not
> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> before casting.
> 
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/core/filter.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++

This patch has merge conflict: https://github.com/kernel-patches/bpf/pull/4811, 
so please rebase in the next respin.

I took a quick skim but haven't finished. Please hold off the respin and review 
can be continued on this revision.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-23 20:06 ` [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
  2023-03-23 23:58   ` Martin KaFai Lau
@ 2023-03-24 21:37   ` Stanislav Fomichev
  2023-03-30 14:42     ` Aditi Ghag
  1 sibling, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-24 21:37 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 03/23, Aditi Ghag wrote:
> The socket destroy kfunc is used to forcefully terminate sockets from
> certain BPF contexts. We plan to use the capability in Cilium to force
> client sockets to reconnect when their remote load-balancing backends are
> deleted. The other use case is on-the-fly policy enforcement where  
> existing
> socket connections prevented by policies need to be forcefully terminated.
> The helper allows terminating sockets that may or may not be actively
> sending traffic.

> The helper is currently exposed to certain BPF iterators where users can
> filter, and terminate selected sockets.  Additionally, the helper can only
> be called from these BPF contexts that ensure socket locking in order to
> allow synchronous execution of destroy helpers that also acquire socket
> locks. The previous commit that batches UDP sockets during iteration
> facilitated a synchronous invocation of the destroy helper from BPF  
> context
> by skipping taking socket locks in the destroy handler. TCP iterators
> already supported batching.

> The helper takes `sock_common` type argument, even though it expects, and
> casts them to a `sock` pointer. This enables the verifier to allow the
> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> option available with the verifier logic that handles kfuncs where BTF
> types are inferred. Furthermore, as `sock_common` only has a subset of
> certain fields of `sock`, casting pointer to the latter type might not
> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> before casting.

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   net/core/filter.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++
>   net/ipv4/tcp.c    | 10 ++++++---
>   net/ipv4/udp.c    |  6 ++++--
>   3 files changed, 65 insertions(+), 5 deletions(-)

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 1d6f165923bf..ba3e0dac119c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)

>   	return func;
>   }
> +
> +/* Disables missing prototype warnings */
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error  
> code.
> + *
> + * The helper expects a non-NULL pointer to a socket. It invokes the
> + * protocol specific socket destroy handlers.
> + *
> + * The helper can only be called from BPF contexts that have acquired  
> the socket
> + * locks.
> + *
> + * Parameters:
> + * @sock: Pointer to socket to be destroyed
> + *
> + * Return:
> + * On error, may return EPROTONOSUPPORT, EINVAL.
> + * EPROTONOSUPPORT if protocol specific destroy handler is not  
> implemented.
> + * 0 otherwise
> + */
> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
> +{
> +	struct sock *sk = (struct sock *)sock;
> +
> +	if (!sk)
> +		return -EINVAL;
> +
> +	/* The locking semantics that allow for synchronous execution of the
> +	 * destroy handlers are only supported for TCP and UDP.
> +	 */
> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> +		return -EOPNOTSUPP;

Copy-pasting from v3, let's discuss here.

Maybe make it more opt-in? (vs current "opt ipproto_raw out")

if (sk->sk_prot->diag_destroy != udp_abort &&
     sk->sk_prot->diag_destroy != tcp_abort)
             return -EOPNOTSUPP;

Is it more robust? Or does it look uglier? )
But maybe fine as is, I'm just thinking out loud..

> +
> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> +}
> +
> +__diag_pop()
> +
> +BTF_SET8_START(sock_destroy_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> +BTF_SET8_END(sock_destroy_kfunc_set)
> +
> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &sock_destroy_kfunc_set,
> +};
> +
> +static int init_subsystem(void)
> +{
> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,  
> &bpf_sock_destroy_kfunc_set);
> +}
> +late_initcall(init_subsystem);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 33f559f491c8..5df6231016e3 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>   		return 0;
>   	}

> -	/* Don't race with userspace socket closes such as tcp_close. */
> -	lock_sock(sk);
> +	/* BPF context ensures sock locking. */
> +	if (!has_current_bpf_ctx())
> +		/* Don't race with userspace socket closes such as tcp_close. */
> +		lock_sock(sk);

>   	if (sk->sk_state == TCP_LISTEN) {
>   		tcp_set_state(sk, TCP_CLOSE);
> @@ -4701,9 +4703,11 @@ int tcp_abort(struct sock *sk, int err)
>   	}

>   	bh_unlock_sock(sk);
> +
>   	local_bh_enable();
>   	tcp_write_queue_purge(sk);
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(tcp_abort);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 58c620243e47..408836102e20 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2925,7 +2925,8 @@ EXPORT_SYMBOL(udp_poll);

>   int udp_abort(struct sock *sk, int err)
>   {
> -	lock_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		lock_sock(sk);

>   	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>   	 * with close()
> @@ -2938,7 +2939,8 @@ int udp_abort(struct sock *sk, int err)
>   	__udp_disconnect(sk, 0);

>   out:
> -	release_sock(sk);
> +	if (!has_current_bpf_ctx())
> +		release_sock(sk);

>   	return 0;
>   }
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator
  2023-03-23 20:06 ` [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator Aditi Ghag
@ 2023-03-24 21:45   ` Stanislav Fomichev
  2023-03-28 15:20     ` Aditi Ghag
  2023-03-27 22:34   ` Martin KaFai Lau
  1 sibling, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-24 21:45 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 03/23, Aditi Ghag wrote:
> Previously, BPF TCP iterator was acquiring fast version of sock lock that
> disables the BH. This introduced a circular dependency with code paths  
> that
> later acquire sockets hash table bucket lock.
> Replace the fast version of sock lock with slow that faciliates BPF
> programs executed from the iterator to destroy TCP listening sockets
> using the bpf_sock_destroy kfunc.

> Here is a stack trace that motivated this change:

> ```
> 1) sock_lock with BH disabled + bucket lock

> lock_acquire+0xcd/0x330
> _raw_spin_lock_bh+0x38/0x50
> inet_unhash+0x96/0xd0
> tcp_set_state+0x6a/0x210
> tcp_abort+0x12b/0x230
> bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
> bpf_iter_run_prog+0x1ff/0x340
> bpf_iter_tcp_seq_show+0xca/0x190
> bpf_seq_read+0x177/0x450
> vfs_read+0xc6/0x300
> ksys_read+0x69/0xf0
> do_syscall_64+0x3c/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc

> 2) sock lock with BH enable

> [    1.499968]   lock_acquire+0xcd/0x330
> [    1.500316]   _raw_spin_lock+0x33/0x40
> [    1.500670]   sk_clone_lock+0x146/0x520
> [    1.501030]   inet_csk_clone_lock+0x1b/0x110
> [    1.501433]   tcp_create_openreq_child+0x22/0x3f0
> [    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
> [    1.502284]   tcp_check_req+0x137/0x660
> [    1.502646]   tcp_v6_rcv+0xa63/0xe80
> [    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
> [    1.503434]   ip6_input_finish+0x72/0x140
> [    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
> [    1.504281]   process_backlog+0x79/0x260
> [    1.504668]   __napi_poll.constprop.0+0x27/0x170
> [    1.505104]   net_rx_action+0x14a/0x2a0
> [    1.505469]   __do_softirq+0x165/0x510
> [    1.505842]   do_softirq+0xcd/0x100
> [    1.506172]   __local_bh_enable_ip+0xcc/0xf0
> [    1.506588]   ip6_finish_output2+0x2a8/0xb00
> [    1.506988]   ip6_finish_output+0x274/0x510
> [    1.507377]   ip6_xmit+0x319/0x9b0
> [    1.507726]   inet6_csk_xmit+0x12b/0x2b0
> [    1.508096]   __tcp_transmit_skb+0x549/0xc40
> [    1.508498]   tcp_rcv_state_process+0x362/0x1180

> ```

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>

Acked-by: Stanislav Fomichev <sdf@google.com>

Don't need fixes because it doesn't trigger without your new
bpf_sock_destroy?


> ---
>   net/ipv4/tcp_ipv4.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)

> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ea370afa70ed..f2d370a9450f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file  
> *seq, void *v)
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
> -	bool slow;
>   	uid_t uid;
>   	int ret;

> @@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file  
> *seq, void *v)
>   		return 0;

>   	if (sk_fullsock(sk))
> -		slow = lock_sock_fast(sk);
> +		lock_sock(sk);

>   	if (unlikely(sk_unhashed(sk))) {
>   		ret = SEQ_SKIP;
> @@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file  
> *seq, void *v)

>   unlock:
>   	if (sk_fullsock(sk))
> -		unlock_sock_fast(sk, slow);
> +		release_sock(sk);
>   	return ret;

>   }
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-23 20:06 ` [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
@ 2023-03-24 21:52   ` Stanislav Fomichev
  2023-03-27 15:57     ` Aditi Ghag
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-24 21:52 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 03/23, Aditi Ghag wrote:
> The test cases for destroying sockets mirror the intended usages of the
> bpf_sock_destroy kfunc using iterators.

> The destroy helpers set `ECONNABORTED` error code that we can validate in
> the test code with client sockets. But UDP sockets have an overriding  
> error
> code from the disconnect called during abort, so the error code the
> validation is only done for TCP sockets.

> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
>   .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
>   2 files changed, 346 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>   create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c

> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c  
> b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> new file mode 100644
> index 000000000000..cbce966af568
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> @@ -0,0 +1,195 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +
> +#include "sock_destroy_prog.skel.h"
> +#include "network_helpers.h"
> +
> +#define SERVER_PORT 6062
> +
> +static void start_iter_sockets(struct bpf_program *prog)
> +{
> +	struct bpf_link *link;
> +	char buf[50] = {};
> +	int iter_fd, len;
> +
> +	link = bpf_program__attach_iter(prog, NULL);
> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
> +		return;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> +		goto free_link;
> +
> +	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
> +		;
> +	ASSERT_GE(len, 0, "read");
> +
> +	close(iter_fd);
> +
> +free_link:
> +	bpf_link__destroy(link);
> +}
> +
> +static void test_tcp_client(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;
> +
> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> +	if (!ASSERT_GE(serv, 0, "start_server"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> +		goto cleanup_serv;
> +
> +	serv = accept(serv, NULL, NULL);
> +	if (!ASSERT_GE(serv, 0, "serv accept"))
> +		goto cleanup;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_GE(n, 0, "client send"))
> +		goto cleanup;
> +
> +	/* Run iterator program that destroys connected client sockets. */
> +	start_iter_sockets(skel->progs.iter_tcp6_client);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> +		goto cleanup;
> +	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
> +
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +static void test_tcp_server(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;
> +
> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
> +	if (!ASSERT_GE(serv, 0, "start_server"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> +		goto cleanup_serv;
> +
> +	serv = accept(serv, NULL, NULL);
> +	if (!ASSERT_GE(serv, 0, "serv accept"))
> +		goto cleanup;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_GE(n, 0, "client send"))
> +		goto cleanup;
> +
> +	/* Run iterator program that destroys server sockets. */
> +	start_iter_sockets(skel->progs.iter_tcp6_server);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> +		goto cleanup;
> +	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
> +
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +
> +static void test_udp_client(struct sock_destroy_prog *skel)
> +{
> +	int serv = -1, clien = -1, n = 0;
> +
> +	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
> +	if (!ASSERT_GE(serv, 0, "start_server"))
> +		goto cleanup_serv;
> +
> +	clien = connect_to_fd(serv, 0);
> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> +		goto cleanup_serv;
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_GE(n, 0, "client send"))
> +		goto cleanup;
> +
> +	/* Run iterator program that destroys sockets. */
> +	start_iter_sockets(skel->progs.iter_udp6_client);
> +
> +	n = send(clien, "t", 1, 0);
> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> +		goto cleanup;
> +	/* UDP sockets have an overriding error code after they are  
> disconnected,
> +	 * so we don't check for ECONNABORTED error code.
> +	 */
> +
> +cleanup:
> +	close(clien);
> +cleanup_serv:
> +	close(serv);
> +}
> +
> +static void test_udp_server(struct sock_destroy_prog *skel)
> +{
> +	int *listen_fds = NULL, n, i;
> +	unsigned int num_listens = 5;
> +	char buf[1];
> +
> +	/* Start reuseport servers. */
> +	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
> +					    "::1", SERVER_PORT, 0,
> +					    num_listens);
> +	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
> +		goto cleanup;
> +
> +	/* Run iterator program that destroys server sockets. */
> +	start_iter_sockets(skel->progs.iter_udp6_server);
> +
> +	for (i = 0; i < num_listens; ++i) {
> +		n = read(listen_fds[i], buf, sizeof(buf));
> +		if (!ASSERT_EQ(n, -1, "read") ||
> +		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
> +			break;
> +	}
> +	ASSERT_EQ(i, num_listens, "server socket");
> +
> +cleanup:
> +	free_fds(listen_fds, num_listens);
> +}
> +
> +void test_sock_destroy(void)
> +{
> +	int cgroup_fd = 0;
> +	struct sock_destroy_prog *skel;
> +
> +	skel = sock_destroy_prog__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		return;
> +
> +	cgroup_fd = test__join_cgroup("/sock_destroy");
> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> +		goto close_cgroup_fd;
> +
> +	skel->links.sock_connect = bpf_program__attach_cgroup(
> +		skel->progs.sock_connect, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
> +		goto close_cgroup_fd;
> +
> +	if (test__start_subtest("tcp_client"))
> +		test_tcp_client(skel);
> +	if (test__start_subtest("tcp_server"))
> +		test_tcp_server(skel);
> +	if (test__start_subtest("udp_client"))
> +		test_udp_client(skel);
> +	if (test__start_subtest("udp_server"))
> +		test_udp_server(skel);
> +
> +
> +close_cgroup_fd:
> +	close(cgroup_fd);
> +	sock_destroy_prog__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c  
> b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> new file mode 100644
> index 000000000000..8e09d82c50f3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> @@ -0,0 +1,151 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +
> +#include "bpf_tracing_net.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +#define AF_INET6 10

[..]

> +/* Keep it in sync with prog_test/sock_destroy. */
> +#define SERVER_PORT 6062

The test looks good, one optional unrelated nit maybe:

I've been guilty of these hard-coded ports in the past, but maybe
we should stop hard-coding them? Getting the address of the listener (bound  
to
port 0) and passing it to the bpf program via global variable should be  
super
easy now (with the skeletons and network_helpers).

And, unrelated, maybe also fix a bunch of places where the reverse christmas
tree doesn't look reverse anymore?

> +
> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} tcp_conn_sockets SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 1);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} udp_conn_sockets SEC(".maps");
> +
> +SEC("cgroup/connect6")
> +int sock_connect(struct bpf_sock_addr *ctx)
> +{
> +	int key = 0;
> +	__u64 sock_cookie = 0;
> +	__u32 keyc = 0;
> +
> +	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> +		return 1;
> +
> +	sock_cookie = bpf_get_socket_cookie(ctx);
> +	if (ctx->protocol == IPPROTO_TCP)
> +		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
> +	else if (ctx->protocol == IPPROTO_UDP)
> +		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
> +	else
> +		return 1;
> +
> +	return 1;
> +}
> +
> +SEC("iter/tcp")
> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
> +{
> +	struct sock_common *sk_common = ctx->sk_common;
> +	struct seq_file *seq = ctx->meta->seq;
> +	__u64 sock_cookie = 0;
> +	__u64 *val;
> +	int key = 0;
> +
> +	if (!sk_common)
> +		return 0;
> +
> +	if (sk_common->skc_family != AF_INET6)
> +		return 0;
> +
> +	sock_cookie  = bpf_get_socket_cookie(sk_common);
> +	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
> +	if (!val)
> +		return 0;
> +	/* Destroy connected client sockets. */
> +	if (sock_cookie == *val)
> +		bpf_sock_destroy(sk_common);
> +
> +	return 0;
> +}
> +
> +SEC("iter/tcp")
> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
> +{
> +	struct sock_common *sk_common = ctx->sk_common;
> +	struct seq_file *seq = ctx->meta->seq;
> +	struct tcp6_sock *tcp_sk;
> +	const struct inet_connection_sock *icsk;
> +	const struct inet_sock *inet;
> +	__u16 srcp;
> +
> +	if (!sk_common)
> +		return 0;
> +
> +	if (sk_common->skc_family != AF_INET6)
> +		return 0;
> +
> +	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
> +	if (!tcp_sk)
> +		return 0;
> +
> +	icsk = &tcp_sk->tcp.inet_conn;
> +	inet = &icsk->icsk_inet;
> +	srcp = bpf_ntohs(inet->inet_sport);
> +
> +	/* Destroy server sockets. */
> +	if (srcp == SERVER_PORT)
> +		bpf_sock_destroy(sk_common);
> +
> +	return 0;
> +}
> +
> +
> +SEC("iter/udp")
> +int iter_udp6_client(struct bpf_iter__udp *ctx)
> +{
> +	struct seq_file *seq = ctx->meta->seq;
> +	struct udp_sock *udp_sk = ctx->udp_sk;
> +	struct sock *sk = (struct sock *) udp_sk;
> +	__u64 sock_cookie = 0, *val;
> +	int key = 0;
> +
> +	if (!sk)
> +		return 0;
> +
> +	sock_cookie  = bpf_get_socket_cookie(sk);
> +	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
> +	if (!val)
> +		return 0;
> +	/* Destroy connected client sockets. */
> +	if (sock_cookie == *val)
> +		bpf_sock_destroy((struct sock_common *)sk);
> +
> +	return 0;
> +}
> +
> +SEC("iter/udp")
> +int iter_udp6_server(struct bpf_iter__udp *ctx)
> +{
> +	struct seq_file *seq = ctx->meta->seq;
> +	struct udp_sock *udp_sk = ctx->udp_sk;
> +	struct sock *sk = (struct sock *) udp_sk;
> +	__u16 srcp;
> +	struct inet_sock *inet;
> +
> +	if (!sk)
> +		return 0;
> +
> +	inet = &udp_sk->inet;
> +	srcp = bpf_ntohs(inet->inet_sport);
> +	if (srcp == SERVER_PORT)
> +		bpf_sock_destroy((struct sock_common *)sk);
> +
> +	return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
@ 2023-03-24 21:56   ` Stanislav Fomichev
  2023-03-27 15:52     ` Aditi Ghag
  2023-03-27 22:28   ` Martin KaFai Lau
  1 sibling, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-24 21:56 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet, Martin KaFai Lau

On 03/23, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs.  This  
> facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.

> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators.  With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.

> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.

> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   include/net/udp.h |   1 +
>   net/ipv4/udp.c    | 255 ++++++++++++++++++++++++++++++++++++++++++++--
>   2 files changed, 247 insertions(+), 9 deletions(-)

> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..d2999447d3f2 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>   struct udp_iter_state {
>   	struct seq_net_private  p;
>   	int			bucket;
> +	int			offset;
>   	struct udp_seq_afinfo	*bpf_seq_afinfo;
>   };

> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..58c620243e47 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3152,6 +3152,171 @@ struct bpf_iter__udp {
>   	int bucket __aligned(8);
>   };

> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
> +static unsigned short seq_file_family(const struct seq_file *seq);
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +
> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock  
> *sk)
> +{
> +	unsigned short family = seq_file_family(seq);
> +
> +	/* AF_UNSPEC is used as a match all */
> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
> +		net_eq(sock_net(sk), seq_file_net(seq)));
> +}
> +
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
> +	struct udp_table *udptable;
> +	struct sock *first_sk = NULL;
> +	struct sock *sk;
> +	unsigned int bucket_sks = 0;
> +	bool resized = false;
> +	int offset = 0;
> +	int new_offset;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done) {
> +		state->bucket++;
> +		state->offset = 0;
> +	}
> +
> +	udptable = udp_get_table_afinfo(afinfo, net);
> +
> +	if (state->bucket > udptable->mask) {
> +		state->bucket = 0;
> +		state->offset = 0;
> +		return NULL;
> +	}
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	first_sk = NULL;
> +	bucket_sks = 0;
> +	offset = state->offset;
> +	new_offset = offset;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
> +
> +		if (hlist_empty(&hslot->head)) {
> +			offset = 0;
> +			continue;
> +		}
> +
> +		spin_lock_bh(&hslot->lock);
> +		/* Resume from the last saved position in a bucket before
> +		 * iterator was stopped.
> +		 */
> +		while (offset-- > 0) {
> +			sk_for_each(sk, &hslot->head)
> +				continue;
> +		}
> +		sk_for_each(sk, &hslot->head) {
> +			if (seq_sk_match(seq, sk)) {
> +				if (!first_sk)
> +					first_sk = sk;
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				bucket_sks++;
> +			}
> +			new_offset++;
> +		}
> +		spin_unlock_bh(&hslot->lock);
> +
> +		if (first_sk)
> +			break;
> +
> +		/* Reset the current bucket's offset before moving to the next bucket.  
> */
> +		offset = 0;
> +		new_offset = 0;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!first_sk)
> +		goto ret;
> +
> +	if (iter->end_sk == bucket_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		goto ret;
> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */
> +		state->bucket--;
> +		goto again;
> +	}
> +ret:
> +	state->offset = new_offset;
> +	return first_sk;
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t  
> *pos)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct sock *sk;
> +
> +	/* Whenever seq_next() is called, the iter->cur_sk is
> +	 * done with seq_show(), so unref the iter->cur_sk.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sock_put(iter->batch[iter->cur_sk++]);
> +		++state->offset;
> +	}
> +
> +	/* After updating iter->cur_sk, check if there are more sockets
> +	 * available in the current bucket batch.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sk = iter->batch[iter->cur_sk];
> +	} else {
> +		// Prepare a new batch.
> +		sk = bpf_iter_udp_batch(seq);
> +	}
> +
> +	++*pos;
> +	return sk;
> +}
> +
> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	/* bpf iter does not support lseek, so it always
> +	 * continue from where it was stop()-ped.
> +	 */
> +	if (*pos)
> +		return bpf_iter_udp_batch(seq);
> +
> +	return SEQ_START_TOKEN;
> +}
> +
>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta  
> *meta,
>   			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   {
> @@ -3172,18 +3337,38 @@ static int bpf_iter_udp_seq_show(struct seq_file  
> *seq, void *v)
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
>   	uid_t uid;
> +	bool slow;
> +	int rc;

>   	if (v == SEQ_START_TOKEN)
>   		return 0;


[..]

> +	slow = lock_sock_fast(sk);
> +
> +	if (unlikely(sk_unhashed(sk))) {
> +		rc = SEQ_SKIP;
> +		goto unlock;
> +	}
> +

Should we use non-fast version here for consistency with tcp?


>   	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>   	meta.seq = seq;
>   	prog = bpf_iter_get_info(&meta, false);
> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +
> +unlock:
> +	unlock_sock_fast(sk, slow);
> +	return rc;
> +}
> +
> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
> +{
> +	while (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
>   }

>   static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   {
> +	struct bpf_udp_iter_state *iter = seq->private;
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;

> @@ -3194,15 +3379,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file  
> *seq, void *v)
>   			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>   	}

> -	udp_seq_stop(seq, v);
> +	if (iter->cur_sk < iter->end_sk) {
> +		bpf_iter_udp_unref_batch(iter);
> +		iter->st_bucket_done = false;
> +	}
>   }

>   static const struct seq_operations bpf_iter_udp_seq_ops = {
> -	.start		= udp_seq_start,
> -	.next		= udp_seq_next,
> +	.start		= bpf_iter_udp_seq_start,
> +	.next		= bpf_iter_udp_seq_next,
>   	.stop		= bpf_iter_udp_seq_stop,
>   	.show		= bpf_iter_udp_seq_show,
>   };
> +
> +static unsigned short seq_file_family(const struct seq_file *seq)
> +{
> +	const struct udp_seq_afinfo *afinfo;
> +
> +	/* BPF iterator: bpf programs to filter sockets. */
> +	if (seq->op == &bpf_iter_udp_seq_ops)
> +		return AF_UNSPEC;
> +
> +	/* Proc fs iterator */
> +	afinfo = pde_data(file_inode(seq->file));
> +	return afinfo->family;
> +}
>   #endif

>   const struct seq_operations udp_seq_ops = {
> @@ -3413,9 +3614,30 @@ static struct pernet_operations __net_initdata  
> udp_sysctl_ops = {
>   DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>   		     struct udp_sock *udp_sk, uid_t uid, int bucket)

> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz)
> +{
> +	struct sock **new_batch;
> +
> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> +				   GFP_USER | __GFP_NOWARN);
> +	if (!new_batch)
> +		return -ENOMEM;
> +
> +	bpf_iter_udp_unref_batch(iter);
> +	kvfree(iter->batch);
> +	iter->batch = new_batch;
> +	iter->max_sk = new_batch_sz;
> +
> +	return 0;
> +}
> +
> +#define INIT_BATCH_SZ 16
> +
>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info  
> *aux)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	struct udp_iter_state *st = &iter->state;
>   	struct udp_seq_afinfo *afinfo;
>   	int ret;

> @@ -3427,24 +3649,39 @@ static int bpf_iter_init_udp(void *priv_data,  
> struct bpf_iter_aux_info *aux)
>   	afinfo->udp_table = NULL;
>   	st->bpf_seq_afinfo = afinfo;
>   	ret = bpf_iter_init_seq_net(priv_data, aux);
> -	if (ret)
> +	if (ret) {
>   		kfree(afinfo);
> +		return ret;
> +	}
> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> +	if (ret) {
> +		bpf_iter_fini_seq_net(priv_data);
> +		return ret;
> +	}
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	st->bucket = 0;
> +	st->offset = 0;
> +
>   	return ret;
>   }

>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	struct udp_iter_state *st = &iter->state;

> -	kfree(st->bpf_seq_afinfo);
>   	bpf_iter_fini_seq_net(priv_data);
> +	kfree(st->bpf_seq_afinfo);
> +	kvfree(iter->batch);
>   }

>   static const struct bpf_iter_seq_info udp_seq_info = {
>   	.seq_ops		= &bpf_iter_udp_seq_ops,
>   	.init_seq_private	= bpf_iter_init_udp,
>   	.fini_seq_private	= bpf_iter_fini_udp,
> -	.seq_priv_size		= sizeof(struct udp_iter_state),
> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>   };

>   static struct bpf_iter_reg udp_reg_info = {
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-24 21:56   ` Stanislav Fomichev
@ 2023-03-27 15:52     ` Aditi Ghag
  2023-03-27 16:52       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Aditi Ghag @ 2023-03-27 15:52 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet, Martin KaFai Lau



> On Mar 24, 2023, at 2:56 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/23, Aditi Ghag wrote:
>> Batch UDP sockets from BPF iterator that allows for overlapping locking
>> semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
>> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
>> BPF iterator programs.
> 
>> Previously, BPF iterators acquired the sock lock and sockets hash table
>> bucket lock while executing BPF programs. This prevented BPF helpers that
>> again acquire these locks to be executed from BPF iterators.  With the
>> batching approach, we acquire a bucket lock, batch all the bucket sockets,
>> and then release the bucket lock. This enables BPF or kernel helpers to
>> skip sock locking when invoked in the supported BPF contexts.
> 
>> The batching logic is similar to the logic implemented in TCP iterator:
>> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> 
>> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  include/net/udp.h |   1 +
>>  net/ipv4/udp.c    | 255 ++++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 247 insertions(+), 9 deletions(-)
> 
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index de4b528522bb..d2999447d3f2 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>>  struct udp_iter_state {
>>  	struct seq_net_private  p;
>>  	int			bucket;
>> +	int			offset;
>>  	struct udp_seq_afinfo	*bpf_seq_afinfo;
>>  };
> 
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index c605d171eb2d..58c620243e47 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3152,6 +3152,171 @@ struct bpf_iter__udp {
>>  	int bucket __aligned(8);
>>  };
> 
>> +struct bpf_udp_iter_state {
>> +	struct udp_iter_state state;
>> +	unsigned int cur_sk;
>> +	unsigned int end_sk;
>> +	unsigned int max_sk;
>> +	struct sock **batch;
>> +	bool st_bucket_done;
>> +};
>> +
>> +static unsigned short seq_file_family(const struct seq_file *seq);
>> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>> +				      unsigned int new_batch_sz);
>> +
>> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
>> +{
>> +	unsigned short family = seq_file_family(seq);
>> +
>> +	/* AF_UNSPEC is used as a match all */
>> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
>> +		net_eq(sock_net(sk), seq_file_net(seq)));
>> +}
>> +
>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>> +{
>> +	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>> +	struct net *net = seq_file_net(seq);
>> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
>> +	struct udp_table *udptable;
>> +	struct sock *first_sk = NULL;
>> +	struct sock *sk;
>> +	unsigned int bucket_sks = 0;
>> +	bool resized = false;
>> +	int offset = 0;
>> +	int new_offset;
>> +
>> +	/* The current batch is done, so advance the bucket. */
>> +	if (iter->st_bucket_done) {
>> +		state->bucket++;
>> +		state->offset = 0;
>> +	}
>> +
>> +	udptable = udp_get_table_afinfo(afinfo, net);
>> +
>> +	if (state->bucket > udptable->mask) {
>> +		state->bucket = 0;
>> +		state->offset = 0;
>> +		return NULL;
>> +	}
>> +
>> +again:
>> +	/* New batch for the next bucket.
>> +	 * Iterate over the hash table to find a bucket with sockets matching
>> +	 * the iterator attributes, and return the first matching socket from
>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>> +	 * before releasing the bucket lock. This allows BPF programs that are
>> +	 * called in seq_show to acquire the bucket lock if needed.
>> +	 */
>> +	iter->cur_sk = 0;
>> +	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	first_sk = NULL;
>> +	bucket_sks = 0;
>> +	offset = state->offset;
>> +	new_offset = offset;
>> +
>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
>> +
>> +		if (hlist_empty(&hslot->head)) {
>> +			offset = 0;
>> +			continue;
>> +		}
>> +
>> +		spin_lock_bh(&hslot->lock);
>> +		/* Resume from the last saved position in a bucket before
>> +		 * iterator was stopped.
>> +		 */
>> +		while (offset-- > 0) {
>> +			sk_for_each(sk, &hslot->head)
>> +				continue;
>> +		}
>> +		sk_for_each(sk, &hslot->head) {
>> +			if (seq_sk_match(seq, sk)) {
>> +				if (!first_sk)
>> +					first_sk = sk;
>> +				if (iter->end_sk < iter->max_sk) {
>> +					sock_hold(sk);
>> +					iter->batch[iter->end_sk++] = sk;
>> +				}
>> +				bucket_sks++;
>> +			}
>> +			new_offset++;
>> +		}
>> +		spin_unlock_bh(&hslot->lock);
>> +
>> +		if (first_sk)
>> +			break;
>> +
>> +		/* Reset the current bucket's offset before moving to the next bucket. */
>> +		offset = 0;
>> +		new_offset = 0;
>> +	}
>> +
>> +	/* All done: no batch made. */
>> +	if (!first_sk)
>> +		goto ret;
>> +
>> +	if (iter->end_sk == bucket_sks) {
>> +		/* Batching is done for the current bucket; return the first
>> +		 * socket to be iterated from the batch.
>> +		 */
>> +		iter->st_bucket_done = true;
>> +		goto ret;
>> +	}
>> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
>> +		resized = true;
>> +		/* Go back to the previous bucket to resize its batch. */
>> +		state->bucket--;
>> +		goto again;
>> +	}
>> +ret:
>> +	state->offset = new_offset;
>> +	return first_sk;
>> +}
>> +
>> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> +	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>> +	struct sock *sk;
>> +
>> +	/* Whenever seq_next() is called, the iter->cur_sk is
>> +	 * done with seq_show(), so unref the iter->cur_sk.
>> +	 */
>> +	if (iter->cur_sk < iter->end_sk) {
>> +		sock_put(iter->batch[iter->cur_sk++]);
>> +		++state->offset;
>> +	}
>> +
>> +	/* After updating iter->cur_sk, check if there are more sockets
>> +	 * available in the current bucket batch.
>> +	 */
>> +	if (iter->cur_sk < iter->end_sk) {
>> +		sk = iter->batch[iter->cur_sk];
>> +	} else {
>> +		// Prepare a new batch.
>> +		sk = bpf_iter_udp_batch(seq);
>> +	}
>> +
>> +	++*pos;
>> +	return sk;
>> +}
>> +
>> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
>> +{
>> +	/* bpf iter does not support lseek, so it always
>> +	 * continue from where it was stop()-ped.
>> +	 */
>> +	if (*pos)
>> +		return bpf_iter_udp_batch(seq);
>> +
>> +	return SEQ_START_TOKEN;
>> +}
>> +
>>  static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>>  			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>>  {
>> @@ -3172,18 +3337,38 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
>>  	struct bpf_prog *prog;
>>  	struct sock *sk = v;
>>  	uid_t uid;
>> +	bool slow;
>> +	int rc;
> 
>>  	if (v == SEQ_START_TOKEN)
>>  		return 0;
> 
> 
> [..]
> 
>> +	slow = lock_sock_fast(sk);
>> +
>> +	if (unlikely(sk_unhashed(sk))) {
>> +		rc = SEQ_SKIP;
>> +		goto unlock;
>> +	}
>> +
> 
> Should we use non-fast version here for consistency with tcp?

We could, but I don't see a problem with acquiring fast version for UDP so we could just stick with it. The TCP change warrants a code comment though, I'll add it in the next reversion. 

> 
> 
>>  	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>>  	meta.seq = seq;
>>  	prog = bpf_iter_get_info(&meta, false);
>> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
>> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
>> +
>> +unlock:
>> +	unlock_sock_fast(sk, slow);
>> +	return rc;
>> +}
>> +
>> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
>> +{
>> +	while (iter->cur_sk < iter->end_sk)
>> +		sock_put(iter->batch[iter->cur_sk++]);
>>  }
> 
>>  static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>>  {
>> +	struct bpf_udp_iter_state *iter = seq->private;
>>  	struct bpf_iter_meta meta;
>>  	struct bpf_prog *prog;
> 
>> @@ -3194,15 +3379,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>>  			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>>  	}
> 
>> -	udp_seq_stop(seq, v);
>> +	if (iter->cur_sk < iter->end_sk) {
>> +		bpf_iter_udp_unref_batch(iter);
>> +		iter->st_bucket_done = false;
>> +	}
>>  }
> 
>>  static const struct seq_operations bpf_iter_udp_seq_ops = {
>> -	.start		= udp_seq_start,
>> -	.next		= udp_seq_next,
>> +	.start		= bpf_iter_udp_seq_start,
>> +	.next		= bpf_iter_udp_seq_next,
>>  	.stop		= bpf_iter_udp_seq_stop,
>>  	.show		= bpf_iter_udp_seq_show,
>>  };
>> +
>> +static unsigned short seq_file_family(const struct seq_file *seq)
>> +{
>> +	const struct udp_seq_afinfo *afinfo;
>> +
>> +	/* BPF iterator: bpf programs to filter sockets. */
>> +	if (seq->op == &bpf_iter_udp_seq_ops)
>> +		return AF_UNSPEC;
>> +
>> +	/* Proc fs iterator */
>> +	afinfo = pde_data(file_inode(seq->file));
>> +	return afinfo->family;
>> +}
>>  #endif
> 
>>  const struct seq_operations udp_seq_ops = {
>> @@ -3413,9 +3614,30 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
>>  DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>>  		     struct udp_sock *udp_sk, uid_t uid, int bucket)
> 
>> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>> +				      unsigned int new_batch_sz)
>> +{
>> +	struct sock **new_batch;
>> +
>> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
>> +				   GFP_USER | __GFP_NOWARN);
>> +	if (!new_batch)
>> +		return -ENOMEM;
>> +
>> +	bpf_iter_udp_unref_batch(iter);
>> +	kvfree(iter->batch);
>> +	iter->batch = new_batch;
>> +	iter->max_sk = new_batch_sz;
>> +
>> +	return 0;
>> +}
>> +
>> +#define INIT_BATCH_SZ 16
>> +
>>  static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>>  {
>> -	struct udp_iter_state *st = priv_data;
>> +	struct bpf_udp_iter_state *iter = priv_data;
>> +	struct udp_iter_state *st = &iter->state;
>>  	struct udp_seq_afinfo *afinfo;
>>  	int ret;
> 
>> @@ -3427,24 +3649,39 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>>  	afinfo->udp_table = NULL;
>>  	st->bpf_seq_afinfo = afinfo;
>>  	ret = bpf_iter_init_seq_net(priv_data, aux);
>> -	if (ret)
>> +	if (ret) {
>>  		kfree(afinfo);
>> +		return ret;
>> +	}
>> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
>> +	if (ret) {
>> +		bpf_iter_fini_seq_net(priv_data);
>> +		return ret;
>> +	}
>> +	iter->cur_sk = 0;
>> +	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	st->bucket = 0;
>> +	st->offset = 0;
>> +
>>  	return ret;
>>  }
> 
>>  static void bpf_iter_fini_udp(void *priv_data)
>>  {
>> -	struct udp_iter_state *st = priv_data;
>> +	struct bpf_udp_iter_state *iter = priv_data;
>> +	struct udp_iter_state *st = &iter->state;
> 
>> -	kfree(st->bpf_seq_afinfo);
>>  	bpf_iter_fini_seq_net(priv_data);
>> +	kfree(st->bpf_seq_afinfo);
>> +	kvfree(iter->batch);
>>  }
> 
>>  static const struct bpf_iter_seq_info udp_seq_info = {
>>  	.seq_ops		= &bpf_iter_udp_seq_ops,
>>  	.init_seq_private	= bpf_iter_init_udp,
>>  	.fini_seq_private	= bpf_iter_fini_udp,
>> -	.seq_priv_size		= sizeof(struct udp_iter_state),
>> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>>  };
> 
>>  static struct bpf_iter_reg udp_reg_info = {
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-24 21:52   ` Stanislav Fomichev
@ 2023-03-27 15:57     ` Aditi Ghag
  2023-03-27 16:54       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Aditi Ghag @ 2023-03-27 15:57 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/23, Aditi Ghag wrote:
>> The test cases for destroying sockets mirror the intended usages of the
>> bpf_sock_destroy kfunc using iterators.
> 
>> The destroy helpers set `ECONNABORTED` error code that we can validate in
>> the test code with client sockets. But UDP sockets have an overriding error
>> code from the disconnect called during abort, so the error code the
>> validation is only done for TCP sockets.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
>>  .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
>>  2 files changed, 346 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> 
>> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> new file mode 100644
>> index 000000000000..cbce966af568
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> @@ -0,0 +1,195 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <test_progs.h>
>> +
>> +#include "sock_destroy_prog.skel.h"
>> +#include "network_helpers.h"
>> +
>> +#define SERVER_PORT 6062
>> +
>> +static void start_iter_sockets(struct bpf_program *prog)
>> +{
>> +	struct bpf_link *link;
>> +	char buf[50] = {};
>> +	int iter_fd, len;
>> +
>> +	link = bpf_program__attach_iter(prog, NULL);
>> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
>> +		return;
>> +
>> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
>> +	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
>> +		goto free_link;
>> +
>> +	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
>> +		;
>> +	ASSERT_GE(len, 0, "read");
>> +
>> +	close(iter_fd);
>> +
>> +free_link:
>> +	bpf_link__destroy(link);
>> +}
>> +
>> +static void test_tcp_client(struct sock_destroy_prog *skel)
>> +{
>> +	int serv = -1, clien = -1, n = 0;
>> +
>> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
>> +	if (!ASSERT_GE(serv, 0, "start_server"))
>> +		goto cleanup_serv;
>> +
>> +	clien = connect_to_fd(serv, 0);
>> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>> +		goto cleanup_serv;
>> +
>> +	serv = accept(serv, NULL, NULL);
>> +	if (!ASSERT_GE(serv, 0, "serv accept"))
>> +		goto cleanup;
>> +
>> +	n = send(clien, "t", 1, 0);
>> +	if (!ASSERT_GE(n, 0, "client send"))
>> +		goto cleanup;
>> +
>> +	/* Run iterator program that destroys connected client sockets. */
>> +	start_iter_sockets(skel->progs.iter_tcp6_client);
>> +
>> +	n = send(clien, "t", 1, 0);
>> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>> +		goto cleanup;
>> +	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
>> +
>> +
>> +cleanup:
>> +	close(clien);
>> +cleanup_serv:
>> +	close(serv);
>> +}
>> +
>> +static void test_tcp_server(struct sock_destroy_prog *skel)
>> +{
>> +	int serv = -1, clien = -1, n = 0;
>> +
>> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
>> +	if (!ASSERT_GE(serv, 0, "start_server"))
>> +		goto cleanup_serv;
>> +
>> +	clien = connect_to_fd(serv, 0);
>> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>> +		goto cleanup_serv;
>> +
>> +	serv = accept(serv, NULL, NULL);
>> +	if (!ASSERT_GE(serv, 0, "serv accept"))
>> +		goto cleanup;
>> +
>> +	n = send(clien, "t", 1, 0);
>> +	if (!ASSERT_GE(n, 0, "client send"))
>> +		goto cleanup;
>> +
>> +	/* Run iterator program that destroys server sockets. */
>> +	start_iter_sockets(skel->progs.iter_tcp6_server);
>> +
>> +	n = send(clien, "t", 1, 0);
>> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>> +		goto cleanup;
>> +	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
>> +
>> +
>> +cleanup:
>> +	close(clien);
>> +cleanup_serv:
>> +	close(serv);
>> +}
>> +
>> +
>> +static void test_udp_client(struct sock_destroy_prog *skel)
>> +{
>> +	int serv = -1, clien = -1, n = 0;
>> +
>> +	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
>> +	if (!ASSERT_GE(serv, 0, "start_server"))
>> +		goto cleanup_serv;
>> +
>> +	clien = connect_to_fd(serv, 0);
>> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>> +		goto cleanup_serv;
>> +
>> +	n = send(clien, "t", 1, 0);
>> +	if (!ASSERT_GE(n, 0, "client send"))
>> +		goto cleanup;
>> +
>> +	/* Run iterator program that destroys sockets. */
>> +	start_iter_sockets(skel->progs.iter_udp6_client);
>> +
>> +	n = send(clien, "t", 1, 0);
>> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>> +		goto cleanup;
>> +	/* UDP sockets have an overriding error code after they are disconnected,
>> +	 * so we don't check for ECONNABORTED error code.
>> +	 */
>> +
>> +cleanup:
>> +	close(clien);
>> +cleanup_serv:
>> +	close(serv);
>> +}
>> +
>> +static void test_udp_server(struct sock_destroy_prog *skel)
>> +{
>> +	int *listen_fds = NULL, n, i;
>> +	unsigned int num_listens = 5;
>> +	char buf[1];
>> +
>> +	/* Start reuseport servers. */
>> +	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
>> +					    "::1", SERVER_PORT, 0,
>> +					    num_listens);
>> +	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
>> +		goto cleanup;
>> +
>> +	/* Run iterator program that destroys server sockets. */
>> +	start_iter_sockets(skel->progs.iter_udp6_server);
>> +
>> +	for (i = 0; i < num_listens; ++i) {
>> +		n = read(listen_fds[i], buf, sizeof(buf));
>> +		if (!ASSERT_EQ(n, -1, "read") ||
>> +		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
>> +			break;
>> +	}
>> +	ASSERT_EQ(i, num_listens, "server socket");
>> +
>> +cleanup:
>> +	free_fds(listen_fds, num_listens);
>> +}
>> +
>> +void test_sock_destroy(void)
>> +{
>> +	int cgroup_fd = 0;
>> +	struct sock_destroy_prog *skel;
>> +
>> +	skel = sock_destroy_prog__open_and_load();
>> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
>> +		return;
>> +
>> +	cgroup_fd = test__join_cgroup("/sock_destroy");
>> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
>> +		goto close_cgroup_fd;
>> +
>> +	skel->links.sock_connect = bpf_program__attach_cgroup(
>> +		skel->progs.sock_connect, cgroup_fd);
>> +	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
>> +		goto close_cgroup_fd;
>> +
>> +	if (test__start_subtest("tcp_client"))
>> +		test_tcp_client(skel);
>> +	if (test__start_subtest("tcp_server"))
>> +		test_tcp_server(skel);
>> +	if (test__start_subtest("udp_client"))
>> +		test_udp_client(skel);
>> +	if (test__start_subtest("udp_server"))
>> +		test_udp_server(skel);
>> +
>> +
>> +close_cgroup_fd:
>> +	close(cgroup_fd);
>> +	sock_destroy_prog__destroy(skel);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>> new file mode 100644
>> index 000000000000..8e09d82c50f3
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>> @@ -0,0 +1,151 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include "vmlinux.h"
>> +
>> +#include "bpf_tracing_net.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_endian.h>
>> +
>> +#define AF_INET6 10
> 
> [..]
> 
>> +/* Keep it in sync with prog_test/sock_destroy. */
>> +#define SERVER_PORT 6062
> 
> The test looks good, one optional unrelated nit maybe:
> 
> I've been guilty of these hard-coded ports in the past, but maybe
> we should stop hard-coding them? Getting the address of the listener (bound to
> port 0) and passing it to the bpf program via global variable should be super
> easy now (with the skeletons and network_helpers).


I briefly considered adding the ports in a map, and retrieving them in the test. But it didn't seem worthwhile as the tests should fail clearly when there is a mismatch. 

> 
> And, unrelated, maybe also fix a bunch of places where the reverse christmas
> tree doesn't look reverse anymore?

Ok. The checks should be part of tooling (e.g., checkpatch) though if they are meant to be enforced consistently, no?

> 
>> +
>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> +	__uint(max_entries, 1);
>> +	__type(key, __u32);
>> +	__type(value, __u64);
>> +} tcp_conn_sockets SEC(".maps");
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> +	__uint(max_entries, 1);
>> +	__type(key, __u32);
>> +	__type(value, __u64);
>> +} udp_conn_sockets SEC(".maps");
>> +
>> +SEC("cgroup/connect6")
>> +int sock_connect(struct bpf_sock_addr *ctx)
>> +{
>> +	int key = 0;
>> +	__u64 sock_cookie = 0;
>> +	__u32 keyc = 0;
>> +
>> +	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
>> +		return 1;
>> +
>> +	sock_cookie = bpf_get_socket_cookie(ctx);
>> +	if (ctx->protocol == IPPROTO_TCP)
>> +		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
>> +	else if (ctx->protocol == IPPROTO_UDP)
>> +		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
>> +	else
>> +		return 1;
>> +
>> +	return 1;
>> +}
>> +
>> +SEC("iter/tcp")
>> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
>> +{
>> +	struct sock_common *sk_common = ctx->sk_common;
>> +	struct seq_file *seq = ctx->meta->seq;
>> +	__u64 sock_cookie = 0;
>> +	__u64 *val;
>> +	int key = 0;
>> +
>> +	if (!sk_common)
>> +		return 0;
>> +
>> +	if (sk_common->skc_family != AF_INET6)
>> +		return 0;
>> +
>> +	sock_cookie  = bpf_get_socket_cookie(sk_common);
>> +	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
>> +	if (!val)
>> +		return 0;
>> +	/* Destroy connected client sockets. */
>> +	if (sock_cookie == *val)
>> +		bpf_sock_destroy(sk_common);
>> +
>> +	return 0;
>> +}
>> +
>> +SEC("iter/tcp")
>> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
>> +{
>> +	struct sock_common *sk_common = ctx->sk_common;
>> +	struct seq_file *seq = ctx->meta->seq;
>> +	struct tcp6_sock *tcp_sk;
>> +	const struct inet_connection_sock *icsk;
>> +	const struct inet_sock *inet;
>> +	__u16 srcp;
>> +
>> +	if (!sk_common)
>> +		return 0;
>> +
>> +	if (sk_common->skc_family != AF_INET6)
>> +		return 0;
>> +
>> +	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
>> +	if (!tcp_sk)
>> +		return 0;
>> +
>> +	icsk = &tcp_sk->tcp.inet_conn;
>> +	inet = &icsk->icsk_inet;
>> +	srcp = bpf_ntohs(inet->inet_sport);
>> +
>> +	/* Destroy server sockets. */
>> +	if (srcp == SERVER_PORT)
>> +		bpf_sock_destroy(sk_common);
>> +
>> +	return 0;
>> +}
>> +
>> +
>> +SEC("iter/udp")
>> +int iter_udp6_client(struct bpf_iter__udp *ctx)
>> +{
>> +	struct seq_file *seq = ctx->meta->seq;
>> +	struct udp_sock *udp_sk = ctx->udp_sk;
>> +	struct sock *sk = (struct sock *) udp_sk;
>> +	__u64 sock_cookie = 0, *val;
>> +	int key = 0;
>> +
>> +	if (!sk)
>> +		return 0;
>> +
>> +	sock_cookie  = bpf_get_socket_cookie(sk);
>> +	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
>> +	if (!val)
>> +		return 0;
>> +	/* Destroy connected client sockets. */
>> +	if (sock_cookie == *val)
>> +		bpf_sock_destroy((struct sock_common *)sk);
>> +
>> +	return 0;
>> +}
>> +
>> +SEC("iter/udp")
>> +int iter_udp6_server(struct bpf_iter__udp *ctx)
>> +{
>> +	struct seq_file *seq = ctx->meta->seq;
>> +	struct udp_sock *udp_sk = ctx->udp_sk;
>> +	struct sock *sk = (struct sock *) udp_sk;
>> +	__u16 srcp;
>> +	struct inet_sock *inet;
>> +
>> +	if (!sk)
>> +		return 0;
>> +
>> +	inet = &udp_sk->inet;
>> +	srcp = bpf_ntohs(inet->inet_sport);
>> +	if (srcp == SERVER_PORT)
>> +		bpf_sock_destroy((struct sock_common *)sk);
>> +
>> +	return 0;
>> +}
>> +
>> +char _license[] SEC("license") = "GPL";
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-27 15:52     ` Aditi Ghag
@ 2023-03-27 16:52       ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-27 16:52 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet, Martin KaFai Lau

On 03/27, Aditi Ghag wrote:


> > On Mar 24, 2023, at 2:56 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/23, Aditi Ghag wrote:
> >> Batch UDP sockets from BPF iterator that allows for overlapping locking
> >> semantics in BPF/kernel helpers executed in BPF programs.  This  
> facilitates
> >> BPF socket destroy kfunc (introduced by follow-up patches) to execute  
> from
> >> BPF iterator programs.
> >
> >> Previously, BPF iterators acquired the sock lock and sockets hash table
> >> bucket lock while executing BPF programs. This prevented BPF helpers  
> that
> >> again acquire these locks to be executed from BPF iterators.  With the
> >> batching approach, we acquire a bucket lock, batch all the bucket  
> sockets,
> >> and then release the bucket lock. This enables BPF or kernel helpers to
> >> skip sock locking when invoked in the supported BPF contexts.
> >
> >> The batching logic is similar to the logic implemented in TCP iterator:
> >> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> >
> >> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  include/net/udp.h |   1 +
> >>  net/ipv4/udp.c    | 255 ++++++++++++++++++++++++++++++++++++++++++++--
> >>  2 files changed, 247 insertions(+), 9 deletions(-)
> >
> >> diff --git a/include/net/udp.h b/include/net/udp.h
> >> index de4b528522bb..d2999447d3f2 100644
> >> --- a/include/net/udp.h
> >> +++ b/include/net/udp.h
> >> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
> >>  struct udp_iter_state {
> >>  	struct seq_net_private  p;
> >>  	int			bucket;
> >> +	int			offset;
> >>  	struct udp_seq_afinfo	*bpf_seq_afinfo;
> >>  };
> >
> >> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> index c605d171eb2d..58c620243e47 100644
> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -3152,6 +3152,171 @@ struct bpf_iter__udp {
> >>  	int bucket __aligned(8);
> >>  };
> >
> >> +struct bpf_udp_iter_state {
> >> +	struct udp_iter_state state;
> >> +	unsigned int cur_sk;
> >> +	unsigned int end_sk;
> >> +	unsigned int max_sk;
> >> +	struct sock **batch;
> >> +	bool st_bucket_done;
> >> +};
> >> +
> >> +static unsigned short seq_file_family(const struct seq_file *seq);
> >> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> >> +				      unsigned int new_batch_sz);
> >> +
> >> +static inline bool seq_sk_match(struct seq_file *seq, const struct  
> sock *sk)
> >> +{
> >> +	unsigned short family = seq_file_family(seq);
> >> +
> >> +	/* AF_UNSPEC is used as a match all */
> >> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
> >> +		net_eq(sock_net(sk), seq_file_net(seq)));
> >> +}
> >> +
> >> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> >> +{
> >> +	struct bpf_udp_iter_state *iter = seq->private;
> >> +	struct udp_iter_state *state = &iter->state;
> >> +	struct net *net = seq_file_net(seq);
> >> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
> >> +	struct udp_table *udptable;
> >> +	struct sock *first_sk = NULL;
> >> +	struct sock *sk;
> >> +	unsigned int bucket_sks = 0;
> >> +	bool resized = false;
> >> +	int offset = 0;
> >> +	int new_offset;
> >> +
> >> +	/* The current batch is done, so advance the bucket. */
> >> +	if (iter->st_bucket_done) {
> >> +		state->bucket++;
> >> +		state->offset = 0;
> >> +	}
> >> +
> >> +	udptable = udp_get_table_afinfo(afinfo, net);
> >> +
> >> +	if (state->bucket > udptable->mask) {
> >> +		state->bucket = 0;
> >> +		state->offset = 0;
> >> +		return NULL;
> >> +	}
> >> +
> >> +again:
> >> +	/* New batch for the next bucket.
> >> +	 * Iterate over the hash table to find a bucket with sockets matching
> >> +	 * the iterator attributes, and return the first matching socket from
> >> +	 * the bucket. The remaining matched sockets from the bucket are  
> batched
> >> +	 * before releasing the bucket lock. This allows BPF programs that  
> are
> >> +	 * called in seq_show to acquire the bucket lock if needed.
> >> +	 */
> >> +	iter->cur_sk = 0;
> >> +	iter->end_sk = 0;
> >> +	iter->st_bucket_done = false;
> >> +	first_sk = NULL;
> >> +	bucket_sks = 0;
> >> +	offset = state->offset;
> >> +	new_offset = offset;
> >> +
> >> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> >> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
> >> +
> >> +		if (hlist_empty(&hslot->head)) {
> >> +			offset = 0;
> >> +			continue;
> >> +		}
> >> +
> >> +		spin_lock_bh(&hslot->lock);
> >> +		/* Resume from the last saved position in a bucket before
> >> +		 * iterator was stopped.
> >> +		 */
> >> +		while (offset-- > 0) {
> >> +			sk_for_each(sk, &hslot->head)
> >> +				continue;
> >> +		}
> >> +		sk_for_each(sk, &hslot->head) {
> >> +			if (seq_sk_match(seq, sk)) {
> >> +				if (!first_sk)
> >> +					first_sk = sk;
> >> +				if (iter->end_sk < iter->max_sk) {
> >> +					sock_hold(sk);
> >> +					iter->batch[iter->end_sk++] = sk;
> >> +				}
> >> +				bucket_sks++;
> >> +			}
> >> +			new_offset++;
> >> +		}
> >> +		spin_unlock_bh(&hslot->lock);
> >> +
> >> +		if (first_sk)
> >> +			break;
> >> +
> >> +		/* Reset the current bucket's offset before moving to the next  
> bucket. */
> >> +		offset = 0;
> >> +		new_offset = 0;
> >> +	}
> >> +
> >> +	/* All done: no batch made. */
> >> +	if (!first_sk)
> >> +		goto ret;
> >> +
> >> +	if (iter->end_sk == bucket_sks) {
> >> +		/* Batching is done for the current bucket; return the first
> >> +		 * socket to be iterated from the batch.
> >> +		 */
> >> +		iter->st_bucket_done = true;
> >> +		goto ret;
> >> +	}
> >> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 /  
> 2)) {
> >> +		resized = true;
> >> +		/* Go back to the previous bucket to resize its batch. */
> >> +		state->bucket--;
> >> +		goto again;
> >> +	}
> >> +ret:
> >> +	state->offset = new_offset;
> >> +	return first_sk;
> >> +}
> >> +
> >> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v,  
> loff_t *pos)
> >> +{
> >> +	struct bpf_udp_iter_state *iter = seq->private;
> >> +	struct udp_iter_state *state = &iter->state;
> >> +	struct sock *sk;
> >> +
> >> +	/* Whenever seq_next() is called, the iter->cur_sk is
> >> +	 * done with seq_show(), so unref the iter->cur_sk.
> >> +	 */
> >> +	if (iter->cur_sk < iter->end_sk) {
> >> +		sock_put(iter->batch[iter->cur_sk++]);
> >> +		++state->offset;
> >> +	}
> >> +
> >> +	/* After updating iter->cur_sk, check if there are more sockets
> >> +	 * available in the current bucket batch.
> >> +	 */
> >> +	if (iter->cur_sk < iter->end_sk) {
> >> +		sk = iter->batch[iter->cur_sk];
> >> +	} else {
> >> +		// Prepare a new batch.
> >> +		sk = bpf_iter_udp_batch(seq);
> >> +	}
> >> +
> >> +	++*pos;
> >> +	return sk;
> >> +}
> >> +
> >> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> >> +{
> >> +	/* bpf iter does not support lseek, so it always
> >> +	 * continue from where it was stop()-ped.
> >> +	 */
> >> +	if (*pos)
> >> +		return bpf_iter_udp_batch(seq);
> >> +
> >> +	return SEQ_START_TOKEN;
> >> +}
> >> +
> >>  static int udp_prog_seq_show(struct bpf_prog *prog, struct  
> bpf_iter_meta *meta,
> >>  			     struct udp_sock *udp_sk, uid_t uid, int bucket)
> >>  {
> >> @@ -3172,18 +3337,38 @@ static int bpf_iter_udp_seq_show(struct  
> seq_file *seq, void *v)
> >>  	struct bpf_prog *prog;
> >>  	struct sock *sk = v;
> >>  	uid_t uid;
> >> +	bool slow;
> >> +	int rc;
> >
> >>  	if (v == SEQ_START_TOKEN)
> >>  		return 0;
> >
> >
> > [..]
> >
> >> +	slow = lock_sock_fast(sk);
> >> +
> >> +	if (unlikely(sk_unhashed(sk))) {
> >> +		rc = SEQ_SKIP;
> >> +		goto unlock;
> >> +	}
> >> +
> >
> > Should we use non-fast version here for consistency with tcp?

> We could, but I don't see a problem with acquiring fast version for UDP  
> so we could just stick with it. The TCP change warrants a code comment  
> though, I'll add it in the next reversion.

lock_sock_fast is an exception and we should have a good reason to use
it in a particular place. It blocks bh (rx softirq) and doesn't
consume the backlog on unlock.

$ grep -ri lock_sock_fast . | wc -l
60

$ grep -ri lock_sock . | wc -l
1075 # this includes 60 from the above, but it doesn't matter

So unless you have a good reason to use it (and not a mere "why not"),
lets use regular lock_sock here?

> >
> >
> >>  	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
> >>  	meta.seq = seq;
> >>  	prog = bpf_iter_get_info(&meta, false);
> >> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> >> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> >> +
> >> +unlock:
> >> +	unlock_sock_fast(sk, slow);
> >> +	return rc;
> >> +}
> >> +
> >> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
> >> +{
> >> +	while (iter->cur_sk < iter->end_sk)
> >> +		sock_put(iter->batch[iter->cur_sk++]);
> >>  }
> >
> >>  static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
> >>  {
> >> +	struct bpf_udp_iter_state *iter = seq->private;
> >>  	struct bpf_iter_meta meta;
> >>  	struct bpf_prog *prog;
> >
> >> @@ -3194,15 +3379,31 @@ static void bpf_iter_udp_seq_stop(struct  
> seq_file *seq, void *v)
> >>  			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
> >>  	}
> >
> >> -	udp_seq_stop(seq, v);
> >> +	if (iter->cur_sk < iter->end_sk) {
> >> +		bpf_iter_udp_unref_batch(iter);
> >> +		iter->st_bucket_done = false;
> >> +	}
> >>  }
> >
> >>  static const struct seq_operations bpf_iter_udp_seq_ops = {
> >> -	.start		= udp_seq_start,
> >> -	.next		= udp_seq_next,
> >> +	.start		= bpf_iter_udp_seq_start,
> >> +	.next		= bpf_iter_udp_seq_next,
> >>  	.stop		= bpf_iter_udp_seq_stop,
> >>  	.show		= bpf_iter_udp_seq_show,
> >>  };
> >> +
> >> +static unsigned short seq_file_family(const struct seq_file *seq)
> >> +{
> >> +	const struct udp_seq_afinfo *afinfo;
> >> +
> >> +	/* BPF iterator: bpf programs to filter sockets. */
> >> +	if (seq->op == &bpf_iter_udp_seq_ops)
> >> +		return AF_UNSPEC;
> >> +
> >> +	/* Proc fs iterator */
> >> +	afinfo = pde_data(file_inode(seq->file));
> >> +	return afinfo->family;
> >> +}
> >>  #endif
> >
> >>  const struct seq_operations udp_seq_ops = {
> >> @@ -3413,9 +3614,30 @@ static struct pernet_operations __net_initdata  
> udp_sysctl_ops = {
> >>  DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
> >>  		     struct udp_sock *udp_sk, uid_t uid, int bucket)
> >
> >> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> >> +				      unsigned int new_batch_sz)
> >> +{
> >> +	struct sock **new_batch;
> >> +
> >> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> >> +				   GFP_USER | __GFP_NOWARN);
> >> +	if (!new_batch)
> >> +		return -ENOMEM;
> >> +
> >> +	bpf_iter_udp_unref_batch(iter);
> >> +	kvfree(iter->batch);
> >> +	iter->batch = new_batch;
> >> +	iter->max_sk = new_batch_sz;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +#define INIT_BATCH_SZ 16
> >> +
> >>  static int bpf_iter_init_udp(void *priv_data, struct  
> bpf_iter_aux_info *aux)
> >>  {
> >> -	struct udp_iter_state *st = priv_data;
> >> +	struct bpf_udp_iter_state *iter = priv_data;
> >> +	struct udp_iter_state *st = &iter->state;
> >>  	struct udp_seq_afinfo *afinfo;
> >>  	int ret;
> >
> >> @@ -3427,24 +3649,39 @@ static int bpf_iter_init_udp(void *priv_data,  
> struct bpf_iter_aux_info *aux)
> >>  	afinfo->udp_table = NULL;
> >>  	st->bpf_seq_afinfo = afinfo;
> >>  	ret = bpf_iter_init_seq_net(priv_data, aux);
> >> -	if (ret)
> >> +	if (ret) {
> >>  		kfree(afinfo);
> >> +		return ret;
> >> +	}
> >> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> >> +	if (ret) {
> >> +		bpf_iter_fini_seq_net(priv_data);
> >> +		return ret;
> >> +	}
> >> +	iter->cur_sk = 0;
> >> +	iter->end_sk = 0;
> >> +	iter->st_bucket_done = false;
> >> +	st->bucket = 0;
> >> +	st->offset = 0;
> >> +
> >>  	return ret;
> >>  }
> >
> >>  static void bpf_iter_fini_udp(void *priv_data)
> >>  {
> >> -	struct udp_iter_state *st = priv_data;
> >> +	struct bpf_udp_iter_state *iter = priv_data;
> >> +	struct udp_iter_state *st = &iter->state;
> >
> >> -	kfree(st->bpf_seq_afinfo);
> >>  	bpf_iter_fini_seq_net(priv_data);
> >> +	kfree(st->bpf_seq_afinfo);
> >> +	kvfree(iter->batch);
> >>  }
> >
> >>  static const struct bpf_iter_seq_info udp_seq_info = {
> >>  	.seq_ops		= &bpf_iter_udp_seq_ops,
> >>  	.init_seq_private	= bpf_iter_init_udp,
> >>  	.fini_seq_private	= bpf_iter_fini_udp,
> >> -	.seq_priv_size		= sizeof(struct udp_iter_state),
> >> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
> >>  };
> >
> >>  static struct bpf_iter_reg udp_reg_info = {
> >> --
> >> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-27 15:57     ` Aditi Ghag
@ 2023-03-27 16:54       ` Stanislav Fomichev
  2023-03-28 17:50         ` Aditi Ghag
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-27 16:54 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On 03/27, Aditi Ghag wrote:


> > On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/23, Aditi Ghag wrote:
> >> The test cases for destroying sockets mirror the intended usages of the
> >> bpf_sock_destroy kfunc using iterators.
> >
> >> The destroy helpers set `ECONNABORTED` error code that we can validate  
> in
> >> the test code with client sockets. But UDP sockets have an overriding  
> error
> >> code from the disconnect called during abort, so the error code the
> >> validation is only done for TCP sockets.
> >
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
> >>  .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
> >>  2 files changed, 346 insertions(+)
> >>  create mode 100644  
> tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >>  create mode 100644  
> tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >
> >> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c  
> b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >> new file mode 100644
> >> index 000000000000..cbce966af568
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >> @@ -0,0 +1,195 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +#include <test_progs.h>
> >> +
> >> +#include "sock_destroy_prog.skel.h"
> >> +#include "network_helpers.h"
> >> +
> >> +#define SERVER_PORT 6062
> >> +
> >> +static void start_iter_sockets(struct bpf_program *prog)
> >> +{
> >> +	struct bpf_link *link;
> >> +	char buf[50] = {};
> >> +	int iter_fd, len;
> >> +
> >> +	link = bpf_program__attach_iter(prog, NULL);
> >> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
> >> +		return;
> >> +
> >> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
> >> +	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> >> +		goto free_link;
> >> +
> >> +	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
> >> +		;
> >> +	ASSERT_GE(len, 0, "read");
> >> +
> >> +	close(iter_fd);
> >> +
> >> +free_link:
> >> +	bpf_link__destroy(link);
> >> +}
> >> +
> >> +static void test_tcp_client(struct sock_destroy_prog *skel)
> >> +{
> >> +	int serv = -1, clien = -1, n = 0;
> >> +
> >> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> >> +	if (!ASSERT_GE(serv, 0, "start_server"))
> >> +		goto cleanup_serv;
> >> +
> >> +	clien = connect_to_fd(serv, 0);
> >> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >> +		goto cleanup_serv;
> >> +
> >> +	serv = accept(serv, NULL, NULL);
> >> +	if (!ASSERT_GE(serv, 0, "serv accept"))
> >> +		goto cleanup;
> >> +
> >> +	n = send(clien, "t", 1, 0);
> >> +	if (!ASSERT_GE(n, 0, "client send"))
> >> +		goto cleanup;
> >> +
> >> +	/* Run iterator program that destroys connected client sockets. */
> >> +	start_iter_sockets(skel->progs.iter_tcp6_client);
> >> +
> >> +	n = send(clien, "t", 1, 0);
> >> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >> +		goto cleanup;
> >> +	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
> >> +
> >> +
> >> +cleanup:
> >> +	close(clien);
> >> +cleanup_serv:
> >> +	close(serv);
> >> +}
> >> +
> >> +static void test_tcp_server(struct sock_destroy_prog *skel)
> >> +{
> >> +	int serv = -1, clien = -1, n = 0;
> >> +
> >> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
> >> +	if (!ASSERT_GE(serv, 0, "start_server"))
> >> +		goto cleanup_serv;
> >> +
> >> +	clien = connect_to_fd(serv, 0);
> >> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >> +		goto cleanup_serv;
> >> +
> >> +	serv = accept(serv, NULL, NULL);
> >> +	if (!ASSERT_GE(serv, 0, "serv accept"))
> >> +		goto cleanup;
> >> +
> >> +	n = send(clien, "t", 1, 0);
> >> +	if (!ASSERT_GE(n, 0, "client send"))
> >> +		goto cleanup;
> >> +
> >> +	/* Run iterator program that destroys server sockets. */
> >> +	start_iter_sockets(skel->progs.iter_tcp6_server);
> >> +
> >> +	n = send(clien, "t", 1, 0);
> >> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >> +		goto cleanup;
> >> +	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
> >> +
> >> +
> >> +cleanup:
> >> +	close(clien);
> >> +cleanup_serv:
> >> +	close(serv);
> >> +}
> >> +
> >> +
> >> +static void test_udp_client(struct sock_destroy_prog *skel)
> >> +{
> >> +	int serv = -1, clien = -1, n = 0;
> >> +
> >> +	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
> >> +	if (!ASSERT_GE(serv, 0, "start_server"))
> >> +		goto cleanup_serv;
> >> +
> >> +	clien = connect_to_fd(serv, 0);
> >> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >> +		goto cleanup_serv;
> >> +
> >> +	n = send(clien, "t", 1, 0);
> >> +	if (!ASSERT_GE(n, 0, "client send"))
> >> +		goto cleanup;
> >> +
> >> +	/* Run iterator program that destroys sockets. */
> >> +	start_iter_sockets(skel->progs.iter_udp6_client);
> >> +
> >> +	n = send(clien, "t", 1, 0);
> >> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >> +		goto cleanup;
> >> +	/* UDP sockets have an overriding error code after they are  
> disconnected,
> >> +	 * so we don't check for ECONNABORTED error code.
> >> +	 */
> >> +
> >> +cleanup:
> >> +	close(clien);
> >> +cleanup_serv:
> >> +	close(serv);
> >> +}
> >> +
> >> +static void test_udp_server(struct sock_destroy_prog *skel)
> >> +{
> >> +	int *listen_fds = NULL, n, i;
> >> +	unsigned int num_listens = 5;
> >> +	char buf[1];
> >> +
> >> +	/* Start reuseport servers. */
> >> +	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
> >> +					    "::1", SERVER_PORT, 0,
> >> +					    num_listens);
> >> +	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
> >> +		goto cleanup;
> >> +
> >> +	/* Run iterator program that destroys server sockets. */
> >> +	start_iter_sockets(skel->progs.iter_udp6_server);
> >> +
> >> +	for (i = 0; i < num_listens; ++i) {
> >> +		n = read(listen_fds[i], buf, sizeof(buf));
> >> +		if (!ASSERT_EQ(n, -1, "read") ||
> >> +		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed  
> socket"))
> >> +			break;
> >> +	}
> >> +	ASSERT_EQ(i, num_listens, "server socket");
> >> +
> >> +cleanup:
> >> +	free_fds(listen_fds, num_listens);
> >> +}
> >> +
> >> +void test_sock_destroy(void)
> >> +{
> >> +	int cgroup_fd = 0;
> >> +	struct sock_destroy_prog *skel;
> >> +
> >> +	skel = sock_destroy_prog__open_and_load();
> >> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> >> +		return;
> >> +
> >> +	cgroup_fd = test__join_cgroup("/sock_destroy");
> >> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> >> +		goto close_cgroup_fd;
> >> +
> >> +	skel->links.sock_connect = bpf_program__attach_cgroup(
> >> +		skel->progs.sock_connect, cgroup_fd);
> >> +	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
> >> +		goto close_cgroup_fd;
> >> +
> >> +	if (test__start_subtest("tcp_client"))
> >> +		test_tcp_client(skel);
> >> +	if (test__start_subtest("tcp_server"))
> >> +		test_tcp_server(skel);
> >> +	if (test__start_subtest("udp_client"))
> >> +		test_udp_client(skel);
> >> +	if (test__start_subtest("udp_server"))
> >> +		test_udp_server(skel);
> >> +
> >> +
> >> +close_cgroup_fd:
> >> +	close(cgroup_fd);
> >> +	sock_destroy_prog__destroy(skel);
> >> +}
> >> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c  
> b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >> new file mode 100644
> >> index 000000000000..8e09d82c50f3
> >> --- /dev/null
> >> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >> @@ -0,0 +1,151 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +
> >> +#include "vmlinux.h"
> >> +
> >> +#include "bpf_tracing_net.h"
> >> +#include <bpf/bpf_helpers.h>
> >> +#include <bpf/bpf_endian.h>
> >> +
> >> +#define AF_INET6 10
> >
> > [..]
> >
> >> +/* Keep it in sync with prog_test/sock_destroy. */
> >> +#define SERVER_PORT 6062
> >
> > The test looks good, one optional unrelated nit maybe:
> >
> > I've been guilty of these hard-coded ports in the past, but maybe
> > we should stop hard-coding them? Getting the address of the listener  
> (bound to
> > port 0) and passing it to the bpf program via global variable should be  
> super
> > easy now (with the skeletons and network_helpers).


> I briefly considered adding the ports in a map, and retrieving them in  
> the test. But it didn't seem worthwhile as the tests should fail clearly  
> when there is a mismatch.

My worry is that the amount of those tests that have a hard-coded port
grows and at some point somebody will clash with somebody else.
And it might not be 100% apparent because test_progs is now multi-threaded
and racy..

> >
> > And, unrelated, maybe also fix a bunch of places where the reverse  
> christmas
> > tree doesn't look reverse anymore?

> Ok. The checks should be part of tooling (e.g., checkpatch) though if  
> they are meant to be enforced consistently, no?

They are networking specific, so they are not part of a checkpath :-(
I won't say they are consistently enforced, but we try to keep then
whenever possible.

> >
> >> +
> >> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> >> +
> >> +struct {
> >> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> >> +	__uint(max_entries, 1);
> >> +	__type(key, __u32);
> >> +	__type(value, __u64);
> >> +} tcp_conn_sockets SEC(".maps");
> >> +
> >> +struct {
> >> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> >> +	__uint(max_entries, 1);
> >> +	__type(key, __u32);
> >> +	__type(value, __u64);
> >> +} udp_conn_sockets SEC(".maps");
> >> +
> >> +SEC("cgroup/connect6")
> >> +int sock_connect(struct bpf_sock_addr *ctx)
> >> +{
> >> +	int key = 0;
> >> +	__u64 sock_cookie = 0;
> >> +	__u32 keyc = 0;
> >> +
> >> +	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> >> +		return 1;
> >> +
> >> +	sock_cookie = bpf_get_socket_cookie(ctx);
> >> +	if (ctx->protocol == IPPROTO_TCP)
> >> +		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
> >> +	else if (ctx->protocol == IPPROTO_UDP)
> >> +		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
> >> +	else
> >> +		return 1;
> >> +
> >> +	return 1;
> >> +}
> >> +
> >> +SEC("iter/tcp")
> >> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
> >> +{
> >> +	struct sock_common *sk_common = ctx->sk_common;
> >> +	struct seq_file *seq = ctx->meta->seq;
> >> +	__u64 sock_cookie = 0;
> >> +	__u64 *val;
> >> +	int key = 0;
> >> +
> >> +	if (!sk_common)
> >> +		return 0;
> >> +
> >> +	if (sk_common->skc_family != AF_INET6)
> >> +		return 0;
> >> +
> >> +	sock_cookie  = bpf_get_socket_cookie(sk_common);
> >> +	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
> >> +	if (!val)
> >> +		return 0;
> >> +	/* Destroy connected client sockets. */
> >> +	if (sock_cookie == *val)
> >> +		bpf_sock_destroy(sk_common);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +SEC("iter/tcp")
> >> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
> >> +{
> >> +	struct sock_common *sk_common = ctx->sk_common;
> >> +	struct seq_file *seq = ctx->meta->seq;
> >> +	struct tcp6_sock *tcp_sk;
> >> +	const struct inet_connection_sock *icsk;
> >> +	const struct inet_sock *inet;
> >> +	__u16 srcp;
> >> +
> >> +	if (!sk_common)
> >> +		return 0;
> >> +
> >> +	if (sk_common->skc_family != AF_INET6)
> >> +		return 0;
> >> +
> >> +	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
> >> +	if (!tcp_sk)
> >> +		return 0;
> >> +
> >> +	icsk = &tcp_sk->tcp.inet_conn;
> >> +	inet = &icsk->icsk_inet;
> >> +	srcp = bpf_ntohs(inet->inet_sport);
> >> +
> >> +	/* Destroy server sockets. */
> >> +	if (srcp == SERVER_PORT)
> >> +		bpf_sock_destroy(sk_common);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +
> >> +SEC("iter/udp")
> >> +int iter_udp6_client(struct bpf_iter__udp *ctx)
> >> +{
> >> +	struct seq_file *seq = ctx->meta->seq;
> >> +	struct udp_sock *udp_sk = ctx->udp_sk;
> >> +	struct sock *sk = (struct sock *) udp_sk;
> >> +	__u64 sock_cookie = 0, *val;
> >> +	int key = 0;
> >> +
> >> +	if (!sk)
> >> +		return 0;
> >> +
> >> +	sock_cookie  = bpf_get_socket_cookie(sk);
> >> +	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
> >> +	if (!val)
> >> +		return 0;
> >> +	/* Destroy connected client sockets. */
> >> +	if (sock_cookie == *val)
> >> +		bpf_sock_destroy((struct sock_common *)sk);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +SEC("iter/udp")
> >> +int iter_udp6_server(struct bpf_iter__udp *ctx)
> >> +{
> >> +	struct seq_file *seq = ctx->meta->seq;
> >> +	struct udp_sock *udp_sk = ctx->udp_sk;
> >> +	struct sock *sk = (struct sock *) udp_sk;
> >> +	__u16 srcp;
> >> +	struct inet_sock *inet;
> >> +
> >> +	if (!sk)
> >> +		return 0;
> >> +
> >> +	inet = &udp_sk->inet;
> >> +	srcp = bpf_ntohs(inet->inet_sport);
> >> +	if (srcp == SERVER_PORT)
> >> +		bpf_sock_destroy((struct sock_common *)sk);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +char _license[] SEC("license") = "GPL";
> >> --
> >> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
  2023-03-24 21:56   ` Stanislav Fomichev
@ 2023-03-27 22:28   ` Martin KaFai Lau
  2023-03-28 17:06     ` Aditi Ghag
  1 sibling, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2023-03-27 22:28 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, Martin KaFai Lau, bpf

On 3/23/23 1:06 PM, Aditi Ghag wrote:
> Batch UDP sockets from BPF iterator that allows for overlapping locking
> semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
> BPF iterator programs.
> 
> Previously, BPF iterators acquired the sock lock and sockets hash table
> bucket lock while executing BPF programs. This prevented BPF helpers that
> again acquire these locks to be executed from BPF iterators.  With the
> batching approach, we acquire a bucket lock, batch all the bucket sockets,
> and then release the bucket lock. This enables BPF or kernel helpers to
> skip sock locking when invoked in the supported BPF contexts.
> 
> The batching logic is similar to the logic implemented in TCP iterator:
> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
> 
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> ---
>   include/net/udp.h |   1 +
>   net/ipv4/udp.c    | 255 ++++++++++++++++++++++++++++++++++++++++++++--
>   2 files changed, 247 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index de4b528522bb..d2999447d3f2 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>   struct udp_iter_state {
>   	struct seq_net_private  p;
>   	int			bucket;
> +	int			offset;

offset should be moved to 'struct bpf_udp_iter_state' instead. It is specific to 
bpf_iter only.

>   	struct udp_seq_afinfo	*bpf_seq_afinfo;
>   };
>   
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index c605d171eb2d..58c620243e47 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3152,6 +3152,171 @@ struct bpf_iter__udp {
>   	int bucket __aligned(8);
>   };
>   
> +struct bpf_udp_iter_state {
> +	struct udp_iter_state state;
> +	unsigned int cur_sk;
> +	unsigned int end_sk;
> +	unsigned int max_sk;
> +	struct sock **batch;
> +	bool st_bucket_done;
> +};
> +
> +static unsigned short seq_file_family(const struct seq_file *seq);
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz);
> +
> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
> +{
> +	unsigned short family = seq_file_family(seq);
> +
> +	/* AF_UNSPEC is used as a match all */
> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
> +		net_eq(sock_net(sk), seq_file_net(seq)));
> +}
> +
> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct net *net = seq_file_net(seq);
> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
> +	struct udp_table *udptable;
> +	struct sock *first_sk = NULL;
> +	struct sock *sk;
> +	unsigned int bucket_sks = 0;
> +	bool resized = false;
> +	int offset = 0;
> +	int new_offset;
> +
> +	/* The current batch is done, so advance the bucket. */
> +	if (iter->st_bucket_done) {
> +		state->bucket++;
> +		state->offset = 0;
> +	}
> +
> +	udptable = udp_get_table_afinfo(afinfo, net);
> +
> +	if (state->bucket > udptable->mask) {
> +		state->bucket = 0;
> +		state->offset = 0;
> +		return NULL;
> +	}
> +
> +again:
> +	/* New batch for the next bucket.
> +	 * Iterate over the hash table to find a bucket with sockets matching
> +	 * the iterator attributes, and return the first matching socket from
> +	 * the bucket. The remaining matched sockets from the bucket are batched
> +	 * before releasing the bucket lock. This allows BPF programs that are
> +	 * called in seq_show to acquire the bucket lock if needed.
> +	 */
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	first_sk = NULL;
> +	bucket_sks = 0;
> +	offset = state->offset;
> +	new_offset = offset;
> +
> +	for (; state->bucket <= udptable->mask; state->bucket++) {
> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];

Use udptable->hash"2" which is hashed by addr and port. It will help to get a 
smaller batch. It was the comment given in v2.

> +
> +		if (hlist_empty(&hslot->head)) {
> +			offset = 0;
> +			continue;
> +		}
> +
> +		spin_lock_bh(&hslot->lock);
> +		/* Resume from the last saved position in a bucket before
> +		 * iterator was stopped.
> +		 */
> +		while (offset-- > 0) {
> +			sk_for_each(sk, &hslot->head)
> +				continue;
> +		}

hmm... how does the above while loop and sk_for_each loop actually work?

> +		sk_for_each(sk, &hslot->head) {

Here starts from the beginning of the hslot->head again. doesn't look right also.

Am I missing something here?

> +			if (seq_sk_match(seq, sk)) {
> +				if (!first_sk)
> +					first_sk = sk;
> +				if (iter->end_sk < iter->max_sk) {
> +					sock_hold(sk);
> +					iter->batch[iter->end_sk++] = sk;
> +				}
> +				bucket_sks++;
> +			}
> +			new_offset++;

And this new_offset is outside of seq_sk_match, so it is not counting for the 
seq_file_net(seq) netns alone.

> +		}
> +		spin_unlock_bh(&hslot->lock);
> +
> +		if (first_sk)
> +			break;
> +
> +		/* Reset the current bucket's offset before moving to the next bucket. */
> +		offset = 0;
> +		new_offset = 0;
> +	}
> +
> +	/* All done: no batch made. */
> +	if (!first_sk)
> +		goto ret;
> +
> +	if (iter->end_sk == bucket_sks) {
> +		/* Batching is done for the current bucket; return the first
> +		 * socket to be iterated from the batch.
> +		 */
> +		iter->st_bucket_done = true;
> +		goto ret;
> +	}
> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
> +		resized = true;
> +		/* Go back to the previous bucket to resize its batch. */
> +		state->bucket--;
> +		goto again;
> +	}
> +ret:
> +	state->offset = new_offset;
> +	return first_sk;
> +}
> +
> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_udp_iter_state *iter = seq->private;
> +	struct udp_iter_state *state = &iter->state;
> +	struct sock *sk;
> +
> +	/* Whenever seq_next() is called, the iter->cur_sk is
> +	 * done with seq_show(), so unref the iter->cur_sk.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sock_put(iter->batch[iter->cur_sk++]);
> +		++state->offset;

but then,
if I read it correctly, this offset counting is only for netns specific to 
seq_file_net(seq) because batch is specific to seq_file_net(net). Is it going to 
work?

> +	}
> +
> +	/* After updating iter->cur_sk, check if there are more sockets
> +	 * available in the current bucket batch.
> +	 */
> +	if (iter->cur_sk < iter->end_sk) {
> +		sk = iter->batch[iter->cur_sk];
> +	} else {
> +		// Prepare a new batch.
> +		sk = bpf_iter_udp_batch(seq);
> +	}
> +
> +	++*pos;
> +	return sk;
> +}
> +
> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	/* bpf iter does not support lseek, so it always
> +	 * continue from where it was stop()-ped.
> +	 */
> +	if (*pos)
> +		return bpf_iter_udp_batch(seq);
> +
> +	return SEQ_START_TOKEN;
> +}
> +
>   static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>   			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   {
> @@ -3172,18 +3337,38 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
>   	struct bpf_prog *prog;
>   	struct sock *sk = v;
>   	uid_t uid;
> +	bool slow;
> +	int rc;
>   
>   	if (v == SEQ_START_TOKEN)
>   		return 0;
>   
> +	slow = lock_sock_fast(sk);
> +
> +	if (unlikely(sk_unhashed(sk))) {
> +		rc = SEQ_SKIP;
> +		goto unlock;
> +	}
> +
>   	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>   	meta.seq = seq;
>   	prog = bpf_iter_get_info(&meta, false);
> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
> +
> +unlock:
> +	unlock_sock_fast(sk, slow);
> +	return rc;
> +}
> +
> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)

nit. Please use the same naming as in tcp-iter and unix-iter, so 
bpf_iter_udp_put_batch().

> +{
> +	while (iter->cur_sk < iter->end_sk)
> +		sock_put(iter->batch[iter->cur_sk++]);
>   }
>   
>   static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   {
> +	struct bpf_udp_iter_state *iter = seq->private;
>   	struct bpf_iter_meta meta;
>   	struct bpf_prog *prog;
>   
> @@ -3194,15 +3379,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>   			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>   	}
>   
> -	udp_seq_stop(seq, v);
> +	if (iter->cur_sk < iter->end_sk) {
> +		bpf_iter_udp_unref_batch(iter);
> +		iter->st_bucket_done = false;
> +	}
>   }
>   
>   static const struct seq_operations bpf_iter_udp_seq_ops = {
> -	.start		= udp_seq_start,
> -	.next		= udp_seq_next,
> +	.start		= bpf_iter_udp_seq_start,
> +	.next		= bpf_iter_udp_seq_next,
>   	.stop		= bpf_iter_udp_seq_stop,
>   	.show		= bpf_iter_udp_seq_show,
>   };
> +
> +static unsigned short seq_file_family(const struct seq_file *seq)
> +{
> +	const struct udp_seq_afinfo *afinfo;
> +
> +	/* BPF iterator: bpf programs to filter sockets. */
> +	if (seq->op == &bpf_iter_udp_seq_ops)
> +		return AF_UNSPEC;
> +
> +	/* Proc fs iterator */
> +	afinfo = pde_data(file_inode(seq->file));
> +	return afinfo->family;
> +}
>   #endif
>   
>   const struct seq_operations udp_seq_ops = {
> @@ -3413,9 +3614,30 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
>   DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>   		     struct udp_sock *udp_sk, uid_t uid, int bucket)
>   
> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
> +				      unsigned int new_batch_sz)
> +{
> +	struct sock **new_batch;
> +
> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
> +				   GFP_USER | __GFP_NOWARN);
> +	if (!new_batch)
> +		return -ENOMEM;
> +
> +	bpf_iter_udp_unref_batch(iter);
> +	kvfree(iter->batch);
> +	iter->batch = new_batch;
> +	iter->max_sk = new_batch_sz;
> +
> +	return 0;
> +}
> +
> +#define INIT_BATCH_SZ 16
> +
>   static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	struct udp_iter_state *st = &iter->state;
>   	struct udp_seq_afinfo *afinfo;
>   	int ret;
>   
> @@ -3427,24 +3649,39 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>   	afinfo->udp_table = NULL;
>   	st->bpf_seq_afinfo = afinfo;
>   	ret = bpf_iter_init_seq_net(priv_data, aux);
> -	if (ret)
> +	if (ret) {
>   		kfree(afinfo);
> +		return ret;
> +	}
> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
> +	if (ret) {
> +		bpf_iter_fini_seq_net(priv_data);
> +		return ret;
> +	}
> +	iter->cur_sk = 0;
> +	iter->end_sk = 0;
> +	iter->st_bucket_done = false;
> +	st->bucket = 0;
> +	st->offset = 0;

 From looking at the tcp and unix counter part, I don't think this zeroings is 
necessary.

> +
>   	return ret;
>   }
>   
>   static void bpf_iter_fini_udp(void *priv_data)
>   {
> -	struct udp_iter_state *st = priv_data;
> +	struct bpf_udp_iter_state *iter = priv_data;
> +	struct udp_iter_state *st = &iter->state;
>   
> -	kfree(st->bpf_seq_afinfo);

The st->bpf_seq_afinfo should no longer be needed. Please remove it from 'struct 
udp_iter_state'.

The other AF_UNSPEC test in the existing udp_get_{first,next,...} should be 
cleaned up to use the refactored seq_sk_match() also.

These two changes should be done as the first one (or two?) cleanup patches 
before the actual udp batching patch. The tcp-iter-batching patch set could be a 
reference point on how the patch set could be structured.

>   	bpf_iter_fini_seq_net(priv_data);
> +	kfree(st->bpf_seq_afinfo);
> +	kvfree(iter->batch);
>   }
>   
>   static const struct bpf_iter_seq_info udp_seq_info = {
>   	.seq_ops		= &bpf_iter_udp_seq_ops,
>   	.init_seq_private	= bpf_iter_init_udp,
>   	.fini_seq_private	= bpf_iter_fini_udp,
> -	.seq_priv_size		= sizeof(struct udp_iter_state),
> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>   };
>   
>   static struct bpf_iter_reg udp_reg_info = {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator
  2023-03-23 20:06 ` [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator Aditi Ghag
  2023-03-24 21:45   ` Stanislav Fomichev
@ 2023-03-27 22:34   ` Martin KaFai Lau
  1 sibling, 0 replies; 29+ messages in thread
From: Martin KaFai Lau @ 2023-03-27 22:34 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: kafai, sdf, edumazet, bpf

On 3/23/23 1:06 PM, Aditi Ghag wrote:
> Previously, BPF TCP iterator was acquiring fast version of sock lock that
> disables the BH. This introduced a circular dependency with code paths that
> later acquire sockets hash table bucket lock.
> Replace the fast version of sock lock with slow that faciliates BPF
> programs executed from the iterator to destroy TCP listening sockets
> using the bpf_sock_destroy kfunc.

This batch should be moved before the bpf_sock_destroy patch.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator
  2023-03-24 21:45   ` Stanislav Fomichev
@ 2023-03-28 15:20     ` Aditi Ghag
  0 siblings, 0 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-28 15:20 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, Eric Dumazet



> On Mar 24, 2023, at 2:45 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/23, Aditi Ghag wrote:
>> Previously, BPF TCP iterator was acquiring fast version of sock lock that
>> disables the BH. This introduced a circular dependency with code paths that
>> later acquire sockets hash table bucket lock.
>> Replace the fast version of sock lock with slow that faciliates BPF
>> programs executed from the iterator to destroy TCP listening sockets
>> using the bpf_sock_destroy kfunc.
> 
>> Here is a stack trace that motivated this change:
> 
>> ```
>> 1) sock_lock with BH disabled + bucket lock
> 
>> lock_acquire+0xcd/0x330
>> _raw_spin_lock_bh+0x38/0x50
>> inet_unhash+0x96/0xd0
>> tcp_set_state+0x6a/0x210
>> tcp_abort+0x12b/0x230
>> bpf_prog_f4110fb1100e26b5_iter_tcp6_server+0xa3/0xaa
>> bpf_iter_run_prog+0x1ff/0x340
>> bpf_iter_tcp_seq_show+0xca/0x190
>> bpf_seq_read+0x177/0x450
>> vfs_read+0xc6/0x300
>> ksys_read+0x69/0xf0
>> do_syscall_64+0x3c/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
>> 2) sock lock with BH enable
> 
>> [    1.499968]   lock_acquire+0xcd/0x330
>> [    1.500316]   _raw_spin_lock+0x33/0x40
>> [    1.500670]   sk_clone_lock+0x146/0x520
>> [    1.501030]   inet_csk_clone_lock+0x1b/0x110
>> [    1.501433]   tcp_create_openreq_child+0x22/0x3f0
>> [    1.501873]   tcp_v6_syn_recv_sock+0x96/0x940
>> [    1.502284]   tcp_check_req+0x137/0x660
>> [    1.502646]   tcp_v6_rcv+0xa63/0xe80
>> [    1.502994]   ip6_protocol_deliver_rcu+0x78/0x590
>> [    1.503434]   ip6_input_finish+0x72/0x140
>> [    1.503818]   __netif_receive_skb_one_core+0x63/0xa0
>> [    1.504281]   process_backlog+0x79/0x260
>> [    1.504668]   __napi_poll.constprop.0+0x27/0x170
>> [    1.505104]   net_rx_action+0x14a/0x2a0
>> [    1.505469]   __do_softirq+0x165/0x510
>> [    1.505842]   do_softirq+0xcd/0x100
>> [    1.506172]   __local_bh_enable_ip+0xcc/0xf0
>> [    1.506588]   ip6_finish_output2+0x2a8/0xb00
>> [    1.506988]   ip6_finish_output+0x274/0x510
>> [    1.507377]   ip6_xmit+0x319/0x9b0
>> [    1.507726]   inet6_csk_xmit+0x12b/0x2b0
>> [    1.508096]   __tcp_transmit_skb+0x549/0xc40
>> [    1.508498]   tcp_rcv_state_process+0x362/0x1180
> 
>> ```
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> 
> Acked-by: Stanislav Fomichev <sdf@google.com>
> 
> Don't need fixes because it doesn't trigger without your new
> bpf_sock_destroy?

That's right.

> 
> 
>> ---
>>  net/ipv4/tcp_ipv4.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index ea370afa70ed..f2d370a9450f 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>  	struct bpf_iter_meta meta;
>>  	struct bpf_prog *prog;
>>  	struct sock *sk = v;
>> -	bool slow;
>>  	uid_t uid;
>>  	int ret;
> 
>> @@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>  		return 0;
> 
>>  	if (sk_fullsock(sk))
>> -		slow = lock_sock_fast(sk);
>> +		lock_sock(sk);
> 
>>  	if (unlikely(sk_unhashed(sk))) {
>>  		ret = SEQ_SKIP;
>> @@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
> 
>>  unlock:
>>  	if (sk_fullsock(sk))
>> -		unlock_sock_fast(sk, slow);
>> +		release_sock(sk);
>>  	return ret;
> 
>>  }
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-27 22:28   ` Martin KaFai Lau
@ 2023-03-28 17:06     ` Aditi Ghag
  2023-03-28 21:33       ` Martin KaFai Lau
  0 siblings, 1 reply; 29+ messages in thread
From: Aditi Ghag @ 2023-03-28 17:06 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: kafai, sdf, edumazet, Martin KaFai Lau, bpf


> On Mar 27, 2023, at 3:28 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/23/23 1:06 PM, Aditi Ghag wrote:
>> Batch UDP sockets from BPF iterator that allows for overlapping locking
>> semantics in BPF/kernel helpers executed in BPF programs.  This facilitates
>> BPF socket destroy kfunc (introduced by follow-up patches) to execute from
>> BPF iterator programs.
>> Previously, BPF iterators acquired the sock lock and sockets hash table
>> bucket lock while executing BPF programs. This prevented BPF helpers that
>> again acquire these locks to be executed from BPF iterators.  With the
>> batching approach, we acquire a bucket lock, batch all the bucket sockets,
>> and then release the bucket lock. This enables BPF or kernel helpers to
>> skip sock locking when invoked in the supported BPF contexts.
>> The batching logic is similar to the logic implemented in TCP iterator:
>> https://lore.kernel.org/bpf/20210701200613.1036157-1-kafai@fb.com/.
>> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  include/net/udp.h |   1 +
>>  net/ipv4/udp.c    | 255 ++++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 247 insertions(+), 9 deletions(-)
>> diff --git a/include/net/udp.h b/include/net/udp.h
>> index de4b528522bb..d2999447d3f2 100644
>> --- a/include/net/udp.h
>> +++ b/include/net/udp.h
>> @@ -437,6 +437,7 @@ struct udp_seq_afinfo {
>>  struct udp_iter_state {
>>  	struct seq_net_private  p;
>>  	int			bucket;
>> +	int			offset;
> 
> offset should be moved to 'struct bpf_udp_iter_state' instead. It is specific to bpf_iter only.

Sure, I'll move it.

> 
>>  	struct udp_seq_afinfo	*bpf_seq_afinfo;
>>  };
>>  diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index c605d171eb2d..58c620243e47 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3152,6 +3152,171 @@ struct bpf_iter__udp {
>>  	int bucket __aligned(8);
>>  };
>>  +struct bpf_udp_iter_state {
>> +	struct udp_iter_state state;
>> +	unsigned int cur_sk;
>> +	unsigned int end_sk;
>> +	unsigned int max_sk;
>> +	struct sock **batch;
>> +	bool st_bucket_done;
>> +};
>> +
>> +static unsigned short seq_file_family(const struct seq_file *seq);
>> +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>> +				      unsigned int new_batch_sz);
>> +
>> +static inline bool seq_sk_match(struct seq_file *seq, const struct sock *sk)
>> +{
>> +	unsigned short family = seq_file_family(seq);
>> +
>> +	/* AF_UNSPEC is used as a match all */
>> +	return ((family == AF_UNSPEC || family == sk->sk_family) &&
>> +		net_eq(sock_net(sk), seq_file_net(seq)));
>> +}
>> +
>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>> +{
>> +	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>> +	struct net *net = seq_file_net(seq);
>> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
>> +	struct udp_table *udptable;
>> +	struct sock *first_sk = NULL;
>> +	struct sock *sk;
>> +	unsigned int bucket_sks = 0;
>> +	bool resized = false;
>> +	int offset = 0;
>> +	int new_offset;
>> +
>> +	/* The current batch is done, so advance the bucket. */
>> +	if (iter->st_bucket_done) {
>> +		state->bucket++;
>> +		state->offset = 0;
>> +	}
>> +
>> +	udptable = udp_get_table_afinfo(afinfo, net);
>> +
>> +	if (state->bucket > udptable->mask) {
>> +		state->bucket = 0;
>> +		state->offset = 0;
>> +		return NULL;
>> +	}
>> +
>> +again:
>> +	/* New batch for the next bucket.
>> +	 * Iterate over the hash table to find a bucket with sockets matching
>> +	 * the iterator attributes, and return the first matching socket from
>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>> +	 * before releasing the bucket lock. This allows BPF programs that are
>> +	 * called in seq_show to acquire the bucket lock if needed.
>> +	 */
>> +	iter->cur_sk = 0;
>> +	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	first_sk = NULL;
>> +	bucket_sks = 0;
>> +	offset = state->offset;
>> +	new_offset = offset;
>> +
>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
> 
> Use udptable->hash"2" which is hashed by addr and port. It will help to get a smaller batch. It was the comment given in v2.

I thought I replied to your review comment, but looks like I didn't. My bad!
I already gave it a shot, and I'll need to understand better how udptable->hash2 is populated. When I swapped hash with hash2, there were no sockets to iterate. Am I missing something obvious? 

> 
>> +
>> +		if (hlist_empty(&hslot->head)) {
>> +			offset = 0;
>> +			continue;
>> +		}
>> +
>> +		spin_lock_bh(&hslot->lock);
>> +		/* Resume from the last saved position in a bucket before
>> +		 * iterator was stopped.
>> +		 */
>> +		while (offset-- > 0) {
>> +			sk_for_each(sk, &hslot->head)
>> +				continue;
>> +		}
> 
> hmm... how does the above while loop and sk_for_each loop actually work?
> 
>> +		sk_for_each(sk, &hslot->head) {
> 
> Here starts from the beginning of the hslot->head again. doesn't look right also.
> 
> Am I missing something here?
> 
>> +			if (seq_sk_match(seq, sk)) {
>> +				if (!first_sk)
>> +					first_sk = sk;
>> +				if (iter->end_sk < iter->max_sk) {
>> +					sock_hold(sk);
>> +					iter->batch[iter->end_sk++] = sk;
>> +				}
>> +				bucket_sks++;
>> +			}
>> +			new_offset++;
> 
> And this new_offset is outside of seq_sk_match, so it is not counting for the seq_file_net(seq) netns alone.

This logic to resume iterator is buggy, indeed! So I was trying to account for the cases where the current bucket could've been updated since we release the bucket lock. 
This is what I intended to do -

+loop:
                sk_for_each(sk, &hslot->head) {
                        if (seq_sk_match(seq, sk)) {
+                               /* Resume from the last saved position in the
+                                * bucket before iterator was stopped.
+                                */
+                               while (offset && offset-- > 0)
+                                       goto loop;
                                if (!first_sk)
                                        first_sk = sk;
                                if (iter->end_sk < iter->max_sk) {
@@ -3245,8 +3244,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
                                        iter->batch[iter->end_sk++] = sk;
                                }
                                bucket_sks++;
+                              new_offset++;
                        }

This handles the case when sockets that weren't iterated in the previous round got deleted by the time iterator was resumed. But it's possible that previously iterated sockets got deleted before the iterator was later resumed, and the offset is now outdated. Ideally, iterator should be invalidated in this case, but there is no way to track this, is there? Any thoughts?  


> 
>> +		}
>> +		spin_unlock_bh(&hslot->lock);
>> +
>> +		if (first_sk)
>> +			break;
>> +
>> +		/* Reset the current bucket's offset before moving to the next bucket. */
>> +		offset = 0;
>> +		new_offset = 0;
>> +	}
>> +
>> +	/* All done: no batch made. */
>> +	if (!first_sk)
>> +		goto ret;
>> +
>> +	if (iter->end_sk == bucket_sks) {
>> +		/* Batching is done for the current bucket; return the first
>> +		 * socket to be iterated from the batch.
>> +		 */
>> +		iter->st_bucket_done = true;
>> +		goto ret;
>> +	}
>> +	if (!resized && !bpf_iter_udp_realloc_batch(iter, bucket_sks * 3 / 2)) {
>> +		resized = true;
>> +		/* Go back to the previous bucket to resize its batch. */
>> +		state->bucket--;
>> +		goto again;
>> +	}
>> +ret:
>> +	state->offset = new_offset;
>> +	return first_sk;
>> +}
>> +
>> +static void *bpf_iter_udp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> +	struct bpf_udp_iter_state *iter = seq->private;
>> +	struct udp_iter_state *state = &iter->state;
>> +	struct sock *sk;
>> +
>> +	/* Whenever seq_next() is called, the iter->cur_sk is
>> +	 * done with seq_show(), so unref the iter->cur_sk.
>> +	 */
>> +	if (iter->cur_sk < iter->end_sk) {
>> +		sock_put(iter->batch[iter->cur_sk++]);
>> +		++state->offset;
> 
> but then,
> if I read it correctly, this offset counting is only for netns specific to seq_file_net(seq) because batch is specific to seq_file_net(net). Is it going to work?
> 
>> +	}
>> +
>> +	/* After updating iter->cur_sk, check if there are more sockets
>> +	 * available in the current bucket batch.
>> +	 */
>> +	if (iter->cur_sk < iter->end_sk) {
>> +		sk = iter->batch[iter->cur_sk];
>> +	} else {
>> +		// Prepare a new batch.
>> +		sk = bpf_iter_udp_batch(seq);
>> +	}
>> +
>> +	++*pos;
>> +	return sk;
>> +}
>> +
>> +static void *bpf_iter_udp_seq_start(struct seq_file *seq, loff_t *pos)
>> +{
>> +	/* bpf iter does not support lseek, so it always
>> +	 * continue from where it was stop()-ped.
>> +	 */
>> +	if (*pos)
>> +		return bpf_iter_udp_batch(seq);
>> +
>> +	return SEQ_START_TOKEN;
>> +}
>> +
>>  static int udp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
>>  			     struct udp_sock *udp_sk, uid_t uid, int bucket)
>>  {
>> @@ -3172,18 +3337,38 @@ static int bpf_iter_udp_seq_show(struct seq_file *seq, void *v)
>>  	struct bpf_prog *prog;
>>  	struct sock *sk = v;
>>  	uid_t uid;
>> +	bool slow;
>> +	int rc;
>>    	if (v == SEQ_START_TOKEN)
>>  		return 0;
>>  +	slow = lock_sock_fast(sk);
>> +
>> +	if (unlikely(sk_unhashed(sk))) {
>> +		rc = SEQ_SKIP;
>> +		goto unlock;
>> +	}
>> +
>>  	uid = from_kuid_munged(seq_user_ns(seq), sock_i_uid(sk));
>>  	meta.seq = seq;
>>  	prog = bpf_iter_get_info(&meta, false);
>> -	return udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
>> +	rc = udp_prog_seq_show(prog, &meta, v, uid, state->bucket);
>> +
>> +unlock:
>> +	unlock_sock_fast(sk, slow);
>> +	return rc;
>> +}
>> +
>> +static void bpf_iter_udp_unref_batch(struct bpf_udp_iter_state *iter)
> 
> nit. Please use the same naming as in tcp-iter and unix-iter, so bpf_iter_udp_put_batch().

Ack
> 
>> +{
>> +	while (iter->cur_sk < iter->end_sk)
>> +		sock_put(iter->batch[iter->cur_sk++]);
>>  }
>>    static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>>  {
>> +	struct bpf_udp_iter_state *iter = seq->private;
>>  	struct bpf_iter_meta meta;
>>  	struct bpf_prog *prog;
>>  @@ -3194,15 +3379,31 @@ static void bpf_iter_udp_seq_stop(struct seq_file *seq, void *v)
>>  			(void)udp_prog_seq_show(prog, &meta, v, 0, 0);
>>  	}
>>  -	udp_seq_stop(seq, v);
>> +	if (iter->cur_sk < iter->end_sk) {
>> +		bpf_iter_udp_unref_batch(iter);
>> +		iter->st_bucket_done = false;
>> +	}
>>  }
>>    static const struct seq_operations bpf_iter_udp_seq_ops = {
>> -	.start		= udp_seq_start,
>> -	.next		= udp_seq_next,
>> +	.start		= bpf_iter_udp_seq_start,
>> +	.next		= bpf_iter_udp_seq_next,
>>  	.stop		= bpf_iter_udp_seq_stop,
>>  	.show		= bpf_iter_udp_seq_show,
>>  };
>> +
>> +static unsigned short seq_file_family(const struct seq_file *seq)
>> +{
>> +	const struct udp_seq_afinfo *afinfo;
>> +
>> +	/* BPF iterator: bpf programs to filter sockets. */
>> +	if (seq->op == &bpf_iter_udp_seq_ops)
>> +		return AF_UNSPEC;
>> +
>> +	/* Proc fs iterator */
>> +	afinfo = pde_data(file_inode(seq->file));
>> +	return afinfo->family;
>> +}
>>  #endif
>>    const struct seq_operations udp_seq_ops = {
>> @@ -3413,9 +3614,30 @@ static struct pernet_operations __net_initdata udp_sysctl_ops = {
>>  DEFINE_BPF_ITER_FUNC(udp, struct bpf_iter_meta *meta,
>>  		     struct udp_sock *udp_sk, uid_t uid, int bucket)
>>  +static int bpf_iter_udp_realloc_batch(struct bpf_udp_iter_state *iter,
>> +				      unsigned int new_batch_sz)
>> +{
>> +	struct sock **new_batch;
>> +
>> +	new_batch = kvmalloc_array(new_batch_sz, sizeof(*new_batch),
>> +				   GFP_USER | __GFP_NOWARN);
>> +	if (!new_batch)
>> +		return -ENOMEM;
>> +
>> +	bpf_iter_udp_unref_batch(iter);
>> +	kvfree(iter->batch);
>> +	iter->batch = new_batch;
>> +	iter->max_sk = new_batch_sz;
>> +
>> +	return 0;
>> +}
>> +
>> +#define INIT_BATCH_SZ 16
>> +
>>  static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>>  {
>> -	struct udp_iter_state *st = priv_data;
>> +	struct bpf_udp_iter_state *iter = priv_data;
>> +	struct udp_iter_state *st = &iter->state;
>>  	struct udp_seq_afinfo *afinfo;
>>  	int ret;
>>  @@ -3427,24 +3649,39 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux)
>>  	afinfo->udp_table = NULL;
>>  	st->bpf_seq_afinfo = afinfo;
>>  	ret = bpf_iter_init_seq_net(priv_data, aux);
>> -	if (ret)
>> +	if (ret) {
>>  		kfree(afinfo);
>> +		return ret;
>> +	}
>> +	ret = bpf_iter_udp_realloc_batch(iter, INIT_BATCH_SZ);
>> +	if (ret) {
>> +		bpf_iter_fini_seq_net(priv_data);
>> +		return ret;
>> +	}
>> +	iter->cur_sk = 0;
>> +	iter->end_sk = 0;
>> +	iter->st_bucket_done = false;
>> +	st->bucket = 0;
>> +	st->offset = 0;
> 
> From looking at the tcp and unix counter part, I don't think this zeroings is necessary.

Ack

> 
>> +
>>  	return ret;
>>  }
>>    static void bpf_iter_fini_udp(void *priv_data)
>>  {
>> -	struct udp_iter_state *st = priv_data;
>> +	struct bpf_udp_iter_state *iter = priv_data;
>> +	struct udp_iter_state *st = &iter->state;
>>  -	kfree(st->bpf_seq_afinfo);
> 
> The st->bpf_seq_afinfo should no longer be needed. Please remove it from 'struct udp_iter_state'.
> 
> The other AF_UNSPEC test in the existing udp_get_{first,next,...} should be cleaned up to use the refactored seq_sk_match() also.
> 
> These two changes should be done as the first one (or two?) cleanup patches before the actual udp batching patch. The tcp-iter-batching patch set could be a reference point on how the patch set could be structured.

Ack for both the clean-up and reshuffling. 

> 
>>  	bpf_iter_fini_seq_net(priv_data);
>> +	kfree(st->bpf_seq_afinfo);
>> +	kvfree(iter->batch);
>>  }
>>    static const struct bpf_iter_seq_info udp_seq_info = {
>>  	.seq_ops		= &bpf_iter_udp_seq_ops,
>>  	.init_seq_private	= bpf_iter_init_udp,
>>  	.fini_seq_private	= bpf_iter_fini_udp,
>> -	.seq_priv_size		= sizeof(struct udp_iter_state),
>> +	.seq_priv_size		= sizeof(struct bpf_udp_iter_state),
>>  };
>>    static struct bpf_iter_reg udp_reg_info = {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-27 16:54       ` Stanislav Fomichev
@ 2023-03-28 17:50         ` Aditi Ghag
  2023-03-28 18:35           ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Aditi Ghag @ 2023-03-28 17:50 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 27, 2023, at 9:54 AM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/27, Aditi Ghag wrote:
> 
> 
>> > On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
>> >
>> > On 03/23, Aditi Ghag wrote:
>> >> The test cases for destroying sockets mirror the intended usages of the
>> >> bpf_sock_destroy kfunc using iterators.
>> >
>> >> The destroy helpers set `ECONNABORTED` error code that we can validate in
>> >> the test code with client sockets. But UDP sockets have an overriding error
>> >> code from the disconnect called during abort, so the error code the
>> >> validation is only done for TCP sockets.
>> >
>> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> >> ---
>> >>  .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
>> >>  .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
>> >>  2 files changed, 346 insertions(+)
>> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> >>  create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>> >
>> >> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> >> new file mode 100644
>> >> index 000000000000..cbce966af568
>> >> --- /dev/null
>> >> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>> >> @@ -0,0 +1,195 @@
>> >> +// SPDX-License-Identifier: GPL-2.0
>> >> +#include <test_progs.h>
>> >> +
>> >> +#include "sock_destroy_prog.skel.h"
>> >> +#include "network_helpers.h"
>> >> +
>> >> +#define SERVER_PORT 6062
>> >> +
>> >> +static void start_iter_sockets(struct bpf_program *prog)
>> >> +{
>> >> +	struct bpf_link *link;
>> >> +	char buf[50] = {};
>> >> +	int iter_fd, len;
>> >> +
>> >> +	link = bpf_program__attach_iter(prog, NULL);
>> >> +	if (!ASSERT_OK_PTR(link, "attach_iter"))
>> >> +		return;
>> >> +
>> >> +	iter_fd = bpf_iter_create(bpf_link__fd(link));
>> >> +	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
>> >> +		goto free_link;
>> >> +
>> >> +	while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
>> >> +		;
>> >> +	ASSERT_GE(len, 0, "read");
>> >> +
>> >> +	close(iter_fd);
>> >> +
>> >> +free_link:
>> >> +	bpf_link__destroy(link);
>> >> +}
>> >> +
>> >> +static void test_tcp_client(struct sock_destroy_prog *skel)
>> >> +{
>> >> +	int serv = -1, clien = -1, n = 0;
>> >> +
>> >> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
>> >> +	if (!ASSERT_GE(serv, 0, "start_server"))
>> >> +		goto cleanup_serv;
>> >> +
>> >> +	clien = connect_to_fd(serv, 0);
>> >> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>> >> +		goto cleanup_serv;
>> >> +
>> >> +	serv = accept(serv, NULL, NULL);
>> >> +	if (!ASSERT_GE(serv, 0, "serv accept"))
>> >> +		goto cleanup;
>> >> +
>> >> +	n = send(clien, "t", 1, 0);
>> >> +	if (!ASSERT_GE(n, 0, "client send"))
>> >> +		goto cleanup;
>> >> +
>> >> +	/* Run iterator program that destroys connected client sockets. */
>> >> +	start_iter_sockets(skel->progs.iter_tcp6_client);
>> >> +
>> >> +	n = send(clien, "t", 1, 0);
>> >> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>> >> +		goto cleanup;
>> >> +	ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
>> >> +
>> >> +
>> >> +cleanup:
>> >> +	close(clien);
>> >> +cleanup_serv:
>> >> +	close(serv);
>> >> +}
>> >> +
>> >> +static void test_tcp_server(struct sock_destroy_prog *skel)
>> >> +{
>> >> +	int serv = -1, clien = -1, n = 0;
>> >> +
>> >> +	serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
>> >> +	if (!ASSERT_GE(serv, 0, "start_server"))
>> >> +		goto cleanup_serv;
>> >> +
>> >> +	clien = connect_to_fd(serv, 0);
>> >> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>> >> +		goto cleanup_serv;
>> >> +
>> >> +	serv = accept(serv, NULL, NULL);
>> >> +	if (!ASSERT_GE(serv, 0, "serv accept"))
>> >> +		goto cleanup;
>> >> +
>> >> +	n = send(clien, "t", 1, 0);
>> >> +	if (!ASSERT_GE(n, 0, "client send"))
>> >> +		goto cleanup;
>> >> +
>> >> +	/* Run iterator program that destroys server sockets. */
>> >> +	start_iter_sockets(skel->progs.iter_tcp6_server);
>> >> +
>> >> +	n = send(clien, "t", 1, 0);
>> >> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>> >> +		goto cleanup;
>> >> +	ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
>> >> +
>> >> +
>> >> +cleanup:
>> >> +	close(clien);
>> >> +cleanup_serv:
>> >> +	close(serv);
>> >> +}
>> >> +
>> >> +
>> >> +static void test_udp_client(struct sock_destroy_prog *skel)
>> >> +{
>> >> +	int serv = -1, clien = -1, n = 0;
>> >> +
>> >> +	serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
>> >> +	if (!ASSERT_GE(serv, 0, "start_server"))
>> >> +		goto cleanup_serv;
>> >> +
>> >> +	clien = connect_to_fd(serv, 0);
>> >> +	if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>> >> +		goto cleanup_serv;
>> >> +
>> >> +	n = send(clien, "t", 1, 0);
>> >> +	if (!ASSERT_GE(n, 0, "client send"))
>> >> +		goto cleanup;
>> >> +
>> >> +	/* Run iterator program that destroys sockets. */
>> >> +	start_iter_sockets(skel->progs.iter_udp6_client);
>> >> +
>> >> +	n = send(clien, "t", 1, 0);
>> >> +	if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>> >> +		goto cleanup;
>> >> +	/* UDP sockets have an overriding error code after they are disconnected,
>> >> +	 * so we don't check for ECONNABORTED error code.
>> >> +	 */
>> >> +
>> >> +cleanup:
>> >> +	close(clien);
>> >> +cleanup_serv:
>> >> +	close(serv);
>> >> +}
>> >> +
>> >> +static void test_udp_server(struct sock_destroy_prog *skel)
>> >> +{
>> >> +	int *listen_fds = NULL, n, i;
>> >> +	unsigned int num_listens = 5;
>> >> +	char buf[1];
>> >> +
>> >> +	/* Start reuseport servers. */
>> >> +	listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
>> >> +					    "::1", SERVER_PORT, 0,
>> >> +					    num_listens);
>> >> +	if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
>> >> +		goto cleanup;
>> >> +
>> >> +	/* Run iterator program that destroys server sockets. */
>> >> +	start_iter_sockets(skel->progs.iter_udp6_server);
>> >> +
>> >> +	for (i = 0; i < num_listens; ++i) {
>> >> +		n = read(listen_fds[i], buf, sizeof(buf));
>> >> +		if (!ASSERT_EQ(n, -1, "read") ||
>> >> +		    !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
>> >> +			break;
>> >> +	}
>> >> +	ASSERT_EQ(i, num_listens, "server socket");
>> >> +
>> >> +cleanup:
>> >> +	free_fds(listen_fds, num_listens);
>> >> +}
>> >> +
>> >> +void test_sock_destroy(void)
>> >> +{
>> >> +	int cgroup_fd = 0;
>> >> +	struct sock_destroy_prog *skel;
>> >> +
>> >> +	skel = sock_destroy_prog__open_and_load();
>> >> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
>> >> +		return;
>> >> +
>> >> +	cgroup_fd = test__join_cgroup("/sock_destroy");
>> >> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
>> >> +		goto close_cgroup_fd;
>> >> +
>> >> +	skel->links.sock_connect = bpf_program__attach_cgroup(
>> >> +		skel->progs.sock_connect, cgroup_fd);
>> >> +	if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
>> >> +		goto close_cgroup_fd;
>> >> +
>> >> +	if (test__start_subtest("tcp_client"))
>> >> +		test_tcp_client(skel);
>> >> +	if (test__start_subtest("tcp_server"))
>> >> +		test_tcp_server(skel);
>> >> +	if (test__start_subtest("udp_client"))
>> >> +		test_udp_client(skel);
>> >> +	if (test__start_subtest("udp_server"))
>> >> +		test_udp_server(skel);
>> >> +
>> >> +
>> >> +close_cgroup_fd:
>> >> +	close(cgroup_fd);
>> >> +	sock_destroy_prog__destroy(skel);
>> >> +}
>> >> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>> >> new file mode 100644
>> >> index 000000000000..8e09d82c50f3
>> >> --- /dev/null
>> >> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>> >> @@ -0,0 +1,151 @@
>> >> +// SPDX-License-Identifier: GPL-2.0
>> >> +
>> >> +#include "vmlinux.h"
>> >> +
>> >> +#include "bpf_tracing_net.h"
>> >> +#include <bpf/bpf_helpers.h>
>> >> +#include <bpf/bpf_endian.h>
>> >> +
>> >> +#define AF_INET6 10
>> >
>> > [..]
>> >
>> >> +/* Keep it in sync with prog_test/sock_destroy. */
>> >> +#define SERVER_PORT 6062
>> >
>> > The test looks good, one optional unrelated nit maybe:
>> >
>> > I've been guilty of these hard-coded ports in the past, but maybe
>> > we should stop hard-coding them? Getting the address of the listener (bound to
>> > port 0) and passing it to the bpf program via global variable should be super
>> > easy now (with the skeletons and network_helpers).
> 
> 
>> I briefly considered adding the ports in a map, and retrieving them in the test. But it didn't seem worthwhile as the tests should fail clearly when there is a mismatch.
> 
> My worry is that the amount of those tests that have a hard-coded port
> grows and at some point somebody will clash with somebody else.
> And it might not be 100% apparent because test_progs is now multi-threaded
> and racy..
> 

So you would like the ports to be unique across all the tests. 

>Getting the address of the listener (bound to
> port 0) and passing it to the bpf program via global variable should be super
> easy now (with the skeletons and network_helpers).

Just so that we are on the same page, could you point to which network helpers are you referring to here for passing global variables?


>> >
>> > And, unrelated, maybe also fix a bunch of places where the reverse christmas
>> > tree doesn't look reverse anymore?
> 
>> Ok. The checks should be part of tooling (e.g., checkpatch) though if they are meant to be enforced consistently, no?
> 
> They are networking specific, so they are not part of a checkpath :-(
> I won't say they are consistently enforced, but we try to keep then
> whenever possible.
> 
>> >
>> >> +
>> >> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>> >> +
>> >> +struct {
>> >> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> >> +	__uint(max_entries, 1);
>> >> +	__type(key, __u32);
>> >> +	__type(value, __u64);
>> >> +} tcp_conn_sockets SEC(".maps");
>> >> +
>> >> +struct {
>> >> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> >> +	__uint(max_entries, 1);
>> >> +	__type(key, __u32);
>> >> +	__type(value, __u64);
>> >> +} udp_conn_sockets SEC(".maps");
>> >> +
>> >> +SEC("cgroup/connect6")
>> >> +int sock_connect(struct bpf_sock_addr *ctx)
>> >> +{
>> >> +	int key = 0;
>> >> +	__u64 sock_cookie = 0;
>> >> +	__u32 keyc = 0;
>> >> +
>> >> +	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
>> >> +		return 1;
>> >> +
>> >> +	sock_cookie = bpf_get_socket_cookie(ctx);
>> >> +	if (ctx->protocol == IPPROTO_TCP)
>> >> +		bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
>> >> +	else if (ctx->protocol == IPPROTO_UDP)
>> >> +		bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
>> >> +	else
>> >> +		return 1;
>> >> +
>> >> +	return 1;
>> >> +}
>> >> +
>> >> +SEC("iter/tcp")
>> >> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
>> >> +{
>> >> +	struct sock_common *sk_common = ctx->sk_common;
>> >> +	struct seq_file *seq = ctx->meta->seq;
>> >> +	__u64 sock_cookie = 0;
>> >> +	__u64 *val;
>> >> +	int key = 0;
>> >> +
>> >> +	if (!sk_common)
>> >> +		return 0;
>> >> +
>> >> +	if (sk_common->skc_family != AF_INET6)
>> >> +		return 0;
>> >> +
>> >> +	sock_cookie  = bpf_get_socket_cookie(sk_common);
>> >> +	val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
>> >> +	if (!val)
>> >> +		return 0;
>> >> +	/* Destroy connected client sockets. */
>> >> +	if (sock_cookie == *val)
>> >> +		bpf_sock_destroy(sk_common);
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +SEC("iter/tcp")
>> >> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
>> >> +{
>> >> +	struct sock_common *sk_common = ctx->sk_common;
>> >> +	struct seq_file *seq = ctx->meta->seq;
>> >> +	struct tcp6_sock *tcp_sk;
>> >> +	const struct inet_connection_sock *icsk;
>> >> +	const struct inet_sock *inet;
>> >> +	__u16 srcp;
>> >> +
>> >> +	if (!sk_common)
>> >> +		return 0;
>> >> +
>> >> +	if (sk_common->skc_family != AF_INET6)
>> >> +		return 0;
>> >> +
>> >> +	tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
>> >> +	if (!tcp_sk)
>> >> +		return 0;
>> >> +
>> >> +	icsk = &tcp_sk->tcp.inet_conn;
>> >> +	inet = &icsk->icsk_inet;
>> >> +	srcp = bpf_ntohs(inet->inet_sport);
>> >> +
>> >> +	/* Destroy server sockets. */
>> >> +	if (srcp == SERVER_PORT)
>> >> +		bpf_sock_destroy(sk_common);
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +
>> >> +SEC("iter/udp")
>> >> +int iter_udp6_client(struct bpf_iter__udp *ctx)
>> >> +{
>> >> +	struct seq_file *seq = ctx->meta->seq;
>> >> +	struct udp_sock *udp_sk = ctx->udp_sk;
>> >> +	struct sock *sk = (struct sock *) udp_sk;
>> >> +	__u64 sock_cookie = 0, *val;
>> >> +	int key = 0;
>> >> +
>> >> +	if (!sk)
>> >> +		return 0;
>> >> +
>> >> +	sock_cookie  = bpf_get_socket_cookie(sk);
>> >> +	val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
>> >> +	if (!val)
>> >> +		return 0;
>> >> +	/* Destroy connected client sockets. */
>> >> +	if (sock_cookie == *val)
>> >> +		bpf_sock_destroy((struct sock_common *)sk);
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +SEC("iter/udp")
>> >> +int iter_udp6_server(struct bpf_iter__udp *ctx)
>> >> +{
>> >> +	struct seq_file *seq = ctx->meta->seq;
>> >> +	struct udp_sock *udp_sk = ctx->udp_sk;
>> >> +	struct sock *sk = (struct sock *) udp_sk;
>> >> +	__u16 srcp;
>> >> +	struct inet_sock *inet;
>> >> +
>> >> +	if (!sk)
>> >> +		return 0;
>> >> +
>> >> +	inet = &udp_sk->inet;
>> >> +	srcp = bpf_ntohs(inet->inet_sport);
>> >> +	if (srcp == SERVER_PORT)
>> >> +		bpf_sock_destroy((struct sock_common *)sk);
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +char _license[] SEC("license") = "GPL";
>> >> --
>> >> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-28 17:50         ` Aditi Ghag
@ 2023-03-28 18:35           ` Stanislav Fomichev
  2023-03-29 23:13             ` Aditi Ghag
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-28 18:35 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On Tue, Mar 28, 2023 at 10:51 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Mar 27, 2023, at 9:54 AM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/27, Aditi Ghag wrote:
> >
> >
> >> > On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >> >
> >> > On 03/23, Aditi Ghag wrote:
> >> >> The test cases for destroying sockets mirror the intended usages of the
> >> >> bpf_sock_destroy kfunc using iterators.
> >> >
> >> >> The destroy helpers set `ECONNABORTED` error code that we can validate in
> >> >> the test code with client sockets. But UDP sockets have an overriding error
> >> >> code from the disconnect called during abort, so the error code the
> >> >> validation is only done for TCP sockets.
> >> >
> >> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> >> ---
> >> >>  .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
> >> >>  .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
> >> >>  2 files changed, 346 insertions(+)
> >> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >> >>  create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >> >
> >> >> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >> >> new file mode 100644
> >> >> index 000000000000..cbce966af568
> >> >> --- /dev/null
> >> >> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >> >> @@ -0,0 +1,195 @@
> >> >> +// SPDX-License-Identifier: GPL-2.0
> >> >> +#include <test_progs.h>
> >> >> +
> >> >> +#include "sock_destroy_prog.skel.h"
> >> >> +#include "network_helpers.h"
> >> >> +
> >> >> +#define SERVER_PORT 6062
> >> >> +
> >> >> +static void start_iter_sockets(struct bpf_program *prog)
> >> >> +{
> >> >> + struct bpf_link *link;
> >> >> + char buf[50] = {};
> >> >> + int iter_fd, len;
> >> >> +
> >> >> + link = bpf_program__attach_iter(prog, NULL);
> >> >> + if (!ASSERT_OK_PTR(link, "attach_iter"))
> >> >> +         return;
> >> >> +
> >> >> + iter_fd = bpf_iter_create(bpf_link__fd(link));
> >> >> + if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> >> >> +         goto free_link;
> >> >> +
> >> >> + while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
> >> >> +         ;
> >> >> + ASSERT_GE(len, 0, "read");
> >> >> +
> >> >> + close(iter_fd);
> >> >> +
> >> >> +free_link:
> >> >> + bpf_link__destroy(link);
> >> >> +}
> >> >> +
> >> >> +static void test_tcp_client(struct sock_destroy_prog *skel)
> >> >> +{
> >> >> + int serv = -1, clien = -1, n = 0;
> >> >> +
> >> >> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> >> >> + if (!ASSERT_GE(serv, 0, "start_server"))
> >> >> +         goto cleanup_serv;
> >> >> +
> >> >> + clien = connect_to_fd(serv, 0);
> >> >> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >> >> +         goto cleanup_serv;
> >> >> +
> >> >> + serv = accept(serv, NULL, NULL);
> >> >> + if (!ASSERT_GE(serv, 0, "serv accept"))
> >> >> +         goto cleanup;
> >> >> +
> >> >> + n = send(clien, "t", 1, 0);
> >> >> + if (!ASSERT_GE(n, 0, "client send"))
> >> >> +         goto cleanup;
> >> >> +
> >> >> + /* Run iterator program that destroys connected client sockets. */
> >> >> + start_iter_sockets(skel->progs.iter_tcp6_client);
> >> >> +
> >> >> + n = send(clien, "t", 1, 0);
> >> >> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >> >> +         goto cleanup;
> >> >> + ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
> >> >> +
> >> >> +
> >> >> +cleanup:
> >> >> + close(clien);
> >> >> +cleanup_serv:
> >> >> + close(serv);
> >> >> +}
> >> >> +
> >> >> +static void test_tcp_server(struct sock_destroy_prog *skel)
> >> >> +{
> >> >> + int serv = -1, clien = -1, n = 0;
> >> >> +
> >> >> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
> >> >> + if (!ASSERT_GE(serv, 0, "start_server"))
> >> >> +         goto cleanup_serv;
> >> >> +
> >> >> + clien = connect_to_fd(serv, 0);
> >> >> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >> >> +         goto cleanup_serv;
> >> >> +
> >> >> + serv = accept(serv, NULL, NULL);
> >> >> + if (!ASSERT_GE(serv, 0, "serv accept"))
> >> >> +         goto cleanup;
> >> >> +
> >> >> + n = send(clien, "t", 1, 0);
> >> >> + if (!ASSERT_GE(n, 0, "client send"))
> >> >> +         goto cleanup;
> >> >> +
> >> >> + /* Run iterator program that destroys server sockets. */
> >> >> + start_iter_sockets(skel->progs.iter_tcp6_server);
> >> >> +
> >> >> + n = send(clien, "t", 1, 0);
> >> >> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >> >> +         goto cleanup;
> >> >> + ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
> >> >> +
> >> >> +
> >> >> +cleanup:
> >> >> + close(clien);
> >> >> +cleanup_serv:
> >> >> + close(serv);
> >> >> +}
> >> >> +
> >> >> +
> >> >> +static void test_udp_client(struct sock_destroy_prog *skel)
> >> >> +{
> >> >> + int serv = -1, clien = -1, n = 0;
> >> >> +
> >> >> + serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
> >> >> + if (!ASSERT_GE(serv, 0, "start_server"))
> >> >> +         goto cleanup_serv;
> >> >> +
> >> >> + clien = connect_to_fd(serv, 0);
> >> >> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >> >> +         goto cleanup_serv;
> >> >> +
> >> >> + n = send(clien, "t", 1, 0);
> >> >> + if (!ASSERT_GE(n, 0, "client send"))
> >> >> +         goto cleanup;
> >> >> +
> >> >> + /* Run iterator program that destroys sockets. */
> >> >> + start_iter_sockets(skel->progs.iter_udp6_client);
> >> >> +
> >> >> + n = send(clien, "t", 1, 0);
> >> >> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >> >> +         goto cleanup;
> >> >> + /* UDP sockets have an overriding error code after they are disconnected,
> >> >> +  * so we don't check for ECONNABORTED error code.
> >> >> +  */
> >> >> +
> >> >> +cleanup:
> >> >> + close(clien);
> >> >> +cleanup_serv:
> >> >> + close(serv);
> >> >> +}
> >> >> +
> >> >> +static void test_udp_server(struct sock_destroy_prog *skel)
> >> >> +{
> >> >> + int *listen_fds = NULL, n, i;
> >> >> + unsigned int num_listens = 5;
> >> >> + char buf[1];
> >> >> +
> >> >> + /* Start reuseport servers. */
> >> >> + listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
> >> >> +                                     "::1", SERVER_PORT, 0,
> >> >> +                                     num_listens);
> >> >> + if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
> >> >> +         goto cleanup;
> >> >> +
> >> >> + /* Run iterator program that destroys server sockets. */
> >> >> + start_iter_sockets(skel->progs.iter_udp6_server);
> >> >> +
> >> >> + for (i = 0; i < num_listens; ++i) {
> >> >> +         n = read(listen_fds[i], buf, sizeof(buf));
> >> >> +         if (!ASSERT_EQ(n, -1, "read") ||
> >> >> +             !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
> >> >> +                 break;
> >> >> + }
> >> >> + ASSERT_EQ(i, num_listens, "server socket");
> >> >> +
> >> >> +cleanup:
> >> >> + free_fds(listen_fds, num_listens);
> >> >> +}
> >> >> +
> >> >> +void test_sock_destroy(void)
> >> >> +{
> >> >> + int cgroup_fd = 0;
> >> >> + struct sock_destroy_prog *skel;
> >> >> +
> >> >> + skel = sock_destroy_prog__open_and_load();
> >> >> + if (!ASSERT_OK_PTR(skel, "skel_open"))
> >> >> +         return;
> >> >> +
> >> >> + cgroup_fd = test__join_cgroup("/sock_destroy");
> >> >> + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> >> >> +         goto close_cgroup_fd;
> >> >> +
> >> >> + skel->links.sock_connect = bpf_program__attach_cgroup(
> >> >> +         skel->progs.sock_connect, cgroup_fd);
> >> >> + if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
> >> >> +         goto close_cgroup_fd;
> >> >> +
> >> >> + if (test__start_subtest("tcp_client"))
> >> >> +         test_tcp_client(skel);
> >> >> + if (test__start_subtest("tcp_server"))
> >> >> +         test_tcp_server(skel);
> >> >> + if (test__start_subtest("udp_client"))
> >> >> +         test_udp_client(skel);
> >> >> + if (test__start_subtest("udp_server"))
> >> >> +         test_udp_server(skel);
> >> >> +
> >> >> +
> >> >> +close_cgroup_fd:
> >> >> + close(cgroup_fd);
> >> >> + sock_destroy_prog__destroy(skel);
> >> >> +}
> >> >> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >> >> new file mode 100644
> >> >> index 000000000000..8e09d82c50f3
> >> >> --- /dev/null
> >> >> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >> >> @@ -0,0 +1,151 @@
> >> >> +// SPDX-License-Identifier: GPL-2.0
> >> >> +
> >> >> +#include "vmlinux.h"
> >> >> +
> >> >> +#include "bpf_tracing_net.h"
> >> >> +#include <bpf/bpf_helpers.h>
> >> >> +#include <bpf/bpf_endian.h>
> >> >> +
> >> >> +#define AF_INET6 10
> >> >
> >> > [..]
> >> >
> >> >> +/* Keep it in sync with prog_test/sock_destroy. */
> >> >> +#define SERVER_PORT 6062
> >> >
> >> > The test looks good, one optional unrelated nit maybe:
> >> >
> >> > I've been guilty of these hard-coded ports in the past, but maybe
> >> > we should stop hard-coding them? Getting the address of the listener (bound to
> >> > port 0) and passing it to the bpf program via global variable should be super
> >> > easy now (with the skeletons and network_helpers).
> >
> >
> >> I briefly considered adding the ports in a map, and retrieving them in the test. But it didn't seem worthwhile as the tests should fail clearly when there is a mismatch.
> >
> > My worry is that the amount of those tests that have a hard-coded port
> > grows and at some point somebody will clash with somebody else.
> > And it might not be 100% apparent because test_progs is now multi-threaded
> > and racy..
> >
>
> So you would like the ports to be unique across all the tests.

Yeah, but it's hard without having some kind of global registry. Take
a look at the following:

$ grep -Iri _port tools/testing/selftests/bpf/ | grep -P '\d{4}'

tools/testing/selftests/bpf/progs/connect_force_port4.c:
sa.sin_port = bpf_htons(22222);
tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
(ctx->user_port == bpf_htons(60000)) {
tools/testing/selftests/bpf/progs/connect_force_port4.c:
 ctx->user_port = bpf_htons(60123);
tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
(ctx->user_port == bpf_htons(60123)) {
tools/testing/selftests/bpf/progs/connect_force_port4.c:
 ctx->user_port = bpf_htons(60000);
tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
(ctx->user_port == bpf_htons(60123)) {
tools/testing/selftests/bpf/progs/connect6_prog.c:#define
DST_REWRITE_PORT6     6666
tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
SRC_PORT = bpf_htons(8008);
tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
DST_PORT = 7007; /* Host byte order */
tools/testing/selftests/bpf/progs/test_tc_dtime.c:      __u16
dst_ns_port = __bpf_htons(50000 + test);
tools/testing/selftests/bpf/progs/connect4_dropper.c:   if
(ctx->user_port == bpf_htons(60120))
tools/testing/selftests/bpf/progs/connect_force_port6.c:
sa.sin6_port = bpf_htons(22223);
tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
(ctx->user_port == bpf_htons(60000)) {
tools/testing/selftests/bpf/progs/connect_force_port6.c:
 ctx->user_port = bpf_htons(60124);
tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
(ctx->user_port == bpf_htons(60124)) {
tools/testing/selftests/bpf/progs/connect_force_port6.c:
 ctx->user_port = bpf_htons(60000);
tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
(ctx->user_port == bpf_htons(60124)) {
tools/testing/selftests/bpf/progs/test_tunnel_kern.c:#define
VXLAN_UDP_PORT 4789
tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define DST_PORT
         4040
tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define
DST_REWRITE_PORT4     4444
tools/testing/selftests/bpf/progs/connect4_prog.c:#define
DST_REWRITE_PORT4     4444
tools/testing/selftests/bpf/progs/bind6_prog.c:#define SERV6_PORT
         6060
tools/testing/selftests/bpf/progs/bind6_prog.c:#define
SERV6_REWRITE_PORT       6666
tools/testing/selftests/bpf/progs/sendmsg6_prog.c:#define
DST_REWRITE_PORT6     6666
tools/testing/selftests/bpf/progs/recvmsg4_prog.c:#define SERV4_PORT
         4040
<cut>

.... there is much more ...

> >Getting the address of the listener (bound to
> > port 0) and passing it to the bpf program via global variable should be super
> > easy now (with the skeletons and network_helpers).
>
> Just so that we are on the same page, could you point to which network helpers are you referring to here for passing global variables?

Take a look at the following existing tests:
* prog_tests/cgroup_skb_sk_lookup.c
  * run_lookup_test(&skel->bss->g_serv_port, out_sk);
* progs/cgroup_skb_sk_lookup_kern.c
  * g_serv_port

Fundamentally, here is what's preferable to have:

fd = start_server(..., port=0, ...);
listener_port = get_port(fd); /* new network_helpers.h helper that
calls getsockname */
obj->bss->port = listener_port; /* populate the port in the BPF program */

Does it make sense?

> >> >
> >> > And, unrelated, maybe also fix a bunch of places where the reverse christmas
> >> > tree doesn't look reverse anymore?
> >
> >> Ok. The checks should be part of tooling (e.g., checkpatch) though if they are meant to be enforced consistently, no?
> >
> > They are networking specific, so they are not part of a checkpath :-(
> > I won't say they are consistently enforced, but we try to keep then
> > whenever possible.
> >
> >> >
> >> >> +
> >> >> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> >> >> +
> >> >> +struct {
> >> >> + __uint(type, BPF_MAP_TYPE_ARRAY);
> >> >> + __uint(max_entries, 1);
> >> >> + __type(key, __u32);
> >> >> + __type(value, __u64);
> >> >> +} tcp_conn_sockets SEC(".maps");
> >> >> +
> >> >> +struct {
> >> >> + __uint(type, BPF_MAP_TYPE_ARRAY);
> >> >> + __uint(max_entries, 1);
> >> >> + __type(key, __u32);
> >> >> + __type(value, __u64);
> >> >> +} udp_conn_sockets SEC(".maps");
> >> >> +
> >> >> +SEC("cgroup/connect6")
> >> >> +int sock_connect(struct bpf_sock_addr *ctx)
> >> >> +{
> >> >> + int key = 0;
> >> >> + __u64 sock_cookie = 0;
> >> >> + __u32 keyc = 0;
> >> >> +
> >> >> + if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> >> >> +         return 1;
> >> >> +
> >> >> + sock_cookie = bpf_get_socket_cookie(ctx);
> >> >> + if (ctx->protocol == IPPROTO_TCP)
> >> >> +         bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
> >> >> + else if (ctx->protocol == IPPROTO_UDP)
> >> >> +         bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
> >> >> + else
> >> >> +         return 1;
> >> >> +
> >> >> + return 1;
> >> >> +}
> >> >> +
> >> >> +SEC("iter/tcp")
> >> >> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
> >> >> +{
> >> >> + struct sock_common *sk_common = ctx->sk_common;
> >> >> + struct seq_file *seq = ctx->meta->seq;
> >> >> + __u64 sock_cookie = 0;
> >> >> + __u64 *val;
> >> >> + int key = 0;
> >> >> +
> >> >> + if (!sk_common)
> >> >> +         return 0;
> >> >> +
> >> >> + if (sk_common->skc_family != AF_INET6)
> >> >> +         return 0;
> >> >> +
> >> >> + sock_cookie  = bpf_get_socket_cookie(sk_common);
> >> >> + val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
> >> >> + if (!val)
> >> >> +         return 0;
> >> >> + /* Destroy connected client sockets. */
> >> >> + if (sock_cookie == *val)
> >> >> +         bpf_sock_destroy(sk_common);
> >> >> +
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +SEC("iter/tcp")
> >> >> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
> >> >> +{
> >> >> + struct sock_common *sk_common = ctx->sk_common;
> >> >> + struct seq_file *seq = ctx->meta->seq;
> >> >> + struct tcp6_sock *tcp_sk;
> >> >> + const struct inet_connection_sock *icsk;
> >> >> + const struct inet_sock *inet;
> >> >> + __u16 srcp;
> >> >> +
> >> >> + if (!sk_common)
> >> >> +         return 0;
> >> >> +
> >> >> + if (sk_common->skc_family != AF_INET6)
> >> >> +         return 0;
> >> >> +
> >> >> + tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
> >> >> + if (!tcp_sk)
> >> >> +         return 0;
> >> >> +
> >> >> + icsk = &tcp_sk->tcp.inet_conn;
> >> >> + inet = &icsk->icsk_inet;
> >> >> + srcp = bpf_ntohs(inet->inet_sport);
> >> >> +
> >> >> + /* Destroy server sockets. */
> >> >> + if (srcp == SERVER_PORT)
> >> >> +         bpf_sock_destroy(sk_common);
> >> >> +
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +
> >> >> +SEC("iter/udp")
> >> >> +int iter_udp6_client(struct bpf_iter__udp *ctx)
> >> >> +{
> >> >> + struct seq_file *seq = ctx->meta->seq;
> >> >> + struct udp_sock *udp_sk = ctx->udp_sk;
> >> >> + struct sock *sk = (struct sock *) udp_sk;
> >> >> + __u64 sock_cookie = 0, *val;
> >> >> + int key = 0;
> >> >> +
> >> >> + if (!sk)
> >> >> +         return 0;
> >> >> +
> >> >> + sock_cookie  = bpf_get_socket_cookie(sk);
> >> >> + val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
> >> >> + if (!val)
> >> >> +         return 0;
> >> >> + /* Destroy connected client sockets. */
> >> >> + if (sock_cookie == *val)
> >> >> +         bpf_sock_destroy((struct sock_common *)sk);
> >> >> +
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +SEC("iter/udp")
> >> >> +int iter_udp6_server(struct bpf_iter__udp *ctx)
> >> >> +{
> >> >> + struct seq_file *seq = ctx->meta->seq;
> >> >> + struct udp_sock *udp_sk = ctx->udp_sk;
> >> >> + struct sock *sk = (struct sock *) udp_sk;
> >> >> + __u16 srcp;
> >> >> + struct inet_sock *inet;
> >> >> +
> >> >> + if (!sk)
> >> >> +         return 0;
> >> >> +
> >> >> + inet = &udp_sk->inet;
> >> >> + srcp = bpf_ntohs(inet->inet_sport);
> >> >> + if (srcp == SERVER_PORT)
> >> >> +         bpf_sock_destroy((struct sock_common *)sk);
> >> >> +
> >> >> + return 0;
> >> >> +}
> >> >> +
> >> >> +char _license[] SEC("license") = "GPL";
> >> >> --
> >> >> 2.34.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-28 17:06     ` Aditi Ghag
@ 2023-03-28 21:33       ` Martin KaFai Lau
  2023-03-29 16:20         ` Aditi Ghag
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2023-03-28 21:33 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: sdf, edumazet, Martin KaFai Lau, bpf

On 3/28/23 10:06 AM, Aditi Ghag wrote:
>>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>> +{
>>> +	struct bpf_udp_iter_state *iter = seq->private;
>>> +	struct udp_iter_state *state = &iter->state;
>>> +	struct net *net = seq_file_net(seq);
>>> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
>>> +	struct udp_table *udptable;
>>> +	struct sock *first_sk = NULL;
>>> +	struct sock *sk;
>>> +	unsigned int bucket_sks = 0;
>>> +	bool resized = false;
>>> +	int offset = 0;
>>> +	int new_offset;
>>> +
>>> +	/* The current batch is done, so advance the bucket. */
>>> +	if (iter->st_bucket_done) {
>>> +		state->bucket++;
>>> +		state->offset = 0;
>>> +	}
>>> +
>>> +	udptable = udp_get_table_afinfo(afinfo, net);
>>> +
>>> +	if (state->bucket > udptable->mask) {
>>> +		state->bucket = 0;
>>> +		state->offset = 0;
>>> +		return NULL;
>>> +	}
>>> +
>>> +again:
>>> +	/* New batch for the next bucket.
>>> +	 * Iterate over the hash table to find a bucket with sockets matching
>>> +	 * the iterator attributes, and return the first matching socket from
>>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>>> +	 * before releasing the bucket lock. This allows BPF programs that are
>>> +	 * called in seq_show to acquire the bucket lock if needed.
>>> +	 */
>>> +	iter->cur_sk = 0;
>>> +	iter->end_sk = 0;
>>> +	iter->st_bucket_done = false;
>>> +	first_sk = NULL;
>>> +	bucket_sks = 0;
>>> +	offset = state->offset;
>>> +	new_offset = offset;
>>> +
>>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>>> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
>>
>> Use udptable->hash"2" which is hashed by addr and port. It will help to get a smaller batch. It was the comment given in v2.
> 
> I thought I replied to your review comment, but looks like I didn't. My bad!
> I already gave it a shot, and I'll need to understand better how udptable->hash2 is populated. When I swapped hash with hash2, there were no sockets to iterate. Am I missing something obvious?

Take a look at udp_lib_lport_inuse2() on how it iterates.

> 
>>
>>> +
>>> +		if (hlist_empty(&hslot->head)) {
>>> +			offset = 0;
>>> +			continue;
>>> +		}
>>> +
>>> +		spin_lock_bh(&hslot->lock);
>>> +		/* Resume from the last saved position in a bucket before
>>> +		 * iterator was stopped.
>>> +		 */
>>> +		while (offset-- > 0) {
>>> +			sk_for_each(sk, &hslot->head)
>>> +				continue;
>>> +		}
>>
>> hmm... how does the above while loop and sk_for_each loop actually work?
>>
>>> +		sk_for_each(sk, &hslot->head) {
>>
>> Here starts from the beginning of the hslot->head again. doesn't look right also.
>>
>> Am I missing something here?
>>
>>> +			if (seq_sk_match(seq, sk)) {
>>> +				if (!first_sk)
>>> +					first_sk = sk;
>>> +				if (iter->end_sk < iter->max_sk) {
>>> +					sock_hold(sk);
>>> +					iter->batch[iter->end_sk++] = sk;
>>> +				}
>>> +				bucket_sks++;
>>> +			}
>>> +			new_offset++;
>>
>> And this new_offset is outside of seq_sk_match, so it is not counting for the seq_file_net(seq) netns alone.
> 
> This logic to resume iterator is buggy, indeed! So I was trying to account for the cases where the current bucket could've been updated since we release the bucket lock.
> This is what I intended to do -
> 
> +loop:
>                  sk_for_each(sk, &hslot->head) {
>                          if (seq_sk_match(seq, sk)) {
> +                               /* Resume from the last saved position in the
> +                                * bucket before iterator was stopped.
> +                                */
> +                               while (offset && offset-- > 0)
> +                                       goto loop;

still does not look right. merely a loop decrementing offset one at a time and 
then go back to the beginning of hslot->head?

A quick (untested and uncompiled) thought :

				/* Skip the first 'offset' number of sk
				 * and not putting them in the iter->batch[].
				 */
				if (offset) {
					offset--;
					continue;
				}

>                                  if (!first_sk)
>                                          first_sk = sk;
>                                  if (iter->end_sk < iter->max_sk) {
> @@ -3245,8 +3244,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>                                          iter->batch[iter->end_sk++] = sk;
>                                  }
>                                  bucket_sks++ > +                              new_offset++;
>                          }
> 
> This handles the case when sockets that weren't iterated in the previous round got deleted by the time iterator was resumed. But it's possible that previously iterated sockets got deleted before the iterator was later resumed, and the offset is now outdated. Ideally, iterator should be invalidated in this case, but there is no way to track this, is there? Any thoughts?

I would not worry about this update in-between case. race will happen anyway 
when the bucket lock is released. This should be very unlikely when hash"2" is used.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator
  2023-03-28 21:33       ` Martin KaFai Lau
@ 2023-03-29 16:20         ` Aditi Ghag
  0 siblings, 0 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-29 16:20 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: Stanislav Fomichev, Eric Dumazet, Martin KaFai Lau, bpf



> On Mar 28, 2023, at 2:33 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/28/23 10:06 AM, Aditi Ghag wrote:
>>>> +static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>> +{
>>>> +	struct bpf_udp_iter_state *iter = seq->private;
>>>> +	struct udp_iter_state *state = &iter->state;
>>>> +	struct net *net = seq_file_net(seq);
>>>> +	struct udp_seq_afinfo *afinfo = state->bpf_seq_afinfo;
>>>> +	struct udp_table *udptable;
>>>> +	struct sock *first_sk = NULL;
>>>> +	struct sock *sk;
>>>> +	unsigned int bucket_sks = 0;
>>>> +	bool resized = false;
>>>> +	int offset = 0;
>>>> +	int new_offset;
>>>> +
>>>> +	/* The current batch is done, so advance the bucket. */
>>>> +	if (iter->st_bucket_done) {
>>>> +		state->bucket++;
>>>> +		state->offset = 0;
>>>> +	}
>>>> +
>>>> +	udptable = udp_get_table_afinfo(afinfo, net);
>>>> +
>>>> +	if (state->bucket > udptable->mask) {
>>>> +		state->bucket = 0;
>>>> +		state->offset = 0;
>>>> +		return NULL;
>>>> +	}
>>>> +
>>>> +again:
>>>> +	/* New batch for the next bucket.
>>>> +	 * Iterate over the hash table to find a bucket with sockets matching
>>>> +	 * the iterator attributes, and return the first matching socket from
>>>> +	 * the bucket. The remaining matched sockets from the bucket are batched
>>>> +	 * before releasing the bucket lock. This allows BPF programs that are
>>>> +	 * called in seq_show to acquire the bucket lock if needed.
>>>> +	 */
>>>> +	iter->cur_sk = 0;
>>>> +	iter->end_sk = 0;
>>>> +	iter->st_bucket_done = false;
>>>> +	first_sk = NULL;
>>>> +	bucket_sks = 0;
>>>> +	offset = state->offset;
>>>> +	new_offset = offset;
>>>> +
>>>> +	for (; state->bucket <= udptable->mask; state->bucket++) {
>>>> +		struct udp_hslot *hslot = &udptable->hash[state->bucket];
>>> 
>>> Use udptable->hash"2" which is hashed by addr and port. It will help to get a smaller batch. It was the comment given in v2.
>> I thought I replied to your review comment, but looks like I didn't. My bad!
>> I already gave it a shot, and I'll need to understand better how udptable->hash2 is populated. When I swapped hash with hash2, there were no sockets to iterate. Am I missing something obvious?
> 
> Take a look at udp_lib_lport_inuse2() on how it iterates.

Thanks! I've updated the code to use hash2 instead of hash.

> 
>>> 
>>>> +
>>>> +		if (hlist_empty(&hslot->head)) {
>>>> +			offset = 0;
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		spin_lock_bh(&hslot->lock);
>>>> +		/* Resume from the last saved position in a bucket before
>>>> +		 * iterator was stopped.
>>>> +		 */
>>>> +		while (offset-- > 0) {
>>>> +			sk_for_each(sk, &hslot->head)
>>>> +				continue;
>>>> +		}
>>> 
>>> hmm... how does the above while loop and sk_for_each loop actually work?
>>> 
>>>> +		sk_for_each(sk, &hslot->head) {
>>> 
>>> Here starts from the beginning of the hslot->head again. doesn't look right also.
>>> 
>>> Am I missing something here?
>>> 
>>>> +			if (seq_sk_match(seq, sk)) {
>>>> +				if (!first_sk)
>>>> +					first_sk = sk;
>>>> +				if (iter->end_sk < iter->max_sk) {
>>>> +					sock_hold(sk);
>>>> +					iter->batch[iter->end_sk++] = sk;
>>>> +				}
>>>> +				bucket_sks++;
>>>> +			}
>>>> +			new_offset++;
>>> 
>>> And this new_offset is outside of seq_sk_match, so it is not counting for the seq_file_net(seq) netns alone.
>> This logic to resume iterator is buggy, indeed! So I was trying to account for the cases where the current bucket could've been updated since we release the bucket lock.
>> This is what I intended to do -
>> +loop:
>>                 sk_for_each(sk, &hslot->head) {
>>                         if (seq_sk_match(seq, sk)) {
>> +                               /* Resume from the last saved position in the
>> +                                * bucket before iterator was stopped.
>> +                                */
>> +                               while (offset && offset-- > 0)
>> +                                       goto loop;
> 
> still does not look right. merely a loop decrementing offset one at a time and then go back to the beginning of hslot->head?

Yes, I realized that the macro doesn't continue as I thought it would. I've fixed it.

> 
> A quick (untested and uncompiled) thought :
> 
> 				/* Skip the first 'offset' number of sk
> 				 * and not putting them in the iter->batch[].
> 				 */
> 				if (offset) {
> 					offset--;
> 					continue;
> 				}
> 
>>                                 if (!first_sk)
>>                                         first_sk = sk;
>>                                 if (iter->end_sk < iter->max_sk) {
>> @@ -3245,8 +3244,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>                                         iter->batch[iter->end_sk++] = sk;
>>                                 }
>>                                 bucket_sks++ > +                              new_offset++;
>>                         }
>> This handles the case when sockets that weren't iterated in the previous round got deleted by the time iterator was resumed. But it's possible that previously iterated sockets got deleted before the iterator was later resumed, and the offset is now outdated. Ideally, iterator should be invalidated in this case, but there is no way to track this, is there? Any thoughts?
> 
> I would not worry about this update in-between case. race will happen anyway when the bucket lock is released. This should be very unlikely when hash"2" is used.
> 
> 

That makes sense. 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-28 18:35           ` Stanislav Fomichev
@ 2023-03-29 23:13             ` Aditi Ghag
  2023-03-29 23:25               ` Aditi Ghag
  2023-03-29 23:25               ` Stanislav Fomichev
  0 siblings, 2 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-29 23:13 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 28, 2023, at 11:35 AM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On Tue, Mar 28, 2023 at 10:51 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>> 
>> 
>> 
>>> On Mar 27, 2023, at 9:54 AM, Stanislav Fomichev <sdf@google.com> wrote:
>>> 
>>> On 03/27, Aditi Ghag wrote:
>>> 
>>> 
>>>>> On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
>>>>> 
>>>>> On 03/23, Aditi Ghag wrote:
>>>>>> The test cases for destroying sockets mirror the intended usages of the
>>>>>> bpf_sock_destroy kfunc using iterators.
>>>>> 
>>>>>> The destroy helpers set `ECONNABORTED` error code that we can validate in
>>>>>> the test code with client sockets. But UDP sockets have an overriding error
>>>>>> code from the disconnect called during abort, so the error code the
>>>>>> validation is only done for TCP sockets.
>>>>> 
>>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>>> ---
>>>>>> .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
>>>>>> .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
>>>>>> 2 files changed, 346 insertions(+)
>>>>>> create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>>>>> create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>>>>> 
>>>>>> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..cbce966af568
>>>>>> --- /dev/null
>>>>>> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>>>>> @@ -0,0 +1,195 @@
>>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>>> +#include <test_progs.h>
>>>>>> +
>>>>>> +#include "sock_destroy_prog.skel.h"
>>>>>> +#include "network_helpers.h"
>>>>>> +
>>>>>> +#define SERVER_PORT 6062
>>>>>> +
>>>>>> +static void start_iter_sockets(struct bpf_program *prog)
>>>>>> +{
>>>>>> + struct bpf_link *link;
>>>>>> + char buf[50] = {};
>>>>>> + int iter_fd, len;
>>>>>> +
>>>>>> + link = bpf_program__attach_iter(prog, NULL);
>>>>>> + if (!ASSERT_OK_PTR(link, "attach_iter"))
>>>>>> +         return;
>>>>>> +
>>>>>> + iter_fd = bpf_iter_create(bpf_link__fd(link));
>>>>>> + if (!ASSERT_GE(iter_fd, 0, "create_iter"))
>>>>>> +         goto free_link;
>>>>>> +
>>>>>> + while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
>>>>>> +         ;
>>>>>> + ASSERT_GE(len, 0, "read");
>>>>>> +
>>>>>> + close(iter_fd);
>>>>>> +
>>>>>> +free_link:
>>>>>> + bpf_link__destroy(link);
>>>>>> +}
>>>>>> +
>>>>>> +static void test_tcp_client(struct sock_destroy_prog *skel)
>>>>>> +{
>>>>>> + int serv = -1, clien = -1, n = 0;
>>>>>> +
>>>>>> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
>>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
>>>>>> +         goto cleanup_serv;
>>>>>> +
>>>>>> + clien = connect_to_fd(serv, 0);
>>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>>>>>> +         goto cleanup_serv;
>>>>>> +
>>>>>> + serv = accept(serv, NULL, NULL);
>>>>>> + if (!ASSERT_GE(serv, 0, "serv accept"))
>>>>>> +         goto cleanup;
>>>>>> +
>>>>>> + n = send(clien, "t", 1, 0);
>>>>>> + if (!ASSERT_GE(n, 0, "client send"))
>>>>>> +         goto cleanup;
>>>>>> +
>>>>>> + /* Run iterator program that destroys connected client sockets. */
>>>>>> + start_iter_sockets(skel->progs.iter_tcp6_client);
>>>>>> +
>>>>>> + n = send(clien, "t", 1, 0);
>>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>>>>>> +         goto cleanup;
>>>>>> + ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
>>>>>> +
>>>>>> +
>>>>>> +cleanup:
>>>>>> + close(clien);
>>>>>> +cleanup_serv:
>>>>>> + close(serv);
>>>>>> +}
>>>>>> +
>>>>>> +static void test_tcp_server(struct sock_destroy_prog *skel)
>>>>>> +{
>>>>>> + int serv = -1, clien = -1, n = 0;
>>>>>> +
>>>>>> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
>>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
>>>>>> +         goto cleanup_serv;
>>>>>> +
>>>>>> + clien = connect_to_fd(serv, 0);
>>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>>>>>> +         goto cleanup_serv;
>>>>>> +
>>>>>> + serv = accept(serv, NULL, NULL);
>>>>>> + if (!ASSERT_GE(serv, 0, "serv accept"))
>>>>>> +         goto cleanup;
>>>>>> +
>>>>>> + n = send(clien, "t", 1, 0);
>>>>>> + if (!ASSERT_GE(n, 0, "client send"))
>>>>>> +         goto cleanup;
>>>>>> +
>>>>>> + /* Run iterator program that destroys server sockets. */
>>>>>> + start_iter_sockets(skel->progs.iter_tcp6_server);
>>>>>> +
>>>>>> + n = send(clien, "t", 1, 0);
>>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>>>>>> +         goto cleanup;
>>>>>> + ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
>>>>>> +
>>>>>> +
>>>>>> +cleanup:
>>>>>> + close(clien);
>>>>>> +cleanup_serv:
>>>>>> + close(serv);
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> +static void test_udp_client(struct sock_destroy_prog *skel)
>>>>>> +{
>>>>>> + int serv = -1, clien = -1, n = 0;
>>>>>> +
>>>>>> + serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
>>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
>>>>>> +         goto cleanup_serv;
>>>>>> +
>>>>>> + clien = connect_to_fd(serv, 0);
>>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>>>>>> +         goto cleanup_serv;
>>>>>> +
>>>>>> + n = send(clien, "t", 1, 0);
>>>>>> + if (!ASSERT_GE(n, 0, "client send"))
>>>>>> +         goto cleanup;
>>>>>> +
>>>>>> + /* Run iterator program that destroys sockets. */
>>>>>> + start_iter_sockets(skel->progs.iter_udp6_client);
>>>>>> +
>>>>>> + n = send(clien, "t", 1, 0);
>>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>>>>>> +         goto cleanup;
>>>>>> + /* UDP sockets have an overriding error code after they are disconnected,
>>>>>> +  * so we don't check for ECONNABORTED error code.
>>>>>> +  */
>>>>>> +
>>>>>> +cleanup:
>>>>>> + close(clien);
>>>>>> +cleanup_serv:
>>>>>> + close(serv);
>>>>>> +}
>>>>>> +
>>>>>> +static void test_udp_server(struct sock_destroy_prog *skel)
>>>>>> +{
>>>>>> + int *listen_fds = NULL, n, i;
>>>>>> + unsigned int num_listens = 5;
>>>>>> + char buf[1];
>>>>>> +
>>>>>> + /* Start reuseport servers. */
>>>>>> + listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
>>>>>> +                                     "::1", SERVER_PORT, 0,
>>>>>> +                                     num_listens);
>>>>>> + if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
>>>>>> +         goto cleanup;
>>>>>> +
>>>>>> + /* Run iterator program that destroys server sockets. */
>>>>>> + start_iter_sockets(skel->progs.iter_udp6_server);
>>>>>> +
>>>>>> + for (i = 0; i < num_listens; ++i) {
>>>>>> +         n = read(listen_fds[i], buf, sizeof(buf));
>>>>>> +         if (!ASSERT_EQ(n, -1, "read") ||
>>>>>> +             !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
>>>>>> +                 break;
>>>>>> + }
>>>>>> + ASSERT_EQ(i, num_listens, "server socket");
>>>>>> +
>>>>>> +cleanup:
>>>>>> + free_fds(listen_fds, num_listens);
>>>>>> +}
>>>>>> +
>>>>>> +void test_sock_destroy(void)
>>>>>> +{
>>>>>> + int cgroup_fd = 0;
>>>>>> + struct sock_destroy_prog *skel;
>>>>>> +
>>>>>> + skel = sock_destroy_prog__open_and_load();
>>>>>> + if (!ASSERT_OK_PTR(skel, "skel_open"))
>>>>>> +         return;
>>>>>> +
>>>>>> + cgroup_fd = test__join_cgroup("/sock_destroy");
>>>>>> + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
>>>>>> +         goto close_cgroup_fd;
>>>>>> +
>>>>>> + skel->links.sock_connect = bpf_program__attach_cgroup(
>>>>>> +         skel->progs.sock_connect, cgroup_fd);
>>>>>> + if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
>>>>>> +         goto close_cgroup_fd;
>>>>>> +
>>>>>> + if (test__start_subtest("tcp_client"))
>>>>>> +         test_tcp_client(skel);
>>>>>> + if (test__start_subtest("tcp_server"))
>>>>>> +         test_tcp_server(skel);
>>>>>> + if (test__start_subtest("udp_client"))
>>>>>> +         test_udp_client(skel);
>>>>>> + if (test__start_subtest("udp_server"))
>>>>>> +         test_udp_server(skel);
>>>>>> +
>>>>>> +
>>>>>> +close_cgroup_fd:
>>>>>> + close(cgroup_fd);
>>>>>> + sock_destroy_prog__destroy(skel);
>>>>>> +}
>>>>>> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..8e09d82c50f3
>>>>>> --- /dev/null
>>>>>> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>>>>>> @@ -0,0 +1,151 @@
>>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>>> +
>>>>>> +#include "vmlinux.h"
>>>>>> +
>>>>>> +#include "bpf_tracing_net.h"
>>>>>> +#include <bpf/bpf_helpers.h>
>>>>>> +#include <bpf/bpf_endian.h>
>>>>>> +
>>>>>> +#define AF_INET6 10
>>>>> 
>>>>> [..]
>>>>> 
>>>>>> +/* Keep it in sync with prog_test/sock_destroy. */
>>>>>> +#define SERVER_PORT 6062
>>>>> 
>>>>> The test looks good, one optional unrelated nit maybe:
>>>>> 
>>>>> I've been guilty of these hard-coded ports in the past, but maybe
>>>>> we should stop hard-coding them? Getting the address of the listener (bound to
>>>>> port 0) and passing it to the bpf program via global variable should be super
>>>>> easy now (with the skeletons and network_helpers).
>>> 
>>> 
>>>> I briefly considered adding the ports in a map, and retrieving them in the test. But it didn't seem worthwhile as the tests should fail clearly when there is a mismatch.
>>> 
>>> My worry is that the amount of those tests that have a hard-coded port
>>> grows and at some point somebody will clash with somebody else.
>>> And it might not be 100% apparent because test_progs is now multi-threaded
>>> and racy..
>>> 
>> 
>> So you would like the ports to be unique across all the tests.
> 
> Yeah, but it's hard without having some kind of global registry. Take
> a look at the following:
> 
> $ grep -Iri _port tools/testing/selftests/bpf/ | grep -P '\d{4}'
> 
> tools/testing/selftests/bpf/progs/connect_force_port4.c:
> sa.sin_port = bpf_htons(22222);
> tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
> (ctx->user_port == bpf_htons(60000)) {
> tools/testing/selftests/bpf/progs/connect_force_port4.c:
> ctx->user_port = bpf_htons(60123);
> tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
> (ctx->user_port == bpf_htons(60123)) {
> tools/testing/selftests/bpf/progs/connect_force_port4.c:
> ctx->user_port = bpf_htons(60000);
> tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
> (ctx->user_port == bpf_htons(60123)) {
> tools/testing/selftests/bpf/progs/connect6_prog.c:#define
> DST_REWRITE_PORT6     6666
> tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
> SRC_PORT = bpf_htons(8008);
> tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
> DST_PORT = 7007; /* Host byte order */
> tools/testing/selftests/bpf/progs/test_tc_dtime.c:      __u16
> dst_ns_port = __bpf_htons(50000 + test);
> tools/testing/selftests/bpf/progs/connect4_dropper.c:   if
> (ctx->user_port == bpf_htons(60120))
> tools/testing/selftests/bpf/progs/connect_force_port6.c:
> sa.sin6_port = bpf_htons(22223);
> tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
> (ctx->user_port == bpf_htons(60000)) {
> tools/testing/selftests/bpf/progs/connect_force_port6.c:
> ctx->user_port = bpf_htons(60124);
> tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
> (ctx->user_port == bpf_htons(60124)) {
> tools/testing/selftests/bpf/progs/connect_force_port6.c:
> ctx->user_port = bpf_htons(60000);
> tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
> (ctx->user_port == bpf_htons(60124)) {
> tools/testing/selftests/bpf/progs/test_tunnel_kern.c:#define
> VXLAN_UDP_PORT 4789
> tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define DST_PORT
>         4040
> tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define
> DST_REWRITE_PORT4     4444
> tools/testing/selftests/bpf/progs/connect4_prog.c:#define
> DST_REWRITE_PORT4     4444
> tools/testing/selftests/bpf/progs/bind6_prog.c:#define SERV6_PORT
>         6060
> tools/testing/selftests/bpf/progs/bind6_prog.c:#define
> SERV6_REWRITE_PORT       6666
> tools/testing/selftests/bpf/progs/sendmsg6_prog.c:#define
> DST_REWRITE_PORT6     6666
> tools/testing/selftests/bpf/progs/recvmsg4_prog.c:#define SERV4_PORT
>         4040
> <cut>
> 
> .... there is much more ...
> 
>>> Getting the address of the listener (bound to
>>> port 0) and passing it to the bpf program via global variable should be super
>>> easy now (with the skeletons and network_helpers).
>> 
>> Just so that we are on the same page, could you point to which network helpers are you referring to here for passing global variables?
> 
> Take a look at the following existing tests:
> * prog_tests/cgroup_skb_sk_lookup.c
>  * run_lookup_test(&skel->bss->g_serv_port, out_sk);
> * progs/cgroup_skb_sk_lookup_kern.c
>  * g_serv_port
> 
> Fundamentally, here is what's preferable to have:
> 
> fd = start_server(..., port=0, ...);
> listener_port = get_port(fd); /* new network_helpers.h helper that
> calls getsockname */
> obj->bss->port = listener_port; /* populate the port in the BPF program */
> 
> Does it make sense?

That makes sense. Good to know for future references. The client tests don't have hard-coded ports anyway, only the server tests do as they are using the so_resuseport option. You did mention that this was an optional nit, so I'll leave the hard-coded ports for the server tests for now. Hope that's reasonable.

> 
>>>>> 
>>>>> And, unrelated, maybe also fix a bunch of places where the reverse christmas
>>>>> tree doesn't look reverse anymore?
>>> 
>>>> Ok. The checks should be part of tooling (e.g., checkpatch) though if they are meant to be enforced consistently, no?
>>> 
>>> They are networking specific, so they are not part of a checkpath :-(
>>> I won't say they are consistently enforced, but we try to keep then
>>> whenever possible.
>>> 
>>>>> 
>>>>>> +
>>>>>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>>>>>> +
>>>>>> +struct {
>>>>>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>>>>>> + __uint(max_entries, 1);
>>>>>> + __type(key, __u32);
>>>>>> + __type(value, __u64);
>>>>>> +} tcp_conn_sockets SEC(".maps");
>>>>>> +
>>>>>> +struct {
>>>>>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>>>>>> + __uint(max_entries, 1);
>>>>>> + __type(key, __u32);
>>>>>> + __type(value, __u64);
>>>>>> +} udp_conn_sockets SEC(".maps");
>>>>>> +
>>>>>> +SEC("cgroup/connect6")
>>>>>> +int sock_connect(struct bpf_sock_addr *ctx)
>>>>>> +{
>>>>>> + int key = 0;
>>>>>> + __u64 sock_cookie = 0;
>>>>>> + __u32 keyc = 0;
>>>>>> +
>>>>>> + if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
>>>>>> +         return 1;
>>>>>> +
>>>>>> + sock_cookie = bpf_get_socket_cookie(ctx);
>>>>>> + if (ctx->protocol == IPPROTO_TCP)
>>>>>> +         bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
>>>>>> + else if (ctx->protocol == IPPROTO_UDP)
>>>>>> +         bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
>>>>>> + else
>>>>>> +         return 1;
>>>>>> +
>>>>>> + return 1;
>>>>>> +}
>>>>>> +
>>>>>> +SEC("iter/tcp")
>>>>>> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
>>>>>> +{
>>>>>> + struct sock_common *sk_common = ctx->sk_common;
>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>> + __u64 sock_cookie = 0;
>>>>>> + __u64 *val;
>>>>>> + int key = 0;
>>>>>> +
>>>>>> + if (!sk_common)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + if (sk_common->skc_family != AF_INET6)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + sock_cookie  = bpf_get_socket_cookie(sk_common);
>>>>>> + val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
>>>>>> + if (!val)
>>>>>> +         return 0;
>>>>>> + /* Destroy connected client sockets. */
>>>>>> + if (sock_cookie == *val)
>>>>>> +         bpf_sock_destroy(sk_common);
>>>>>> +
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +SEC("iter/tcp")
>>>>>> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
>>>>>> +{
>>>>>> + struct sock_common *sk_common = ctx->sk_common;
>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>> + struct tcp6_sock *tcp_sk;
>>>>>> + const struct inet_connection_sock *icsk;
>>>>>> + const struct inet_sock *inet;
>>>>>> + __u16 srcp;
>>>>>> +
>>>>>> + if (!sk_common)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + if (sk_common->skc_family != AF_INET6)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
>>>>>> + if (!tcp_sk)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + icsk = &tcp_sk->tcp.inet_conn;
>>>>>> + inet = &icsk->icsk_inet;
>>>>>> + srcp = bpf_ntohs(inet->inet_sport);
>>>>>> +
>>>>>> + /* Destroy server sockets. */
>>>>>> + if (srcp == SERVER_PORT)
>>>>>> +         bpf_sock_destroy(sk_common);
>>>>>> +
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> +SEC("iter/udp")
>>>>>> +int iter_udp6_client(struct bpf_iter__udp *ctx)
>>>>>> +{
>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>> + struct udp_sock *udp_sk = ctx->udp_sk;
>>>>>> + struct sock *sk = (struct sock *) udp_sk;
>>>>>> + __u64 sock_cookie = 0, *val;
>>>>>> + int key = 0;
>>>>>> +
>>>>>> + if (!sk)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + sock_cookie  = bpf_get_socket_cookie(sk);
>>>>>> + val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
>>>>>> + if (!val)
>>>>>> +         return 0;
>>>>>> + /* Destroy connected client sockets. */
>>>>>> + if (sock_cookie == *val)
>>>>>> +         bpf_sock_destroy((struct sock_common *)sk);
>>>>>> +
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +SEC("iter/udp")
>>>>>> +int iter_udp6_server(struct bpf_iter__udp *ctx)
>>>>>> +{
>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>> + struct udp_sock *udp_sk = ctx->udp_sk;
>>>>>> + struct sock *sk = (struct sock *) udp_sk;
>>>>>> + __u16 srcp;
>>>>>> + struct inet_sock *inet;
>>>>>> +
>>>>>> + if (!sk)
>>>>>> +         return 0;
>>>>>> +
>>>>>> + inet = &udp_sk->inet;
>>>>>> + srcp = bpf_ntohs(inet->inet_sport);
>>>>>> + if (srcp == SERVER_PORT)
>>>>>> +         bpf_sock_destroy((struct sock_common *)sk);
>>>>>> +
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +char _license[] SEC("license") = "GPL";
>>>>>> --
>>>>>> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-29 23:13             ` Aditi Ghag
@ 2023-03-29 23:25               ` Aditi Ghag
  2023-03-29 23:25               ` Stanislav Fomichev
  1 sibling, 0 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-03-29 23:25 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 29, 2023, at 4:13 PM, Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> 
> 
> 
>> On Mar 28, 2023, at 11:35 AM, Stanislav Fomichev <sdf@google.com> wrote:
>> 
>> On Tue, Mar 28, 2023 at 10:51 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 27, 2023, at 9:54 AM, Stanislav Fomichev <sdf@google.com> wrote:
>>>> 
>>>> On 03/27, Aditi Ghag wrote:
>>>> 
>>>> 
>>>>>> On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
>>>>>> 
>>>>>> On 03/23, Aditi Ghag wrote:
>>>>>>> The test cases for destroying sockets mirror the intended usages of the
>>>>>>> bpf_sock_destroy kfunc using iterators.
>>>>>> 
>>>>>>> The destroy helpers set `ECONNABORTED` error code that we can validate in
>>>>>>> the test code with client sockets. But UDP sockets have an overriding error
>>>>>>> code from the disconnect called during abort, so the error code the
>>>>>>> validation is only done for TCP sockets.
>>>>>> 
>>>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>>>>>>> ---
>>>>>>> .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
>>>>>>> .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
>>>>>>> 2 files changed, 346 insertions(+)
>>>>>>> create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>>>>>> create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>>>>>> 
>>>>>>> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>>>>>> new file mode 100644
>>>>>>> index 000000000000..cbce966af568
>>>>>>> --- /dev/null
>>>>>>> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
>>>>>>> @@ -0,0 +1,195 @@
>>>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>>>> +#include <test_progs.h>
>>>>>>> +
>>>>>>> +#include "sock_destroy_prog.skel.h"
>>>>>>> +#include "network_helpers.h"
>>>>>>> +
>>>>>>> +#define SERVER_PORT 6062
>>>>>>> +
>>>>>>> +static void start_iter_sockets(struct bpf_program *prog)
>>>>>>> +{
>>>>>>> + struct bpf_link *link;
>>>>>>> + char buf[50] = {};
>>>>>>> + int iter_fd, len;
>>>>>>> +
>>>>>>> + link = bpf_program__attach_iter(prog, NULL);
>>>>>>> + if (!ASSERT_OK_PTR(link, "attach_iter"))
>>>>>>> +         return;
>>>>>>> +
>>>>>>> + iter_fd = bpf_iter_create(bpf_link__fd(link));
>>>>>>> + if (!ASSERT_GE(iter_fd, 0, "create_iter"))
>>>>>>> +         goto free_link;
>>>>>>> +
>>>>>>> + while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
>>>>>>> +         ;
>>>>>>> + ASSERT_GE(len, 0, "read");
>>>>>>> +
>>>>>>> + close(iter_fd);
>>>>>>> +
>>>>>>> +free_link:
>>>>>>> + bpf_link__destroy(link);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void test_tcp_client(struct sock_destroy_prog *skel)
>>>>>>> +{
>>>>>>> + int serv = -1, clien = -1, n = 0;
>>>>>>> +
>>>>>>> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
>>>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
>>>>>>> +         goto cleanup_serv;
>>>>>>> +
>>>>>>> + clien = connect_to_fd(serv, 0);
>>>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>>>>>>> +         goto cleanup_serv;
>>>>>>> +
>>>>>>> + serv = accept(serv, NULL, NULL);
>>>>>>> + if (!ASSERT_GE(serv, 0, "serv accept"))
>>>>>>> +         goto cleanup;
>>>>>>> +
>>>>>>> + n = send(clien, "t", 1, 0);
>>>>>>> + if (!ASSERT_GE(n, 0, "client send"))
>>>>>>> +         goto cleanup;
>>>>>>> +
>>>>>>> + /* Run iterator program that destroys connected client sockets. */
>>>>>>> + start_iter_sockets(skel->progs.iter_tcp6_client);
>>>>>>> +
>>>>>>> + n = send(clien, "t", 1, 0);
>>>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>>>>>>> +         goto cleanup;
>>>>>>> + ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
>>>>>>> +
>>>>>>> +
>>>>>>> +cleanup:
>>>>>>> + close(clien);
>>>>>>> +cleanup_serv:
>>>>>>> + close(serv);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void test_tcp_server(struct sock_destroy_prog *skel)
>>>>>>> +{
>>>>>>> + int serv = -1, clien = -1, n = 0;
>>>>>>> +
>>>>>>> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
>>>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
>>>>>>> +         goto cleanup_serv;
>>>>>>> +
>>>>>>> + clien = connect_to_fd(serv, 0);
>>>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>>>>>>> +         goto cleanup_serv;
>>>>>>> +
>>>>>>> + serv = accept(serv, NULL, NULL);
>>>>>>> + if (!ASSERT_GE(serv, 0, "serv accept"))
>>>>>>> +         goto cleanup;
>>>>>>> +
>>>>>>> + n = send(clien, "t", 1, 0);
>>>>>>> + if (!ASSERT_GE(n, 0, "client send"))
>>>>>>> +         goto cleanup;
>>>>>>> +
>>>>>>> + /* Run iterator program that destroys server sockets. */
>>>>>>> + start_iter_sockets(skel->progs.iter_tcp6_server);
>>>>>>> +
>>>>>>> + n = send(clien, "t", 1, 0);
>>>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>>>>>>> +         goto cleanup;
>>>>>>> + ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
>>>>>>> +
>>>>>>> +
>>>>>>> +cleanup:
>>>>>>> + close(clien);
>>>>>>> +cleanup_serv:
>>>>>>> + close(serv);
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +static void test_udp_client(struct sock_destroy_prog *skel)
>>>>>>> +{
>>>>>>> + int serv = -1, clien = -1, n = 0;
>>>>>>> +
>>>>>>> + serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
>>>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
>>>>>>> +         goto cleanup_serv;
>>>>>>> +
>>>>>>> + clien = connect_to_fd(serv, 0);
>>>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
>>>>>>> +         goto cleanup_serv;
>>>>>>> +
>>>>>>> + n = send(clien, "t", 1, 0);
>>>>>>> + if (!ASSERT_GE(n, 0, "client send"))
>>>>>>> +         goto cleanup;
>>>>>>> +
>>>>>>> + /* Run iterator program that destroys sockets. */
>>>>>>> + start_iter_sockets(skel->progs.iter_udp6_client);
>>>>>>> +
>>>>>>> + n = send(clien, "t", 1, 0);
>>>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
>>>>>>> +         goto cleanup;
>>>>>>> + /* UDP sockets have an overriding error code after they are disconnected,
>>>>>>> +  * so we don't check for ECONNABORTED error code.
>>>>>>> +  */
>>>>>>> +
>>>>>>> +cleanup:
>>>>>>> + close(clien);
>>>>>>> +cleanup_serv:
>>>>>>> + close(serv);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void test_udp_server(struct sock_destroy_prog *skel)
>>>>>>> +{
>>>>>>> + int *listen_fds = NULL, n, i;
>>>>>>> + unsigned int num_listens = 5;
>>>>>>> + char buf[1];
>>>>>>> +
>>>>>>> + /* Start reuseport servers. */
>>>>>>> + listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
>>>>>>> +                                     "::1", SERVER_PORT, 0,
>>>>>>> +                                     num_listens);
>>>>>>> + if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
>>>>>>> +         goto cleanup;
>>>>>>> +
>>>>>>> + /* Run iterator program that destroys server sockets. */
>>>>>>> + start_iter_sockets(skel->progs.iter_udp6_server);
>>>>>>> +
>>>>>>> + for (i = 0; i < num_listens; ++i) {
>>>>>>> +         n = read(listen_fds[i], buf, sizeof(buf));
>>>>>>> +         if (!ASSERT_EQ(n, -1, "read") ||
>>>>>>> +             !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
>>>>>>> +                 break;
>>>>>>> + }
>>>>>>> + ASSERT_EQ(i, num_listens, "server socket");
>>>>>>> +
>>>>>>> +cleanup:
>>>>>>> + free_fds(listen_fds, num_listens);
>>>>>>> +}
>>>>>>> +
>>>>>>> +void test_sock_destroy(void)
>>>>>>> +{
>>>>>>> + int cgroup_fd = 0;
>>>>>>> + struct sock_destroy_prog *skel;
>>>>>>> +
>>>>>>> + skel = sock_destroy_prog__open_and_load();
>>>>>>> + if (!ASSERT_OK_PTR(skel, "skel_open"))
>>>>>>> +         return;
>>>>>>> +
>>>>>>> + cgroup_fd = test__join_cgroup("/sock_destroy");
>>>>>>> + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
>>>>>>> +         goto close_cgroup_fd;
>>>>>>> +
>>>>>>> + skel->links.sock_connect = bpf_program__attach_cgroup(
>>>>>>> +         skel->progs.sock_connect, cgroup_fd);
>>>>>>> + if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
>>>>>>> +         goto close_cgroup_fd;
>>>>>>> +
>>>>>>> + if (test__start_subtest("tcp_client"))
>>>>>>> +         test_tcp_client(skel);
>>>>>>> + if (test__start_subtest("tcp_server"))
>>>>>>> +         test_tcp_server(skel);
>>>>>>> + if (test__start_subtest("udp_client"))
>>>>>>> +         test_udp_client(skel);
>>>>>>> + if (test__start_subtest("udp_server"))
>>>>>>> +         test_udp_server(skel);
>>>>>>> +
>>>>>>> +
>>>>>>> +close_cgroup_fd:
>>>>>>> + close(cgroup_fd);
>>>>>>> + sock_destroy_prog__destroy(skel);
>>>>>>> +}
>>>>>>> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>>>>>>> new file mode 100644
>>>>>>> index 000000000000..8e09d82c50f3
>>>>>>> --- /dev/null
>>>>>>> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
>>>>>>> @@ -0,0 +1,151 @@
>>>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>>>> +
>>>>>>> +#include "vmlinux.h"
>>>>>>> +
>>>>>>> +#include "bpf_tracing_net.h"
>>>>>>> +#include <bpf/bpf_helpers.h>
>>>>>>> +#include <bpf/bpf_endian.h>
>>>>>>> +
>>>>>>> +#define AF_INET6 10
>>>>>> 
>>>>>> [..]
>>>>>> 
>>>>>>> +/* Keep it in sync with prog_test/sock_destroy. */
>>>>>>> +#define SERVER_PORT 6062
>>>>>> 
>>>>>> The test looks good, one optional unrelated nit maybe:
>>>>>> 
>>>>>> I've been guilty of these hard-coded ports in the past, but maybe
>>>>>> we should stop hard-coding them? Getting the address of the listener (bound to
>>>>>> port 0) and passing it to the bpf program via global variable should be super
>>>>>> easy now (with the skeletons and network_helpers).
>>>> 
>>>> 
>>>>> I briefly considered adding the ports in a map, and retrieving them in the test. But it didn't seem worthwhile as the tests should fail clearly when there is a mismatch.
>>>> 
>>>> My worry is that the amount of those tests that have a hard-coded port
>>>> grows and at some point somebody will clash with somebody else.
>>>> And it might not be 100% apparent because test_progs is now multi-threaded
>>>> and racy..
>>>> 
>>> 
>>> So you would like the ports to be unique across all the tests.
>> 
>> Yeah, but it's hard without having some kind of global registry. Take
>> a look at the following:
>> 
>> $ grep -Iri _port tools/testing/selftests/bpf/ | grep -P '\d{4}'
>> 
>> tools/testing/selftests/bpf/progs/connect_force_port4.c:
>> sa.sin_port = bpf_htons(22222);
>> tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
>> (ctx->user_port == bpf_htons(60000)) {
>> tools/testing/selftests/bpf/progs/connect_force_port4.c:
>> ctx->user_port = bpf_htons(60123);
>> tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
>> (ctx->user_port == bpf_htons(60123)) {
>> tools/testing/selftests/bpf/progs/connect_force_port4.c:
>> ctx->user_port = bpf_htons(60000);
>> tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
>> (ctx->user_port == bpf_htons(60123)) {
>> tools/testing/selftests/bpf/progs/connect6_prog.c:#define
>> DST_REWRITE_PORT6     6666
>> tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
>> SRC_PORT = bpf_htons(8008);
>> tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
>> DST_PORT = 7007; /* Host byte order */
>> tools/testing/selftests/bpf/progs/test_tc_dtime.c:      __u16
>> dst_ns_port = __bpf_htons(50000 + test);
>> tools/testing/selftests/bpf/progs/connect4_dropper.c:   if
>> (ctx->user_port == bpf_htons(60120))
>> tools/testing/selftests/bpf/progs/connect_force_port6.c:
>> sa.sin6_port = bpf_htons(22223);
>> tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
>> (ctx->user_port == bpf_htons(60000)) {
>> tools/testing/selftests/bpf/progs/connect_force_port6.c:
>> ctx->user_port = bpf_htons(60124);
>> tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
>> (ctx->user_port == bpf_htons(60124)) {
>> tools/testing/selftests/bpf/progs/connect_force_port6.c:
>> ctx->user_port = bpf_htons(60000);
>> tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
>> (ctx->user_port == bpf_htons(60124)) {
>> tools/testing/selftests/bpf/progs/test_tunnel_kern.c:#define
>> VXLAN_UDP_PORT 4789
>> tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define DST_PORT
>>        4040
>> tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define
>> DST_REWRITE_PORT4     4444
>> tools/testing/selftests/bpf/progs/connect4_prog.c:#define
>> DST_REWRITE_PORT4     4444
>> tools/testing/selftests/bpf/progs/bind6_prog.c:#define SERV6_PORT
>>        6060
>> tools/testing/selftests/bpf/progs/bind6_prog.c:#define
>> SERV6_REWRITE_PORT       6666
>> tools/testing/selftests/bpf/progs/sendmsg6_prog.c:#define
>> DST_REWRITE_PORT6     6666
>> tools/testing/selftests/bpf/progs/recvmsg4_prog.c:#define SERV4_PORT
>>        4040
>> <cut>
>> 
>> .... there is much more ...
>> 
>>>> Getting the address of the listener (bound to
>>>> port 0) and passing it to the bpf program via global variable should be super
>>>> easy now (with the skeletons and network_helpers).
>>> 
>>> Just so that we are on the same page, could you point to which network helpers are you referring to here for passing global variables?
>> 
>> Take a look at the following existing tests:
>> * prog_tests/cgroup_skb_sk_lookup.c
>> * run_lookup_test(&skel->bss->g_serv_port, out_sk);
>> * progs/cgroup_skb_sk_lookup_kern.c
>> * g_serv_port
>> 
>> Fundamentally, here is what's preferable to have:
>> 
>> fd = start_server(..., port=0, ...);
>> listener_port = get_port(fd); /* new network_helpers.h helper that
>> calls getsockname */
>> obj->bss->port = listener_port; /* populate the port in the BPF program */
>> 
>> Does it make sense?
> 
> That makes sense. Good to know for future references. The client tests don't have hard-coded ports anyway, only the server tests do as they are using the so_resuseport option. You did mention that this was an optional nit, so I'll leave the hard-coded ports for the server tests for now. Hope that's reasonable.
> 

Looks like that shouldn't be a problem as you can pass port 0 to start_reuseport_server. I'll give this a shot, as it looks like a better option than contributing to the long list of hard-coded ports that you pointed out above. 

>> 
>>>>>> 
>>>>>> And, unrelated, maybe also fix a bunch of places where the reverse christmas
>>>>>> tree doesn't look reverse anymore?
>>>> 
>>>>> Ok. The checks should be part of tooling (e.g., checkpatch) though if they are meant to be enforced consistently, no?
>>>> 
>>>> They are networking specific, so they are not part of a checkpath :-(
>>>> I won't say they are consistently enforced, but we try to keep then
>>>> whenever possible.
>>>> 
>>>>>> 
>>>>>>> +
>>>>>>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
>>>>>>> +
>>>>>>> +struct {
>>>>>>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>>>>>>> + __uint(max_entries, 1);
>>>>>>> + __type(key, __u32);
>>>>>>> + __type(value, __u64);
>>>>>>> +} tcp_conn_sockets SEC(".maps");
>>>>>>> +
>>>>>>> +struct {
>>>>>>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>>>>>>> + __uint(max_entries, 1);
>>>>>>> + __type(key, __u32);
>>>>>>> + __type(value, __u64);
>>>>>>> +} udp_conn_sockets SEC(".maps");
>>>>>>> +
>>>>>>> +SEC("cgroup/connect6")
>>>>>>> +int sock_connect(struct bpf_sock_addr *ctx)
>>>>>>> +{
>>>>>>> + int key = 0;
>>>>>>> + __u64 sock_cookie = 0;
>>>>>>> + __u32 keyc = 0;
>>>>>>> +
>>>>>>> + if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
>>>>>>> +         return 1;
>>>>>>> +
>>>>>>> + sock_cookie = bpf_get_socket_cookie(ctx);
>>>>>>> + if (ctx->protocol == IPPROTO_TCP)
>>>>>>> +         bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
>>>>>>> + else if (ctx->protocol == IPPROTO_UDP)
>>>>>>> +         bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
>>>>>>> + else
>>>>>>> +         return 1;
>>>>>>> +
>>>>>>> + return 1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +SEC("iter/tcp")
>>>>>>> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
>>>>>>> +{
>>>>>>> + struct sock_common *sk_common = ctx->sk_common;
>>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>>> + __u64 sock_cookie = 0;
>>>>>>> + __u64 *val;
>>>>>>> + int key = 0;
>>>>>>> +
>>>>>>> + if (!sk_common)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + if (sk_common->skc_family != AF_INET6)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + sock_cookie  = bpf_get_socket_cookie(sk_common);
>>>>>>> + val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
>>>>>>> + if (!val)
>>>>>>> +         return 0;
>>>>>>> + /* Destroy connected client sockets. */
>>>>>>> + if (sock_cookie == *val)
>>>>>>> +         bpf_sock_destroy(sk_common);
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +SEC("iter/tcp")
>>>>>>> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
>>>>>>> +{
>>>>>>> + struct sock_common *sk_common = ctx->sk_common;
>>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>>> + struct tcp6_sock *tcp_sk;
>>>>>>> + const struct inet_connection_sock *icsk;
>>>>>>> + const struct inet_sock *inet;
>>>>>>> + __u16 srcp;
>>>>>>> +
>>>>>>> + if (!sk_common)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + if (sk_common->skc_family != AF_INET6)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
>>>>>>> + if (!tcp_sk)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + icsk = &tcp_sk->tcp.inet_conn;
>>>>>>> + inet = &icsk->icsk_inet;
>>>>>>> + srcp = bpf_ntohs(inet->inet_sport);
>>>>>>> +
>>>>>>> + /* Destroy server sockets. */
>>>>>>> + if (srcp == SERVER_PORT)
>>>>>>> +         bpf_sock_destroy(sk_common);
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +SEC("iter/udp")
>>>>>>> +int iter_udp6_client(struct bpf_iter__udp *ctx)
>>>>>>> +{
>>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>>> + struct udp_sock *udp_sk = ctx->udp_sk;
>>>>>>> + struct sock *sk = (struct sock *) udp_sk;
>>>>>>> + __u64 sock_cookie = 0, *val;
>>>>>>> + int key = 0;
>>>>>>> +
>>>>>>> + if (!sk)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + sock_cookie  = bpf_get_socket_cookie(sk);
>>>>>>> + val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
>>>>>>> + if (!val)
>>>>>>> +         return 0;
>>>>>>> + /* Destroy connected client sockets. */
>>>>>>> + if (sock_cookie == *val)
>>>>>>> +         bpf_sock_destroy((struct sock_common *)sk);
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +SEC("iter/udp")
>>>>>>> +int iter_udp6_server(struct bpf_iter__udp *ctx)
>>>>>>> +{
>>>>>>> + struct seq_file *seq = ctx->meta->seq;
>>>>>>> + struct udp_sock *udp_sk = ctx->udp_sk;
>>>>>>> + struct sock *sk = (struct sock *) udp_sk;
>>>>>>> + __u16 srcp;
>>>>>>> + struct inet_sock *inet;
>>>>>>> +
>>>>>>> + if (!sk)
>>>>>>> +         return 0;
>>>>>>> +
>>>>>>> + inet = &udp_sk->inet;
>>>>>>> + srcp = bpf_ntohs(inet->inet_sport);
>>>>>>> + if (srcp == SERVER_PORT)
>>>>>>> +         bpf_sock_destroy((struct sock_common *)sk);
>>>>>>> +
>>>>>>> + return 0;
>>>>>>> +}
>>>>>>> +
>>>>>>> +char _license[] SEC("license") = "GPL";
>>>>>>> --
>>>>>>> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy
  2023-03-29 23:13             ` Aditi Ghag
  2023-03-29 23:25               ` Aditi Ghag
@ 2023-03-29 23:25               ` Stanislav Fomichev
  1 sibling, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-29 23:25 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On Wed, Mar 29, 2023 at 4:13 PM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Mar 28, 2023, at 11:35 AM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Mar 28, 2023 at 10:51 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
> >>
> >>
> >>
> >>> On Mar 27, 2023, at 9:54 AM, Stanislav Fomichev <sdf@google.com> wrote:
> >>>
> >>> On 03/27, Aditi Ghag wrote:
> >>>
> >>>
> >>>>> On Mar 24, 2023, at 2:52 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >>>>>
> >>>>> On 03/23, Aditi Ghag wrote:
> >>>>>> The test cases for destroying sockets mirror the intended usages of the
> >>>>>> bpf_sock_destroy kfunc using iterators.
> >>>>>
> >>>>>> The destroy helpers set `ECONNABORTED` error code that we can validate in
> >>>>>> the test code with client sockets. But UDP sockets have an overriding error
> >>>>>> code from the disconnect called during abort, so the error code the
> >>>>>> validation is only done for TCP sockets.
> >>>>>
> >>>>>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >>>>>> ---
> >>>>>> .../selftests/bpf/prog_tests/sock_destroy.c   | 195 ++++++++++++++++++
> >>>>>> .../selftests/bpf/progs/sock_destroy_prog.c   | 151 ++++++++++++++
> >>>>>> 2 files changed, 346 insertions(+)
> >>>>>> create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >>>>>> create mode 100644 tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >>>>>
> >>>>>> diff --git a/tools/testing/selftests/bpf/prog_tests/sock_destroy.c b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >>>>>> new file mode 100644
> >>>>>> index 000000000000..cbce966af568
> >>>>>> --- /dev/null
> >>>>>> +++ b/tools/testing/selftests/bpf/prog_tests/sock_destroy.c
> >>>>>> @@ -0,0 +1,195 @@
> >>>>>> +// SPDX-License-Identifier: GPL-2.0
> >>>>>> +#include <test_progs.h>
> >>>>>> +
> >>>>>> +#include "sock_destroy_prog.skel.h"
> >>>>>> +#include "network_helpers.h"
> >>>>>> +
> >>>>>> +#define SERVER_PORT 6062
> >>>>>> +
> >>>>>> +static void start_iter_sockets(struct bpf_program *prog)
> >>>>>> +{
> >>>>>> + struct bpf_link *link;
> >>>>>> + char buf[50] = {};
> >>>>>> + int iter_fd, len;
> >>>>>> +
> >>>>>> + link = bpf_program__attach_iter(prog, NULL);
> >>>>>> + if (!ASSERT_OK_PTR(link, "attach_iter"))
> >>>>>> +         return;
> >>>>>> +
> >>>>>> + iter_fd = bpf_iter_create(bpf_link__fd(link));
> >>>>>> + if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> >>>>>> +         goto free_link;
> >>>>>> +
> >>>>>> + while ((len = read(iter_fd, buf, sizeof(buf))) > 0)
> >>>>>> +         ;
> >>>>>> + ASSERT_GE(len, 0, "read");
> >>>>>> +
> >>>>>> + close(iter_fd);
> >>>>>> +
> >>>>>> +free_link:
> >>>>>> + bpf_link__destroy(link);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void test_tcp_client(struct sock_destroy_prog *skel)
> >>>>>> +{
> >>>>>> + int serv = -1, clien = -1, n = 0;
> >>>>>> +
> >>>>>> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0);
> >>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
> >>>>>> +         goto cleanup_serv;
> >>>>>> +
> >>>>>> + clien = connect_to_fd(serv, 0);
> >>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >>>>>> +         goto cleanup_serv;
> >>>>>> +
> >>>>>> + serv = accept(serv, NULL, NULL);
> >>>>>> + if (!ASSERT_GE(serv, 0, "serv accept"))
> >>>>>> +         goto cleanup;
> >>>>>> +
> >>>>>> + n = send(clien, "t", 1, 0);
> >>>>>> + if (!ASSERT_GE(n, 0, "client send"))
> >>>>>> +         goto cleanup;
> >>>>>> +
> >>>>>> + /* Run iterator program that destroys connected client sockets. */
> >>>>>> + start_iter_sockets(skel->progs.iter_tcp6_client);
> >>>>>> +
> >>>>>> + n = send(clien, "t", 1, 0);
> >>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >>>>>> +         goto cleanup;
> >>>>>> + ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket");
> >>>>>> +
> >>>>>> +
> >>>>>> +cleanup:
> >>>>>> + close(clien);
> >>>>>> +cleanup_serv:
> >>>>>> + close(serv);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void test_tcp_server(struct sock_destroy_prog *skel)
> >>>>>> +{
> >>>>>> + int serv = -1, clien = -1, n = 0;
> >>>>>> +
> >>>>>> + serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);
> >>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
> >>>>>> +         goto cleanup_serv;
> >>>>>> +
> >>>>>> + clien = connect_to_fd(serv, 0);
> >>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >>>>>> +         goto cleanup_serv;
> >>>>>> +
> >>>>>> + serv = accept(serv, NULL, NULL);
> >>>>>> + if (!ASSERT_GE(serv, 0, "serv accept"))
> >>>>>> +         goto cleanup;
> >>>>>> +
> >>>>>> + n = send(clien, "t", 1, 0);
> >>>>>> + if (!ASSERT_GE(n, 0, "client send"))
> >>>>>> +         goto cleanup;
> >>>>>> +
> >>>>>> + /* Run iterator program that destroys server sockets. */
> >>>>>> + start_iter_sockets(skel->progs.iter_tcp6_server);
> >>>>>> +
> >>>>>> + n = send(clien, "t", 1, 0);
> >>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >>>>>> +         goto cleanup;
> >>>>>> + ASSERT_EQ(errno, ECONNRESET, "error code on destroyed socket");
> >>>>>> +
> >>>>>> +
> >>>>>> +cleanup:
> >>>>>> + close(clien);
> >>>>>> +cleanup_serv:
> >>>>>> + close(serv);
> >>>>>> +}
> >>>>>> +
> >>>>>> +
> >>>>>> +static void test_udp_client(struct sock_destroy_prog *skel)
> >>>>>> +{
> >>>>>> + int serv = -1, clien = -1, n = 0;
> >>>>>> +
> >>>>>> + serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);
> >>>>>> + if (!ASSERT_GE(serv, 0, "start_server"))
> >>>>>> +         goto cleanup_serv;
> >>>>>> +
> >>>>>> + clien = connect_to_fd(serv, 0);
> >>>>>> + if (!ASSERT_GE(clien, 0, "connect_to_fd"))
> >>>>>> +         goto cleanup_serv;
> >>>>>> +
> >>>>>> + n = send(clien, "t", 1, 0);
> >>>>>> + if (!ASSERT_GE(n, 0, "client send"))
> >>>>>> +         goto cleanup;
> >>>>>> +
> >>>>>> + /* Run iterator program that destroys sockets. */
> >>>>>> + start_iter_sockets(skel->progs.iter_udp6_client);
> >>>>>> +
> >>>>>> + n = send(clien, "t", 1, 0);
> >>>>>> + if (!ASSERT_LT(n, 0, "client_send on destroyed socket"))
> >>>>>> +         goto cleanup;
> >>>>>> + /* UDP sockets have an overriding error code after they are disconnected,
> >>>>>> +  * so we don't check for ECONNABORTED error code.
> >>>>>> +  */
> >>>>>> +
> >>>>>> +cleanup:
> >>>>>> + close(clien);
> >>>>>> +cleanup_serv:
> >>>>>> + close(serv);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void test_udp_server(struct sock_destroy_prog *skel)
> >>>>>> +{
> >>>>>> + int *listen_fds = NULL, n, i;
> >>>>>> + unsigned int num_listens = 5;
> >>>>>> + char buf[1];
> >>>>>> +
> >>>>>> + /* Start reuseport servers. */
> >>>>>> + listen_fds = start_reuseport_server(AF_INET6, SOCK_DGRAM,
> >>>>>> +                                     "::1", SERVER_PORT, 0,
> >>>>>> +                                     num_listens);
> >>>>>> + if (!ASSERT_OK_PTR(listen_fds, "start_reuseport_server"))
> >>>>>> +         goto cleanup;
> >>>>>> +
> >>>>>> + /* Run iterator program that destroys server sockets. */
> >>>>>> + start_iter_sockets(skel->progs.iter_udp6_server);
> >>>>>> +
> >>>>>> + for (i = 0; i < num_listens; ++i) {
> >>>>>> +         n = read(listen_fds[i], buf, sizeof(buf));
> >>>>>> +         if (!ASSERT_EQ(n, -1, "read") ||
> >>>>>> +             !ASSERT_EQ(errno, ECONNABORTED, "error code on destroyed socket"))
> >>>>>> +                 break;
> >>>>>> + }
> >>>>>> + ASSERT_EQ(i, num_listens, "server socket");
> >>>>>> +
> >>>>>> +cleanup:
> >>>>>> + free_fds(listen_fds, num_listens);
> >>>>>> +}
> >>>>>> +
> >>>>>> +void test_sock_destroy(void)
> >>>>>> +{
> >>>>>> + int cgroup_fd = 0;
> >>>>>> + struct sock_destroy_prog *skel;
> >>>>>> +
> >>>>>> + skel = sock_destroy_prog__open_and_load();
> >>>>>> + if (!ASSERT_OK_PTR(skel, "skel_open"))
> >>>>>> +         return;
> >>>>>> +
> >>>>>> + cgroup_fd = test__join_cgroup("/sock_destroy");
> >>>>>> + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> >>>>>> +         goto close_cgroup_fd;
> >>>>>> +
> >>>>>> + skel->links.sock_connect = bpf_program__attach_cgroup(
> >>>>>> +         skel->progs.sock_connect, cgroup_fd);
> >>>>>> + if (!ASSERT_OK_PTR(skel->links.sock_connect, "prog_attach"))
> >>>>>> +         goto close_cgroup_fd;
> >>>>>> +
> >>>>>> + if (test__start_subtest("tcp_client"))
> >>>>>> +         test_tcp_client(skel);
> >>>>>> + if (test__start_subtest("tcp_server"))
> >>>>>> +         test_tcp_server(skel);
> >>>>>> + if (test__start_subtest("udp_client"))
> >>>>>> +         test_udp_client(skel);
> >>>>>> + if (test__start_subtest("udp_server"))
> >>>>>> +         test_udp_server(skel);
> >>>>>> +
> >>>>>> +
> >>>>>> +close_cgroup_fd:
> >>>>>> + close(cgroup_fd);
> >>>>>> + sock_destroy_prog__destroy(skel);
> >>>>>> +}
> >>>>>> diff --git a/tools/testing/selftests/bpf/progs/sock_destroy_prog.c b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >>>>>> new file mode 100644
> >>>>>> index 000000000000..8e09d82c50f3
> >>>>>> --- /dev/null
> >>>>>> +++ b/tools/testing/selftests/bpf/progs/sock_destroy_prog.c
> >>>>>> @@ -0,0 +1,151 @@
> >>>>>> +// SPDX-License-Identifier: GPL-2.0
> >>>>>> +
> >>>>>> +#include "vmlinux.h"
> >>>>>> +
> >>>>>> +#include "bpf_tracing_net.h"
> >>>>>> +#include <bpf/bpf_helpers.h>
> >>>>>> +#include <bpf/bpf_endian.h>
> >>>>>> +
> >>>>>> +#define AF_INET6 10
> >>>>>
> >>>>> [..]
> >>>>>
> >>>>>> +/* Keep it in sync with prog_test/sock_destroy. */
> >>>>>> +#define SERVER_PORT 6062
> >>>>>
> >>>>> The test looks good, one optional unrelated nit maybe:
> >>>>>
> >>>>> I've been guilty of these hard-coded ports in the past, but maybe
> >>>>> we should stop hard-coding them? Getting the address of the listener (bound to
> >>>>> port 0) and passing it to the bpf program via global variable should be super
> >>>>> easy now (with the skeletons and network_helpers).
> >>>
> >>>
> >>>> I briefly considered adding the ports in a map, and retrieving them in the test. But it didn't seem worthwhile as the tests should fail clearly when there is a mismatch.
> >>>
> >>> My worry is that the amount of those tests that have a hard-coded port
> >>> grows and at some point somebody will clash with somebody else.
> >>> And it might not be 100% apparent because test_progs is now multi-threaded
> >>> and racy..
> >>>
> >>
> >> So you would like the ports to be unique across all the tests.
> >
> > Yeah, but it's hard without having some kind of global registry. Take
> > a look at the following:
> >
> > $ grep -Iri _port tools/testing/selftests/bpf/ | grep -P '\d{4}'
> >
> > tools/testing/selftests/bpf/progs/connect_force_port4.c:
> > sa.sin_port = bpf_htons(22222);
> > tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
> > (ctx->user_port == bpf_htons(60000)) {
> > tools/testing/selftests/bpf/progs/connect_force_port4.c:
> > ctx->user_port = bpf_htons(60123);
> > tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
> > (ctx->user_port == bpf_htons(60123)) {
> > tools/testing/selftests/bpf/progs/connect_force_port4.c:
> > ctx->user_port = bpf_htons(60000);
> > tools/testing/selftests/bpf/progs/connect_force_port4.c:        if
> > (ctx->user_port == bpf_htons(60123)) {
> > tools/testing/selftests/bpf/progs/connect6_prog.c:#define
> > DST_REWRITE_PORT6     6666
> > tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
> > SRC_PORT = bpf_htons(8008);
> > tools/testing/selftests/bpf/progs/test_sk_lookup.c:static const __u16
> > DST_PORT = 7007; /* Host byte order */
> > tools/testing/selftests/bpf/progs/test_tc_dtime.c:      __u16
> > dst_ns_port = __bpf_htons(50000 + test);
> > tools/testing/selftests/bpf/progs/connect4_dropper.c:   if
> > (ctx->user_port == bpf_htons(60120))
> > tools/testing/selftests/bpf/progs/connect_force_port6.c:
> > sa.sin6_port = bpf_htons(22223);
> > tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
> > (ctx->user_port == bpf_htons(60000)) {
> > tools/testing/selftests/bpf/progs/connect_force_port6.c:
> > ctx->user_port = bpf_htons(60124);
> > tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
> > (ctx->user_port == bpf_htons(60124)) {
> > tools/testing/selftests/bpf/progs/connect_force_port6.c:
> > ctx->user_port = bpf_htons(60000);
> > tools/testing/selftests/bpf/progs/connect_force_port6.c:        if
> > (ctx->user_port == bpf_htons(60124)) {
> > tools/testing/selftests/bpf/progs/test_tunnel_kern.c:#define
> > VXLAN_UDP_PORT 4789
> > tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define DST_PORT
> >         4040
> > tools/testing/selftests/bpf/progs/sendmsg4_prog.c:#define
> > DST_REWRITE_PORT4     4444
> > tools/testing/selftests/bpf/progs/connect4_prog.c:#define
> > DST_REWRITE_PORT4     4444
> > tools/testing/selftests/bpf/progs/bind6_prog.c:#define SERV6_PORT
> >         6060
> > tools/testing/selftests/bpf/progs/bind6_prog.c:#define
> > SERV6_REWRITE_PORT       6666
> > tools/testing/selftests/bpf/progs/sendmsg6_prog.c:#define
> > DST_REWRITE_PORT6     6666
> > tools/testing/selftests/bpf/progs/recvmsg4_prog.c:#define SERV4_PORT
> >         4040
> > <cut>
> >
> > .... there is much more ...
> >
> >>> Getting the address of the listener (bound to
> >>> port 0) and passing it to the bpf program via global variable should be super
> >>> easy now (with the skeletons and network_helpers).
> >>
> >> Just so that we are on the same page, could you point to which network helpers are you referring to here for passing global variables?
> >
> > Take a look at the following existing tests:
> > * prog_tests/cgroup_skb_sk_lookup.c
> >  * run_lookup_test(&skel->bss->g_serv_port, out_sk);
> > * progs/cgroup_skb_sk_lookup_kern.c
> >  * g_serv_port
> >
> > Fundamentally, here is what's preferable to have:
> >
> > fd = start_server(..., port=0, ...);
> > listener_port = get_port(fd); /* new network_helpers.h helper that
> > calls getsockname */
> > obj->bss->port = listener_port; /* populate the port in the BPF program */
> >
> > Does it make sense?
>
> That makes sense. Good to know for future references. The client tests don't have hard-coded ports anyway, only the server tests do as they are using the so_resuseport option. You did mention that this was an optional nit, so I'll leave the hard-coded ports for the server tests for now. Hope that's reasonable.

Sure, up to you, but to clarify. You have the following:

+static void test_tcp_server(struct sock_destroy_prog *skel)
+{
+ int serv = -1, clien = -1, n = 0;
+
+ serv = start_server(AF_INET6, SOCK_STREAM, NULL, SERVER_PORT, 0);

And the following:

+static void test_udp_client(struct sock_destroy_prog *skel)
+{
+ int serv = -1, clien = -1, n = 0;
+
+ serv = start_server(AF_INET6, SOCK_DGRAM, NULL, 6161, 0);

Both have hard-coded ports and not using reuseport?

> >
> >>>>>
> >>>>> And, unrelated, maybe also fix a bunch of places where the reverse christmas
> >>>>> tree doesn't look reverse anymore?
> >>>
> >>>> Ok. The checks should be part of tooling (e.g., checkpatch) though if they are meant to be enforced consistently, no?
> >>>
> >>> They are networking specific, so they are not part of a checkpath :-(
> >>> I won't say they are consistently enforced, but we try to keep then
> >>> whenever possible.
> >>>
> >>>>>
> >>>>>> +
> >>>>>> +int bpf_sock_destroy(struct sock_common *sk) __ksym;
> >>>>>> +
> >>>>>> +struct {
> >>>>>> + __uint(type, BPF_MAP_TYPE_ARRAY);
> >>>>>> + __uint(max_entries, 1);
> >>>>>> + __type(key, __u32);
> >>>>>> + __type(value, __u64);
> >>>>>> +} tcp_conn_sockets SEC(".maps");
> >>>>>> +
> >>>>>> +struct {
> >>>>>> + __uint(type, BPF_MAP_TYPE_ARRAY);
> >>>>>> + __uint(max_entries, 1);
> >>>>>> + __type(key, __u32);
> >>>>>> + __type(value, __u64);
> >>>>>> +} udp_conn_sockets SEC(".maps");
> >>>>>> +
> >>>>>> +SEC("cgroup/connect6")
> >>>>>> +int sock_connect(struct bpf_sock_addr *ctx)
> >>>>>> +{
> >>>>>> + int key = 0;
> >>>>>> + __u64 sock_cookie = 0;
> >>>>>> + __u32 keyc = 0;
> >>>>>> +
> >>>>>> + if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> >>>>>> +         return 1;
> >>>>>> +
> >>>>>> + sock_cookie = bpf_get_socket_cookie(ctx);
> >>>>>> + if (ctx->protocol == IPPROTO_TCP)
> >>>>>> +         bpf_map_update_elem(&tcp_conn_sockets, &key, &sock_cookie, 0);
> >>>>>> + else if (ctx->protocol == IPPROTO_UDP)
> >>>>>> +         bpf_map_update_elem(&udp_conn_sockets, &keyc, &sock_cookie, 0);
> >>>>>> + else
> >>>>>> +         return 1;
> >>>>>> +
> >>>>>> + return 1;
> >>>>>> +}
> >>>>>> +
> >>>>>> +SEC("iter/tcp")
> >>>>>> +int iter_tcp6_client(struct bpf_iter__tcp *ctx)
> >>>>>> +{
> >>>>>> + struct sock_common *sk_common = ctx->sk_common;
> >>>>>> + struct seq_file *seq = ctx->meta->seq;
> >>>>>> + __u64 sock_cookie = 0;
> >>>>>> + __u64 *val;
> >>>>>> + int key = 0;
> >>>>>> +
> >>>>>> + if (!sk_common)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + if (sk_common->skc_family != AF_INET6)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + sock_cookie  = bpf_get_socket_cookie(sk_common);
> >>>>>> + val = bpf_map_lookup_elem(&tcp_conn_sockets, &key);
> >>>>>> + if (!val)
> >>>>>> +         return 0;
> >>>>>> + /* Destroy connected client sockets. */
> >>>>>> + if (sock_cookie == *val)
> >>>>>> +         bpf_sock_destroy(sk_common);
> >>>>>> +
> >>>>>> + return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +SEC("iter/tcp")
> >>>>>> +int iter_tcp6_server(struct bpf_iter__tcp *ctx)
> >>>>>> +{
> >>>>>> + struct sock_common *sk_common = ctx->sk_common;
> >>>>>> + struct seq_file *seq = ctx->meta->seq;
> >>>>>> + struct tcp6_sock *tcp_sk;
> >>>>>> + const struct inet_connection_sock *icsk;
> >>>>>> + const struct inet_sock *inet;
> >>>>>> + __u16 srcp;
> >>>>>> +
> >>>>>> + if (!sk_common)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + if (sk_common->skc_family != AF_INET6)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + tcp_sk = bpf_skc_to_tcp6_sock(sk_common);
> >>>>>> + if (!tcp_sk)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + icsk = &tcp_sk->tcp.inet_conn;
> >>>>>> + inet = &icsk->icsk_inet;
> >>>>>> + srcp = bpf_ntohs(inet->inet_sport);
> >>>>>> +
> >>>>>> + /* Destroy server sockets. */
> >>>>>> + if (srcp == SERVER_PORT)
> >>>>>> +         bpf_sock_destroy(sk_common);
> >>>>>> +
> >>>>>> + return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +
> >>>>>> +SEC("iter/udp")
> >>>>>> +int iter_udp6_client(struct bpf_iter__udp *ctx)
> >>>>>> +{
> >>>>>> + struct seq_file *seq = ctx->meta->seq;
> >>>>>> + struct udp_sock *udp_sk = ctx->udp_sk;
> >>>>>> + struct sock *sk = (struct sock *) udp_sk;
> >>>>>> + __u64 sock_cookie = 0, *val;
> >>>>>> + int key = 0;
> >>>>>> +
> >>>>>> + if (!sk)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + sock_cookie  = bpf_get_socket_cookie(sk);
> >>>>>> + val = bpf_map_lookup_elem(&udp_conn_sockets, &key);
> >>>>>> + if (!val)
> >>>>>> +         return 0;
> >>>>>> + /* Destroy connected client sockets. */
> >>>>>> + if (sock_cookie == *val)
> >>>>>> +         bpf_sock_destroy((struct sock_common *)sk);
> >>>>>> +
> >>>>>> + return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +SEC("iter/udp")
> >>>>>> +int iter_udp6_server(struct bpf_iter__udp *ctx)
> >>>>>> +{
> >>>>>> + struct seq_file *seq = ctx->meta->seq;
> >>>>>> + struct udp_sock *udp_sk = ctx->udp_sk;
> >>>>>> + struct sock *sk = (struct sock *) udp_sk;
> >>>>>> + __u16 srcp;
> >>>>>> + struct inet_sock *inet;
> >>>>>> +
> >>>>>> + if (!sk)
> >>>>>> +         return 0;
> >>>>>> +
> >>>>>> + inet = &udp_sk->inet;
> >>>>>> + srcp = bpf_ntohs(inet->inet_sport);
> >>>>>> + if (srcp == SERVER_PORT)
> >>>>>> +         bpf_sock_destroy((struct sock_common *)sk);
> >>>>>> +
> >>>>>> + return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +char _license[] SEC("license") = "GPL";
> >>>>>> --
> >>>>>> 2.34.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-24 21:37   ` Stanislav Fomichev
@ 2023-03-30 14:42     ` Aditi Ghag
  2023-03-30 16:32       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Aditi Ghag @ 2023-03-30 14:42 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: bpf, kafai, edumazet



> On Mar 24, 2023, at 2:37 PM, Stanislav Fomichev <sdf@google.com> wrote:
> 
> On 03/23, Aditi Ghag wrote:
>> The socket destroy kfunc is used to forcefully terminate sockets from
>> certain BPF contexts. We plan to use the capability in Cilium to force
>> client sockets to reconnect when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly policy enforcement where existing
>> socket connections prevented by policies need to be forcefully terminated.
>> The helper allows terminating sockets that may or may not be actively
>> sending traffic.
> 
>> The helper is currently exposed to certain BPF iterators where users can
>> filter, and terminate selected sockets.  Additionally, the helper can only
>> be called from these BPF contexts that ensure socket locking in order to
>> allow synchronous execution of destroy helpers that also acquire socket
>> locks. The previous commit that batches UDP sockets during iteration
>> facilitated a synchronous invocation of the destroy helper from BPF context
>> by skipping taking socket locks in the destroy handler. TCP iterators
>> already supported batching.
> 
>> The helper takes `sock_common` type argument, even though it expects, and
>> casts them to a `sock` pointer. This enables the verifier to allow the
>> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
>> `sock` structs. As a comparison, BPF helpers enable this behavior with the
>> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
>> option available with the verifier logic that handles kfuncs where BTF
>> types are inferred. Furthermore, as `sock_common` only has a subset of
>> certain fields of `sock`, casting pointer to the latter type might not
>> always be safe. Hence, the BPF kfunc converts the argument to a full sock
>> before casting.
> 
>> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
>> ---
>>  net/core/filter.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++
>>  net/ipv4/tcp.c    | 10 ++++++---
>>  net/ipv4/udp.c    |  6 ++++--
>>  3 files changed, 65 insertions(+), 5 deletions(-)
> 
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 1d6f165923bf..ba3e0dac119c 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> 
>>  	return func;
>>  }
>> +
>> +/* Disables missing prototype warnings */
>> +__diag_push();
>> +__diag_ignore_all("-Wmissing-prototypes",
>> +		  "Global functions as their definitions will be in vmlinux BTF");
>> +
>> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
>> + *
>> + * The helper expects a non-NULL pointer to a socket. It invokes the
>> + * protocol specific socket destroy handlers.
>> + *
>> + * The helper can only be called from BPF contexts that have acquired the socket
>> + * locks.
>> + *
>> + * Parameters:
>> + * @sock: Pointer to socket to be destroyed
>> + *
>> + * Return:
>> + * On error, may return EPROTONOSUPPORT, EINVAL.
>> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
>> + * 0 otherwise
>> + */
>> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
>> +{
>> +	struct sock *sk = (struct sock *)sock;
>> +
>> +	if (!sk)
>> +		return -EINVAL;
>> +
>> +	/* The locking semantics that allow for synchronous execution of the
>> +	 * destroy handlers are only supported for TCP and UDP.
>> +	 */
>> +	if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
>> +		return -EOPNOTSUPP;
> 
> Copy-pasting from v3, let's discuss here.
> 
> Maybe make it more opt-in? (vs current "opt ipproto_raw out")
> 
> if (sk->sk_prot->diag_destroy != udp_abort &&
>    sk->sk_prot->diag_destroy != tcp_abort)
>            return -EOPNOTSUPP;
> 
> Is it more robust? Or does it look uglier? )
> But maybe fine as is, I'm just thinking out loud..

Do we expect the handler to be extended for more types? Probably not... So I'll leave it as is.

> 
>> +
>> +	return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
>> +}
>> +
>> +__diag_pop()
>> +
>> +BTF_SET8_START(sock_destroy_kfunc_set)
>> +BTF_ID_FLAGS(func, bpf_sock_destroy)
>> +BTF_SET8_END(sock_destroy_kfunc_set)
>> +
>> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &sock_destroy_kfunc_set,
>> +};
>> +
>> +static int init_subsystem(void)
>> +{
>> +	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
>> +}
>> +late_initcall(init_subsystem);
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index 33f559f491c8..5df6231016e3 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
>>  		return 0;
>>  	}
> 
>> -	/* Don't race with userspace socket closes such as tcp_close. */
>> -	lock_sock(sk);
>> +	/* BPF context ensures sock locking. */
>> +	if (!has_current_bpf_ctx())
>> +		/* Don't race with userspace socket closes such as tcp_close. */
>> +		lock_sock(sk);
> 
>>  	if (sk->sk_state == TCP_LISTEN) {
>>  		tcp_set_state(sk, TCP_CLOSE);
>> @@ -4701,9 +4703,11 @@ int tcp_abort(struct sock *sk, int err)
>>  	}
> 
>>  	bh_unlock_sock(sk);
>> +
>>  	local_bh_enable();
>>  	tcp_write_queue_purge(sk);
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
>>  	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tcp_abort);
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 58c620243e47..408836102e20 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -2925,7 +2925,8 @@ EXPORT_SYMBOL(udp_poll);
> 
>>  int udp_abort(struct sock *sk, int err)
>>  {
>> -	lock_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		lock_sock(sk);
> 
>>  	/* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
>>  	 * with close()
>> @@ -2938,7 +2939,8 @@ int udp_abort(struct sock *sk, int err)
>>  	__udp_disconnect(sk, 0);
> 
>>  out:
>> -	release_sock(sk);
>> +	if (!has_current_bpf_ctx())
>> +		release_sock(sk);
> 
>>  	return 0;
>>  }
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-30 14:42     ` Aditi Ghag
@ 2023-03-30 16:32       ` Stanislav Fomichev
  2023-03-30 17:30         ` Martin KaFai Lau
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-03-30 16:32 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, kafai, edumazet

On Thu, Mar 30, 2023 at 7:42 AM Aditi Ghag <aditi.ghag@isovalent.com> wrote:
>
>
>
> > On Mar 24, 2023, at 2:37 PM, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/23, Aditi Ghag wrote:
> >> The socket destroy kfunc is used to forcefully terminate sockets from
> >> certain BPF contexts. We plan to use the capability in Cilium to force
> >> client sockets to reconnect when their remote load-balancing backends are
> >> deleted. The other use case is on-the-fly policy enforcement where existing
> >> socket connections prevented by policies need to be forcefully terminated.
> >> The helper allows terminating sockets that may or may not be actively
> >> sending traffic.
> >
> >> The helper is currently exposed to certain BPF iterators where users can
> >> filter, and terminate selected sockets.  Additionally, the helper can only
> >> be called from these BPF contexts that ensure socket locking in order to
> >> allow synchronous execution of destroy helpers that also acquire socket
> >> locks. The previous commit that batches UDP sockets during iteration
> >> facilitated a synchronous invocation of the destroy helper from BPF context
> >> by skipping taking socket locks in the destroy handler. TCP iterators
> >> already supported batching.
> >
> >> The helper takes `sock_common` type argument, even though it expects, and
> >> casts them to a `sock` pointer. This enables the verifier to allow the
> >> sock_destroy kfunc to be called for TCP with `sock_common` and UDP with
> >> `sock` structs. As a comparison, BPF helpers enable this behavior with the
> >> `ARG_PTR_TO_BTF_ID_SOCK_COMMON` argument type. However, there is no such
> >> option available with the verifier logic that handles kfuncs where BTF
> >> types are inferred. Furthermore, as `sock_common` only has a subset of
> >> certain fields of `sock`, casting pointer to the latter type might not
> >> always be safe. Hence, the BPF kfunc converts the argument to a full sock
> >> before casting.
> >
> >> Signed-off-by: Aditi Ghag <aditi.ghag@isovalent.com>
> >> ---
> >>  net/core/filter.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  net/ipv4/tcp.c    | 10 ++++++---
> >>  net/ipv4/udp.c    |  6 ++++--
> >>  3 files changed, 65 insertions(+), 5 deletions(-)
> >
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 1d6f165923bf..ba3e0dac119c 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -11621,3 +11621,57 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> >
> >>      return func;
> >>  }
> >> +
> >> +/* Disables missing prototype warnings */
> >> +__diag_push();
> >> +__diag_ignore_all("-Wmissing-prototypes",
> >> +              "Global functions as their definitions will be in vmlinux BTF");
> >> +
> >> +/* bpf_sock_destroy: Destroy the given socket with ECONNABORTED error code.
> >> + *
> >> + * The helper expects a non-NULL pointer to a socket. It invokes the
> >> + * protocol specific socket destroy handlers.
> >> + *
> >> + * The helper can only be called from BPF contexts that have acquired the socket
> >> + * locks.
> >> + *
> >> + * Parameters:
> >> + * @sock: Pointer to socket to be destroyed
> >> + *
> >> + * Return:
> >> + * On error, may return EPROTONOSUPPORT, EINVAL.
> >> + * EPROTONOSUPPORT if protocol specific destroy handler is not implemented.
> >> + * 0 otherwise
> >> + */
> >> +__bpf_kfunc int bpf_sock_destroy(struct sock_common *sock)
> >> +{
> >> +    struct sock *sk = (struct sock *)sock;
> >> +
> >> +    if (!sk)
> >> +            return -EINVAL;
> >> +
> >> +    /* The locking semantics that allow for synchronous execution of the
> >> +     * destroy handlers are only supported for TCP and UDP.
> >> +     */
> >> +    if (!sk->sk_prot->diag_destroy || sk->sk_protocol == IPPROTO_RAW)
> >> +            return -EOPNOTSUPP;
> >
> > Copy-pasting from v3, let's discuss here.
> >
> > Maybe make it more opt-in? (vs current "opt ipproto_raw out")
> >
> > if (sk->sk_prot->diag_destroy != udp_abort &&
> >    sk->sk_prot->diag_destroy != tcp_abort)
> >            return -EOPNOTSUPP;
> >
> > Is it more robust? Or does it look uglier? )
> > But maybe fine as is, I'm just thinking out loud..
>
> Do we expect the handler to be extended for more types? Probably not... So I'll leave it as is.

My worry is about somebody adding .diag_destroy to some new/old
protocol in the future, say sctp_prot, without being aware
of this bpf_sock_destroy helper and its locking requirements.

So having an opt-in here (as in sk_protocol == IPPROTO_TCP ||
sk_protocol == IPPROTO_UDP) feels more future-proof than your current
opt-out (sk_proto != IPPROTO_RAW).
WDYT?

> >
> >> +
> >> +    return sk->sk_prot->diag_destroy(sk, ECONNABORTED);
> >> +}
> >> +
> >> +__diag_pop()
> >> +
> >> +BTF_SET8_START(sock_destroy_kfunc_set)
> >> +BTF_ID_FLAGS(func, bpf_sock_destroy)
> >> +BTF_SET8_END(sock_destroy_kfunc_set)
> >> +
> >> +static const struct btf_kfunc_id_set bpf_sock_destroy_kfunc_set = {
> >> +    .owner = THIS_MODULE,
> >> +    .set   = &sock_destroy_kfunc_set,
> >> +};
> >> +
> >> +static int init_subsystem(void)
> >> +{
> >> +    return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_sock_destroy_kfunc_set);
> >> +}
> >> +late_initcall(init_subsystem);
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index 33f559f491c8..5df6231016e3 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -4678,8 +4678,10 @@ int tcp_abort(struct sock *sk, int err)
> >>              return 0;
> >>      }
> >
> >> -    /* Don't race with userspace socket closes such as tcp_close. */
> >> -    lock_sock(sk);
> >> +    /* BPF context ensures sock locking. */
> >> +    if (!has_current_bpf_ctx())
> >> +            /* Don't race with userspace socket closes such as tcp_close. */
> >> +            lock_sock(sk);
> >
> >>      if (sk->sk_state == TCP_LISTEN) {
> >>              tcp_set_state(sk, TCP_CLOSE);
> >> @@ -4701,9 +4703,11 @@ int tcp_abort(struct sock *sk, int err)
> >>      }
> >
> >>      bh_unlock_sock(sk);
> >> +
> >>      local_bh_enable();
> >>      tcp_write_queue_purge(sk);
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >>      return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(tcp_abort);
> >> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >> index 58c620243e47..408836102e20 100644
> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -2925,7 +2925,8 @@ EXPORT_SYMBOL(udp_poll);
> >
> >>  int udp_abort(struct sock *sk, int err)
> >>  {
> >> -    lock_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            lock_sock(sk);
> >
> >>      /* udp{v6}_destroy_sock() sets it under the sk lock, avoid racing
> >>       * with close()
> >> @@ -2938,7 +2939,8 @@ int udp_abort(struct sock *sk, int err)
> >>      __udp_disconnect(sk, 0);
> >
> >>  out:
> >> -    release_sock(sk);
> >> +    if (!has_current_bpf_ctx())
> >> +            release_sock(sk);
> >
> >>      return 0;
> >>  }
> >> --
> >> 2.34.1
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-30 16:32       ` Stanislav Fomichev
@ 2023-03-30 17:30         ` Martin KaFai Lau
  2023-04-03 15:58           ` Aditi Ghag
  0 siblings, 1 reply; 29+ messages in thread
From: Martin KaFai Lau @ 2023-03-30 17:30 UTC (permalink / raw)
  To: Aditi Ghag; +Cc: bpf, edumazet, Stanislav Fomichev

On 3/30/23 9:32 AM, Stanislav Fomichev wrote:
>>> Maybe make it more opt-in? (vs current "opt ipproto_raw out")
>>>
>>> if (sk->sk_prot->diag_destroy != udp_abort &&
>>>     sk->sk_prot->diag_destroy != tcp_abort)
>>>             return -EOPNOTSUPP;
>>>
>>> Is it more robust? Or does it look uglier? )
>>> But maybe fine as is, I'm just thinking out loud..
>>
>> Do we expect the handler to be extended for more types? Probably not... So I'll leave it as is.
> 
> My worry is about somebody adding .diag_destroy to some new/old
> protocol in the future, say sctp_prot, without being aware
> of this bpf_sock_destroy helper and its locking requirements.

Other helpers in filter.c is also opt-in. I think it is better to do the same 
here. IPPROTO_TCP and IPPROTO_UDP should have very good use case coverage to 
begin with. It can also help to ensure new selftests are written to cover the 
protocol supporting bpf_sock_destroy in the future.

I like the comment in bpf_sock_destroy() in this patch. It will be even better 
if it can spell out more clearly that future supporting protocol needs to assume 
the lock_sock has already been done on the bpf side.

> 
> So having an opt-in here (as in sk_protocol == IPPROTO_TCP ||
> sk_protocol == IPPROTO_UDP) feels more future-proof than your current
> opt-out (sk_proto != IPPROTO_RAW).
> WDYT?
>>>> +            release_sock(sk);
>>>
>>>>       return 0;
>>>>   }
>>>> --
>>>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc
  2023-03-30 17:30         ` Martin KaFai Lau
@ 2023-04-03 15:58           ` Aditi Ghag
  0 siblings, 0 replies; 29+ messages in thread
From: Aditi Ghag @ 2023-04-03 15:58 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: bpf, edumazet, Stanislav Fomichev



> On Mar 30, 2023, at 10:30 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 3/30/23 9:32 AM, Stanislav Fomichev wrote:
>>>> Maybe make it more opt-in? (vs current "opt ipproto_raw out")
>>>> 
>>>> if (sk->sk_prot->diag_destroy != udp_abort &&
>>>>    sk->sk_prot->diag_destroy != tcp_abort)
>>>>            return -EOPNOTSUPP;
>>>> 
>>>> Is it more robust? Or does it look uglier? )
>>>> But maybe fine as is, I'm just thinking out loud..
>>> 
>>> Do we expect the handler to be extended for more types? Probably not... So I'll leave it as is.
>> My worry is about somebody adding .diag_destroy to some new/old
>> protocol in the future, say sctp_prot, without being aware
>> of this bpf_sock_destroy helper and its locking requirements.

Ah, sctp!! 

> 
> Other helpers in filter.c is also opt-in. I think it is better to do the same here. IPPROTO_TCP and IPPROTO_UDP should have very good use case coverage to begin with. It can also help to ensure new selftests are written to cover the protocol supporting bpf_sock_destroy in the future.
> 
> I like the comment in bpf_sock_destroy() in this patch. It will be even better if it can spell out more clearly that future supporting protocol needs to assume the lock_sock has already been done on the bpf side.

Ack... I'll make it opt-in. 

> 
>> So having an opt-in here (as in sk_protocol == IPPROTO_TCP ||
>> sk_protocol == IPPROTO_UDP) feels more future-proof than your current
>> opt-out (sk_proto != IPPROTO_RAW).
>> WDYT?
>>>>> +            release_sock(sk);
>>>> 
>>>>>      return 0;
>>>>>  }
>>>>> --
>>>>> 2.34.1
>>> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-04-03 15:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-23 20:06 [PATCH v4 bpf-next 0/5] bpf-nex: Add socket destroy capability Aditi Ghag
2023-03-23 20:06 ` [PATCH v4 bpf-next 1/4] bpf: Implement batching in UDP iterator Aditi Ghag
2023-03-24 21:56   ` Stanislav Fomichev
2023-03-27 15:52     ` Aditi Ghag
2023-03-27 16:52       ` Stanislav Fomichev
2023-03-27 22:28   ` Martin KaFai Lau
2023-03-28 17:06     ` Aditi Ghag
2023-03-28 21:33       ` Martin KaFai Lau
2023-03-29 16:20         ` Aditi Ghag
2023-03-23 20:06 ` [PATCH v4 bpf-next 2/4] bpf: Add bpf_sock_destroy kfunc Aditi Ghag
2023-03-23 23:58   ` Martin KaFai Lau
2023-03-24 21:37   ` Stanislav Fomichev
2023-03-30 14:42     ` Aditi Ghag
2023-03-30 16:32       ` Stanislav Fomichev
2023-03-30 17:30         ` Martin KaFai Lau
2023-04-03 15:58           ` Aditi Ghag
2023-03-23 20:06 ` [PATCH v4 bpf-next 3/4] bpf,tcp: Avoid taking fast sock lock in iterator Aditi Ghag
2023-03-24 21:45   ` Stanislav Fomichev
2023-03-28 15:20     ` Aditi Ghag
2023-03-27 22:34   ` Martin KaFai Lau
2023-03-23 20:06 ` [PATCH v4 bpf-next 4/4] selftests/bpf: Add tests for bpf_sock_destroy Aditi Ghag
2023-03-24 21:52   ` Stanislav Fomichev
2023-03-27 15:57     ` Aditi Ghag
2023-03-27 16:54       ` Stanislav Fomichev
2023-03-28 17:50         ` Aditi Ghag
2023-03-28 18:35           ` Stanislav Fomichev
2023-03-29 23:13             ` Aditi Ghag
2023-03-29 23:25               ` Aditi Ghag
2023-03-29 23:25               ` Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).