All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
@ 2023-12-19 19:32 Martin KaFai Lau
  2023-12-19 19:32 ` [PATCH bpf 2/2] selftests/bpf: Test udp and tcp iter batching Martin KaFai Lau
  2023-12-20 14:24 ` [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Daniel Borkmann
  0 siblings, 2 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2023-12-19 19:32 UTC (permalink / raw)
  To: bpf
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	'Daniel Borkmann ',
	netdev, kernel-team, Aditi Ghag

From: Martin KaFai Lau <martin.lau@kernel.org>

The bpf_iter_udp iterates all udp_sk by iterating the udp_table.
The bpf_iter_udp stores all udp_sk of a bucket while iterating
the udp_table. The term used in the kernel code is "batch" the
whole bucket. The reason for batching is to allow lock_sock() on
each socket before calling the bpf prog such that the bpf prog can
safely call helper/kfunc that changes the sk's state,
e.g. bpf_setsockopt.

There is a bug in the bpf_iter_udp_batch() function that stops
the userspace from making forward progress.

The case that triggers the bug is the userspace passed in
a very small read buffer. When the bpf prog does bpf_seq_printf,
the userspace read buffer is not enough to capture the whole "batch".

When the read buffer is not enough for the whole "batch", the kernel
will remember the offset of the batch in iter->offset such that
the next userspace read() can continue from where it left off.

The kernel will skip the number (== "iter->offset") of sockets in
the next read(). However, the code directly decrements the
"--iter->offset". This is incorrect because the next read() may
not consume the whole "batch" either and the next next read() will
start from offset 0.

Doing "--iter->offset" is essentially making backward progress.
The net effect is the userspace will keep reading from the beginning
of a bucket and the process will never finish. "iter->offset" must always
go forward until the whole "batch" (or bucket) is consumed by the
userspace.

This patch fixes it by doing the decrement in a local stack
variable.

Cc: Aditi Ghag <aditi.ghag@isovalent.com>
Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 net/ipv4/udp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..6cf4151c2eb4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3141,6 +3141,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	unsigned int batch_sks = 0;
 	bool resized = false;
 	struct sock *sk;
+	int offset;
 
 	/* The current batch is done, so advance the bucket. */
 	if (iter->st_bucket_done) {
@@ -3162,6 +3163,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	iter->end_sk = 0;
 	iter->st_bucket_done = false;
 	batch_sks = 0;
+	offset = iter->offset;
 
 	for (; state->bucket <= udptable->mask; state->bucket++) {
 		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
@@ -3177,8 +3179,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 				/* Resume from the last iterated socket at the
 				 * offset in the bucket before iterator was stopped.
 				 */
-				if (iter->offset) {
-					--iter->offset;
+				if (offset) {
+					--offset;
 					continue;
 				}
 				if (iter->end_sk < iter->max_sk) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf 2/2] selftests/bpf: Test udp and tcp iter batching
  2023-12-19 19:32 [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Martin KaFai Lau
@ 2023-12-19 19:32 ` Martin KaFai Lau
  2023-12-20 14:24 ` [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Daniel Borkmann
  1 sibling, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2023-12-19 19:32 UTC (permalink / raw)
  To: bpf
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	'Daniel Borkmann ',
	netdev, kernel-team

From: Martin KaFai Lau <martin.lau@kernel.org>

The patch adds a test to exercise the bpf_iter_udp batching
logic. It specifically tests the case that there are multiple
so_reuseport udp_sk in a bucket of the udp_table.
The userspace is only reading one udp_sk at a time from
the bpf_iter_udp prog. The true case in
"read_batch(..., bool read_one)". This is the buggy case
that the previous patch fixed.

It also tests the "false" case in "read_batch(..., bool read_one)",
meaning the userspace reads the whole bucket. There is
no bug in this case but adding this test also while
at it.

Considering the way to have multiple tcp_sk in the same
bucket is similar (by using so_reuseport),
this patch also tests the bpf_iter_tcp even though the
bpf_iter_tcp batching logic works correctly.

Both IP v4 and v6 are exercising the same bpf_iter batching
code path, so only v6 is tested.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 .../bpf/prog_tests/sock_iter_batch.c          | 101 ++++++++++++++++++
 .../selftests/bpf/progs/sock_iter_batch.c     |  43 ++++++++
 2 files changed, 144 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
 create mode 100644 tools/testing/selftests/bpf/progs/sock_iter_batch.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
new file mode 100644
index 000000000000..38aa564862e8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sock_iter_batch.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2023 Meta
+
+#include <test_progs.h>
+#include "network_helpers.h"
+#include "sock_iter_batch.skel.h"
+
+#define TEST_NS "sock_iter_batch_netns"
+
+static const char expected_char = 'x';
+static const int nr_soreuse = 4;
+
+static void read_batch(struct bpf_program *prog, bool read_one)
+{
+	int iter_fd, i, nread, total_nread = 0;
+	struct bpf_link *link;
+	char b[nr_soreuse];
+
+	link = bpf_program__attach_iter(prog, NULL);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
+		return;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create")) {
+		bpf_link__destroy(link);
+		return;
+	}
+
+	do {
+		nread = read(iter_fd, b, read_one ? 1 : nr_soreuse);
+		if (nread <= 0)
+			break;
+
+		for (i = 0; i < nread; i++)
+			ASSERT_EQ(b[i], expected_char, "b[i]");
+
+		total_nread += nread;
+	} while (total_nread <= nr_soreuse);
+
+	ASSERT_EQ(nread, 0, "nread");
+	ASSERT_EQ(total_nread, nr_soreuse, "total_nread");
+
+	close(iter_fd);
+	bpf_link__destroy(link);
+}
+
+static void do_test(int sock_type)
+{
+	struct sock_iter_batch *skel;
+	int *fds, err;
+
+	fds = start_reuseport_server(AF_INET6, sock_type, "::1", 0, 0,
+				     nr_soreuse);
+	if (!ASSERT_OK_PTR(fds, "start_reuseport_server"))
+		return;
+
+	skel = sock_iter_batch__open();
+	if (!ASSERT_OK_PTR(skel, "sock_iter_batch__open"))
+		goto done;
+
+	skel->rodata->local_port = ntohs(get_socket_local_port(fds[0]));
+	skel->rodata->expected_char = expected_char;
+
+	err = sock_iter_batch__load(skel);
+	if (!ASSERT_OK(err, "sock_iter_batch__load"))
+		goto done;
+
+	if (sock_type == SOCK_STREAM) {
+		read_batch(skel->progs.iter_tcp_soreuse, true);
+		read_batch(skel->progs.iter_tcp_soreuse, false);
+	} else {
+		read_batch(skel->progs.iter_udp_soreuse, true);
+		read_batch(skel->progs.iter_udp_soreuse, false);
+	}
+
+done:
+	sock_iter_batch__destroy(skel);
+	free_fds(fds, nr_soreuse);
+}
+
+void test_sock_iter_batch(void)
+{
+	struct nstoken *nstoken = NULL;
+
+	SYS_NOFAIL("ip netns del " TEST_NS " &> /dev/null");
+	SYS(done, "ip netns add %s", TEST_NS);
+	SYS(done, "ip -net %s link set dev lo up", TEST_NS);
+
+	nstoken = open_netns(TEST_NS);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto done;
+
+	if (test__start_subtest("tcp"))
+		do_test(SOCK_STREAM);
+	if (test__start_subtest("udp"))
+		do_test(SOCK_DGRAM);
+
+done:
+	close_netns(nstoken);
+	SYS_NOFAIL("ip netns del " TEST_NS " &> /dev/null");
+}
diff --git a/tools/testing/selftests/bpf/progs/sock_iter_batch.c b/tools/testing/selftests/bpf/progs/sock_iter_batch.c
new file mode 100644
index 000000000000..4264df162d83
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/sock_iter_batch.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2023 Meta
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_tracing_net.h"
+#include "bpf_kfuncs.h"
+
+volatile const __u16 local_port;
+volatile const char expected_char;
+
+SEC("iter/tcp")
+int iter_tcp_soreuse(struct bpf_iter__tcp *ctx)
+{
+	struct sock *sk = (struct sock *)ctx->sk_common;
+
+	if (!sk)
+		return 0;
+
+	sk = bpf_rdonly_cast(sk, bpf_core_type_id_kernel(struct sock));
+	if (sk->sk_family == AF_INET6 && sk->sk_num == local_port)
+		bpf_seq_write(ctx->meta->seq, (void *)&expected_char, 1);
+
+	return 0;
+}
+
+SEC("iter/udp")
+int iter_udp_soreuse(struct bpf_iter__udp *ctx)
+{
+	struct sock *sk = (struct sock *)ctx->udp_sk;
+
+	if (!sk)
+		return 0;
+
+	sk = bpf_rdonly_cast(sk, bpf_core_type_id_kernel(struct sock));
+	if (sk->sk_family == AF_INET6 && sk->sk_num == local_port)
+		bpf_seq_write(ctx->meta->seq, (void *)&expected_char, 1);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-19 19:32 [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Martin KaFai Lau
  2023-12-19 19:32 ` [PATCH bpf 2/2] selftests/bpf: Test udp and tcp iter batching Martin KaFai Lau
@ 2023-12-20 14:24 ` Daniel Borkmann
  2023-12-20 19:10   ` Martin KaFai Lau
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2023-12-20 14:24 UTC (permalink / raw)
  To: Martin KaFai Lau, bpf
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag

On 12/19/23 8:32 PM, Martin KaFai Lau wrote:
> From: Martin KaFai Lau <martin.lau@kernel.org>
> 
> The bpf_iter_udp iterates all udp_sk by iterating the udp_table.
> The bpf_iter_udp stores all udp_sk of a bucket while iterating
> the udp_table. The term used in the kernel code is "batch" the
> whole bucket. The reason for batching is to allow lock_sock() on
> each socket before calling the bpf prog such that the bpf prog can
> safely call helper/kfunc that changes the sk's state,
> e.g. bpf_setsockopt.
> 
> There is a bug in the bpf_iter_udp_batch() function that stops
> the userspace from making forward progress.
> 
> The case that triggers the bug is the userspace passed in
> a very small read buffer. When the bpf prog does bpf_seq_printf,
> the userspace read buffer is not enough to capture the whole "batch".
> 
> When the read buffer is not enough for the whole "batch", the kernel
> will remember the offset of the batch in iter->offset such that
> the next userspace read() can continue from where it left off.
> 
> The kernel will skip the number (== "iter->offset") of sockets in
> the next read(). However, the code directly decrements the
> "--iter->offset". This is incorrect because the next read() may
> not consume the whole "batch" either and the next next read() will
> start from offset 0.
> 
> Doing "--iter->offset" is essentially making backward progress.
> The net effect is the userspace will keep reading from the beginning
> of a bucket and the process will never finish. "iter->offset" must always
> go forward until the whole "batch" (or bucket) is consumed by the
> userspace.
> 
> This patch fixes it by doing the decrement in a local stack
> variable.

nit: Probably makes sense to also state here that bpf_iter_tcp does
not have this issue, so it's clear from commit message that you did
also audit the TCP one.

> Cc: Aditi Ghag <aditi.ghag@isovalent.com>
> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>   net/ipv4/udp.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 89e5a806b82e..6cf4151c2eb4 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3141,6 +3141,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   	unsigned int batch_sks = 0;
>   	bool resized = false;
>   	struct sock *sk;
> +	int offset;
>   
>   	/* The current batch is done, so advance the bucket. */
>   	if (iter->st_bucket_done) {
> @@ -3162,6 +3163,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   	iter->end_sk = 0;
>   	iter->st_bucket_done = false;
>   	batch_sks = 0;
> +	offset = iter->offset;
>   
>   	for (; state->bucket <= udptable->mask; state->bucket++) {
>   		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> @@ -3177,8 +3179,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>   				/* Resume from the last iterated socket at the
>   				 * offset in the bucket before iterator was stopped.
>   				 */
> -				if (iter->offset) {
> -					--iter->offset;
> +				if (offset) {
> +					--offset;
>   					continue;
>   				}
>   				if (iter->end_sk < iter->max_sk) {
> 

Do we also need something like :

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6cf4151c2eb4..ef4fc436253d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3169,7 +3169,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
                 struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];

                 if (hlist_empty(&hslot2->head)) {
-                       iter->offset = 0;
+                       iter->offset = offset = 0;
                         continue;
                 }

@@ -3196,7 +3196,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
                         break;

                 /* Reset the current bucket's offset before moving to the next bucket. */
-               iter->offset = 0;
+               iter->offset = offset = 0;
         }

         /* All done: no batch made. */

For the case when upon retry the current batch is done earlier than expected ?

Thanks,
Daniel

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-20 14:24 ` [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Daniel Borkmann
@ 2023-12-20 19:10   ` Martin KaFai Lau
  2023-12-21  4:45     ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2023-12-20 19:10 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag

[-- Attachment #1: Type: text/plain, Size: 7080 bytes --]

On 12/20/23 6:24 AM, Daniel Borkmann wrote:
> On 12/19/23 8:32 PM, Martin KaFai Lau wrote:
>> From: Martin KaFai Lau <martin.lau@kernel.org>
>>
>> The bpf_iter_udp iterates all udp_sk by iterating the udp_table.
>> The bpf_iter_udp stores all udp_sk of a bucket while iterating
>> the udp_table. The term used in the kernel code is "batch" the
>> whole bucket. The reason for batching is to allow lock_sock() on
>> each socket before calling the bpf prog such that the bpf prog can
>> safely call helper/kfunc that changes the sk's state,
>> e.g. bpf_setsockopt.
>>
>> There is a bug in the bpf_iter_udp_batch() function that stops
>> the userspace from making forward progress.
>>
>> The case that triggers the bug is the userspace passed in
>> a very small read buffer. When the bpf prog does bpf_seq_printf,
>> the userspace read buffer is not enough to capture the whole "batch".
>>
>> When the read buffer is not enough for the whole "batch", the kernel
>> will remember the offset of the batch in iter->offset such that
>> the next userspace read() can continue from where it left off.
>>
>> The kernel will skip the number (== "iter->offset") of sockets in
>> the next read(). However, the code directly decrements the
>> "--iter->offset". This is incorrect because the next read() may
>> not consume the whole "batch" either and the next next read() will
>> start from offset 0.
>>
>> Doing "--iter->offset" is essentially making backward progress.
>> The net effect is the userspace will keep reading from the beginning
>> of a bucket and the process will never finish. "iter->offset" must always
>> go forward until the whole "batch" (or bucket) is consumed by the
>> userspace.
>>
>> This patch fixes it by doing the decrement in a local stack
>> variable.
> 
> nit: Probably makes sense to also state here that bpf_iter_tcp does
> not have this issue, so it's clear from commit message that you did
> also audit the TCP one.

Ack.

> 
>> Cc: Aditi Ghag <aditi.ghag@isovalent.com>
>> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>> ---
>>   net/ipv4/udp.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 89e5a806b82e..6cf4151c2eb4 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3141,6 +3141,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>> *seq)
>>       unsigned int batch_sks = 0;
>>       bool resized = false;
>>       struct sock *sk;
>> +    int offset;
>>       /* The current batch is done, so advance the bucket. */
>>       if (iter->st_bucket_done) {
>> @@ -3162,6 +3163,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>> *seq)
>>       iter->end_sk = 0;
>>       iter->st_bucket_done = false;
>>       batch_sks = 0;
>> +    offset = iter->offset;
>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>> @@ -3177,8 +3179,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>> *seq)
>>                   /* Resume from the last iterated socket at the
>>                    * offset in the bucket before iterator was stopped.
>>                    */
>> -                if (iter->offset) {
>> -                    --iter->offset;
>> +                if (offset) {
>> +                    --offset;
>>                       continue;
>>                   }
>>                   if (iter->end_sk < iter->max_sk) {
>>
> 
> Do we also need something like :
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 6cf4151c2eb4..ef4fc436253d 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3169,7 +3169,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>                  struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> 
>                  if (hlist_empty(&hslot2->head)) {
> -                       iter->offset = 0;
> +                       iter->offset = offset = 0;
>                          continue;
>                  }
> 
> @@ -3196,7 +3196,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>                          break;
> 
>                  /* Reset the current bucket's offset before moving to the next 
> bucket. */
> -               iter->offset = 0;
> +               iter->offset = offset = 0;
>          }
> 
>          /* All done: no batch made. */
> 
> For the case when upon retry the current batch is done earlier than expected ?

Good catch. It will unnecessary skip in the following batch/bucket if there is 
changes in the current batch/bucket.

 From looking at the loop again, I think it is better not to change the 
iter->offset during the for loop. Only update iter->offset after the for loop 
has concluded.

The non-zero iter->offset is only useful for the first bucket, so does a test on 
the first bucket (state->bucket == bucket) before skipping sockets. Something 
like this:

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..a993f364d6ae 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	struct net *net = seq_file_net(seq);
  	struct udp_table *udptable;
  	unsigned int batch_sks = 0;
+	int bucket, bucket_offset;
  	bool resized = false;
  	struct sock *sk;

@@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	iter->end_sk = 0;
  	iter->st_bucket_done = false;
  	batch_sks = 0;
+	bucket = state->bucket;
+	bucket_offset = 0;

  	for (; state->bucket <= udptable->mask; state->bucket++) {
  		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];

-		if (hlist_empty(&hslot2->head)) {
-			iter->offset = 0;
+		if (hlist_empty(&hslot2->head))
  			continue;
-		}

  		spin_lock_bh(&hslot2->lock);
  		udp_portaddr_for_each_entry(sk, &hslot2->head) {
@@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  				/* Resume from the last iterated socket at the
  				 * offset in the bucket before iterator was stopped.
  				 */
-				if (iter->offset) {
-					--iter->offset;
+				if (state->bucket == bucket &&
+				    bucket_offset < iter->offset) {
+					++bucket_offset;
  					continue;
  				}
  				if (iter->end_sk < iter->max_sk) {
@@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)

  		if (iter->end_sk)
  			break;
+	}

-		/* Reset the current bucket's offset before moving to the next bucket. */
+	if (state->bucket != bucket)
  		iter->offset = 0;
-	}

  	/* All done: no batch made. */
  	if (!iter->end_sk)

[-- Attachment #2: 0001-bpf-Avoid-iter-offset-making-backward-progress-in-bp.patch --]
[-- Type: text/plain, Size: 3696 bytes --]

From 2d70cbeb92daff876dcf88733b8d745393f033b0 Mon Sep 17 00:00:00 2001
From: Martin KaFai Lau <martin.lau@kernel.org>
Date: Mon, 18 Dec 2023 17:33:10 -0800
Subject: [PATCH v2 bpf 1/2] bpf: Avoid iter->offset making backward progress
 in bpf_iter_udp

The bpf_iter_udp iterates all udp_sk by iterating the udp_table.
The bpf_iter_udp stores all udp_sk of a bucket while iterating
the udp_table. The term used in the kernel code is "batch" the
whole bucket. The reason for batching is to allow lock_sock() on
each socket before calling the bpf prog such that the bpf prog can
safely call helper/kfunc that changes the sk's state,
e.g. bpf_setsockopt.

There is a bug in the bpf_iter_udp_batch() function that stops
the userspace from making forward progress.

The case that triggers the bug is the userspace passed in
a very small read buffer. When the bpf prog does bpf_seq_printf,
the userspace read buffer is not enough to capture the whole "batch".

When the read buffer is not enough for the whole "batch", the kernel
will remember the offset of the batch in iter->offset such that
the next userspace read() can continue from where it left off.

The kernel will skip the number (== "iter->offset") of sockets in
the next read(). However, the code directly decrements the
"--iter->offset". This is incorrect because the next read() may
not consume the whole "batch" either and the next next read() will
start from offset 0.

Doing "--iter->offset" is essentially making backward progress.
The net effect is the userspace will keep reading from the beginning
of a bucket and the process will never finish. "iter->offset" must always
go forward until the whole "batch" (or bucket) is consumed by the
userspace.

This patch fixes it by doing the decrement in a local stack
variable.

Cc: Aditi Ghag <aditi.ghag@isovalent.com>
Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
---
 net/ipv4/udp.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..a993f364d6ae 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	struct net *net = seq_file_net(seq);
 	struct udp_table *udptable;
 	unsigned int batch_sks = 0;
+	int bucket, bucket_offset;
 	bool resized = false;
 	struct sock *sk;
 
@@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 	iter->end_sk = 0;
 	iter->st_bucket_done = false;
 	batch_sks = 0;
+	bucket = state->bucket;
+	bucket_offset = 0;
 
 	for (; state->bucket <= udptable->mask; state->bucket++) {
 		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
 
-		if (hlist_empty(&hslot2->head)) {
-			iter->offset = 0;
+		if (hlist_empty(&hslot2->head))
 			continue;
-		}
 
 		spin_lock_bh(&hslot2->lock);
 		udp_portaddr_for_each_entry(sk, &hslot2->head) {
@@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 				/* Resume from the last iterated socket at the
 				 * offset in the bucket before iterator was stopped.
 				 */
-				if (iter->offset) {
-					--iter->offset;
+				if (state->bucket == bucket &&
+				    bucket_offset < iter->offset) {
+					++bucket_offset;
 					continue;
 				}
 				if (iter->end_sk < iter->max_sk) {
@@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
 
 		if (iter->end_sk)
 			break;
+	}
 
-		/* Reset the current bucket's offset before moving to the next bucket. */
+	if (state->bucket != bucket)
 		iter->offset = 0;
-	}
 
 	/* All done: no batch made. */
 	if (!iter->end_sk)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-20 19:10   ` Martin KaFai Lau
@ 2023-12-21  4:45     ` Martin KaFai Lau
  2023-12-21 13:21       ` Daniel Borkmann
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2023-12-21  4:45 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag, bpf

On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
> Good catch. It will unnecessary skip in the following batch/bucket if there is 
> changes in the current batch/bucket.
> 
>  From looking at the loop again, I think it is better not to change the 
> iter->offset during the for loop. Only update iter->offset after the for loop 
> has concluded.
> 
> The non-zero iter->offset is only useful for the first bucket, so does a test on 
> the first bucket (state->bucket == bucket) before skipping sockets. Something 
> like this:
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 89e5a806b82e..a993f364d6ae 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>       struct net *net = seq_file_net(seq);
>       struct udp_table *udptable;
>       unsigned int batch_sks = 0;
> +    int bucket, bucket_offset;
>       bool resized = false;
>       struct sock *sk;
> 
> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
> *seq)
>       iter->end_sk = 0;
>       iter->st_bucket_done = false;
>       batch_sks = 0;
> +    bucket = state->bucket;
> +    bucket_offset = 0;
> 
>       for (; state->bucket <= udptable->mask; state->bucket++) {
>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
> 
> -        if (hlist_empty(&hslot2->head)) {
> -            iter->offset = 0;
> +        if (hlist_empty(&hslot2->head))
>               continue;
> -        }
> 
>           spin_lock_bh(&hslot2->lock);
>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>                   /* Resume from the last iterated socket at the
>                    * offset in the bucket before iterator was stopped.
>                    */
> -                if (iter->offset) {
> -                    --iter->offset;
> +                if (state->bucket == bucket &&
> +                    bucket_offset < iter->offset) {
> +                    ++bucket_offset;
>                       continue;
>                   }
>                   if (iter->end_sk < iter->max_sk) {
> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
> *seq)
> 
>           if (iter->end_sk)
>               break;
> +    }
> 
> -        /* Reset the current bucket's offset before moving to the next bucket. */
> +    if (state->bucket != bucket)
>           iter->offset = 0;
> -    }
> 
>       /* All done: no batch made. */
>       if (!iter->end_sk)

I think I found another bug in the current bpf_iter_udp_batch(). The 
"state->bucket--;" at the end of the batch() function is wrong also. It does not 
need to go back to the previous bucket. After realloc with a larger batch array, 
it should retry on the "state->bucket" as is. I tried to force the bind() to use 
bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket 
< 0) triggered.

Going back to this bug (backward progress on --iter->offset), I think it is a 
bit cleaner to always reset iter->offset to 0 and advance iter->offset to the 
resume_offset only when needed. Something like this:

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..184aa966a006 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3137,16 +3137,18 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	struct bpf_udp_iter_state *iter = seq->private;
  	struct udp_iter_state *state = &iter->state;
  	struct net *net = seq_file_net(seq);
+	int resume_bucket, resume_offset;
  	struct udp_table *udptable;
  	unsigned int batch_sks = 0;
  	bool resized = false;
  	struct sock *sk;

+	resume_bucket = state->bucket;
+	resume_offset = iter->offset;
+
  	/* The current batch is done, so advance the bucket. */
-	if (iter->st_bucket_done) {
+	if (iter->st_bucket_done)
  		state->bucket++;
-		iter->offset = 0;
-	}

  	udptable = udp_get_table_seq(seq, net);

@@ -3166,10 +3168,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	for (; state->bucket <= udptable->mask; state->bucket++) {
  		struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];

-		if (hlist_empty(&hslot2->head)) {
-			iter->offset = 0;
+		iter->offset = 0;
+		if (hlist_empty(&hslot2->head))
  			continue;
-		}

  		spin_lock_bh(&hslot2->lock);
  		udp_portaddr_for_each_entry(sk, &hslot2->head) {
@@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  				/* Resume from the last iterated socket at the
  				 * offset in the bucket before iterator was stopped.
  				 */
-				if (iter->offset) {
-					--iter->offset;
+				if (state->bucket == resume_bucket &&
+				    iter->offset < resume_offset) {
+					++iter->offset;
  					continue;
  				}
  				if (iter->end_sk < iter->max_sk) {
@@ -3192,9 +3194,6 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)

  		if (iter->end_sk)
  			break;
-
-		/* Reset the current bucket's offset before moving to the next bucket. */
-		iter->offset = 0;
  	}

  	/* All done: no batch made. */
@@ -3210,10 +3209,6 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	}
  	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
  		resized = true;
-		/* After allocating a larger batch, retry one more time to grab
-		 * the whole bucket.
-		 */
-		state->bucket--;
  		goto again;
  	}
  done:


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-21  4:45     ` Martin KaFai Lau
@ 2023-12-21 13:21       ` Daniel Borkmann
  2023-12-21 14:58         ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2023-12-21 13:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag, bpf

On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>
>>  From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>
>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 89e5a806b82e..a993f364d6ae 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       struct net *net = seq_file_net(seq);
>>       struct udp_table *udptable;
>>       unsigned int batch_sks = 0;
>> +    int bucket, bucket_offset;
>>       bool resized = false;
>>       struct sock *sk;
>>
>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       iter->end_sk = 0;
>>       iter->st_bucket_done = false;
>>       batch_sks = 0;
>> +    bucket = state->bucket;
>> +    bucket_offset = 0;
>>
>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>
>> -        if (hlist_empty(&hslot2->head)) {
>> -            iter->offset = 0;
>> +        if (hlist_empty(&hslot2->head))
>>               continue;
>> -        }
>>
>>           spin_lock_bh(&hslot2->lock);
>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>                   /* Resume from the last iterated socket at the
>>                    * offset in the bucket before iterator was stopped.
>>                    */
>> -                if (iter->offset) {
>> -                    --iter->offset;
>> +                if (state->bucket == bucket &&
>> +                    bucket_offset < iter->offset) {
>> +                    ++bucket_offset;
>>                       continue;
>>                   }
>>                   if (iter->end_sk < iter->max_sk) {
>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>
>>           if (iter->end_sk)
>>               break;
>> +    }
>>
>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>> +    if (state->bucket != bucket)
>>           iter->offset = 0;
>> -    }
>>
>>       /* All done: no batch made. */
>>       if (!iter->end_sk)
> 
> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.
> 
> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:

Hm, my assumption was.. why not do something like the below, and fully start over?

I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
in the bucket could already have changed.. maybe I'm still missing something - what do
we need to deal with exactly worst case when we need to go and retry everything, and what
guarantees do we have?

(only compile tested)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 89e5a806b82e..ca62a4bb7bec 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	struct udp_iter_state *state = &iter->state;
  	struct net *net = seq_file_net(seq);
  	struct udp_table *udptable;
-	unsigned int batch_sks = 0;
+	int orig_bucket, orig_offset;
+	unsigned int i, batch_sks = 0;
  	bool resized = false;
  	struct sock *sk;

@@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	}

  	udptable = udp_get_table_seq(seq, net);
-
+	orig_bucket = state->bucket;
+	orig_offset = iter->offset;
  again:
  	/* New batch for the next bucket.
  	 * Iterate over the hash table to find a bucket with sockets matching
@@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
  	if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
  		resized = true;
  		/* After allocating a larger batch, retry one more time to grab
-		 * the whole bucket.
+		 * the whole bucket. Drop the current refs since for the next
+		 * attempt the composition could have changed, thus start over.
  		 */
-		state->bucket--;
+		for (i = 0; i < iter->end_sk; i++) {
+			sock_put(iter->batch[i]);
+			iter->batch[i] = NULL;
+		}
+		state->bucket = orig_bucket;
+		iter->offset = orig_offset;
  		goto again;
  	}
  done:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-21 13:21       ` Daniel Borkmann
@ 2023-12-21 14:58         ` Martin KaFai Lau
  2023-12-21 20:27           ` Daniel Borkmann
  2024-01-04 20:21           ` Aditi Ghag
  0 siblings, 2 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2023-12-21 14:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag, bpf

On 12/21/23 5:21 AM, Daniel Borkmann wrote:
> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>> Good catch. It will unnecessary skip in the following batch/bucket if there 
>>> is changes in the current batch/bucket.
>>>
>>>  From looking at the loop again, I think it is better not to change the 
>>> iter->offset during the for loop. Only update iter->offset after the for loop 
>>> has concluded.
>>>
>>> The non-zero iter->offset is only useful for the first bucket, so does a test 
>>> on the first bucket (state->bucket == bucket) before skipping sockets. 
>>> Something like this:
>>>
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 89e5a806b82e..a993f364d6ae 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>>> *seq)
>>>       struct net *net = seq_file_net(seq);
>>>       struct udp_table *udptable;
>>>       unsigned int batch_sks = 0;
>>> +    int bucket, bucket_offset;
>>>       bool resized = false;
>>>       struct sock *sk;
>>>
>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct 
>>> seq_file *seq)
>>>       iter->end_sk = 0;
>>>       iter->st_bucket_done = false;
>>>       batch_sks = 0;
>>> +    bucket = state->bucket;
>>> +    bucket_offset = 0;
>>>
>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>
>>> -        if (hlist_empty(&hslot2->head)) {
>>> -            iter->offset = 0;
>>> +        if (hlist_empty(&hslot2->head))
>>>               continue;
>>> -        }
>>>
>>>           spin_lock_bh(&hslot2->lock);
>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>>> *seq)
>>>                   /* Resume from the last iterated socket at the
>>>                    * offset in the bucket before iterator was stopped.
>>>                    */
>>> -                if (iter->offset) {
>>> -                    --iter->offset;
>>> +                if (state->bucket == bucket &&
>>> +                    bucket_offset < iter->offset) {
>>> +                    ++bucket_offset;
>>>                       continue;
>>>                   }
>>>                   if (iter->end_sk < iter->max_sk) {
>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct 
>>> seq_file *seq)
>>>
>>>           if (iter->end_sk)
>>>               break;
>>> +    }
>>>
>>> -        /* Reset the current bucket's offset before moving to the next 
>>> bucket. */
>>> +    if (state->bucket != bucket)
>>>           iter->offset = 0;
>>> -    }
>>>
>>>       /* All done: no batch made. */
>>>       if (!iter->end_sk)
>>
>> I think I found another bug in the current bpf_iter_udp_batch(). The 
>> "state->bucket--;" at the end of the batch() function is wrong also. It does 
>> not need to go back to the previous bucket. After realloc with a larger batch 
>> array, it should retry on the "state->bucket" as is. I tried to force the 
>> bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). 
>> WARN_ON(state->bucket < 0) triggered.
>>
>> Going back to this bug (backward progress on --iter->offset), I think it is a 
>> bit cleaner to always reset iter->offset to 0 and advance iter->offset to the 
>> resume_offset only when needed. Something like this:
> 
> Hm, my assumption was.. why not do something like the below, and fully start over?
> 
> I'm mostly puzzled about the side-effects here, in particular, if for the rerun 
> the sockets
> in the bucket could already have changed.. maybe I'm still missing something - 
> what do
> we need to deal with exactly worst case when we need to go and retry everything, 
> and what
> guarantees do we have?
> 
> (only compile tested)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 89e5a806b82e..ca62a4bb7bec 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>       struct udp_iter_state *state = &iter->state;
>       struct net *net = seq_file_net(seq);
>       struct udp_table *udptable;
> -    unsigned int batch_sks = 0;
> +    int orig_bucket, orig_offset;
> +    unsigned int i, batch_sks = 0;
>       bool resized = false;
>       struct sock *sk;
> 
> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>       }
> 
>       udptable = udp_get_table_seq(seq, net);
> -
> +    orig_bucket = state->bucket;
> +    orig_offset = iter->offset;
>   again:
>       /* New batch for the next bucket.
>        * Iterate over the hash table to find a bucket with sockets matching
> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>       if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>           resized = true;
>           /* After allocating a larger batch, retry one more time to grab
> -         * the whole bucket.
> +         * the whole bucket. Drop the current refs since for the next
> +         * attempt the composition could have changed, thus start over.
>            */
> -        state->bucket--;
> +        for (i = 0; i < iter->end_sk; i++) {
> +            sock_put(iter->batch[i]);
> +            iter->batch[i] = NULL;
> +        }
> +        state->bucket = orig_bucket;
> +        iter->offset = orig_offset;

It does not need to start over from the orig_bucket. Once it advanced to the 
next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need 
to make backward progress here on the state->bucket. The batch size too small 
happens on the current state->bucket, so it should retry with the same 
state->bucket after realloc_batch(). If the state->bucket happens to be the 
orig_bucket (mean it has not advanced), it will skip the same orig_offset.

If the orig_bucket had changed (e.g. having more sockets than the last time it 
was batched) after state->bucket++, it is arguably fine because it was added 
after the orig_bucket was completely captured in a batch before. The same goes 
for (orig_bucket-1) that could have changed during the whole udp_table iteration.

>           goto again;
>       }
>   done:


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-21 14:58         ` Martin KaFai Lau
@ 2023-12-21 20:27           ` Daniel Borkmann
  2023-12-21 22:19             ` Martin KaFai Lau
  2024-01-04 20:21           ` Aditi Ghag
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2023-12-21 20:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag, bpf

On 12/21/23 3:58 PM, Martin KaFai Lau wrote:
> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>>>
>>>>  From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>>>
>>>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>>>
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       struct net *net = seq_file_net(seq);
>>>>       struct udp_table *udptable;
>>>>       unsigned int batch_sks = 0;
>>>> +    int bucket, bucket_offset;
>>>>       bool resized = false;
>>>>       struct sock *sk;
>>>>
>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       iter->end_sk = 0;
>>>>       iter->st_bucket_done = false;
>>>>       batch_sks = 0;
>>>> +    bucket = state->bucket;
>>>> +    bucket_offset = 0;
>>>>
>>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>
>>>> -        if (hlist_empty(&hslot2->head)) {
>>>> -            iter->offset = 0;
>>>> +        if (hlist_empty(&hslot2->head))
>>>>               continue;
>>>> -        }
>>>>
>>>>           spin_lock_bh(&hslot2->lock);
>>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>                   /* Resume from the last iterated socket at the
>>>>                    * offset in the bucket before iterator was stopped.
>>>>                    */
>>>> -                if (iter->offset) {
>>>> -                    --iter->offset;
>>>> +                if (state->bucket == bucket &&
>>>> +                    bucket_offset < iter->offset) {
>>>> +                    ++bucket_offset;
>>>>                       continue;
>>>>                   }
>>>>                   if (iter->end_sk < iter->max_sk) {
>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>
>>>>           if (iter->end_sk)
>>>>               break;
>>>> +    }
>>>>
>>>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>>>> +    if (state->bucket != bucket)
>>>>           iter->offset = 0;
>>>> -    }
>>>>
>>>>       /* All done: no batch made. */
>>>>       if (!iter->end_sk)
>>>
>>> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.
>>>
>>> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:
>>
>> Hm, my assumption was.. why not do something like the below, and fully start over?
>>
>> I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
>> in the bucket could already have changed.. maybe I'm still missing something - what do
>> we need to deal with exactly worst case when we need to go and retry everything, and what
>> guarantees do we have?
>>
>> (only compile tested)
>>
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 89e5a806b82e..ca62a4bb7bec 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       struct udp_iter_state *state = &iter->state;
>>       struct net *net = seq_file_net(seq);
>>       struct udp_table *udptable;
>> -    unsigned int batch_sks = 0;
>> +    int orig_bucket, orig_offset;
>> +    unsigned int i, batch_sks = 0;
>>       bool resized = false;
>>       struct sock *sk;
>>
>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       }
>>
>>       udptable = udp_get_table_seq(seq, net);
>> -
>> +    orig_bucket = state->bucket;
>> +    orig_offset = iter->offset;
>>   again:
>>       /* New batch for the next bucket.
>>        * Iterate over the hash table to find a bucket with sockets matching
>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>       if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>           resized = true;
>>           /* After allocating a larger batch, retry one more time to grab
>> -         * the whole bucket.
>> +         * the whole bucket. Drop the current refs since for the next
>> +         * attempt the composition could have changed, thus start over.
>>            */
>> -        state->bucket--;
>> +        for (i = 0; i < iter->end_sk; i++) {
>> +            sock_put(iter->batch[i]);
>> +            iter->batch[i] = NULL;
>> +        }
>> +        state->bucket = orig_bucket;
>> +        iter->offset = orig_offset;
> 
> It does not need to start over from the orig_bucket. Once it advanced to the next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need to make backward progress here on the state->bucket. The batch size too small happens on the current state->bucket, so it should retry with the same state->bucket after realloc_batch(). If the state->bucket happens to be the orig_bucket (mean it has not advanced), it will skip the same orig_offset.
> 
> If the orig_bucket had changed (e.g. having more sockets than the last time it was batched) after state->bucket++, it is arguably fine because it was added after the orig_bucket was completely captured in a batch before. The same goes for (orig_bucket-1) that could have changed during the whole udp_table iteration.

Fair, I was thinking in relation to the above to avoid such inconsistency, hence the
drop of the refs. I think it's probably fine to argue either way on how semantics
should be as long as the code doesn't get overly complex for covering this corner
case.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-21 20:27           ` Daniel Borkmann
@ 2023-12-21 22:19             ` Martin KaFai Lau
  0 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2023-12-21 22:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: 'Alexei Starovoitov ', 'Andrii Nakryiko ',
	netdev, kernel-team, Aditi Ghag, bpf

On 12/21/23 12:27 PM, Daniel Borkmann wrote:
> On 12/21/23 3:58 PM, Martin KaFai Lau wrote:
>> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>>> Good catch. It will unnecessary skip in the following batch/bucket if there 
>>>>> is changes in the current batch/bucket.
>>>>>
>>>>>  From looking at the loop again, I think it is better not to change the 
>>>>> iter->offset during the for loop. Only update iter->offset after the for 
>>>>> loop has concluded.
>>>>>
>>>>> The non-zero iter->offset is only useful for the first bucket, so does a 
>>>>> test on the first bucket (state->bucket == bucket) before skipping sockets. 
>>>>> Something like this:
>>>>>
>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>>> --- a/net/ipv4/udp.c
>>>>> +++ b/net/ipv4/udp.c
>>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct 
>>>>> seq_file *seq)
>>>>>       struct net *net = seq_file_net(seq);
>>>>>       struct udp_table *udptable;
>>>>>       unsigned int batch_sks = 0;
>>>>> +    int bucket, bucket_offset;
>>>>>       bool resized = false;
>>>>>       struct sock *sk;
>>>>>
>>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct 
>>>>> seq_file *seq)
>>>>>       iter->end_sk = 0;
>>>>>       iter->st_bucket_done = false;
>>>>>       batch_sks = 0;
>>>>> +    bucket = state->bucket;
>>>>> +    bucket_offset = 0;
>>>>>
>>>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>>
>>>>> -        if (hlist_empty(&hslot2->head)) {
>>>>> -            iter->offset = 0;
>>>>> +        if (hlist_empty(&hslot2->head))
>>>>>               continue;
>>>>> -        }
>>>>>
>>>>>           spin_lock_bh(&hslot2->lock);
>>>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct 
>>>>> seq_file *seq)
>>>>>                   /* Resume from the last iterated socket at the
>>>>>                    * offset in the bucket before iterator was stopped.
>>>>>                    */
>>>>> -                if (iter->offset) {
>>>>> -                    --iter->offset;
>>>>> +                if (state->bucket == bucket &&
>>>>> +                    bucket_offset < iter->offset) {
>>>>> +                    ++bucket_offset;
>>>>>                       continue;
>>>>>                   }
>>>>>                   if (iter->end_sk < iter->max_sk) {
>>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct 
>>>>> seq_file *seq)
>>>>>
>>>>>           if (iter->end_sk)
>>>>>               break;
>>>>> +    }
>>>>>
>>>>> -        /* Reset the current bucket's offset before moving to the next 
>>>>> bucket. */
>>>>> +    if (state->bucket != bucket)
>>>>>           iter->offset = 0;
>>>>> -    }
>>>>>
>>>>>       /* All done: no batch made. */
>>>>>       if (!iter->end_sk)
>>>>
>>>> I think I found another bug in the current bpf_iter_udp_batch(). The 
>>>> "state->bucket--;" at the end of the batch() function is wrong also. It does 
>>>> not need to go back to the previous bucket. After realloc with a larger 
>>>> batch array, it should retry on the "state->bucket" as is. I tried to force 
>>>> the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). 
>>>> WARN_ON(state->bucket < 0) triggered.
>>>>
>>>> Going back to this bug (backward progress on --iter->offset), I think it is 
>>>> a bit cleaner to always reset iter->offset to 0 and advance iter->offset to 
>>>> the resume_offset only when needed. Something like this:
>>>
>>> Hm, my assumption was.. why not do something like the below, and fully start 
>>> over?
>>>
>>> I'm mostly puzzled about the side-effects here, in particular, if for the 
>>> rerun the sockets
>>> in the bucket could already have changed.. maybe I'm still missing something 
>>> - what do
>>> we need to deal with exactly worst case when we need to go and retry 
>>> everything, and what
>>> guarantees do we have?
>>>
>>> (only compile tested)
>>>
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 89e5a806b82e..ca62a4bb7bec 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>>> *seq)
>>>       struct udp_iter_state *state = &iter->state;
>>>       struct net *net = seq_file_net(seq);
>>>       struct udp_table *udptable;
>>> -    unsigned int batch_sks = 0;
>>> +    int orig_bucket, orig_offset;
>>> +    unsigned int i, batch_sks = 0;
>>>       bool resized = false;
>>>       struct sock *sk;
>>>
>>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>>> *seq)
>>>       }
>>>
>>>       udptable = udp_get_table_seq(seq, net);
>>> -
>>> +    orig_bucket = state->bucket;
>>> +    orig_offset = iter->offset;
>>>   again:
>>>       /* New batch for the next bucket.
>>>        * Iterate over the hash table to find a bucket with sockets matching
>>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file 
>>> *seq)
>>>       if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>>           resized = true;
>>>           /* After allocating a larger batch, retry one more time to grab
>>> -         * the whole bucket.
>>> +         * the whole bucket. Drop the current refs since for the next
>>> +         * attempt the composition could have changed, thus start over.
>>>            */
>>> -        state->bucket--;
>>> +        for (i = 0; i < iter->end_sk; i++) {
>>> +            sock_put(iter->batch[i]);
>>> +            iter->batch[i] = NULL;
>>> +        }
>>> +        state->bucket = orig_bucket;
>>> +        iter->offset = orig_offset;
>>
>> It does not need to start over from the orig_bucket. Once it advanced to the 
>> next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need 
>> to make backward progress here on the state->bucket. The batch size too small 
>> happens on the current state->bucket, so it should retry with the same 
>> state->bucket after realloc_batch(). If the state->bucket happens to be the 
>> orig_bucket (mean it has not advanced), it will skip the same orig_offset.
>>
>> If the orig_bucket had changed (e.g. having more sockets than the last time it 
>> was batched) after state->bucket++, it is arguably fine because it was added 
>> after the orig_bucket was completely captured in a batch before. The same goes 
>> for (orig_bucket-1) that could have changed during the whole udp_table iteration.
> 
> Fair, I was thinking in relation to the above to avoid such inconsistency, hence 
> the drop of the refs. 
Ah, for the ref drop, the bpf_iter_udp_realloc_batch() earlier did the sock_put 
on the iter->batch[]. It is done in the bpf_iter_udp_put_batch(). When it 
retries with a larger iter->batch[] array, it does retry from the very first 
udp_sk of the state->bucket again. Thus, the intention is to do the best effort 
to capture the whole bucket under the "spin_lock_bh(&hslot2->lock)".

> I think it's probably fine to argue either way on how semantics
> should be as long as the code doesn't get overly complex for covering this corner
> case.
> 
> Thanks,
> Daniel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2023-12-21 14:58         ` Martin KaFai Lau
  2023-12-21 20:27           ` Daniel Borkmann
@ 2024-01-04 20:21           ` Aditi Ghag
  2024-01-04 22:27             ` Martin KaFai Lau
  1 sibling, 1 reply; 14+ messages in thread
From: Aditi Ghag @ 2024-01-04 20:21 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, netdev,
	kernel-team, bpf



> On Dec 21, 2023, at 6:58 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>>> 
>>>>  From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>>> 
>>>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>>> 
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       struct net *net = seq_file_net(seq);
>>>>       struct udp_table *udptable;
>>>>       unsigned int batch_sks = 0;
>>>> +    int bucket, bucket_offset;
>>>>       bool resized = false;
>>>>       struct sock *sk;
>>>> 
>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>       iter->end_sk = 0;
>>>>       iter->st_bucket_done = false;
>>>>       batch_sks = 0;
>>>> +    bucket = state->bucket;
>>>> +    bucket_offset = 0;
>>>> 
>>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>> 
>>>> -        if (hlist_empty(&hslot2->head)) {
>>>> -            iter->offset = 0;
>>>> +        if (hlist_empty(&hslot2->head))
>>>>               continue;
>>>> -        }
>>>> 
>>>>           spin_lock_bh(&hslot2->lock);
>>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>                   /* Resume from the last iterated socket at the
>>>>                    * offset in the bucket before iterator was stopped.
>>>>                    */
>>>> -                if (iter->offset) {
>>>> -                    --iter->offset;
>>>> +                if (state->bucket == bucket &&
>>>> +                    bucket_offset < iter->offset) {
>>>> +                    ++bucket_offset;
>>>>                       continue;
>>>>                   }
>>>>                   if (iter->end_sk < iter->max_sk) {
>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>> 
>>>>           if (iter->end_sk)
>>>>               break;
>>>> +    }
>>>> 
>>>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>>>> +    if (state->bucket != bucket)
>>>>           iter->offset = 0;
>>>> -    }
>>>> 
>>>>       /* All done: no batch made. */
>>>>       if (!iter->end_sk)
>>> 
>>> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.
>>> 
>>> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:
>> Hm, my assumption was.. why not do something like the below, and fully start over?
>> I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
>> in the bucket could already have changed.. maybe I'm still missing something - what do
>> we need to deal with exactly worst case when we need to go and retry everything, and what
>> guarantees do we have?
>> (only compile tested)
>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>> index 89e5a806b82e..ca62a4bb7bec 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>      struct udp_iter_state *state = &iter->state;
>>      struct net *net = seq_file_net(seq);
>>      struct udp_table *udptable;
>> -    unsigned int batch_sks = 0;
>> +    int orig_bucket, orig_offset;
>> +    unsigned int i, batch_sks = 0;
>>      bool resized = false;
>>      struct sock *sk;
>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>      }
>>      udptable = udp_get_table_seq(seq, net);
>> -
>> +    orig_bucket = state->bucket;
>> +    orig_offset = iter->offset;
>>  again:
>>      /* New batch for the next bucket.
>>       * Iterate over the hash table to find a bucket with sockets matching
>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>      if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>          resized = true;
>>          /* After allocating a larger batch, retry one more time to grab
>> -         * the whole bucket.
>> +         * the whole bucket. Drop the current refs since for the next
>> +         * attempt the composition could have changed, thus start over.
>>           */
>> -        state->bucket--;
>> +        for (i = 0; i < iter->end_sk; i++) {
>> +            sock_put(iter->batch[i]);
>> +            iter->batch[i] = NULL;
>> +        }
>> +        state->bucket = orig_bucket;
>> +        iter->offset = orig_offset;
> 
> It does not need to start over from the orig_bucket. Once it advanced to the next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need to make backward progress here on the state->bucket. The batch size too small happens on the current state->bucket, so it should retry with the same state->bucket after realloc_batch(). If the state->bucket happens to be the orig_bucket (mean it has not advanced), it will skip the same orig_offset.


Thanks for the patch. I was on vacation, hence the delay in reviewing the patch. The changes around iter->offset match with what I had locally (the patch fell off my radar, sorry!). 

I'm not sure about semantics of the resume operation for certain corner cases like these:

- The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false. 
- Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added.  With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator.
What do you think about such cases? 


> 
> If the orig_bucket had changed (e.g. having more sockets than the last time it was batched) after state->bucket++, it is arguably fine because it was added after the orig_bucket was completely captured in a batch before. The same goes for (orig_bucket-1) that could have changed during the whole udp_table iteration.
> 
>>          goto again;
>>      }
>>  done:


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2024-01-04 20:21           ` Aditi Ghag
@ 2024-01-04 22:27             ` Martin KaFai Lau
  2024-01-04 23:38               ` Aditi Ghag
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2024-01-04 22:27 UTC (permalink / raw)
  To: Aditi Ghag
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, netdev,
	kernel-team, bpf

On 1/4/24 12:21 PM, Aditi Ghag wrote:
> 
> 
>> On Dec 21, 2023, at 6:58 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>>>>
>>>>>   From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>>>>
>>>>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>>>>
>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>>> --- a/net/ipv4/udp.c
>>>>> +++ b/net/ipv4/udp.c
>>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>        struct net *net = seq_file_net(seq);
>>>>>        struct udp_table *udptable;
>>>>>        unsigned int batch_sks = 0;
>>>>> +    int bucket, bucket_offset;
>>>>>        bool resized = false;
>>>>>        struct sock *sk;
>>>>>
>>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>        iter->end_sk = 0;
>>>>>        iter->st_bucket_done = false;
>>>>>        batch_sks = 0;
>>>>> +    bucket = state->bucket;
>>>>> +    bucket_offset = 0;
>>>>>
>>>>>        for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>>            struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>>
>>>>> -        if (hlist_empty(&hslot2->head)) {
>>>>> -            iter->offset = 0;
>>>>> +        if (hlist_empty(&hslot2->head))
>>>>>                continue;
>>>>> -        }
>>>>>
>>>>>            spin_lock_bh(&hslot2->lock);
>>>>>            udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>                    /* Resume from the last iterated socket at the
>>>>>                     * offset in the bucket before iterator was stopped.
>>>>>                     */
>>>>> -                if (iter->offset) {
>>>>> -                    --iter->offset;
>>>>> +                if (state->bucket == bucket &&
>>>>> +                    bucket_offset < iter->offset) {
>>>>> +                    ++bucket_offset;
>>>>>                        continue;
>>>>>                    }
>>>>>                    if (iter->end_sk < iter->max_sk) {
>>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>
>>>>>            if (iter->end_sk)
>>>>>                break;
>>>>> +    }
>>>>>
>>>>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>>>>> +    if (state->bucket != bucket)
>>>>>            iter->offset = 0;
>>>>> -    }
>>>>>
>>>>>        /* All done: no batch made. */
>>>>>        if (!iter->end_sk)
>>>>
>>>> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.
>>>>
>>>> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:
>>> Hm, my assumption was.. why not do something like the below, and fully start over?
>>> I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
>>> in the bucket could already have changed.. maybe I'm still missing something - what do
>>> we need to deal with exactly worst case when we need to go and retry everything, and what
>>> guarantees do we have?
>>> (only compile tested)
>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>> index 89e5a806b82e..ca62a4bb7bec 100644
>>> --- a/net/ipv4/udp.c
>>> +++ b/net/ipv4/udp.c
>>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>       struct udp_iter_state *state = &iter->state;
>>>       struct net *net = seq_file_net(seq);
>>>       struct udp_table *udptable;
>>> -    unsigned int batch_sks = 0;
>>> +    int orig_bucket, orig_offset;
>>> +    unsigned int i, batch_sks = 0;
>>>       bool resized = false;
>>>       struct sock *sk;
>>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>       }
>>>       udptable = udp_get_table_seq(seq, net);
>>> -
>>> +    orig_bucket = state->bucket;
>>> +    orig_offset = iter->offset;
>>>   again:
>>>       /* New batch for the next bucket.
>>>        * Iterate over the hash table to find a bucket with sockets matching
>>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>       if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>>           resized = true;
>>>           /* After allocating a larger batch, retry one more time to grab
>>> -         * the whole bucket.
>>> +         * the whole bucket. Drop the current refs since for the next
>>> +         * attempt the composition could have changed, thus start over.
>>>            */
>>> -        state->bucket--;
>>> +        for (i = 0; i < iter->end_sk; i++) {
>>> +            sock_put(iter->batch[i]);
>>> +            iter->batch[i] = NULL;
>>> +        }
>>> +        state->bucket = orig_bucket;
>>> +        iter->offset = orig_offset;
>>
>> It does not need to start over from the orig_bucket. Once it advanced to the next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need to make backward progress here on the state->bucket. The batch size too small happens on the current state->bucket, so it should retry with the same state->bucket after realloc_batch(). If the state->bucket happens to be the orig_bucket (mean it has not advanced), it will skip the same orig_offset.
> 
> 
> Thanks for the patch. I was on vacation, hence the delay in reviewing the patch. The changes around iter->offset match with what I had locally (the patch fell off my radar, sorry!).

No problem. I am hitting it one and off in my local testing for some time, so 
decided to post the fix and the test. I have added a few more tests in patch 2. 
I will post v2 similar to the diff in my last reply of this thread in the early 
next week.

> 
> I'm not sure about semantics of the resume operation for certain corner cases like these:
> 
> - The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false.
> - Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added.  With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator.
> What do you think about such cases?
I believe it is something orthogonal to the bug fix here but we could use this 
thread to discuss.

This is not something specific to the bpf tcp/udp iter which uses the offset as 
a best effort to resume (e.g. the inet_diag and the /proc/net/{tcp[6],udp} are 
using similar strategy to resume). To improve it, it will need to introduce some 
synchronization with the (potentially fast path) writer side (e.g. bind, after 
3WHS...etc). Not convinced it is worth it to catch these cases.

For the cases described above, skipped the newer sockets is arguably ok. These 
two new sockets will not be captured anyway even the batch was not stop()-ed in 
the middle. I also don't see how it is different semantically if the two new 
sockets are added to the X-1 bucket: the sockets are added after the bpf-iter 
scanned it regardless they are added to an earlier bucket or to an earlier 
location of the same bucket.

That said, the bpf_iter_udp_seq_stop() should only happen if the bpf_prog 
bpf_seq_printf() something AND hit the seq->buf (PAGE_SIZE) << 3) limit or the 
count in "read(iter_fd, buf, count)" limit. For this case, bpf_iter.c may be 
improved to capture the whole batch's seq_printf() to seq->buf first even the 
userspace's buf is full. It would be a separate effort if it is indeed needed.

For the bpf_setsockopt and bpf_sock_destroy use case where it probably does not 
need to seq_printf() anything, it should not need to worry about it. The 
bpf_iter_udp_batch() should be able to grab the whole bucket such that the 
bpf_prog will not miss a socket to do bpf_setsockopt or bpf_sock_destroy.

> 
> 
>>
>> If the orig_bucket had changed (e.g. having more sockets than the last time it was batched) after state->bucket++, it is arguably fine because it was added after the orig_bucket was completely captured in a batch before. The same goes for (orig_bucket-1) that could have changed during the whole udp_table iteration.
>>
>>>           goto again;
>>>       }
>>>   done:
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2024-01-04 22:27             ` Martin KaFai Lau
@ 2024-01-04 23:38               ` Aditi Ghag
  2024-01-05  0:33                 ` Martin KaFai Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Aditi Ghag @ 2024-01-04 23:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, netdev,
	kernel-team, bpf



> On Jan 4, 2024, at 2:27 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 1/4/24 12:21 PM, Aditi Ghag wrote:
>>> On Dec 21, 2023, at 6:58 AM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>> 
>>> On 12/21/23 5:21 AM, Daniel Borkmann wrote:
>>>> On 12/21/23 5:45 AM, Martin KaFai Lau wrote:
>>>>> On 12/20/23 11:10 AM, Martin KaFai Lau wrote:
>>>>>> Good catch. It will unnecessary skip in the following batch/bucket if there is changes in the current batch/bucket.
>>>>>> 
>>>>>>  From looking at the loop again, I think it is better not to change the iter->offset during the for loop. Only update iter->offset after the for loop has concluded.
>>>>>> 
>>>>>> The non-zero iter->offset is only useful for the first bucket, so does a test on the first bucket (state->bucket == bucket) before skipping sockets. Something like this:
>>>>>> 
>>>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>>>> index 89e5a806b82e..a993f364d6ae 100644
>>>>>> --- a/net/ipv4/udp.c
>>>>>> +++ b/net/ipv4/udp.c
>>>>>> @@ -3139,6 +3139,7 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>>       struct net *net = seq_file_net(seq);
>>>>>>       struct udp_table *udptable;
>>>>>>       unsigned int batch_sks = 0;
>>>>>> +    int bucket, bucket_offset;
>>>>>>       bool resized = false;
>>>>>>       struct sock *sk;
>>>>>> 
>>>>>> @@ -3162,14 +3163,14 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>>       iter->end_sk = 0;
>>>>>>       iter->st_bucket_done = false;
>>>>>>       batch_sks = 0;
>>>>>> +    bucket = state->bucket;
>>>>>> +    bucket_offset = 0;
>>>>>> 
>>>>>>       for (; state->bucket <= udptable->mask; state->bucket++) {
>>>>>>           struct udp_hslot *hslot2 = &udptable->hash2[state->bucket];
>>>>>> 
>>>>>> -        if (hlist_empty(&hslot2->head)) {
>>>>>> -            iter->offset = 0;
>>>>>> +        if (hlist_empty(&hslot2->head))
>>>>>>               continue;
>>>>>> -        }
>>>>>> 
>>>>>>           spin_lock_bh(&hslot2->lock);
>>>>>>           udp_portaddr_for_each_entry(sk, &hslot2->head) {
>>>>>> @@ -3177,8 +3178,9 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>>                   /* Resume from the last iterated socket at the
>>>>>>                    * offset in the bucket before iterator was stopped.
>>>>>>                    */
>>>>>> -                if (iter->offset) {
>>>>>> -                    --iter->offset;
>>>>>> +                if (state->bucket == bucket &&
>>>>>> +                    bucket_offset < iter->offset) {
>>>>>> +                    ++bucket_offset;
>>>>>>                       continue;
>>>>>>                   }
>>>>>>                   if (iter->end_sk < iter->max_sk) {
>>>>>> @@ -3192,10 +3194,10 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>>> 
>>>>>>           if (iter->end_sk)
>>>>>>               break;
>>>>>> +    }
>>>>>> 
>>>>>> -        /* Reset the current bucket's offset before moving to the next bucket. */
>>>>>> +    if (state->bucket != bucket)
>>>>>>           iter->offset = 0;
>>>>>> -    }
>>>>>> 
>>>>>>       /* All done: no batch made. */
>>>>>>       if (!iter->end_sk)
>>>>> 
>>>>> I think I found another bug in the current bpf_iter_udp_batch(). The "state->bucket--;" at the end of the batch() function is wrong also. It does not need to go back to the previous bucket. After realloc with a larger batch array, it should retry on the "state->bucket" as is. I tried to force the bind() to use bucket 0 and bind a larger so_reuseport set (24 sockets). WARN_ON(state->bucket < 0) triggered.

Good catch! This error case would be hit when a batch needs to be resized for the very first bucket during iteration.

>>>>> 
>>>>> Going back to this bug (backward progress on --iter->offset), I think it is a bit cleaner to always reset iter->offset to 0 and advance iter->offset to the resume_offset only when needed. Something like this:
>>>> Hm, my assumption was.. why not do something like the below, and fully start over?
>>>> I'm mostly puzzled about the side-effects here, in particular, if for the rerun the sockets
>>>> in the bucket could already have changed.. maybe I'm still missing something - what do
>>>> we need to deal with exactly worst case when we need to go and retry everything, and what
>>>> guarantees do we have?
>>>> (only compile tested)
>>>> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
>>>> index 89e5a806b82e..ca62a4bb7bec 100644
>>>> --- a/net/ipv4/udp.c
>>>> +++ b/net/ipv4/udp.c
>>>> @@ -3138,7 +3138,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>      struct udp_iter_state *state = &iter->state;
>>>>      struct net *net = seq_file_net(seq);
>>>>      struct udp_table *udptable;
>>>> -    unsigned int batch_sks = 0;
>>>> +    int orig_bucket, orig_offset;
>>>> +    unsigned int i, batch_sks = 0;
>>>>      bool resized = false;
>>>>      struct sock *sk;
>>>> @@ -3149,7 +3150,8 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>      }
>>>>      udptable = udp_get_table_seq(seq, net);
>>>> -
>>>> +    orig_bucket = state->bucket;
>>>> +    orig_offset = iter->offset;
>>>>  again:
>>>>      /* New batch for the next bucket.
>>>>       * Iterate over the hash table to find a bucket with sockets matching
>>>> @@ -3211,9 +3213,15 @@ static struct sock *bpf_iter_udp_batch(struct seq_file *seq)
>>>>      if (!resized && !bpf_iter_udp_realloc_batch(iter, batch_sks * 3 / 2)) {
>>>>          resized = true;
>>>>          /* After allocating a larger batch, retry one more time to grab
>>>> -         * the whole bucket.
>>>> +         * the whole bucket. Drop the current refs since for the next
>>>> +         * attempt the composition could have changed, thus start over.
>>>>           */
>>>> -        state->bucket--;
>>>> +        for (i = 0; i < iter->end_sk; i++) {
>>>> +            sock_put(iter->batch[i]);
>>>> +            iter->batch[i] = NULL;
>>>> +        }
>>>> +        state->bucket = orig_bucket;
>>>> +        iter->offset = orig_offset;
>>> 
>>> It does not need to start over from the orig_bucket. Once it advanced to the next bucket (state->bucket++), the orig_bucket is done. Otherwise, it may need to make backward progress here on the state->bucket. The batch size too small happens on the current state->bucket, so it should retry with the same state->bucket after realloc_batch(). If the state->bucket happens to be the orig_bucket (mean it has not advanced), it will skip the same orig_offset.
>> Thanks for the patch. I was on vacation, hence the delay in reviewing the patch. The changes around iter->offset match with what I had locally (the patch fell off my radar, sorry!).
> 
> No problem. I am hitting it one and off in my local testing for some time, so decided to post the fix and the test. I have added a few more tests in patch 2. I will post v2 similar to the diff in my last reply of this thread in the early next week.
> 
>> I'm not sure about semantics of the resume operation for certain corner cases like these:
>> - The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false.
>> - Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added.  With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator.
>> What do you think about such cases?
> I believe it is something orthogonal to the bug fix here but we could use this thread to discuss.

Yes, indeed! But I piggy-backed on the same thread, as one potential option could be to always start iterating from the beginning of a bucket. (More details below.)
> 
> This is not something specific to the bpf tcp/udp iter which uses the offset as a best effort to resume (e.g. the inet_diag and the /proc/net/{tcp[6],udp} are using similar strategy to resume). To improve it, it will need to introduce some synchronization with the (potentially fast path) writer side (e.g. bind, after 3WHS...etc). Not convinced it is worth it to catch these cases.

Right, synchronizing fast paths with the iterator logic seems like an overkill.

If we change the resume semantics, and make the iterator always start from the beginning of a bucket, it could solve some of these corner cases (and simplify the batching logic). The last I checked, the TCP (BPF) iterator logic was tightly coupled with the file based iterator (/proc/net/{tcp,udp}), so I'm not sure if it's an easy change if we were to change the resume semantics for both TCP and UDP BFP iterators?
Note that, this behavior would be similar to the lseek operation with seq_file [1]. Here is a snippet - 

The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release that lock. The value that *pos was set to by the last next() call before stop() is remembered, and used for the first start() call of the next session unless lseek() has been called on the file; in that case next start() will be asked to start at position zero

[1] https://docs.kernel.org/filesystems/seq_file.html

> 
> For the cases described above, skipped the newer sockets is arguably ok. These two new sockets will not be captured anyway even the batch was not stop()-ed in the middle. I also don't see how it is different semantically if the two new sockets are added to the X-1 bucket: the sockets are added after the bpf-iter scanned it regardless they are added to an earlier bucket or to an earlier location of the same bucket.
> 
> That said, the bpf_iter_udp_seq_stop() should only happen if the bpf_prog bpf_seq_printf() something AND hit the seq->buf (PAGE_SIZE) << 3) limit or the count in "read(iter_fd, buf, count)" limit.

Thanks for sharing the additional context. Would you have a link for these set of conditions where an iterator can be stopped? It'll be good to document the API semantics so that users are aware of the implications of setting the read size parameter too low. 


> For this case, bpf_iter.c may be improved to capture the whole batch's seq_printf() to seq->buf first even the userspace's buf is full. It would be a separate effort if it is indeed needed.

Interesting proposal... Other option could be to invalidate the userspace iterator if an entire bucket batch is not being captured, so that userspace can retry with a bigger buffer. 


> 
> For the bpf_setsockopt and bpf_sock_destroy use case where it probably does not need to seq_printf() anything, it should not need to worry about it. The bpf_iter_udp_batch() should be able to grab the whole bucket such that the bpf_prog will not miss a socket to do bpf_setsockopt or bpf_sock_destroy.
> 

Yup, agreed, the setsockeopt and sock_destroy cases are not affected. 

>>> 
>>> If the orig_bucket had changed (e.g. having more sockets than the last time it was batched) after state->bucket++, it is arguably fine because it was added after the orig_bucket was completely captured in a batch before. The same goes for (orig_bucket-1) that could have changed during the whole udp_table iteration.
>>> 
>>>>          goto again;
>>>>      }
>>>>  done:


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2024-01-04 23:38               ` Aditi Ghag
@ 2024-01-05  0:33                 ` Martin KaFai Lau
  2024-01-08 23:24                   ` Aditi Ghag
  0 siblings, 1 reply; 14+ messages in thread
From: Martin KaFai Lau @ 2024-01-05  0:33 UTC (permalink / raw)
  To: Aditi Ghag
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, netdev,
	kernel-team, bpf

On 1/4/24 3:38 PM, Aditi Ghag wrote:
>>> I'm not sure about semantics of the resume operation for certain corner cases like these:
>>> - The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false.
>>> - Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added.  With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator.
>>> What do you think about such cases?
>> I believe it is something orthogonal to the bug fix here but we could use this thread to discuss.
> 
> Yes, indeed! But I piggy-backed on the same thread, as one potential option could be to always start iterating from the beginning of a bucket. (More details below.)
>>
>> This is not something specific to the bpf tcp/udp iter which uses the offset as a best effort to resume (e.g. the inet_diag and the /proc/net/{tcp[6],udp} are using similar strategy to resume). To improve it, it will need to introduce some synchronization with the (potentially fast path) writer side (e.g. bind, after 3WHS...etc). Not convinced it is worth it to catch these cases.
> 
> Right, synchronizing fast paths with the iterator logic seems like an overkill.
> 
> If we change the resume semantics, and make the iterator always start from the beginning of a bucket, it could solve some of these corner cases (and simplify the batching logic). The last I checked, the TCP (BPF) iterator logic was tightly coupled with the 

Always resume from the beginning of the bucket? hmm... then it is making 
backward progress and will hit the same bug again. or I miss-understood your 
proposal?

> file based iterator (/proc/net/{tcp,udp}), so I'm not sure if it's an easy change if we were to change the resume semantics for both TCP and UDP BFP iterators?
> Note that, this behavior would be similar to the lseek operation with seq_file [1]. Here is a snippet -

bpf_iter does not support lseek.

> 
> The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release that lock. The value that *pos was set to by the last next() call before stop() is remembered, and used for the first start() call of the next session unless lseek() has been called on the file; in that case next start() will be asked to start at position zero
> 
> [1] https://docs.kernel.org/filesystems/seq_file.html
> 
>>
>> For the cases described above, skipped the newer sockets is arguably ok. These two new sockets will not be captured anyway even the batch was not stop()-ed in the middle. I also don't see how it is different semantically if the two new sockets are added to the X-1 bucket: the sockets are added after the bpf-iter scanned it regardless they are added to an earlier bucket or to an earlier location of the same bucket.
>>
>> That said, the bpf_iter_udp_seq_stop() should only happen if the bpf_prog bpf_seq_printf() something AND hit the seq->buf (PAGE_SIZE) << 3) limit or the count in "read(iter_fd, buf, count)" limit.
> 
> Thanks for sharing the additional context. Would you have a link for these set of conditions where an iterator can be stopped? It'll be good to document the API semantics so that users are aware of the implications of setting the read size parameter too low.

Most of the conditions should be in bpf_seq_read() in bpf_iter.c.

Again, this resume on offset behavior is not something specific to 
bpf-{tcp,udp}-iter.

> 
> 
>> For this case, bpf_iter.c may be improved to capture the whole batch's seq_printf() to seq->buf first even the userspace's buf is full. It would be a separate effort if it is indeed needed.
> 
> Interesting proposal... Other option could be to invalidate the userspace iterator if an entire bucket batch is not being captured, so that userspace can retry with a bigger buffer.

Not sure how to invalidate the user buffer without breaking the existing 
userspace app though.

The earlier idea on seq->buf was a quick thought. I suspect there is still 
things that need to think through if we did want to explore how to provide 
better guarantee to allow seq_printf() for one whole batch. I still feel it is 
overkill.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp
  2024-01-05  0:33                 ` Martin KaFai Lau
@ 2024-01-08 23:24                   ` Aditi Ghag
  0 siblings, 0 replies; 14+ messages in thread
From: Aditi Ghag @ 2024-01-08 23:24 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, netdev,
	kernel-team, bpf



> On Jan 4, 2024, at 4:33 PM, Martin KaFai Lau <martin.lau@linux.dev> wrote:
> 
> On 1/4/24 3:38 PM, Aditi Ghag wrote:
>>>> I'm not sure about semantics of the resume operation for certain corner cases like these:
>>>> - The BPF UDP sockets iterator was stopped while iterating bucker #X, and the offset was set to 2. bpf_iter_udp_seq_stop then released references to the batched sockets, and marks the bucket X iterator state (aka iter->st_bucket_done) as false.
>>>> - Before the iterator is "resumed", the bucket #X was mutated such that the previously iterated sockets were removed, and new sockets were added.  With the current logic, the iterator will skip the first two sockets in the bucket, which isn't right. This is slightly different from the case where sockets were updated in the X -1 bucket *after* it was fully iterated. Since the bucket and sock locks are released, we don't have any guarantees that the underlying sockets table isn't mutated while the userspace has a valid iterator.
>>>> What do you think about such cases?
>>> I believe it is something orthogonal to the bug fix here but we could use this thread to discuss.
>> Yes, indeed! But I piggy-backed on the same thread, as one potential option could be to always start iterating from the beginning of a bucket. (More details below.)
>>> 
>>> This is not something specific to the bpf tcp/udp iter which uses the offset as a best effort to resume (e.g. the inet_diag and the /proc/net/{tcp[6],udp} are using similar strategy to resume). To improve it, it will need to introduce some synchronization with the (potentially fast path) writer side (e.g. bind, after 3WHS...etc). Not convinced it is worth it to catch these cases.
>> Right, synchronizing fast paths with the iterator logic seems like an overkill.
>> If we change the resume semantics, and make the iterator always start from the beginning of a bucket, it could solve some of these corner cases (and simplify the batching logic). The last I checked, the TCP (BPF) iterator logic was tightly coupled with the 
> 
> Always resume from the beginning of the bucket? hmm... then it is making backward progress and will hit the same bug again. or I miss-understood your proposal?

I presumed that the user would be required to pass a bigger buffer when seq_printf fails to capture the socket data being iterated (this was prior to when I wasn't aware of the logic that decided when to stop the sockets iterator). 

Thanks for the code pointer in your last message, so I'll expand on the proposal below.

Also, we could continue to discuss if there are better ways to handle the cases where an iterator is stopped, but I would expect that we still need to fix the broken case in the current code, and get it backported. So I'll keep an eye out for your v2 patch. 

> 
>> file based iterator (/proc/net/{tcp,udp}), so I'm not sure if it's an easy change if we were to change the resume semantics for both TCP and UDP BFP iterators?
>> Note that, this behavior would be similar to the lseek operation with seq_file [1]. Here is a snippet -
> 
> bpf_iter does not support lseek.
> 
>> The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release that lock. The value that *pos was set to by the last next() call before stop() is remembered, and used for the first start() call of the next session unless lseek() has been called on the file; in that case next start() will be asked to start at position zero
>> [1] https://docs.kernel.org/filesystems/seq_file.html
>>> 
>>> For the cases described above, skipped the newer sockets is arguably ok. These two new sockets will not be captured anyway even the batch was not stop()-ed in the middle. I also don't see how it is different semantically if the two new sockets are added to the X-1 bucket: the sockets are added after the bpf-iter scanned it regardless they are added to an earlier bucket or to an earlier location of the same bucket.
>>> 
>>> That said, the bpf_iter_udp_seq_stop() should only happen if the bpf_prog bpf_seq_printf() something AND hit the seq->buf (PAGE_SIZE) << 3) limit or the count in "read(iter_fd, buf, count)" limit.
>> Thanks for sharing the additional context. Would you have a link for these set of conditions where an iterator can be stopped? It'll be good to document the API semantics so that users are aware of the implications of setting the read size parameter too low.
> 
> Most of the conditions should be in bpf_seq_read() in bpf_iter.c.

Ah! This is helpful.

> 
> Again, this resume on offset behavior is not something specific to bpf-{tcp,udp}-iter.
> 
>>> For this case, bpf_iter.c may be improved to capture the whole batch's seq_printf() to seq->buf first even the userspace's buf is full. It would be a separate effort if it is indeed needed.
>> Interesting proposal... Other option could be to invalidate the userspace iterator if an entire bucket batch is not being captured, so that userspace can retry with a bigger buffer.
> 
> Not sure how to invalidate the user buffer without breaking the existing userspace app though.

By "invalidate the user buffer", I meant passing an error code to the userspace app, so that the userspace can allocate a bigger buffer accordingly. (I wasn't aware if/of how this was being done behind the scenes in bpf_seq_read, so thanks for the pointer.)
Based on my reading of the code, bpf_seq_read does seem to pass an error code when the passed buffer size isn't enough. When that happens, I would've expected the userspace iterator to be invalidated rather than resumed, and the BPF iterator program to be rerun with a larger buffer.

> 
> The earlier idea on seq->buf was a quick thought. I suspect there is still things that need to think through if we did want to explore how to provide better guarantee to allow seq_printf() for one whole batch. I still feel it is overkill.

I'm still trying to fully grasp the logic in bpf_seq_read, but it seems like it's a generic function for all BPF iterators (and not just BPF TCP/UDP iterator). *sigh* So if we wanted to simplify the resume case such that we didn't have to keep track of offsets within a batch, we would have to tailor the bpf_seq_read specifically for the batching logic in BPF TCP/UDP iterator (being able to fully capture batched sockets printf). That would indeed be a separate effort, and would need more discussion. One possible solution could be to handle "resume" operation in seq->buf without involving the BPF TCP/UDP handlers, but I haven't fully thought of this proposal. /cc Daniel 




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-01-08 23:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-19 19:32 [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Martin KaFai Lau
2023-12-19 19:32 ` [PATCH bpf 2/2] selftests/bpf: Test udp and tcp iter batching Martin KaFai Lau
2023-12-20 14:24 ` [PATCH bpf 1/2] bpf: Avoid iter->offset making backward progress in bpf_iter_udp Daniel Borkmann
2023-12-20 19:10   ` Martin KaFai Lau
2023-12-21  4:45     ` Martin KaFai Lau
2023-12-21 13:21       ` Daniel Borkmann
2023-12-21 14:58         ` Martin KaFai Lau
2023-12-21 20:27           ` Daniel Borkmann
2023-12-21 22:19             ` Martin KaFai Lau
2024-01-04 20:21           ` Aditi Ghag
2024-01-04 22:27             ` Martin KaFai Lau
2024-01-04 23:38               ` Aditi Ghag
2024-01-05  0:33                 ` Martin KaFai Lau
2024-01-08 23:24                   ` Aditi Ghag

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.