All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] bpf: call get_random_u32() for random integers
@ 2022-12-05 18:15 Jason A. Donenfeld
  2022-12-05 22:21 ` Daniel Borkmann
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-12-05 18:15 UTC (permalink / raw)
  To: bpf, netdev, linux-kernel
  Cc: Jason A. Donenfeld, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau

Since BPF's bpf_user_rnd_u32() was introduced, there have been three
significant developments in the RNG: 1) get_random_u32() returns the
same types of bytes as /dev/urandom, eliminating the distinction between
"kernel random bytes" and "userspace random bytes", 2) get_random_u32()
operates mostly locklessly over percpu state, 3) get_random_u32() has
become quite fast.

So rather than using the old clunky Tausworthe prandom code, just call
get_random_u32(), which should fit BPF uses perfectly.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 include/linux/bpf.h   |  1 -
 kernel/bpf/core.c     | 17 +----------------
 kernel/bpf/verifier.c |  2 --
 net/core/filter.c     |  1 -
 4 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0566705c1d4e..aae89318789a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2554,7 +2554,6 @@ const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
 
 /* Shared helpers among cBPF and eBPF. */
-void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 u64 bpf_get_raw_cpu_id(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 38159f39e2af..2cc28d63d761 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2579,14 +2579,6 @@ void bpf_prog_free(struct bpf_prog *fp)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_free);
 
-/* RNG for unpriviledged user space with separated state from prandom_u32(). */
-static DEFINE_PER_CPU(struct rnd_state, bpf_user_rnd_state);
-
-void bpf_user_rnd_init_once(void)
-{
-	prandom_init_once(&bpf_user_rnd_state);
-}
-
 BPF_CALL_0(bpf_user_rnd_u32)
 {
 	/* Should someone ever have the rather unwise idea to use some
@@ -2595,14 +2587,7 @@ BPF_CALL_0(bpf_user_rnd_u32)
 	 * transformations. Register assignments from both sides are
 	 * different, f.e. classic always sets fn(ctx, A, X) here.
 	 */
-	struct rnd_state *state;
-	u32 res;
-
-	state = &get_cpu_var(bpf_user_rnd_state);
-	res = prandom_u32_state(state);
-	put_cpu_var(bpf_user_rnd_state);
-
-	return res;
+	return get_random_u32();
 }
 
 BPF_CALL_0(bpf_get_raw_cpu_id)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 225666307bba..75a1a6526165 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14045,8 +14045,6 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 
 		if (insn->imm == BPF_FUNC_get_route_realm)
 			prog->dst_needed = 1;
-		if (insn->imm == BPF_FUNC_get_prandom_u32)
-			bpf_user_rnd_init_once();
 		if (insn->imm == BPF_FUNC_override_return)
 			prog->kprobe_override = 1;
 		if (insn->imm == BPF_FUNC_tail_call) {
diff --git a/net/core/filter.c b/net/core/filter.c
index bb0136e7a8e4..7a595ac0028d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -443,7 +443,6 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
 			break;
 		case SKF_AD_OFF + SKF_AD_RANDOM:
 			*insn = BPF_EMIT_CALL(bpf_user_rnd_u32);
-			bpf_user_rnd_init_once();
 			break;
 		}
 		break;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-05 18:15 [PATCH] bpf: call get_random_u32() for random integers Jason A. Donenfeld
@ 2022-12-05 22:21 ` Daniel Borkmann
  2022-12-05 22:47   ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Borkmann @ 2022-12-05 22:21 UTC (permalink / raw)
  To: Jason A. Donenfeld, bpf, netdev, linux-kernel
  Cc: Andrii Nakryiko, Martin KaFai Lau

On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
> Since BPF's bpf_user_rnd_u32() was introduced, there have been three
> significant developments in the RNG: 1) get_random_u32() returns the
> same types of bytes as /dev/urandom, eliminating the distinction between
> "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
> operates mostly locklessly over percpu state, 3) get_random_u32() has
> become quite fast.

Wrt "quite fast", do you have a comparison between the two? Asking as its
often used in networking worst case on per packet basis (e.g. via XDP), would
be useful to state concrete numbers for the two on a given machine.

> So rather than using the old clunky Tausworthe prandom code, just call
> get_random_u32(), which should fit BPF uses perfectly.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-05 22:21 ` Daniel Borkmann
@ 2022-12-05 22:47   ` Jason A. Donenfeld
  2022-12-06 12:50     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-12-05 22:47 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: bpf, netdev, linux-kernel, Andrii Nakryiko, Martin KaFai Lau

On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
> > significant developments in the RNG: 1) get_random_u32() returns the
> > same types of bytes as /dev/urandom, eliminating the distinction between
> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
> > operates mostly locklessly over percpu state, 3) get_random_u32() has
> > become quite fast.
> 
> Wrt "quite fast", do you have a comparison between the two? Asking as its
> often used in networking worst case on per packet basis (e.g. via XDP), would
> be useful to state concrete numbers for the two on a given machine.

Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
little slower, but too small of a difference to matter.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-05 22:47   ` Jason A. Donenfeld
@ 2022-12-06 12:50     ` Toke Høiland-Jørgensen
  2022-12-06 12:59       ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-06 12:50 UTC (permalink / raw)
  To: Jason A. Donenfeld, Daniel Borkmann
  Cc: bpf, netdev, linux-kernel, Andrii Nakryiko, Martin KaFai Lau

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
>> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
>> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
>> > significant developments in the RNG: 1) get_random_u32() returns the
>> > same types of bytes as /dev/urandom, eliminating the distinction between
>> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
>> > operates mostly locklessly over percpu state, 3) get_random_u32() has
>> > become quite fast.
>> 
>> Wrt "quite fast", do you have a comparison between the two? Asking as its
>> often used in networking worst case on per packet basis (e.g. via XDP), would
>> be useful to state concrete numbers for the two on a given machine.
>
> Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
> little slower, but too small of a difference to matter.

Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
with small packets, the per-packet processing budget is 67.2 ns, so
those extra 4.3 ns will eat up ~6.4% of the budget.

So in other words, "too small a difference to matter" is definitely not
true in general. It really depends on the use case; if someone is using
this to, say, draw per-packet random numbers to compute a drop frequency
on ingress, that extra processing time will most likely result in a
quite measurable drop in performance.

-Toke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-06 12:50     ` Toke Høiland-Jørgensen
@ 2022-12-06 12:59       ` Jason A. Donenfeld
  2022-12-06 13:26         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-12-06 12:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, bpf, netdev, linux-kernel, Andrii Nakryiko,
	Martin KaFai Lau

Hi Toke,

On Tue, Dec 6, 2022 at 1:50 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> "Jason A. Donenfeld" <Jason@zx2c4.com> writes:
>
> > On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
> >> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
> >> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
> >> > significant developments in the RNG: 1) get_random_u32() returns the
> >> > same types of bytes as /dev/urandom, eliminating the distinction between
> >> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
> >> > operates mostly locklessly over percpu state, 3) get_random_u32() has
> >> > become quite fast.
> >>
> >> Wrt "quite fast", do you have a comparison between the two? Asking as its
> >> often used in networking worst case on per packet basis (e.g. via XDP), would
> >> be useful to state concrete numbers for the two on a given machine.
> >
> > Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
> > little slower, but too small of a difference to matter.
>
> Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
> additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
> with small packets, the per-packet processing budget is 67.2 ns, so
> those extra 4.3 ns will eat up ~6.4% of the budget.
>
> So in other words, "too small a difference to matter" is definitely not
> true in general. It really depends on the use case; if someone is using
> this to, say, draw per-packet random numbers to compute a drop frequency
> on ingress, that extra processing time will most likely result in a
> quite measurable drop in performance.

Huh, neat calculation, I'll keep that method in mind.

Alright, sorry for the noise here. I'll check back in if I ever manage
to eliminate that performance gap.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-06 12:59       ` Jason A. Donenfeld
@ 2022-12-06 13:26         ` Toke Høiland-Jørgensen
  2022-12-06 13:30           ` Jason A. Donenfeld
  0 siblings, 1 reply; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-06 13:26 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Daniel Borkmann, bpf, netdev, linux-kernel, Andrii Nakryiko,
	Martin KaFai Lau

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hi Toke,
>
> On Tue, Dec 6, 2022 at 1:50 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> "Jason A. Donenfeld" <Jason@zx2c4.com> writes:
>>
>> > On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
>> >> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
>> >> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
>> >> > significant developments in the RNG: 1) get_random_u32() returns the
>> >> > same types of bytes as /dev/urandom, eliminating the distinction between
>> >> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
>> >> > operates mostly locklessly over percpu state, 3) get_random_u32() has
>> >> > become quite fast.
>> >>
>> >> Wrt "quite fast", do you have a comparison between the two? Asking as its
>> >> often used in networking worst case on per packet basis (e.g. via XDP), would
>> >> be useful to state concrete numbers for the two on a given machine.
>> >
>> > Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
>> > little slower, but too small of a difference to matter.
>>
>> Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
>> additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
>> with small packets, the per-packet processing budget is 67.2 ns, so
>> those extra 4.3 ns will eat up ~6.4% of the budget.
>>
>> So in other words, "too small a difference to matter" is definitely not
>> true in general. It really depends on the use case; if someone is using
>> this to, say, draw per-packet random numbers to compute a drop frequency
>> on ingress, that extra processing time will most likely result in a
>> quite measurable drop in performance.
>
> Huh, neat calculation, I'll keep that method in mind.

Yeah, I find that thinking in "time budget per packet" helps a lot! For
completeness, the 67.2 ns comes from 10 Gbps / 84 bytes (that's 64-byte
packets + 20 bytes of preamble and inter-packet gap), which is 14.8
Mpps, or 67.2 ns/pkt.

> Alright, sorry for the noise here. I'll check back in if I ever manage
> to eliminate that performance gap.

One approach that we use generally for XDP (and which may or may not
help for this case) is that we try to amortise as much fixed cost as we
can over a batch of packets. Because XDP processing always happens in
the NAPI receive poll loop, we have a natural batching mechanism, and we
know that for that whole loop we'll keep running on the same CPU.

So for instance, if there's a large fixed component of the overhead of
get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
per-CPU buffer and then just emit u32 chunks of that as long as we're
still in the same NAPI loop as the first call. Or something to that
effect. Not sure if this makes sense for this use case, but figured I'd
throw the idea out there :)

-Toke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-06 13:26         ` Toke Høiland-Jørgensen
@ 2022-12-06 13:30           ` Jason A. Donenfeld
  2022-12-06 13:53             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 8+ messages in thread
From: Jason A. Donenfeld @ 2022-12-06 13:30 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Daniel Borkmann, bpf, netdev, linux-kernel, Andrii Nakryiko,
	Martin KaFai Lau

Hi Toke,

On Tue, Dec 6, 2022 at 2:26 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> So for instance, if there's a large fixed component of the overhead of
> get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
> per-CPU buffer and then just emit u32 chunks of that as long as we're
> still in the same NAPI loop as the first call. Or something to that
> effect. Not sure if this makes sense for this use case, but figured I'd
> throw the idea out there :)

Actually, this already is how get_random_u32() works! It buffers a
bunch of u32s in percpu batches, and doles them out as requested.
However, this API currently works in all contexts, including in
interrupts. So each call results in disabling irqs and reenabling
them. If I bifurcated batches into irq batches and non-irq batches, so
that we only needed to disable preemption for the non-irq batches,
that'd probably improve things quite a bit, since then the overhead
really would reduce to just a memcpy for the majority of calls. But I
don't know if adding that duplication of all code paths is really
worth the huge hassle.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] bpf: call get_random_u32() for random integers
  2022-12-06 13:30           ` Jason A. Donenfeld
@ 2022-12-06 13:53             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-06 13:53 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Daniel Borkmann, bpf, netdev, linux-kernel, Andrii Nakryiko,
	Martin KaFai Lau

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hi Toke,
>
> On Tue, Dec 6, 2022 at 2:26 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> So for instance, if there's a large fixed component of the overhead of
>> get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
>> per-CPU buffer and then just emit u32 chunks of that as long as we're
>> still in the same NAPI loop as the first call. Or something to that
>> effect. Not sure if this makes sense for this use case, but figured I'd
>> throw the idea out there :)
>
> Actually, this already is how get_random_u32() works! It buffers a
> bunch of u32s in percpu batches, and doles them out as requested.

Ah, right. Not terribly surprised you already did this!

> However, this API currently works in all contexts, including in
> interrupts. So each call results in disabling irqs and reenabling
> them. If I bifurcated batches into irq batches and non-irq batches, so
> that we only needed to disable preemption for the non-irq batches,
> that'd probably improve things quite a bit, since then the overhead
> really would reduce to just a memcpy for the majority of calls. But I
> don't know if adding that duplication of all code paths is really
> worth the huge hassle.

Right, makes sense; happy to leave that decision entirely up to you :)

-Toke

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-12-06 13:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-05 18:15 [PATCH] bpf: call get_random_u32() for random integers Jason A. Donenfeld
2022-12-05 22:21 ` Daniel Borkmann
2022-12-05 22:47   ` Jason A. Donenfeld
2022-12-06 12:50     ` Toke Høiland-Jørgensen
2022-12-06 12:59       ` Jason A. Donenfeld
2022-12-06 13:26         ` Toke Høiland-Jørgensen
2022-12-06 13:30           ` Jason A. Donenfeld
2022-12-06 13:53             ` Toke Høiland-Jørgensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.