stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Flaw in "random32: update the net random state on interrupt and activity"
       [not found] <9f74230f-ba4d-2e19-5751-79dc2ab59877@gmail.com>
@ 2020-08-05  0:57 ` Marc Plumb
  2020-08-05  1:02 ` Linus Torvalds
  2020-08-05  2:49 ` Willy Tarreau
  2 siblings, 0 replies; 32+ messages in thread
From: Marc Plumb @ 2020-08-05  0:57 UTC (permalink / raw)
  To: tytso, w
  Cc: netdev, aksecurity, torvalds, edumazet, Jason, luto, keescook,
	tglx, peterz, stable

Willy and Ted,

This commit has serious security flaws 
f227e3ec3b5cad859ad15666874405e8c1bbc1d4


TL;DR This change takes the seed data from get_random_bytes and 
broadcasts it to the network, thereby destroying the security of 
dev/random. This change needs to be reverted and redesigned.


It is inefficient:

This function is called from an interrupt context, so there is no chance 
of a CPU switch, therefore the this_cpu_add function should be 
__this_cpu_add. This is a sign that the patch may have been rushed and 
may not be suitable for a stable release.


It is fixing the wrong problem:

The net_rand_state PRNG is a weak PRNG for the purpose of avoiding 
collisions, not to be unguessable to an attacker. The network PRNG does 
not need secure seeding. If you need a secure PRNG then you shouldn't be 
using the net_rand_state PRNG. Please reconsider why you think that this 
change is necessary.

It dramatically weakens dev/random:

Seeding two PRNGs with the same entropy causes two problems. The minor 
one is that you're double counting entropy. The major one is that anyone 
who can determine the state of one PRNG can determine the state of the 
other.

The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can 
see any 113 bits of output can determine the complete internal state.

The output of the net_rand_state PRNG is used to determine how data is 
sent to the network, so the output is effectively broadcast to anyone 
watching network traffic. Therefore anyone watching the network traffic 
can determine the seed data being fed to the net_rand_state PRNG. Since 
this is the same seed data being fed to get_random_bytes, it allows an 
attacker to determine the state and there output of /dev/random. I 
sincerely hope that this was not the intended goal. :)

Thank you
Marc

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
       [not found] <9f74230f-ba4d-2e19-5751-79dc2ab59877@gmail.com>
  2020-08-05  0:57 ` Flaw in "random32: update the net random state on interrupt and activity" Marc Plumb
@ 2020-08-05  1:02 ` Linus Torvalds
  2020-08-05  2:49 ` Willy Tarreau
  2 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2020-08-05  1:02 UTC (permalink / raw)
  To: Marc Plumb
  Cc: Theodore Ts'o, Willy Tarreau, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Tue, Aug 4, 2020 at 5:52 PM Marc Plumb <lkml.mplumb@gmail.com> wrote:
>
> TL;DR This change takes the seed data from get_random_bytes and broadcasts it to the network, thereby destroying the security of dev/random. This change needs to be reverted and redesigned.

This was discussed.,

It's theoretical, not practical.

The patch improves real security, and the fake "but in theory" kind is
meaningless and people should stop that kind of behavior.

                Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
       [not found] <9f74230f-ba4d-2e19-5751-79dc2ab59877@gmail.com>
  2020-08-05  0:57 ` Flaw in "random32: update the net random state on interrupt and activity" Marc Plumb
  2020-08-05  1:02 ` Linus Torvalds
@ 2020-08-05  2:49 ` Willy Tarreau
  2020-08-05 15:34   ` tytso
  2020-08-05 15:44   ` Marc Plumb
  2 siblings, 2 replies; 32+ messages in thread
From: Willy Tarreau @ 2020-08-05  2:49 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Hi Marc,

On Tue, Aug 04, 2020 at 05:52:36PM -0700, Marc Plumb wrote:
> Seeding two PRNGs with the same entropy causes two problems. The minor one
> is that you're double counting entropy. The major one is that anyone who can
> determine the state of one PRNG can determine the state of the other.
> 
> The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can see
> any 113 bits of output can determine the complete internal state.
> 
> The output of the net_rand_state PRNG is used to determine how data is sent
> to the network, so the output is effectively broadcast to anyone watching
> network traffic. Therefore anyone watching the network traffic can determine
> the seed data being fed to the net_rand_state PRNG.

The problem this patch is trying to work around is that the reporter
(Amit) was able to determine the entire net_rand_state after observing
a certain number of packets due to this trivial LFSR and the fact that
its internal state between two reseedings only depends on the number
of calls to read it. (please note that regarding this point I'll
propose a patch to replace that PRNG to stop directly exposing the
internal state to the network).

If you look closer at the patch, you'll see that in one interrupt
the patch only uses any 32 out of the 128 bits of fast_pool to
update only 32 bits of the net_rand_state. As such, the sequence
observed on the network also depends on the remaining bits of
net_rand_state, while the 96 other bits of the fast_pool are not
exposed there.

> Since this is the same
> seed data being fed to get_random_bytes, it allows an attacker to determine
> the state and there output of /dev/random. I sincerely hope that this was
> not the intended goal. :)

Not only was this obviously not the goal, but I'd be particularly
interested in seeing this reality demonstrated, considering that
the whole 128 bits of fast_pool together count as a single bit of
entropy, and that as such, even if you were able to figure the
value of the 32 bits leaked to net_rand_state, you'd still have to
guess the 96 other bits for each single entropy bit :-/

Regards,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05  2:49 ` Willy Tarreau
@ 2020-08-05 15:34   ` tytso
  2020-08-05 16:06     ` Marc Plumb
                       ` (2 more replies)
  2020-08-05 15:44   ` Marc Plumb
  1 sibling, 3 replies; 32+ messages in thread
From: tytso @ 2020-08-05 15:34 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Marc Plumb, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On Wed, Aug 05, 2020 at 04:49:41AM +0200, Willy Tarreau wrote:
> 
> Not only was this obviously not the goal, but I'd be particularly
> interested in seeing this reality demonstrated, considering that
> the whole 128 bits of fast_pool together count as a single bit of
> entropy, and that as such, even if you were able to figure the
> value of the 32 bits leaked to net_rand_state, you'd still have to
> guess the 96 other bits for each single entropy bit :-/

Not only that, you'd have to figure out which 32-bits in the fast_pool
actually had gotten leaked to the net_rand_state.

I agree with Willy that I'd love to see an exploit since it would
probably give a lot of insights.  Maybe at a Crypto rump session once
it's safe to have those sorts of things again.  :-)

That being said, it certainly is a certificational / theoretical
weakness, and if the bright boys and girls at Fort Meade did figure
out a way to exploit this, they are very much unlikely to share it at
an open Crypto conference.  So replacing LFSR-based PRnG with
something stronger which didn't release any bits from the fast_pool
would certainly be desireable, and I look forward to seeing what Willy
has in mind.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05  2:49 ` Willy Tarreau
  2020-08-05 15:34   ` tytso
@ 2020-08-05 15:44   ` Marc Plumb
  2020-08-05 16:39     ` Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Marc Plumb @ 2020-08-05 15:44 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Hi Willy,

On 2020-08-04 7:49 p.m., Willy Tarreau wrote:
> Hi Marc,
>
> On Tue, Aug 04, 2020 at 05:52:36PM -0700, Marc Plumb wrote:
>> Seeding two PRNGs with the same entropy causes two problems. The minor one
>> is that you're double counting entropy. The major one is that anyone who can
>> determine the state of one PRNG can determine the state of the other.
>>
>> The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can see
>> any 113 bits of output can determine the complete internal state.
>>
>> The output of the net_rand_state PRNG is used to determine how data is sent
>> to the network, so the output is effectively broadcast to anyone watching
>> network traffic. Therefore anyone watching the network traffic can determine
>> the seed data being fed to the net_rand_state PRNG.
> The problem this patch is trying to work around is that the reporter
> (Amit) was able to determine the entire net_rand_state after observing
> a certain number of packets due to this trivial LFSR and the fact that
> its internal state between two reseedings only depends on the number
> of calls to read it.

I thought net_rand_state was assumed to be insecure and that anyone 
could determine the internal state. Isn't this Working as Designed? It's 
a PRNG not a CPRNG. If some aspects of security depends on the sate 
remaining secret then this is fundamentally the wrong tool.

> (please note that regarding this point I'll
> propose a patch to replace that PRNG to stop directly exposing the
> internal state to the network).

I'm glad to hear that. A good option would be SFC32.

> If you look closer at the patch, you'll see that in one interrupt
> the patch only uses any 32 out of the 128 bits of fast_pool to
> update only 32 bits of the net_rand_state. As such, the sequence
> observed on the network also depends on the remaining bits of
> net_rand_state, while the 96 other bits of the fast_pool are not
> exposed there.

The fast pool contains 128 bits of state, not 128 bits of entropy. The 
purpose of the large pool size is to make sure that the entropy is not 
lost due to collisions. The non-linear mixing function (a simplified 
version of a single round of the ChaCha mixing function so the entropy 
diffusion is low) means that the 32 bits leaked are not independent from 
the other 96 bits, and in fact you can reconstruct the entire pool from 
a single reading of 32 bits (as long as there aren't more than 32 bits 
of entropy added during this time -- which isn't the case, see below). 
Please think harder about this part. I think you are misunderstanding 
how this code works.

>> Since this is the same
>> seed data being fed to get_random_bytes, it allows an attacker to determine
>> the state and there output of /dev/random. I sincerely hope that this was
>> not the intended goal. :)
> Not only was this obviously not the goal, but I'd be particularly
> interested in seeing this reality demonstrated, considering that
> the whole 128 bits of fast_pool together count as a single bit of
> entropy, and that as such, even if you were able to figure the
> value of the 32 bits leaked to net_rand_state, you'd still have to
> guess the 96 other bits for each single entropy bit :-/

The code assumes that there is at least 1/64 bit of entropy per sample, 
and at most 2 bits of entropy per sample (which is why it dumps 128 bits 
every 64 samples). If you're extracting 32 bits every sample, which 
means it's leaking 2048 bits in 64 samples (to net_random, how fast it 
leaks to the outside world is a different issue). So the question is if 
an attacker can reconstruct 128 bits of entropy from 2048 bits of 
internal state -- this does not seem obviously impossible, since there 
are no cryptographically vetted operations in this.

The other thing that this misses is that reseeding in dribs and drabs 
makes it much easier to recover the new state. This is is explained well 
in the documentation about catastrophic reseeding in the Fortuna CPRNG. 
This entire approach is flawed. Throwing in single bits of entropy at a 
time doesn't really help since an attacker can brute force a single bit 
at a time. (There is a challenge in deriving the initial fast_pool 
states which this makes more difficult, but there is no real 
crytptographic guarantee about how difficult it is.)

Thanks,

Marc


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 15:34   ` tytso
@ 2020-08-05 16:06     ` Marc Plumb
  2020-08-05 19:38       ` Willy Tarreau
  2020-08-05 22:05       ` tytso
  2020-08-05 16:24     ` Jason A. Donenfeld
  2020-08-05 16:53     ` Willy Tarreau
  2 siblings, 2 replies; 32+ messages in thread
From: Marc Plumb @ 2020-08-05 16:06 UTC (permalink / raw)
  To: tytso, Willy Tarreau
  Cc: netdev, aksecurity, torvalds, edumazet, Jason, luto, keescook,
	tglx, peterz, stable

Hi Ted,

On 2020-08-05 8:34 a.m., tytso@mit.edu wrote:
> On Wed, Aug 05, 2020 at 04:49:41AM +0200, Willy Tarreau wrote:
>> Not only was this obviously not the goal, but I'd be particularly
>> interested in seeing this reality demonstrated, considering that
>> the whole 128 bits of fast_pool together count as a single bit of
>> entropy, and that as such, even if you were able to figure the
>> value of the 32 bits leaked to net_rand_state, you'd still have to
>> guess the 96 other bits for each single entropy bit :-/
> Not only that, you'd have to figure out which 32-bits in the fast_pool
> actually had gotten leaked to the net_rand_state.

That's 2 bits which are already inputs to the fast_pool, so it doesn't 
even make a brute force any more difficult.

> I agree with Willy that I'd love to see an exploit since it would
> probably give a lot of insights.  Maybe at a Crypto rump session once
> it's safe to have those sorts of things again.  :-)

Just because you or I don't have a working exploit doesn't mean that 
someone else isn't more clever. It pays to be paranoid about 
cryptographic primitives and there is nothing more important than the 
entropy pool.

> So replacing LFSR-based PRnG with
> something stronger which didn't release any bits from the fast_pool
> would certainly be desireable, and I look forward to seeing what Willy
> has in mind.

Isn't get_random_u32 the function you wrote to do that? If this needs to 
be cryptographically secure, that's an existing option that's safe.

The fundamental question is: Why is this attack on net_rand_state 
problem? It's Working as Designed. Why is it a major enough problem to 
risk harming cryptographically important functions?

Do you remember how you resisted making dev/urandom fast for large reads 
for a long time to punish stupid uses of the interface? In this case 
anyone who is using net_rand_state assuming it is a CPRNG should stop 
doing that. Don't enable stupidity in the kernel.

This whole thing is making the fundamental mistake of all amateur 
cryptographers of trying to create your own cryptographic primitive. 
You're trying to invent a secure stream cipher. Either don't try to make 
net_rand_state secure, or use a known secure primitive.


Thanks,

Marc



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 15:34   ` tytso
  2020-08-05 16:06     ` Marc Plumb
@ 2020-08-05 16:24     ` Jason A. Donenfeld
  2020-08-05 16:53     ` Willy Tarreau
  2 siblings, 0 replies; 32+ messages in thread
From: Jason A. Donenfeld @ 2020-08-05 16:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Willy Tarreau, Marc Plumb, Netdev, Amit Klein, Linus Torvalds,
	Eric Dumazet, Andrew Lutomirski, Kees Cook, Thomas Gleixner,
	Peter Zijlstra, stable

On Wed, Aug 5, 2020 at 6:07 PM <tytso@mit.edu> wrote:
> That being said, it certainly is a certificational / theoretical
> weakness

random.c is filled with super suspicious things that are probably only
correct by accident, or only correct in practice, but in theory it's
just such a mess. Stupid example if I'm remembering correctly: you
fill the sha1 IV with input from rdrand; if rdrand is borked or
backdoored or whatever, then the security of sha1 there reduces to
shacal1, which isn't totally broken, far from it actually, so we're
fine there, but you can't help but look at that and say "ugh." I'll
rewrite that eventually. Anyway, having another "certificational
weakness", as you put it, that isn't a practical one would be par for
the course with random.c

> , and if the bright boys and girls at Fort Meade did figure
> out a way to exploit this, they are very much unlikely to share it at
> an open Crypto conference.  So replacing LFSR-based PRnG with
> something stronger which didn't release any bits from the fast_pool
> would certainly be desireable, and I look forward to seeing what Willy
> has in mind.

This disaster is partially my fault. I was going to make
get_random_u{32,64} fast enough that we could get rid of the fake rng
stuff, and then Covid things got weird and I haven't gotten refocused
on that yet. Andy started that, and I was supposed to pick up his work
and complete it, but I dropped the ball. I kept meaning to get back to
that, but I'd get discouraged every time I saw Willy sending more
messages about improving the fake rng stuff with more fake rng stuff.
But, seems like it's time for me to step on the pedal a bit and get
that working. So hopefully we'll be able to get rid of the code in
question here, and use a good rng everywhere. I'll send an update on
that if I get it working.


Jason

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 15:44   ` Marc Plumb
@ 2020-08-05 16:39     ` Linus Torvalds
  2020-08-05 23:49       ` Stephen Hemminger
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2020-08-05 16:39 UTC (permalink / raw)
  To: Marc Plumb
  Cc: Willy Tarreau, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Wed, Aug 5, 2020 at 8:44 AM Marc Plumb <lkml.mplumb@gmail.com> wrote:
>
> I thought net_rand_state was assumed to be insecure and that anyone
> could determine the internal state. Isn't this Working as Designed?

I was working as designed - because it wasn't really designed to be
"real crypto" - but sadly it's also the only thing that is fast enough
for a lot of networking.

So it may be _designed_ to be "not real crypto" and to have a
discoverable internal state. But once again, reality interferes, and
it turns out that people really want something very very fast that is
also not deterministic enough to be discoverable at least remotely.

The stuff that is actually designed and intended to be a complete
black box is sadly also much too slow. By about an order of magnitude.

           Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 15:34   ` tytso
  2020-08-05 16:06     ` Marc Plumb
  2020-08-05 16:24     ` Jason A. Donenfeld
@ 2020-08-05 16:53     ` Willy Tarreau
  2 siblings, 0 replies; 32+ messages in thread
From: Willy Tarreau @ 2020-08-05 16:53 UTC (permalink / raw)
  To: tytso
  Cc: Marc Plumb, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Hi Ted,

On Wed, Aug 05, 2020 at 11:34:32AM -0400, tytso@mit.edu wrote:
> That being said, it certainly is a certificational / theoretical
> weakness, and if the bright boys and girls at Fort Meade did figure
> out a way to exploit this, they are very much unlikely to share it at
> an open Crypto conference.  So replacing LFSR-based PRnG with
> something stronger which didn't release any bits from the fast_pool
> would certainly be desireable, and I look forward to seeing what Willy
> has in mind.

I'll post a proposal patch shortly about this, hopefully this week-end
(got diverted by work lately :-)). Just to give you a few pointers,
it's a small modification of MSWS. It passes the Practrand test suite
on 256 GB of data with zero warning (something that Tausworthe is
supposed to fail at).

By default, MSWS *does* leak its internal state, as Amit showed us (and
seeing that the paper on it suggests it's safe as-is for crypto use is
a bit shocking), but once slightly adjusted, it doesn't reveal its state
anymore and that would constitute a much more future-proof solution for
quite some time. Tausworthe was created something like 20 years ago or
so, hence it's not surprizing that it's a bit dated by now, but if we
can upgrade once every 2 decades I guess it's not that bad.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 16:06     ` Marc Plumb
@ 2020-08-05 19:38       ` Willy Tarreau
  2020-08-05 22:21         ` Marc Plumb
  2020-08-05 22:05       ` tytso
  1 sibling, 1 reply; 32+ messages in thread
From: Willy Tarreau @ 2020-08-05 19:38 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Hi Mark,

On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote:
> Just because you or I don't have a working exploit doesn't mean that someone
> else isn't more clever.

I agree on the principle, but it can be said from many things, including
our respective inability to factor large numbers for example. But for
sure we do need to be careful, and actually picking only some limited
parts of the fast pool (which are only used to update the input pool
and are only made of low-difficulty stuff like instruction pointers,
jiffies and TSC values) is probably not going to disclose an extremely
well guarded secret.

> The fundamental question is: Why is this attack on net_rand_state problem?
> It's Working as Designed. Why is it a major enough problem to risk harming
> cryptographically important functions?

It's not *that* major an issue (in my personal opinion) but the current
net_rand_state is easy enough to guess so that an observer may reduce
the difficulty to build certain attacks (using known source ports for
example). The goal of this change (and the one in update_process_times())
is to disturb the net_rand_state a little bit so that external observations
turn from "this must be that" to "this may be this or maybe that", which
is sufficient to limit the ability to reliably guess a state and reduce
the cost of an attack.

Another approach involving the replacement of the algorithm was considered
but we were working with -stable in mind, trying to limit the backporting
difficulty (and it revealed a circular dependency nightmare that had been
sleeping there for years), and making the changes easier to check (which
is precisely what you're doing).

> Do you remember how you resisted making dev/urandom fast for large reads for
> a long time to punish stupid uses of the interface? In this case anyone who
> is using net_rand_state assuming it is a CPRNG should stop doing that. Don't
> enable stupidity in the kernel.
> 
> This whole thing is making the fundamental mistake of all amateur
> cryptographers of trying to create your own cryptographic primitive. You're
> trying to invent a secure stream cipher. Either don't try to make
> net_rand_state secure, or use a known secure primitive.

We're not trying to invent any stream cipher or whatever, just making
use of a few bits that are never exposed alone as-is to internal nor
external states, to slightly disturb another state that otherwise only
changes once a minute so that there's no more a 100% chance of guessing
a 16-bit port after seeing a few packets. I mean, I'm pretty sure that
even stealing three or four bits only from there would be quite enough
to defeat the attack given that Amit only recovers a few bits per packet.

For me the right longterm solution will be to replace the easily guessable
LFSR. But given the build breakage we got by just adding one include, I
can only guess what we'll see when trying to do more in this area :-/

Regards,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 16:06     ` Marc Plumb
  2020-08-05 19:38       ` Willy Tarreau
@ 2020-08-05 22:05       ` tytso
  2020-08-05 23:03         ` Andy Lutomirski
  2020-08-06 17:00         ` Marc Plumb
  1 sibling, 2 replies; 32+ messages in thread
From: tytso @ 2020-08-05 22:05 UTC (permalink / raw)
  To: Marc Plumb
  Cc: Willy Tarreau, netdev, aksecurity, torvalds, edumazet, Jason,
	luto, keescook, tglx, peterz, stable

On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote:
> Isn't get_random_u32 the function you wrote to do that? If this needs to be
> cryptographically secure, that's an existing option that's safe.
> 
> The fundamental question is: Why is this attack on net_rand_state problem?
> It's Working as Designed. Why is it a major enough problem to risk harming
> cryptographically important functions?

I haven't looked at the users of net_rand_state, but historically, the
networking subsystem has a expressed a (perceived?) need for *very* fast
mostly-random-but-if-doens't-have-to-be-perfect-numbers-our-benchmarks-
are-way-more-important numbers.   As in, if there are extra cache line
misses, our benchmarks would suffer and that's not acceptable.

One of the problems here is that it's not sufficient for the average case
to be fast, but once in every N operations, we need to do something that
requires Real Crypto, and so that Nth time, there would be an extra lag
and that would be the end of the world (at least as far as networking
benchmarks are concerned, anyway).   So in other words, it's not enough for
the average time to run get_random_u32() to be fast, they care about the 95th or
99th percentile number of get_random_u32() to be fast as well.

An example of this would be for TCP sequence number generation; it's
not *really* something that needs to be secure, and if we rekey the
RNG every 5 minutes, so long as the best case attack takes at most,
say, an hour, if the worst the attacker can do is to be able to carry
out an man-in-the-middle attack without being physically in between
the source and the destination --- well, if you *really* cared about
security the TCP connection would be protected using TLS anyway.  See
RFC 1948 (later updated by RFC 6528) for an argument along these
lines.

> This whole thing is making the fundamental mistake of all amateur
> cryptographers of trying to create your own cryptographic primitive. You're
> trying to invent a secure stream cipher. Either don't try to make
> net_rand_state secure, or use a known secure primitive.

Well, technically it's not supposed to be a secure cryptographic
primitive.  net_rand_state is used in the call prandom_u32(), so the
only supposed guarantee is PSEUDO random.

That being said, a quick "get grep prandom_u32" shows that there are a
*huge* number of uses of prandom_u32() and whether they are all
appropriate uses of prandom_u32(), or kernel developers are using it
because "I haz a ne3D for spE3d" but in fact it's for a security
critical application is a pretty terrifying question.  If we start
seeing CVE's getting filed caused by inappropriate uses of
prandom_u32, to be honest, it won't surprise me.

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 19:38       ` Willy Tarreau
@ 2020-08-05 22:21         ` Marc Plumb
  2020-08-06  6:30           ` Willy Tarreau
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Plumb @ 2020-08-05 22:21 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Willy,

On 2020-08-05 12:38 p.m., Willy Tarreau wrote:

> It's not *that* major an issue (in my personal opinion) but the current
> net_rand_state is easy enough to guess so that an observer may reduce
> the difficulty to build certain attacks (using known source ports for
> example). The goal of this change (and the one in update_process_times())
> is to disturb the net_rand_state a little bit so that external observations
> turn from "this must be that" to "this may be this or maybe that", which
> is sufficient to limit the ability to reliably guess a state and reduce
> the cost of an attack.
There is nothing wrong with perturbing net_rand_state, the sin is doing 
it with the raw entropy that is also the seed for your CPRNG. Use the 
output of a CPRNG to perturb the pool all you want, but don't do things 
that bit by bit reveal the entropy that is being fed into the CPRNG.


> Another approach involving the replacement of the algorithm was considered
> but we were working with -stable in mind, trying to limit the backporting
> difficulty (and it revealed a circular dependency nightmare that had been
> sleeping there for years), and making the changes easier to check (which
> is precisely what you're doing).

Really? You can replace the LFSR and only change lib/random32.c. That 
might fix the problem without the need to use the raw fast_pool data for 
seed material. As Linus said, speed is a concern but SFC32 is faster 
than existing Tausworthe generator, and it's a drop-in replacement with 
the same state size if that makes your life easier. If you're willing to 
expand the state there are even better options (like SFC64 or some of 
chaotic generators like Jenkins' Frog).


> We're not trying to invent any stream cipher or whatever, just making
> use of a few bits that are never exposed alone as-is to internal nor
> external states, to slightly disturb another state that otherwise only
> changes once a minute so that there's no more a 100% chance of guessing
> a 16-bit port after seeing a few packets. I mean, I'm pretty sure that
> even stealing three or four bits only from there would be quite enough
> to defeat the attack given that Amit only recovers a few bits per packet.

If Amit's attack can recover that much per packet (in practice not just 
in theory) and there is even one packet per interrupt, then it's a major 
problem. There are at most 2 bits of entropy added per call to 
add_interrupt_randomness, so it you're leaking "a few bits per packet" 
then that's everything. Over 64 interrupts you've lost the 128 bits of 
entropy that the fast_pool has spilled into the input_pool.

Marc


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 22:05       ` tytso
@ 2020-08-05 23:03         ` Andy Lutomirski
  2020-08-06 17:00         ` Marc Plumb
  1 sibling, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2020-08-05 23:03 UTC (permalink / raw)
  To: tytso
  Cc: Marc Plumb, Willy Tarreau, netdev, aksecurity, torvalds,
	edumazet, Jason, luto, keescook, tglx, peterz, stable


>> On Aug 5, 2020, at 3:06 PM, tytso@mit.edu wrote:
>> 
>> On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote:
>> Isn't get_random_u32 the function you wrote to do that? If this needs to be
>> cryptographically secure, that's an existing option that's safe.
>> The fundamental question is: Why is this attack on net_rand_state problem?
>> It's Working as Designed. Why is it a major enough problem to risk harming
>> cryptographically important functions?
> 
> I haven't looked at the users of net_rand_state, but historically, the
> networking subsystem has a expressed a (perceived?) need for *very* fast
> mostly-random-but-if-doens't-have-to-be-perfect-numbers-our-benchmarks-
> are-way-more-important numbers.   As in, if there are extra cache line
> misses, our benchmarks would suffer and that's not acceptable.
> 
> One of the problems here is that it's not sufficient for the average case
> to be fast, but once in every N operations, we need to do something that
> requires Real Crypto, and so that Nth time, there would be an extra lag
> and that would be the end of the world (at least as far as networking
> benchmarks are concerned, anyway).

I respectfully disagree with the supposed network people :). I’m working, slowly, on a patch set to make this genuinely fast.  

>  So in other words, it's not enough for
> the average time to run get_random_u32() to be fast, they care about the 95th or
> 99th percentile number of get_random_u32() to be fast as well.
> 
> An example of this would be for TCP sequence number generation; it's
> not *really* something that needs to be secure, and if we rekey the
> RNG every 5 minutes, so long as the best case attack takes at most,
> say, an hour, if the worst the attacker can do is to be able to carry
> out an man-in-the-middle attack without being physically in between
> the source and the destination --- well, if you *really* cared about
> security the TCP connection would be protected using TLS anyway.  See
> RFC 1948 (later updated by RFC 6528) for an argument along these
> lines.
> 
>> This whole thing is making the fundamental mistake of all amateur
>> cryptographers of trying to create your own cryptographic primitive. You're
>> trying to invent a secure stream cipher. Either don't try to make
>> net_rand_state secure, or use a known secure primitive.
> 
> Well, technically it's not supposed to be a secure cryptographic
> primitive.  net_rand_state is used in the call prandom_u32(), so the
> only supposed guarantee is PSEUDO random.
> 
> That being said, a quick "get grep prandom_u32" shows that there are a
> *huge* number of uses of prandom_u32() and whether they are all
> appropriate uses of prandom_u32(), or kernel developers are using it
> because "I haz a ne3D for spE3d" but in fact it's for a security
> critical application is a pretty terrifying question.  If we start
> seeing CVE's getting filed caused by inappropriate uses of
> prandom_u32, to be honest, it won't surprise me.
> 
>                       - Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 16:39     ` Linus Torvalds
@ 2020-08-05 23:49       ` Stephen Hemminger
  0 siblings, 0 replies; 32+ messages in thread
From: Stephen Hemminger @ 2020-08-05 23:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Marc Plumb, Willy Tarreau, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Wed, 5 Aug 2020 09:39:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Aug 5, 2020 at 8:44 AM Marc Plumb <lkml.mplumb@gmail.com> wrote:
> >
> > I thought net_rand_state was assumed to be insecure and that anyone
> > could determine the internal state. Isn't this Working as Designed?  
> 
> I was working as designed - because it wasn't really designed to be
> "real crypto" - but sadly it's also the only thing that is fast enough
> for a lot of networking.
> 
> So it may be _designed_ to be "not real crypto" and to have a
> discoverable internal state. But once again, reality interferes, and
> it turns out that people really want something very very fast that is
> also not deterministic enough to be discoverable at least remotely.

If you turn on the wayback machine, the original net_random came out
of having ok values for network simulation with netem. In that use case,
the point was to just have something with good distribution and reasonably long
period.

Then the idea of TCP ports have to be random came along and that was both
a weak security feature and benchmark sensitive so the net_random got drafted
into more areas.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 22:21         ` Marc Plumb
@ 2020-08-06  6:30           ` Willy Tarreau
  2020-08-06 17:18             ` Marc Plumb
  0 siblings, 1 reply; 32+ messages in thread
From: Willy Tarreau @ 2020-08-06  6:30 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
> There is nothing wrong with perturbing net_rand_state, the sin is doing it
> with the raw entropy that is also the seed for your CPRNG. Use the output of
> a CPRNG to perturb the pool all you want, but don't do things that bit by
> bit reveal the entropy that is being fed into the CPRNG.

This is interesting because I think some of us considered it exactly the
other way around, i.e. we're not copying exact bits but just taking a
pseudo-random part of such bits at one point in time, to serve as an
increment among other ones. And given that these bits were collected
over time from not very secret sources, they appeared to be of lower
risk than the output.

I mean, if we reimplemented something in parallel just mixing the IRQ
return pointer and TSC, some people could possibly say "is this really
strong enough?" but it wouldn't seem very shocking in terms of disclosure.
But if by doing so we ended up in accident reproducing the same contents
as the fast_pool it could be problematic.

Would you think that using only the input info used to update the
fast_pool would be cleaner ? I mean, instead of :

        fast_pool->pool[0] ^= cycles ^ j_high ^ irq;
        fast_pool->pool[1] ^= now ^ c_high;
        ip = regs ? instruction_pointer(regs) : _RET_IP_;
        fast_pool->pool[2] ^= ip;
        fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 :
                get_reg(fast_pool, regs);

we'd do:

        x0 = cycles ^ j_high ^ irq;
        x1 = now ^ c_high;
        x2 = regs ? instruction_pointer(regs) : _RET_IP_;
        x3 = (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs);

        fast_pool->pool[0] ^= x0;
        fast_pool->pool[1] ^= x1;
        fast_pool->pool[2] ^= x2;
        fast_pool->pool[3] ^= x3;

	this_cpu_add(net_rand_state.s1, x0^x1^x2^x3);

Because that's something easy to do and providing the same level
of perturbation to net_rand_state.

> > Another approach involving the replacement of the algorithm was considered
> > but we were working with -stable in mind, trying to limit the backporting
> > difficulty (and it revealed a circular dependency nightmare that had been
> > sleeping there for years), and making the changes easier to check (which
> > is precisely what you're doing).
> 
> Really? You can replace the LFSR and only change lib/random32.c. That might
> fix the problem without the need to use the raw fast_pool data for seed
> material.

Definitely. I had been working on a variant of MSWS
(https://arxiv.org/pdf/1704.00358.pdf) that I discussed a little bit with
Amit, but I didn't publish it yet since it needs to be cleaned up and to
replace all the test patterns in random32.c. And when starting to rebase
it I figured that updating it from the interrupt handler didn't really
look like it brought anything. I mean that when you're starting to wonder
what to update "to make it better", it likely means that you don't need it.

> As Linus said, speed is a concern but SFC32 is faster than
> existing Tausworthe generator, and it's a drop-in replacement with the same
> state size if that makes your life easier. If you're willing to expand the
> state there are even better options (like SFC64 or some of chaotic
> generators like Jenkins' Frog).

I didn't know about SFC32, it looks like a variation of the large family
of xorshift generators, which is thus probably quite suitable as well
for this task. Having used xoroshiro128** myself in another project, I
found it overkill for this task compared to MSWS but I definitely agree
that any of them is more suited to the task than the current one.

> > We're not trying to invent any stream cipher or whatever, just making
> > use of a few bits that are never exposed alone as-is to internal nor
> > external states, to slightly disturb another state that otherwise only
> > changes once a minute so that there's no more a 100% chance of guessing
> > a 16-bit port after seeing a few packets. I mean, I'm pretty sure that
> > even stealing three or four bits only from there would be quite enough
> > to defeat the attack given that Amit only recovers a few bits per packet.
> 
> If Amit's attack can recover that much per packet (in practice not just in
> theory) and there is even one packet per interrupt, then it's a major
> problem. There are at most 2 bits of entropy added per call to
> add_interrupt_randomness, so it you're leaking "a few bits per packet" then
> that's everything. Over 64 interrupts you've lost the 128 bits of entropy
> that the fast_pool has spilled into the input_pool.

But how do you *figure* the value of these bits and their position in the
fast_pool ? I mean that the attack relies on net_rand_state not being
perturbed. Let's assume you'd be able to brute-force them in linear time
by drawing the list of possible candidates for 32-bit increment on each
packet and you end up with 64 sets of candidates. You'll probably have
quite a number of candidates there since you'll also need to take into
account the CPU the packet ends up on and calls to update_process_times()
on each jiffy. For each of these candidates you don't precisely know what
part of the fast_pool was used so you're at 4^64 combinations, and even
if you can guess which ones they are, they come from different generations
of the fast_pool, which are iterated over on other activities.

Don't get me wrong, I'm not trying to defend the solution or anything,
I find it more like a reasonable short term plaster to at least address
lab-style attacks on almost idle machines. But outside of the lab, when
machines are doing real work, I agree with Linus that both the attack
and the risks mentioned above fly into pieces because there are a lot
more possibilities at each step of the operations that have to be taken
into account. Now if you think that doing some small changes like proposed
above just to better protect the entropy would provide significant help,
feel free to suggest so.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-05 22:05       ` tytso
  2020-08-05 23:03         ` Andy Lutomirski
@ 2020-08-06 17:00         ` Marc Plumb
  1 sibling, 0 replies; 32+ messages in thread
From: Marc Plumb @ 2020-08-06 17:00 UTC (permalink / raw)
  To: tytso
  Cc: Willy Tarreau, netdev, aksecurity, torvalds, edumazet, Jason,
	luto, keescook, tglx, peterz, stable


On 2020-08-05 3:05 p.m., tytso@mit.edu wrote:
>
> Well, technically it's not supposed to be a secure cryptographic
> primitive.  net_rand_state is used in the call prandom_u32(), so the
> only supposed guarantee is PSEUDO random.
>
> That being said, a quick "get grep prandom_u32" shows that there are a
> *huge* number of uses of prandom_u32() and whether they are all
> appropriate uses of prandom_u32(), or kernel developers are using it
> because "I haz a ne3D for spE3d" but in fact it's for a security
> critical application is a pretty terrifying question.  If we start
> seeing CVE's getting filed caused by inappropriate uses of
> prandom_u32, to be honest, it won't surprise me.

The danger I'm worried about it's misuse of prandom_u32. That would mean 
one function would have weak random numbers. I'm worried about the 
disclosure of the entropy that is the basis for the good random numbers 
because that would undermine the security of the people who are using 
the right functions for their task.

Having said that, auditing all uses of prandom_u32 would be useful, but 
a different issue.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-06  6:30           ` Willy Tarreau
@ 2020-08-06 17:18             ` Marc Plumb
  2020-08-07  7:03               ` Willy Tarreau
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Plumb @ 2020-08-06 17:18 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Willy,


On 2020-08-05 11:30 p.m., Willy Tarreau wrote:
> On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
>> There is nothing wrong with perturbing net_rand_state, the sin is doing it
>> with the raw entropy that is also the seed for your CPRNG. Use the output of
>> a CPRNG to perturb the pool all you want, but don't do things that bit by
>> bit reveal the entropy that is being fed into the CPRNG.
> This is interesting because I think some of us considered it exactly the
> other way around, i.e. we're not copying exact bits but just taking a
> pseudo-random part of such bits at one point in time, to serve as an
> increment among other ones. And given that these bits were collected
> over time from not very secret sources, they appeared to be of lower
> risk than the output.

No. The output of a CPRNG can't be used to determine the internal state. 
The input can. The input entropy is the one thing that cannot be 
produced by a deterministic computer, so they are the crown jewels of 
this. It's much much safer to use the output.


> I mean, if we reimplemented something in parallel just mixing the IRQ
> return pointer and TSC, some people could possibly say "is this really
> strong enough?" but it wouldn't seem very shocking in terms of disclosure.
> But if by doing so we ended up in accident reproducing the same contents
> as the fast_pool it could be problematic.
>
> Would you think that using only the input info used to update the
> fast_pool would be cleaner ? I mean, instead of :
>
>          fast_pool->pool[0] ^= cycles ^ j_high ^ irq;
>          fast_pool->pool[1] ^= now ^ c_high;
>          ip = regs ? instruction_pointer(regs) : _RET_IP_;
>          fast_pool->pool[2] ^= ip;
>          fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 :
>                  get_reg(fast_pool, regs);
>
> we'd do:
>
>          x0 = cycles ^ j_high ^ irq;
>          x1 = now ^ c_high;
>          x2 = regs ? instruction_pointer(regs) : _RET_IP_;
>          x3 = (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs);
>
>          fast_pool->pool[0] ^= x0;
>          fast_pool->pool[1] ^= x1;
>          fast_pool->pool[2] ^= x2;
>          fast_pool->pool[3] ^= x3;
>
> 	this_cpu_add(net_rand_state.s1, x0^x1^x2^x3);

No. That's just as bad. There are two major problems:

It takes the entropy and sends it to the outside world without any 
strong crypto between the seed and the output. Reversing this isn't 
trivial, but it also isn't provably difficult.

It adds small amounts of entropy at a time and exposes it to the outside 
world. No crypto can make this safe (google "catastrophic reseeding"). 
If an attacker can guess the time within 1ms, then on a 4GHz CPU that's 
only 22 bits of uncertainty, so it's possible to brute force the input. 
Any details about which part of the fast_pool are used are irrelevant 
since that's determined by that input also, so it adds no security to 
this type of brute force attack. The only other part is the initial TSC 
offset, but if that were sufficient we wouldn't need the reseeding at all.


> I didn't know about SFC32, it looks like a variation of the large family
> of xorshift generators, which is thus probably quite suitable as well
> for this task. Having used xoroshiro128** myself in another project, I
> found it overkill for this task compared to MSWS but I definitely agree
> that any of them is more suited to the task than the current one.
>
It's actually a chaotic generator (not a linear one like an xorshift 
generator), which gives it weaker period guarantees which makes it more 
difficult to reverse. With a counter added to help the period length.

I'll trust Amit that SFC32 isn't strong enough and look at other options 
-- I just thought of it as better, and faster than the existing one with 
the same state size. Maybe a larger state is needed.


Thanks,

Marc


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-06 17:18             ` Marc Plumb
@ 2020-08-07  7:03               ` Willy Tarreau
  2020-08-07 16:52                 ` Marc Plumb
  0 siblings, 1 reply; 32+ messages in thread
From: Willy Tarreau @ 2020-08-07  7:03 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On Thu, Aug 06, 2020 at 10:18:55AM -0700, Marc Plumb wrote:
> Willy,
> 
> 
> On 2020-08-05 11:30 p.m., Willy Tarreau wrote:
> > On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
> > > There is nothing wrong with perturbing net_rand_state, the sin is doing it
> > > with the raw entropy that is also the seed for your CPRNG. Use the output of
> > > a CPRNG to perturb the pool all you want, but don't do things that bit by
> > > bit reveal the entropy that is being fed into the CPRNG.
> > This is interesting because I think some of us considered it exactly the
> > other way around, i.e. we're not copying exact bits but just taking a
> > pseudo-random part of such bits at one point in time, to serve as an
> > increment among other ones. And given that these bits were collected
> > over time from not very secret sources, they appeared to be of lower
> > risk than the output.
> 
> No. The output of a CPRNG can't be used to determine the internal state. The
> input can. The input entropy is the one thing that cannot be produced by a
> deterministic computer, so they are the crown jewels of this. It's much much
> safer to use the output.

OK, noted.

> > I didn't know about SFC32, it looks like a variation of the large family
> > of xorshift generators, which is thus probably quite suitable as well
> > for this task. Having used xoroshiro128** myself in another project, I
> > found it overkill for this task compared to MSWS but I definitely agree
> > that any of them is more suited to the task than the current one.
> > 
> It's actually a chaotic generator (not a linear one like an xorshift
> generator), which gives it weaker period guarantees which makes it more
> difficult to reverse. With a counter added to help the period length.
> 
> I'll trust Amit that SFC32 isn't strong enough and look at other options --
> I just thought of it as better, and faster than the existing one with the
> same state size. Maybe a larger state is needed.

Just to give a heads up on this, here's what I'm having pending regarding
MSWS:

  struct rnd_state {
        uint64_t x, w;
        uint64_t seed;
        uint64_t noise;
  };

  uint32_t msws32(struct rnd_state *state)
  {
        uint64_t x;

        x  = state->w += state->seed;
        x += state->x * state->x;
        x  = state->x = (x >> 32) | (x << 32);
        x -= state->noise++;
        return x ^ (x >> 32);
  }

It passes PractRand without any warning after 1 TB of data:

  rng=RNG_stdin, seed=unknown
  length= 512 megabytes (2^29 bytes), time= 2.0 seconds
    no anomalies in 229 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 1 gigabyte (2^30 bytes), time= 4.3 seconds
    no anomalies in 248 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 2 gigabytes (2^31 bytes), time= 8.3 seconds
    no anomalies in 266 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 4 gigabytes (2^32 bytes), time= 15.8 seconds
    no anomalies in 282 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 8 gigabytes (2^33 bytes), time= 31.3 seconds
    no anomalies in 299 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 16 gigabytes (2^34 bytes), time= 61.9 seconds
    no anomalies in 315 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 32 gigabytes (2^35 bytes), time= 119 seconds
    no anomalies in 328 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 64 gigabytes (2^36 bytes), time= 242 seconds
    no anomalies in 344 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 128 gigabytes (2^37 bytes), time= 483 seconds
    no anomalies in 359 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 256 gigabytes (2^38 bytes), time= 940 seconds
    no anomalies in 372 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 512 gigabytes (2^39 bytes), time= 1906 seconds
    no anomalies in 387 test result(s)
  
  rng=RNG_stdin, seed=unknown
  length= 1 terabyte (2^40 bytes), time= 3826 seconds
    no anomalies in 401 test result(s)

The two modifications compared to the original msws are:

  - mix bits on output so that we don't reveal the internal
    state upon each call ;

  - combination of the output with an independent noise
    variable whose purpose was to be updated upon IRQ
    and/or CPU usage and/or invocations. But on this point,
    while implementing it I figured that updating it on each
    invocation did already provide the frequent updates we
    were missing in Tausworthe that required the interrupt
    updates. I'd definitely update in update_process_times()
    so that it's not reduced to a pure counter, but the
    results, speed and simplicity look encouraging.

I'll try to work on finishing the patch proposal this week-end.

Regards,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07  7:03               ` Willy Tarreau
@ 2020-08-07 16:52                 ` Marc Plumb
  2020-08-07 17:43                   ` Willy Tarreau
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Plumb @ 2020-08-07 16:52 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On 2020-08-07 12:03 a.m., Willy Tarreau wrote:

> Just to give a heads up on this, here's what I'm having pending regarding
> MSWS:
>
>    struct rnd_state {
>          uint64_t x, w;
>          uint64_t seed;
>          uint64_t noise;
>    };
>
>    uint32_t msws32(struct rnd_state *state)
>    {
>          uint64_t x;
>
>          x  = state->w += state->seed;
>          x += state->x * state->x;
>          x  = state->x = (x >> 32) | (x << 32);
>          x -= state->noise++;
>          return x ^ (x >> 32);
>    }

A few comments:

This is still another non-cryptographic PRNG. An LFSR can pass PractRand 
(if you do a tweak to get around the specific linear complexity test for 
LFSRs).

On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1 
shift, 1 xor

This will be much slower on 32-bit machines, if that's still a concern

As long as the noise is the output of a CPRNG, this doesn't hurt the 
security of dev/dandom.

The noise is more like effective 32-bits since you're xoring the low and 
high half of the noise together (ignoring the minor details of carry 
bits). Which means that it's 2^32 effort to brute force this (which Amit 
called "no biggie for modern machines"). If the noise is the raw sample 
data with only a few bits of entropy, then it's even easier to brute force.


Given the uses of this, I think we really should look into a CPRNG for 
this and then completely reseed it periodically. The problem is finding 
one that's fast enough. Is there a hard instruction budget for this, or 
it is just "fast enough to not hurt the network benchmarks" (i.e. if 
Dave Miller screams)?


Marc


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 16:52                 ` Marc Plumb
@ 2020-08-07 17:43                   ` Willy Tarreau
       [not found]                     ` <C74EC3BC-F892-416F-A95C-4ACFC96EEECE@amacapital.net>
  2020-08-07 19:59                     ` Marc Plumb
  0 siblings, 2 replies; 32+ messages in thread
From: Willy Tarreau @ 2020-08-07 17:43 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On Fri, Aug 07, 2020 at 09:52:14AM -0700, Marc Plumb wrote:
> On 2020-08-07 12:03 a.m., Willy Tarreau wrote:
> 
> > Just to give a heads up on this, here's what I'm having pending regarding
> > MSWS:
> > 
> >    struct rnd_state {
> >          uint64_t x, w;
> >          uint64_t seed;
> >          uint64_t noise;
> >    };
> > 
> >    uint32_t msws32(struct rnd_state *state)
> >    {
> >          uint64_t x;
> > 
> >          x  = state->w += state->seed;
> >          x += state->x * state->x;
> >          x  = state->x = (x >> 32) | (x << 32);
> >          x -= state->noise++;
> >          return x ^ (x >> 32);
> >    }
> 
> A few comments:
> 
> This is still another non-cryptographic PRNG.

Absolutely. During some discussions regarding the possibility of using
CSPRNGs, orders around hundreds of CPU cycles were mentioned for them,
which can definitely be a huge waste of precious resources for some
workloads, possibly causing the addition of a few percent extra machines
in certain environments just to keep the average load under a certain
threshold. And the worst there is that such workloads would exactly be
the ones absolutely not affected by the theorical attacks regarding
predictability :-/

> An LFSR can pass PractRand (if
> you do a tweak to get around the specific linear complexity test for LFSRs).

OK.

> On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1
> shift, 1 xor
> 
> This will be much slower on 32-bit machines, if that's still a concern

Yep, that's something I'm totally aware of and one reason I wanted to
have a public discussion about this. My personal view on this is that
a 64-bit multiply will always be cheaper than a crypto operation and
that the environments where picking a source port or accepting a
connection matters that much are not those running on such low-end
machines, so that's very likely an acceptable tradeoff.

> As long as the noise is the output of a CPRNG, this doesn't hurt the
> security of dev/dandom.
> 
> The noise is more like effective 32-bits since you're xoring the low and
> high half of the noise together (ignoring the minor details of carry bits).

Definitely. The purpose of this noise in fact is more to complicate the
reconstruction of the internal state in case a large number of samples
is used, precisely due to these carry bits that propagate solely depending
on the noise value itself, and the uncertainty about when it changes and
from what to what.

> Which means that it's 2^32 effort to brute force this (which Amit called "no
> biggie for modern machines"). If the noise is the raw sample data with only
> a few bits of entropy, then it's even easier to brute force.

Don't you forget to multiply by another 2^32 for X being folded onto itself ?
Because you have 2^32 possible values of X which will give you a single 32-bit
output value for a given noise value.

> Given the uses of this, I think we really should look into a CPRNG for this
> and then completely reseed it periodically. The problem is finding one
> that's fast enough.

That was precisely the problem that drove me to propose something like
this. And all these cycles wasted everywhere just to protect a 16-bit
source port on a mostly idle machine seem quite overkill to me.

> Is there a hard instruction budget for this, or it is
> just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller
> screams)?

It's not just Davem. I too am concerned about wasting CPU cycles in fast
path especially in the network code. A few half-percent gains are hardly
won once in a while in this area and in some infrastructures they matter.
Not much but they do.

Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
       [not found]                     ` <C74EC3BC-F892-416F-A95C-4ACFC96EEECE@amacapital.net>
@ 2020-08-07 18:04                       ` Willy Tarreau
  2020-08-07 18:10                       ` Linus Torvalds
  1 sibling, 0 replies; 32+ messages in thread
From: Willy Tarreau @ 2020-08-07 18:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Marc Plumb, tytso, netdev, aksecurity, torvalds, edumazet, Jason,
	luto, keescook, tglx, peterz, stable

Hi Andy,

On Fri, Aug 07, 2020 at 10:55:11AM -0700, Andy Lutomirski wrote:
> >> This is still another non-cryptographic PRNG.
> > 
> > Absolutely. During some discussions regarding the possibility of using
> > CSPRNGs, orders around hundreds of CPU cycles were mentioned for them,
> > which can definitely be a huge waste of precious resources for some
> > workloads, possibly causing the addition of a few percent extra machines
> > in certain environments just to keep the average load under a certain
> > threshold.
> 
> I think the real random.c can run plenty fast. It's ChaCha20 plus ludicrous
> overhead right now. I'm working (slowly) on making the overhead go away.  I'm
> hoping to have something testable in a few days.  As it stands, there is a
> ton of indirection, a pile of locks, multiple time comparisons, per-node and
> percpu buffers (why both?), wasted bits due to alignment, and probably other
> things that can be cleaned up.  I'm trying to come up with something that is
> fast and has easy-to-understand semantics.
> 
> You can follow along at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=random/fast

Thanks, we'll see. I developed a quick test tool that's meant to be easy
to use to measure the performance impact on connect/accept. I have not
yet run it on a modified PRNG to verify if it works. I'll send it once
I've tested. I'd definitely would like to see no measurable performance
drop, and ideally even a small performance increase (as Tausworthe isn't
the lightest thing around either so we do have some little margin).

Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
       [not found]                     ` <C74EC3BC-F892-416F-A95C-4ACFC96EEECE@amacapital.net>
  2020-08-07 18:04                       ` Willy Tarreau
@ 2020-08-07 18:10                       ` Linus Torvalds
  2020-08-07 19:08                         ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2020-08-07 18:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Fri, Aug 7, 2020 at 10:55 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think the real random.c can run plenty fast. It’s ChaCha20 plus ludicrous overhead right now.

I doubt it.

I tried something very much like that in user space to just see how
many cycles it ended up being.

I made a "just raw ChaCha20", and it was already much too slow for
what some of the networking people claim to want.

And maybe they are asking for too much, but if they think it's too
slow, they'll not use it, and then we're back to square one.

Now, what *might* be acceptable is to not do ChaCha20, but simply do a
single double-round of it.

So after doing 10 prandom_u32() calls, you'd have done a full
ChaCha20. I didn't actually try that, but from looking at the costs
from trying the full thing, I think it might be in the right ballpark.

How does that sound to people?

                 Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 18:10                       ` Linus Torvalds
@ 2020-08-07 19:08                         ` Andy Lutomirski
  2020-08-07 19:21                           ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2020-08-07 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable



> On Aug 7, 2020, at 11:10 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Aug 7, 2020 at 10:55 AM Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> I think the real random.c can run plenty fast. It’s ChaCha20 plus ludicrous overhead right now.
> 
> I doubt it.
> 
> I tried something very much like that in user space to just see how
> many cycles it ended up being.
> 
> I made a "just raw ChaCha20", and it was already much too slow for
> what some of the networking people claim to want.

Do you remember the numbers?

Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20. DJB claims 4 cycles per byte on Core 2, and it had better be faster now, although we can’t usefully use XMM regs, so I don’t know the real timings.

But with the current code, the actual crypto will be lost in the noise.  That’s what I’m trying to fix.
> 
> Now, what *might* be acceptable is to not do ChaCha20, but simply do a
> single double-round of it.

We can certainly have a parallel RNG seeded by the main RNG that runs fewer rounds. I’ll do that if benchmarks say I’m still too slow.

All of this is trivial except the locking. If I’m writing this code, I personally refuse to use  the “races just make it more random” strategy. I’m going to do it without data races, and this will take a bit of work.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 19:08                         ` Andy Lutomirski
@ 2020-08-07 19:21                           ` Linus Torvalds
  2020-08-07 19:33                             ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2020-08-07 19:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> > On Aug 7, 2020, at 11:10 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > I tried something very much like that in user space to just see how
> > many cycles it ended up being.
> >
> > I made a "just raw ChaCha20", and it was already much too slow for
> > what some of the networking people claim to want.
>
> Do you remember the numbers?

Sorry, no. I wrote a hacky thing in user space, and threw it away.

> Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20.

That's what I did (well, I did just the 32-bit one), basically
emulating percpu accesses for incrementing the offset (I didn't
actually *do* percpu accesses, I just did a single-threaded run and
used globals, but wrote it with wrappers so that it would look like it
might work).

> DJB claims 4 cycles per byte on Core 2

I took the reference C implementation as-is, and just compiled it with
O2, so my numbers may not be what some heavily optimized case does.

But it was way more than that, even when amortizing for "only need to
do it every 8 cases". I think the 4 cycles/byte might be some "zero
branch mispredicts" case when you've fully unrolled the thing, but
then you'll be taking I$ misses out of the wazoo, since by definition
this won't be in your L1 I$ at all (only called every 8 times).

Sure, it might look ok on microbenchmarks where it does stay hot the
cache all the time, but that's not realistic. I

               Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 19:21                           ` Linus Torvalds
@ 2020-08-07 19:33                             ` Andy Lutomirski
  2020-08-07 19:56                               ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2020-08-07 19:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable



> On Aug 7, 2020, at 12:21 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> 4 cycles per byte on Core 2
> 
> I took the reference C implementation as-is, and just compiled it with
> O2, so my numbers may not be what some heavily optimized case does.
> 
> But it was way more than that, even when amortizing for "only need to
> do it every 8 cases". I think the 4 cycles/byte might be some "zero
> branch mispredicts" case when you've fully unrolled the thing, but
> then you'll be taking I$ misses out of the wazoo, since by definition
> this won't be in your L1 I$ at all (only called every 8 times).
> 
> Sure, it might look ok on microbenchmarks where it does stay hot the
> cache all the time, but that's not realistic. I

No one said we have to do only one ChaCha20 block per slow path hit.  In fact, the more we reduce the number of rounds, the more time we spend on I$ misses, branch mispredictions, etc, so reducing rounds may be barking up the wrong tree entirely.  We probably don’t want to have more than one page 

I wonder if AES-NI adds any value here.  AES-CTR is almost a drop-in replacement for ChaCha20, and maybe the performance for a cache-cold short run is better.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 19:33                             ` Andy Lutomirski
@ 2020-08-07 19:56                               ` Linus Torvalds
  2020-08-07 20:16                                 ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2020-08-07 19:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Fri, Aug 7, 2020 at 12:33 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> No one said we have to do only one ChaCha20 block per slow path hit.

Sure, doing more might be better for amortizing the cost.

But you have to be very careful about latency spikes. I would be
*really* nervous about doing a whole page at a time, when this is
called from routines that literally expect it to be less than 50
cycles.

So I would seriously suggest you look at a much smaller buffer. Maybe
not a single block, but definitely not multiple kB either.

Maybe something like 2 cachelines might be ok, but there's a reason
the current code only works with 16 bytes (or whatever) and only does
simple operations with no looping.

That's why I think you might look at a single double-round ChaCha20
instead. Maybe do it for two blocks - by the time you wrap around,
you'll have done more than a full ChaCaa20.

That would imnsho *much* better than doing some big block, and have
huge latency spikes and flush a large portion of your L1 when they
happen. Nasty nasty behavior.

I really think the whole "we can amortize it with bigger blocks" is
complete and utter garbage. It's classic "benchmarketing" crap.

             Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 17:43                   ` Willy Tarreau
       [not found]                     ` <C74EC3BC-F892-416F-A95C-4ACFC96EEECE@amacapital.net>
@ 2020-08-07 19:59                     ` Marc Plumb
  2020-08-07 22:19                       ` Willy Tarreau
  1 sibling, 1 reply; 32+ messages in thread
From: Marc Plumb @ 2020-08-07 19:59 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable


On 2020-08-07 10:43 a.m., Willy Tarreau wrote:
>
>> Which means that it's 2^32 effort to brute force this (which Amit called "no
>> biggie for modern machines"). If the noise is the raw sample data with only
>> a few bits of entropy, then it's even easier to brute force.
> Don't you forget to multiply by another 2^32 for X being folded onto itself ?
> Because you have 2^32 possible values of X which will give you a single 32-bit
> output value for a given noise value.

If I can figure the state out once, then the only new input is the 
noise, so that's the only part I have to brute force. Throwing the noise 
in makes it more difficult to get that state once, but once I have it 
then this type of reseeding doesn't help.


>> Is there a hard instruction budget for this, or it is
>> just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller
>> screams)?
> It's not just Davem. I too am concerned about wasting CPU cycles in fast
> path especially in the network code. A few half-percent gains are hardly
> won once in a while in this area and in some infrastructures they matter.
> Not much but they do.

That's why I was asking. I don't have the same experience as you for 
what acceptable is. I think it might be possible to do a decent CPRNG 
(that's at least had some cryptanalys of it) with ~20 instructions per 
word, but if that's not fast enough then I'll think about other options.

Marc

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 19:56                               ` Linus Torvalds
@ 2020-08-07 20:16                                 ` Andy Lutomirski
  2020-08-07 20:24                                   ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2020-08-07 20:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable



> On Aug 7, 2020, at 12:57 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Aug 7, 2020 at 12:33 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> No one said we have to do only one ChaCha20 block per slow path hit.
> 
> Sure, doing more might be better for amortizing the cost.
> 
> But you have to be very careful about latency spikes. I would be
> *really* nervous about doing a whole page at a time, when this is
> called from routines that literally expect it to be less than 50
> cycles.
> 
> So I would seriously suggest you look at a much smaller buffer. Maybe
> not a single block, but definitely not multiple kB either.
> 
> Maybe something like 2 cachelines might be ok, but there's a reason
> the current code only works with 16 bytes (or whatever) and only does
> simple operations with no looping.
> 
> That's why I think you might look at a single double-round ChaCha20
> instead. Maybe do it for two blocks - by the time you wrap around,
> you'll have done more than a full ChaCaa20.
> 
> That would imnsho *much* better than doing some big block, and have
> huge latency spikes and flush a large portion of your L1 when they
> happen. Nasty nasty behavior.
> 
> I really think the whole "we can amortize it with bigger blocks" is
> complete and utter garbage. It's classic "benchmarketing" crap.
> 

I think this will come down to actual measurements :). If the cost of one block of cache-cold ChaCha20 is 100 cycles of actual computation and 200 cycles of various cache misses, then let’s do more than one block.

I’ll get something working and we’ll see. 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 20:16                                 ` Andy Lutomirski
@ 2020-08-07 20:24                                   ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2020-08-07 20:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willy Tarreau, Marc Plumb, Theodore Ts'o, Netdev, Amit Klein,
	Eric Dumazet, Jason A. Donenfeld, Andrew Lutomirski, Kees Cook,
	Thomas Gleixner, Peter Zijlstra, stable

On Fri, Aug 7, 2020 at 1:16 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think this will come down to actual measurements :).

Numbers talk, bullshit walks.

But:

> If the cost of one block of cache-cold ChaCha20 is 100 cycles of actual computation and 200 cycles of various cache misses, then let’s do more than one block.

That's *completely* nonsensical thinking and crazy talk.

If the issue is latency (and for a lot of networking, that literally
*is* the issue), your mental model is completely insane.

"Oh, it's so expensive that we should queue *more*" is classic
throughput thinking, and it's wrong.

If you have performance problems, you should look really really hard
at fixing latency.

Because fixing latency helps throughput. But fixing throughput _hurts_ latency.

It really is that simple. Latency is the much more important thing to optimize.

Nobody cares about "bulk TCP sequence numbers". Sure, you'll find
benchmarks for it, because it's a lot easier to benchmark throughput.
But what people _care_ about is generally latency.

              Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 19:59                     ` Marc Plumb
@ 2020-08-07 22:19                       ` Willy Tarreau
  2020-08-07 22:45                         ` Marc Plumb
  0 siblings, 1 reply; 32+ messages in thread
From: Willy Tarreau @ 2020-08-07 22:19 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
> 
> On 2020-08-07 10:43 a.m., Willy Tarreau wrote:
> > 
> > > Which means that it's 2^32 effort to brute force this (which Amit called "no
> > > biggie for modern machines"). If the noise is the raw sample data with only
> > > a few bits of entropy, then it's even easier to brute force.
> > Don't you forget to multiply by another 2^32 for X being folded onto itself ?
> > Because you have 2^32 possible values of X which will give you a single 32-bit
> > output value for a given noise value.
> 
> If I can figure the state out once,

Yes but how do you take that as granted ? This state doesn't appear
without its noise counterpart, so taking as a prerequisite that you may
guess one separately obviously indicates that you then just have to
deduce the other, but the point of mixing precisely is that we do not
expose individual parts.

This way of thinking is often what results in extreme solutions to be
designed, which are far away from the reality of the field of application,
and result in unacceptable costs that make people turn to other solutions.
Do you think it makes me happy to see people waste their time reimplementing
alternate userland TCP stacks that are supposedly "way faster" by getting
rid of all the useless (to them) stuff that was forced on them at the cost
of performance ? And it makes me even less happy when they ask me why I'm
not spending more time trying to adopt them. The reality is that this time
could be better spent optimizing our stack to be sure that costs are added
where they are relevant, and not just to make sure that when we align 7
conditions starting with "imagine that I could guess that", the 8th could
be guessed as well, except that none of these can really be guessed outside
of a lab :-/

> then the only new input is the noise, so
> that's the only part I have to brute force. Throwing the noise in makes it
> more difficult to get that state once, but once I have it then this type of
> reseeding doesn't help.

> I think it might be possible to do a decent CPRNG (that's at
> least had some cryptanalys of it) with ~20 instructions per word, but if
> that's not fast enough then I'll think about other options.

I think that around 20 instructions for a hash would definitely be nice
(but please be aware that we're speaking about RISC-like instructions,
not SIMD instructions). And also please be careful not to count only
with amortized performance that's only good to show nice openssl
benchmarks, because if that's 1280 instructions for 256 bits that
result in 20 instructions per 32-bit word, it's not the same anymore
at all!

Regards,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 22:19                       ` Willy Tarreau
@ 2020-08-07 22:45                         ` Marc Plumb
  2020-08-07 23:11                           ` Willy Tarreau
  0 siblings, 1 reply; 32+ messages in thread
From: Marc Plumb @ 2020-08-07 22:45 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

Willy,


On 2020-08-07 3:19 p.m., Willy Tarreau wrote:
> On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
>>
>> If I can figure the state out once,
> Yes but how do you take that as granted ? This state doesn't appear
> without its noise counterpart,

Amit has shown attacks that can deduce the full internal state from 4-5 
packets with a weak PRNG. If the noise is added less often than that, an 
attacker can figure out the entire state at which point the partial 
reseeding doesn't help. If the noise is added more often than that, and 
it's raw timing events, then it's only adding a few bits of entropy so 
its easy to guess (and it weakens dev/random). If the noise is added 
more often, and it's from the output of a CPRNG, then we have all the 
performance/latency problems from the CPRNG itself, so we might as well 
use it directly.

>> I think it might be possible to do a decent CPRNG (that's at
>> least had some cryptanalys of it) with ~20 instructions per word, but if
>> that's not fast enough then I'll think about other options.
> I think that around 20 instructions for a hash would definitely be nice
> (but please be aware that we're speaking about RISC-like instructions,
> not SIMD instructions). And also please be careful not to count only
> with amortized performance that's only good to show nice openssl
> benchmarks, because if that's 1280 instructions for 256 bits that
> result in 20 instructions per 32-bit word, it's not the same anymore
> at all!

Understood.

Marc

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Flaw in "random32: update the net random state on interrupt and activity"
  2020-08-07 22:45                         ` Marc Plumb
@ 2020-08-07 23:11                           ` Willy Tarreau
  0 siblings, 0 replies; 32+ messages in thread
From: Willy Tarreau @ 2020-08-07 23:11 UTC (permalink / raw)
  To: Marc Plumb
  Cc: tytso, netdev, aksecurity, torvalds, edumazet, Jason, luto,
	keescook, tglx, peterz, stable

On Fri, Aug 07, 2020 at 03:45:48PM -0700, Marc Plumb wrote:
> Willy,
> 
> 
> On 2020-08-07 3:19 p.m., Willy Tarreau wrote:
> > On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
> > > 
> > > If I can figure the state out once,
> > Yes but how do you take that as granted ? This state doesn't appear
> > without its noise counterpart,
> 
> Amit has shown attacks that can deduce the full internal state from 4-5
> packets with a weak PRNG. If the noise is added less often than that, an
> attacker can figure out the entire state at which point the partial
> reseeding doesn't help. If the noise is added more often than that, and it's
> raw timing events, then it's only adding a few bits of entropy so its easy
> to guess (and it weakens dev/random). If the noise is added more often, and
> it's from the output of a CPRNG, then we have all the performance/latency
> problems from the CPRNG itself, so we might as well use it directly.

His attacks is based on the fact that internal state bits leak as-is.
So it is possible to collect them and perform the inverse operation to
reconstruct the input.

Here the output is made by mixing two input bits with two noise bits
and produce one single output bit. So for each output bit you see,
you don't have just one possible input bit, but 16 possibilities.
That's 128 bits of internal changing states that once combined result
in 32 bits. For each 32-bit output you still have 2^96 possible
internal (x,noise) states producing it. And that's without counting
on the 64-bit seed that you still don't know but can be deduced from
two guessed 128 bit states (assuming that can be brute-forced at all,
of course).

For sure there are plenty of number theories allowing you to
significantly reduce the space you need to work on to brute-force
this but it will definitely remain above 2^32 attempts for each
iteration, which is the floor you have without the noise part,
while the whole internal state will be reseeded every minute
anyway.

I mean, I'm not trying to sell this thing, I'm just trying to defend
that we use a reasonable tool for a reasonable protection level. And
yes, probably that 15 years from now, someone will say "hey look, I
can brute-force this thing in less than a minute on my 1024-core 39th
gen core i7 machine with 2^60 operations per second, why don't we use
our CPU's native 10 picosecond AES1024 instruction instead ?" and we'll
say "well, it's an old story and it's probably about time to change it
again".

I don't see anything wrong with evolving this way, matching concrete
needs more than pure theory.

Regards,
Willy

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-08-07 23:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <9f74230f-ba4d-2e19-5751-79dc2ab59877@gmail.com>
2020-08-05  0:57 ` Flaw in "random32: update the net random state on interrupt and activity" Marc Plumb
2020-08-05  1:02 ` Linus Torvalds
2020-08-05  2:49 ` Willy Tarreau
2020-08-05 15:34   ` tytso
2020-08-05 16:06     ` Marc Plumb
2020-08-05 19:38       ` Willy Tarreau
2020-08-05 22:21         ` Marc Plumb
2020-08-06  6:30           ` Willy Tarreau
2020-08-06 17:18             ` Marc Plumb
2020-08-07  7:03               ` Willy Tarreau
2020-08-07 16:52                 ` Marc Plumb
2020-08-07 17:43                   ` Willy Tarreau
     [not found]                     ` <C74EC3BC-F892-416F-A95C-4ACFC96EEECE@amacapital.net>
2020-08-07 18:04                       ` Willy Tarreau
2020-08-07 18:10                       ` Linus Torvalds
2020-08-07 19:08                         ` Andy Lutomirski
2020-08-07 19:21                           ` Linus Torvalds
2020-08-07 19:33                             ` Andy Lutomirski
2020-08-07 19:56                               ` Linus Torvalds
2020-08-07 20:16                                 ` Andy Lutomirski
2020-08-07 20:24                                   ` Linus Torvalds
2020-08-07 19:59                     ` Marc Plumb
2020-08-07 22:19                       ` Willy Tarreau
2020-08-07 22:45                         ` Marc Plumb
2020-08-07 23:11                           ` Willy Tarreau
2020-08-05 22:05       ` tytso
2020-08-05 23:03         ` Andy Lutomirski
2020-08-06 17:00         ` Marc Plumb
2020-08-05 16:24     ` Jason A. Donenfeld
2020-08-05 16:53     ` Willy Tarreau
2020-08-05 15:44   ` Marc Plumb
2020-08-05 16:39     ` Linus Torvalds
2020-08-05 23:49       ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).