George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
@ 2016-12-22  2:07 Andy Lutomirski
  2016-12-22  5:01 ` George Spelvin
  2016-12-22 16:09 ` Andy Lutomirski
  0 siblings, 2 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-12-22  2:07 UTC (permalink / raw)
  To: George Spelvin
  Cc: Ted Ts'o, Andi Kleen, David S. Miller, David Laight,
	D. J. Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jason A. Donenfeld, Jean-Philippe Aumasson,
	kernel-hardening, Linux Crypto Mailing List, linux-kernel,
	Network Development, Tom Herbert, Linus Torvalds, Vegard Nossum

On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
<linux@sciencehorizons.net> wrote:
> As a separate message, to disentangle the threads, I'd like to
> talk about get_random_long().
>
> After some thinking, I still like the "state-preserving" construct
> that's equivalent to the current MD5 code.  Yes, we could just do
> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
> preserve a bit more.
>
> It requires library support from the SipHash code to return the full
> SipHash state, but I hope that's a fair thing to ask for.

I don't even think it needs that.  This is just adding a
non-destructive final operation, right?

>
> Here's my current straw man design for comment.  It's very similar to
> the current MD5-based design, but feeds all the seed material in the
> "correct" way, as opposed to Xring directly into the MD5 state.
>
> * Each CPU has a (Half)SipHash state vector,
>   "unsigned long get_random_int_hash[4]".  Unlike the current
>   MD5 code, we take care to initialize it to an asymmetric state.
>
> * There's a global 256-bit random_int_secret (which we could
>   reseed periodically).
>
> To generate a random number:
> * If get_random_int_hash is all-zero, seed it with fresh a half-sized
>   SipHash key and the appropriate XOR constants.
> * Generate three words of random_get_entropy(), jiffies, and current->pid.
>   (This is arbitary seed material, copied from the current code.)
> * Crank through that with (Half)SipHash-1-0.
> * Crank through the random_int_secret with (Half)SipHash-1-0.
> * Return v1 ^ v3.

Just to clarify, if we replace SipHash with a black box, I think this
effectively means, where "entropy" is random_get_entropy() || jiffies
|| current->pid:

The first call returns H(random seed || entropy_0 || secret).  The
second call returns H(random seed || entropy_0 || secret || entropy_1
|| secret).  Etc.

If not, then I have a fairly strong preference to keep whatever
construction we come up with consistent with something that could
actually happen with invocations of unmodified SipHash -- then all the
security analysis on SipHash goes through.

Anyway, I have mixed thoughts about the construction.  It manages to
have a wide state at essentially no cost, which buys us quite a bit of
work factor to break it.  Even with full knowledge of the state, an
output doesn't reveal the entropy except to the extent that it can be
brute-force (this is just whatever the appropriate extended version of
first preimage resistance gives us).  The one thing I don't like is
that I don't see how to prove that you can't run it backwards if you
manage to acquire a memory dump.  In fact, I that that there exist, at
least in theory, hash functions that are secure in the random oracle
model but that *can* be run backwards given the full state.  From
memory, SHA-3 has exactly that property, and it would be a bit sad for
a CSPRNG to be reversible.

We could also periodically mix in a big (128-bit?) chunk of fresh
urandom output to keep the bad guys guessing.

(P.S.  This kind of resembles the duplex sponge construction.  If
hardware SHA-3 ever shows up, a duplex sponge RNG might nice indeed.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22  2:07 George's crazy full state idea (Re: HalfSipHash Acceptable Usage) Andy Lutomirski
@ 2016-12-22  5:01 ` George Spelvin
  2016-12-22  5:42   ` Andy Lutomirski
  2016-12-22 16:09 ` Andy Lutomirski
  1 sibling, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-22  5:01 UTC (permalink / raw)
  To: linux, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, netdev, tom, torvalds, tytso, vegard.nossum

Andy Lutomirski wrote:
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?

It is, but the problem is that SipHash is intended for *small* inputs,
so the standard implementations aren't broken into init/update/final
functions.

There's just one big function that keeps the state variables in
registers and never stores them anywhere.

If we *had* init/update/final functions, then it would be trivial.

> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:

> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Basically, yes.  I was skipping the padding byte and keying the
finalization rounds on the grounds of "can't hurt and might help",
but we could do it a more standard way.

> If not, then I have a fairly strong preference to keep whatever
> construction we come up with consistent with something that could
> actually happen with invocations of unmodified SipHash -- then all the
> security analysis on SipHash goes through.

Okay.  I don't think it makes a difference, but it's not a *big* waste
of time.  If we have finalization rounds, we can reduce the secret
to 128 bits.

If we include the padding byte, we can do one of two things:
1) Make the secret 184 bits, to fill up the final partial word as
   much as possible, or
2) Make the entropy 1 byte smaller and conceptually misalign the
   secret.  What we'd actually do is remove the last byte of
   the secret and include it in the entropy words, but that's
   just a rotation of the secret between storage and hashing.

Also, I assume you'd like SipHash-2-4, since you want to rely
on a security analysis.

(Regarding the padding byte, getting it right might be annoying
to do exactly.  All of the security analysis depends *only* on
its low 3 bits indicating how much of the final block is used.
As it says in the SipHash paper, they included 8 bits just because
it was easy.  But if you want it exact, it's just one more byte of
state.)

> The one thing I don't like is
> that I don't see how to prove that you can't run it backwards if you
> manage to acquire a memory dump.  In fact, I that that there exist, at
> least in theory, hash functions that are secure in the random oracle
> model but that *can* be run backwards given the full state.  From
> memory, SHA-3 has exactly that property, and it would be a bit sad for
> a CSPRNG to be reversible.

Er...  get_random_int() is specifically *not* designed to be resistant
to state capture, and I didn't try.  Remember, what it's used for
is ASLR, what we're worried about is somene learning the layouts
of still-running processes, and and if you get a memory dump, you have
the memory layout!

If you want anti-backtracking, though, it's easy to add.  What we
hash is:

entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...

You mix the output word right back in to the (unfinalized) state after
generating it.  This is still equivalent to unmodified back-box SipHash,
you're just using a (conceptually independent) SipHash invocation to
produce some of its input.

Each output is produced by copying the state, padding & finalizing after the
secret.

In fact, to make our lives easier, let's define the secret to end with
a counter byte that happens to be equal to the padding byte.  The input
stream will be:

Previous output: 8 (or 4 for HalfSipHash) bytes
Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid)
Secret: 16 bytes
Counter: 1 byte
...repeat...

> We could also periodically mix in a big (128-bit?) chunk of fresh
> urandom output to keep the bad guys guessing.

Simpler and faster to just update the global master secret.
The state is per-CPU, so mixing in has to be repeated per CPU.

With these changes, I'm satisifed that it's secure, cheap, has a
sufficiently wide state size, *and* all standard SipHash analysis applies.

The only remaining issues are:
1) How many rounds, and
2) May we use HalfSipHash?

I'd *like* to persuade you that skipping the padding byte wouldn't
invalidate any security proofs, because it's true and would simplify
the code.  But if you want 100% stock, I'm willing to cater to that.

Ted, what do you think?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22  5:01 ` George Spelvin
@ 2016-12-22  5:42   ` Andy Lutomirski
  2016-12-22  8:02     ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-12-22  5:42 UTC (permalink / raw)
  To: George Spelvin
  Cc: Andrew Lutomirski, Andi Kleen, David S. Miller, David Laight,
	D. J. Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jason A. Donenfeld, Jean-Philippe Aumasson,
	kernel-hardening, Linux Crypto Mailing List, linux-kernel,
	Network Development, Tom Herbert, Linus Torvalds, Ted Ts'o,
	Vegard Nossum

On Wed, Dec 21, 2016 at 9:01 PM, George Spelvin
<linux@sciencehorizons.net> wrote:
> Andy Lutomirski wrote:
>> I don't even think it needs that.  This is just adding a
>> non-destructive final operation, right?
>
> It is, but the problem is that SipHash is intended for *small* inputs,
> so the standard implementations aren't broken into init/update/final
> functions.
>
> There's just one big function that keeps the state variables in
> registers and never stores them anywhere.
>
> If we *had* init/update/final functions, then it would be trivial.
>
>> Just to clarify, if we replace SipHash with a black box, I think this
>> effectively means, where "entropy" is random_get_entropy() || jiffies
>> || current->pid:
>
>> The first call returns H(random seed || entropy_0 || secret).  The
>> second call returns H(random seed || entropy_0 || secret || entropy_1
>> || secret).  Etc.
>
> Basically, yes.  I was skipping the padding byte and keying the
> finalization rounds on the grounds of "can't hurt and might help",
> but we could do it a more standard way.
>
>> If not, then I have a fairly strong preference to keep whatever
>> construction we come up with consistent with something that could
>> actually happen with invocations of unmodified SipHash -- then all the
>> security analysis on SipHash goes through.
>
> Okay.  I don't think it makes a difference, but it's not a *big* waste
> of time.  If we have finalization rounds, we can reduce the secret
> to 128 bits.
>
> If we include the padding byte, we can do one of two things:
> 1) Make the secret 184 bits, to fill up the final partial word as
>    much as possible, or
> 2) Make the entropy 1 byte smaller and conceptually misalign the
>    secret.  What we'd actually do is remove the last byte of
>    the secret and include it in the entropy words, but that's
>    just a rotation of the secret between storage and hashing.
>
> Also, I assume you'd like SipHash-2-4, since you want to rely
> on a security analysis.

I haven't looked, but I assume that the analysis at least thought
about reduced rounds, so maybe other variants are okay.

>> The one thing I don't like is
>> that I don't see how to prove that you can't run it backwards if you
>> manage to acquire a memory dump.  In fact, I that that there exist, at
>> least in theory, hash functions that are secure in the random oracle
>> model but that *can* be run backwards given the full state.  From
>> memory, SHA-3 has exactly that property, and it would be a bit sad for
>> a CSPRNG to be reversible.
>
> Er...  get_random_int() is specifically *not* designed to be resistant
> to state capture, and I didn't try.  Remember, what it's used for
> is ASLR, what we're worried about is somene learning the layouts
> of still-running processes, and and if you get a memory dump, you have
> the memory layout!

True, but it's called get_random_int(), and it seems like making it
stronger, especially if the performance cost is low to zero, is a good
thing.

>
> If you want anti-backtracking, though, it's easy to add.  What we
> hash is:
>
> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>
> You mix the output word right back in to the (unfinalized) state after
> generating it.  This is still equivalent to unmodified back-box SipHash,
> you're just using a (conceptually independent) SipHash invocation to
> produce some of its input.

Ah, cute.  This could probably be sped up by doing something like:

entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

It's a little weak because the output is only 64 bits, so you could
plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
old entropy is guessable.  I suspect there are sneaky ways around it
like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

>
> The only remaining issues are:
> 1) How many rounds, and
> 2) May we use HalfSipHash?

I haven't looked closely enough to have a real opinion here.  I don't
know what the security margin is believed to be.

>
> I'd *like* to persuade you that skipping the padding byte wouldn't
> invalidate any security proofs, because it's true and would simplify
> the code.  But if you want 100% stock, I'm willing to cater to that.

I lean toward stock in the absence of a particularly good reason.  At
the very least I'd want to read that paper carefully.

>
> Ted, what do you think?



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22  5:42   ` Andy Lutomirski
@ 2016-12-22  8:02     ` George Spelvin
  0 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2016-12-22  8:02 UTC (permalink / raw)
  To: linux, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, luto, netdev, tom, torvalds, tytso, vegard.nossum

> True, but it's called get_random_int(), and it seems like making it
> stronger, especially if the performance cost is low to zero, is a good
> thing.

If it's cheap enough, I don't mind.  But it's documented as being
marginal-quality, used where speed is more important.

In particular, it's *not* used for key material; only values that matter
only as long as they are in use.  Whule they're in use, they can't be
concealed from an attacker with kernel access, and when they're dne
being used, they're worthless.

>> If you want anti-backtracking, though, it's easy to add.  What we
>> hash is:
>>
>> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>>
>> You mix the output word right back in to the (unfinalized) state after
>> generating it.  This is still equivalent to unmodified back-box SipHash,
>> you're just using a (conceptually independent) SipHash invocation to
>> produce some of its input.

> Ah, cute.  This could probably be sped up by doing something like:
>
> entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

I'm disinclined to do that because that requires deferring the mixing
until the *next* time you generate something.  Storing the value you
don't want revealed by a state capture defeats the purpose.

I'd rather mix it in immediately, so you have anti-backtracking from
the moment of creation.

Also, I don't think it's particularly "cute" or clever; mixing output back
in is the standard way all antibacktracking is accomplished.  It's how
the Davies-Meyer hash construction makes a reversible function one-way.

(There is a second way to do it by throwing away state, but that's
expensive in seed entropy.)

> It's a little weak because the output is only 64 bits, so you could
> plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
> old entropy is guessable.  I suspect there are sneaky ways around it
> like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

Ah, yes, I see.  Given the final state, you guess the output word, go
back one round, then forward the finalization rounds.   Is the output
equal to the guessed output?  You'll find the true value, plus
Poisson(1 - 2^-64) additional.  (Since you have 2^64-1 chances at
something with probability 1 in 2^64.)

And this can be iterated as often as you like to get earlier output words,
as long as you can guess the entropy.  *That's* the part that hurts;
you'd like something that peters out.

You could use the double-length-output SipHash variant (which requires
a second set of finalization rounds) and mix more output back, but
that's expensive.

The challenge is coming up with more unpredictable data to mix in than one
invocation of SipHash returns.  And without storing previous output
anywhere, because that is exactly wrong.

A running sum or xor or whatever of the outputs doesn't help, because
once you've guessed the last output, you can backtrack that for no
additional effort.

State capture is incredibly difficult, our application doesn't require
resistance anyway... unless you can think of something cheap, I think
we can just live with this.

>> I'd *like* to persuade you that skipping the padding byte wouldn't
>> invalidate any security proofs, because it's true and would simplify
>> the code.  But if you want 100% stock, I'm willing to cater to that.

> I lean toward stock in the absence of a particularly good reason.  At
> the very least I'd want to read that paper carefully.

Er... adding the length is standard Merkle-Damgaard strengthening.
Why you do this is described in the original papers by Merkle and Damgaard.

The lay summary is at
https://en.wikipedia.org/wiki/Merkle-Damgard_construction

The original sources are:
http://www.merkle.com/papers/Thesis1979.pdf
http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf

Merkle describes the construction; Damgaard proves it secure.  Basically,
appending the length is required to handle variable-length input if the
input is not itself self-delimiting.

The proof of security is theorem 3.1 in the latter.  (The first, more
detailed explanation involves the use of an extra bit, which the second
then explains how todo without.)

In particular, see the top of page 420, which notes that the security
proof only requires encoding *how much padding is added* in the final
block, not the overall length of the message, and the second remark on
p. 421 which notes that no such suffix is required if it's not necessary
to distinguish messages with different numbers of trailing null bytes.

The rules are alluded to in the "Choice of padding rule" part of the
"Rationale" section of the SipHash paper (p. 7), but the description is
very brief because it assumes the reader has the background.

That's why they say "We could have chosen a slightly simpler padding rule,
such as appending a <tt>80</tt> byte followed by zeroes."

The thing is, if the amount of the last block that is used is fixed
(within the domain of a particular key), you don't need to encode this
information at all.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22  2:07 George's crazy full state idea (Re: HalfSipHash Acceptable Usage) Andy Lutomirski
  2016-12-22  5:01 ` George Spelvin
@ 2016-12-22 16:09 ` Andy Lutomirski
  2016-12-22 19:24   ` George Spelvin
  1 sibling, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-12-22 16:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: George Spelvin, Ted Ts'o, Andi Kleen, David S. Miller,
	David Laight, D. J. Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jason A. Donenfeld, Jean-Philippe Aumasson,
	kernel-hardening, Linux Crypto Mailing List, linux-kernel,
	Network Development, Tom Herbert, Linus Torvalds, Vegard Nossum

On Wed, Dec 21, 2016 at 6:07 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
> <linux@sciencehorizons.net> wrote:
>> As a separate message, to disentangle the threads, I'd like to
>> talk about get_random_long().
>>
>> After some thinking, I still like the "state-preserving" construct
>> that's equivalent to the current MD5 code.  Yes, we could just do
>> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
>> preserve a bit more.
>>
>> It requires library support from the SipHash code to return the full
>> SipHash state, but I hope that's a fair thing to ask for.
>
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?
>
>>
>> Here's my current straw man design for comment.  It's very similar to
>> the current MD5-based design, but feeds all the seed material in the
>> "correct" way, as opposed to Xring directly into the MD5 state.
>>
>> * Each CPU has a (Half)SipHash state vector,
>>   "unsigned long get_random_int_hash[4]".  Unlike the current
>>   MD5 code, we take care to initialize it to an asymmetric state.
>>
>> * There's a global 256-bit random_int_secret (which we could
>>   reseed periodically).
>>
>> To generate a random number:
>> * If get_random_int_hash is all-zero, seed it with fresh a half-sized
>>   SipHash key and the appropriate XOR constants.
>> * Generate three words of random_get_entropy(), jiffies, and current->pid.
>>   (This is arbitary seed material, copied from the current code.)
>> * Crank through that with (Half)SipHash-1-0.
>> * Crank through the random_int_secret with (Half)SipHash-1-0.
>> * Return v1 ^ v3.
>
> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:
>
> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Having slept on this, I like it less.  The problem is that a
backtracking attacker doesn't just learn H(random seed || entropy_0 ||
secret || ...) -- they learn the internal state of the hash function
that generates that value.  This probably breaks any attempt to apply
security properties of the hash function.  For example, the internal
state could easily contain a whole bunch of prior outputs it in
verbatim.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22 16:09 ` Andy Lutomirski
@ 2016-12-22 19:24   ` George Spelvin
  2016-12-22 19:32     ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-22 19:24 UTC (permalink / raw)
  To: luto, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, linux, netdev, tom, torvalds, tytso, vegard.nossum

> Having slept on this, I like it less.  The problem is that a
> backtracking attacker doesn't just learn H(random seed || entropy_0 ||
> secret || ...) -- they learn the internal state of the hash function
> that generates that value.  This probably breaks any attempt to apply
> security properties of the hash function.  For example, the internal
> state could easily contain a whole bunch of prior outputs it in
> verbatim.

The problem is, anti-backtracing is in severe tension with your desire
to use unmodified SipHash.

First of all, I'd like to repeat that it isn't a design goal of the current
generator and isn't necessary.

The current generator just returns hash[0] from the MD5 state, then
leaves the state stored.  The fact that it conceals earlier outputs is
an accident of the Davies-Meyer structure of md5_transform.

It isn't necessary, because every secret generated is stored unencrypted
for as long as it's of value.  A few values are used for retransmit
backoffs and random MAC addresses.  Those are revealed to the world as
soon as they're used.

Most values are used for ASLR.  These address are of interest to an
attacker trying to mount a buffer-overflow attack, but that only lasts
as long as the process is running to receive buffers.  After the process
exits, knowledge of its layout is worthless.

And this is stored as long as it's running in kernel-accessible data,
so storing a *second* copy in less conveniently kernel-accessible data
(the RNG state) doesn't make the situation any worse.

In addition to the above, if you're assuming a state capture, then
since we have (for necessary efficieny reasons) a negligible about of
fresh entropy, an attacker has the secret key and can predict *future*
outputs very easily.

Given that information, an attacker doesn't need to learn the layout of
vulnerable server X.  If they have a buffer overflow, they can crash
the current instance and wait for a fresh image to be started (with a
known address space) to launch their attack at.

Kernel state capture attacks are a very unlikely attack, mostly because
it's a narrow target a hair's breadth away from the much more interesting
outright kernel compromise (attacker gains write access as well as read)
which renders all this fancy cryptanaysis moot.

Now, the main point:  it's not likely to be solvable.

The problem with unmodified SipHash is that is has only 64 bits of
output.  No mix-back mechanism can get around the fundamental problem
that that's too small to prevent a brute-force guessing attack.  You need
wider mix-back.  And getting more output from unmodified SipHash requires
more finalization rounds, which is expensive.

(Aside: 64 bits does have the advantage that it can't be brute-forced on
the attacked machine.  It must be exfiltrated to the attacker, and the
solution returned to the attack code.  But doing this is getting easier
all the time.)

Adding antibacktracking to SipHash is trivial: just add a Davies-Meyer
structure around its internal state.  Remember the internal state before
hashing in the entropy and secret, generate the output, then add the
previous and final states together for storage.

This is a standard textbook construction, very cheap, and doesn't touch
the compression function which is the target of analysis and attacks,
but it requires poking around in SipHash's internal state.  (And people
who read the textbooks without understanding them will get upset because
the textbooks all talk about using this construction with block ciphers,
and SipHash's compression function is not advertised as a block cipher.)

Alternative designs exist; you could extract additional output from
earlier rounds of SipHash, using the duplex sponge construction you
mentioned earlier.  That output would be used for mixback purposes *only*,
so wouldn't affect the security proof of the "primary" output.
But this is also getting creative with SipHash's internals.

Now, you could use a completely *different* cryptographic primitive
to enforce one-way-ness, and apply SipHash as a strong output transform,
but that doesn't feel like good design, and is probably more expensive.

Finally, your discomfort about an attacker learning the internal state...
if an attacker knows the key and the input, they can construct the
internal state.  Yes, we could discard the internal state and construct
a fresh one on the next call to get_random_int, but what are you going
to key it with?  What are you going to feed it?  What keeps *that*
internal state any more secret from an attacker than one that's explicitly
stored?

Keeping the internal state around is a cacheing optimization, that's all.

*If* you're assuming a state capture, the only thing secret from the
attacker is any fresh entropy collected between the time of capture
and the time of generation.  Due to mandatory efficiency requirements,
this is very small.  

I really think you're wishing for the impossible here.

A final note: although I'm disagreeing with you, thank you very much for 
the informed discussion.  Knowing that someone will read and think about
this message carefully has forced me to think it through carefully myself.

For example, clearly stating the concern over starting new processes
with predictable layout, and the limits on the fresh entropy supply,
has made me realize that there *is* a possible source: each exec()
is passed 128 bits from get_random_bytes in the AT_RANDOM element of
its auxv.  Since get_random_int() accesses "current" anyway, we could
store some seed material there rather than using "pid".  While this is
not fresh for each call to get_random_int, it *is* fresh for each new
address space whose layout is being randomized.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22 19:24   ` George Spelvin
@ 2016-12-22 19:32     ` Andy Lutomirski
  2016-12-22 21:11       ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-12-22 19:32 UTC (permalink / raw)
  To: George Spelvin
  Cc: Andrew Lutomirski, Andi Kleen, David S. Miller, David Laight,
	D. J. Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jason A. Donenfeld, Jean-Philippe Aumasson,
	kernel-hardening, Linux Crypto Mailing List, linux-kernel,
	Network Development, Tom Herbert, Linus Torvalds, Ted Ts'o,
	Vegard Nossum

On Thu, Dec 22, 2016 at 11:24 AM, George Spelvin
<linux@sciencehorizons.net> wrote:
>> Having slept on this, I like it less.  The problem is that a
>> backtracking attacker doesn't just learn H(random seed || entropy_0 ||
>> secret || ...) -- they learn the internal state of the hash function
>> that generates that value.  This probably breaks any attempt to apply
>> security properties of the hash function.  For example, the internal
>> state could easily contain a whole bunch of prior outputs it in
>> verbatim.
>
> The problem is, anti-backtracing is in severe tension with your desire
> to use unmodified SipHash.
>
> First of all, I'd like to repeat that it isn't a design goal of the current
> generator and isn't necessary.

Agreed.

> Now, the main point:  it's not likely to be solvable.
>
> The problem with unmodified SipHash is that is has only 64 bits of
> output.  No mix-back mechanism can get around the fundamental problem
> that that's too small to prevent a brute-force guessing attack.  You need
> wider mix-back.  And getting more output from unmodified SipHash requires
> more finalization rounds, which is expensive.

It could only mix the output back in every two calls, in which case
you can backtrack up to one call but you need to do 2^128 work to
backtrack farther.  But yes, this is getting excessively complicated.

> Finally, your discomfort about an attacker learning the internal state...
> if an attacker knows the key and the input, they can construct the
> internal state.  Yes, we could discard the internal state and construct
> a fresh one on the next call to get_random_int, but what are you going
> to key it with?  What are you going to feed it?  What keeps *that*
> internal state any more secret from an attacker than one that's explicitly
> stored?

I do tend to like Ted's version in which we use batched
get_random_bytes() output.  If it's fast enough, it's simpler and lets
us get the full strength of a CSPRNG.

(Aside: some day I want to move all that code from drivers/ to lib/
and teach it to be buildable in userspace, too, so it's easy to play
with it, feed it test vectors, confirm that it's equivalent to a
reference implementation, write up the actual design and try to get
real cryptographers to analyze it, etc.)

> For example, clearly stating the concern over starting new processes
> with predictable layout, and the limits on the fresh entropy supply,
> has made me realize that there *is* a possible source: each exec()
> is passed 128 bits from get_random_bytes in the AT_RANDOM element of
> its auxv.  Since get_random_int() accesses "current" anyway, we could
> store some seed material there rather than using "pid".  While this is
> not fresh for each call to get_random_int, it *is* fresh for each new
> address space whose layout is being randomized.

Hmm, interesting.  Although, for ASLR, we could use get_random_bytes()
directly and be done with it.  It won't be a bottleneck.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22 19:32     ` Andy Lutomirski
@ 2016-12-22 21:11       ` George Spelvin
  2016-12-22 21:38         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-22 21:11 UTC (permalink / raw)
  To: linux, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, netdev, tom, torvalds, tytso, vegard.nossum

> I do tend to like Ted's version in which we use batched
> get_random_bytes() output.  If it's fast enough, it's simpler and lets
> us get the full strength of a CSPRNG.

With the ChaCha20 generator, that's fine, although note that this abandons
anti-backtracking entirely.

It also takes locks, something the previous get_random_int code
path avoided.  Do we need to audit the call sites to ensure that's safe?

And there is the issue that the existing callers assume that there's a
fixed cost per word.  A good half of get_random_long calls are followed by
"& ~PAGE_MASK" to extract the low 12 bits.  Or "& ((1ul << mmap_rnd_bits)
- 1)" to extract the low 28.  If we have a buffer we're going to have to
pay to refill, it would be nice to use less than 8 bytes to satisfy those.

But that can be a followup patch.  I'm thinking

unsigned long get_random_bits(unsigned bits)
	E.g. get_random_bits(PAGE_SHIFT),
	     get_random_bits(mmap_rnd_bits),
	u32 imm_rnd = get_random_bits(32)

unsigned get_random_mod(unsigned modulus)
	E.g. get_random_mod(hole) & ~(alignment - 1);
	     get_random_mod(port_scan_backoff)
	(Althogh probably drivers/s390/scsi/zfcp_fc.c should be changed
	to prandom.)

with, until the audit is completed:
#define get_random_int() get_random_bits(32)
#define get_random_long() get_random_bits(BITS_PER_LONG)

> It could only mix the output back in every two calls, in which case
> you can backtrack up to one call but you need to do 2^128 work to
> backtrack farther.  But yes, this is getting excessively complicated.

No, if you're willing to accept limited backtrack, this is a perfectly
acceptable solution, and not too complicated.  You could do it phase-less
if you like; store the previous output, then after generating the new
one, mix in both.  Then overwrite the previous output.  (But doing two
rounds of a crypto primtive to avoid one conditional jump is stupid,
so forget that.)

>> Hmm, interesting.  Although, for ASLR, we could use get_random_bytes()
>> directly and be done with it.  It won't be a bottleneck.

Isn't that what you already suggested?

I don't mind fewer primtives; I got a bit fixated on "Replace MD5 with
SipHash".  It's just the locking that I want to check isn't a problem.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22 21:11       ` George Spelvin
@ 2016-12-22 21:38         ` Hannes Frederic Sowa
  2016-12-23  0:07           ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-22 21:38 UTC (permalink / raw)
  To: George Spelvin, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, Jason,
	jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, netdev, tom, torvalds, tytso, vegard.nossum

On 22.12.2016 22:11, George Spelvin wrote:
>> I do tend to like Ted's version in which we use batched
>> get_random_bytes() output.  If it's fast enough, it's simpler and lets
>> us get the full strength of a CSPRNG.
> 
> With the ChaCha20 generator, that's fine, although note that this abandons
> anti-backtracking entirely.
> 
> It also takes locks, something the previous get_random_int code
> path avoided.  Do we need to audit the call sites to ensure that's safe?

We have spin_lock_irq* locks on the way. Of course they can hurt in when
contended. The situation should be the same as with get_random_bytes,
callable from every possible situation in the kernel without fear, so
this should also work for get_random_int. A lockdep test should still be
done. ;)

> And there is the issue that the existing callers assume that there's a
> fixed cost per word.  A good half of get_random_long calls are followed by
> "& ~PAGE_MASK" to extract the low 12 bits.  Or "& ((1ul << mmap_rnd_bits)
> - 1)" to extract the low 28.  If we have a buffer we're going to have to
> pay to refill, it would be nice to use less than 8 bytes to satisfy those.
> 
> But that can be a followup patch.  I'm thinking
> 
> unsigned long get_random_bits(unsigned bits)
> 	E.g. get_random_bits(PAGE_SHIFT),
> 	     get_random_bits(mmap_rnd_bits),
> 	u32 imm_rnd = get_random_bits(32)
> 
> unsigned get_random_mod(unsigned modulus)
> 	E.g. get_random_mod(hole) & ~(alignment - 1);
> 	     get_random_mod(port_scan_backoff)
> 	(Althogh probably drivers/s390/scsi/zfcp_fc.c should be changed
> 	to prandom.)
> 
> with, until the audit is completed:
> #define get_random_int() get_random_bits(32)
> #define get_random_long() get_random_bits(BITS_PER_LONG)

Yes, that does look nice indeed. Accounting for bits instead of bytes
shouldn't be a huge problem either. Maybe it gets a bit more verbose in
case you can't satisfy a request with one batched entropy block and have
to consume randomness from two.

>> It could only mix the output back in every two calls, in which case
>> you can backtrack up to one call but you need to do 2^128 work to
>> backtrack farther.  But yes, this is getting excessively complicated.
> 
> No, if you're willing to accept limited backtrack, this is a perfectly
> acceptable solution, and not too complicated.  You could do it phase-less
> if you like; store the previous output, then after generating the new
> one, mix in both.  Then overwrite the previous output.  (But doing two
> rounds of a crypto primtive to avoid one conditional jump is stupid,
> so forget that.)

Can you quickly explain why we lose the backtracking capability?

ChaCha as a block cipher gives a "perfect" permutation from the output
of either the CRNG or the CPRNG, which actually itself has backtracking
protection.

Thanks for explaining,
Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-22 21:38         ` Hannes Frederic Sowa
@ 2016-12-23  0:07           ` George Spelvin
  2016-12-23 12:05             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-23  0:07 UTC (permalink / raw)
  To: hannes, linux, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, Jason,
	jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, netdev, tom, torvalds, tytso, vegard.nossum

Hannes Frederic Sowa wrote:
> A lockdep test should still be done. ;)

Adding might_lock() annotations will improve coverage a lot.

> Yes, that does look nice indeed. Accounting for bits instead of bytes
> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
> case you can't satisfy a request with one batched entropy block and have
> to consume randomness from two.

The bit granularity is also for the callers' convenience, so they don't
have to mask again.  Whether get_random_bits rounds up to byte boundaries
internally or not is something else.

When the current batch runs low, I was actually thinking of throwing
away the remaining bits and computing a new batch of 512.  But it's
whatever works best at implementation time.

>>> It could only mix the output back in every two calls, in which case
>>> you can backtrack up to one call but you need to do 2^128 work to
>>> backtrack farther.  But yes, this is getting excessively complicated.
> 
>> No, if you're willing to accept limited backtrack, this is a perfectly
>> acceptable solution, and not too complicated.  You could do it phase-less
>> if you like; store the previous output, then after generating the new
>> one, mix in both.  Then overwrite the previous output.  (But doing two
>> rounds of a crypto primtive to avoid one conditional jump is stupid,
>> so forget that.)

> Can you quickly explain why we lose the backtracking capability?

Sure.  An RNG is (state[i], output[i]) = f(state[i-1]).  The goal of
backtracking is to compute output[i], or better yet state[i-1], given
state[i].

For example, consider an OFB or CTR mode generator.  The state is a key
and and IV, and you encrypt the IV with the key to produce output, then
either replace the IV with the output, or increment it.  Either way,
since you still have the key, you can invert the transformation and
recover the previous IV.

The standard way around this is to use the Davies-Meyer construction:

IV[i] = IV[i-1] + E(IV[i-1], key)

This is the standard way to make a non-invertible random function
out of an invertible random permutation.

>From the sum, there's no easy way to find the ciphertext *or* the
plaintext that was encrypted.  Assuming the encryption is secure,
the only way to reverse it is brute force: guess IV[i-1] and run the
operation forward to see if the resultant IV[i] matches.

There are a variety of ways to organize this computation, since the
guess gives toy both IV[i-1] and E(IV[i-1], key) = IV[i] - IV[i-1], including
running E forward, backward, or starting from both ends to see if you
meet in the middle.

The way you add the encryption output to the IV is not very important.
It can be addition, xor, or some more complex invertible transformation.
In the case of SipHash, the "encryption" output is smaller than the
input, so we have to get a bit more creative, but it's still basically
the same thing.

The problem is that the output which is combined with the IV is too small.
With only 64 bits, trying all possible values is practical.  (The world's
Bitcoin miners are collectively computing SHA-256(SHA-256(input)) 1.7 * 2^64
times per second.)

By basically doing two iterations at once and mixing in 128 bits of
output, the guessing attack is rendered impractical.  The only downside
is that you need to remember and store one result between when it's
computed and last used.  This is part of the state, so an attack can
find output[i-1], but not anything farther back.

> ChaCha as a block cipher gives a "perfect" permutation from the output
> of either the CRNG or the CPRNG, which actually itself has backtracking
> protection.

I'm not quite understanding.  The /dev/random implementation uses some
of the ChaCha output as a new ChaCha key (that's another way to mix output
back into the state) to prevent backtracking.  But this slows it down, and
again if you want to be efficient, you're generating and storing large batches
of entropy and storing it in the RNG state.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-23  0:07           ` George Spelvin
@ 2016-12-23 12:05             ` Hannes Frederic Sowa
  2016-12-23 18:26               ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-23 12:05 UTC (permalink / raw)
  To: George Spelvin, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, Jason,
	jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, netdev, tom, torvalds, tytso, vegard.nossum

On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote:
> Hannes Frederic Sowa wrote:
> > A lockdep test should still be done. ;)
> 
> Adding might_lock() annotations will improve coverage a lot.

Might be hard to find the correct lock we take later down the code
path, but if that is possible, certainly.

> > Yes, that does look nice indeed. Accounting for bits instead of bytes
> > shouldn't be a huge problem either. Maybe it gets a bit more verbose in
> > case you can't satisfy a request with one batched entropy block and have
> > to consume randomness from two.
> 
> The bit granularity is also for the callers' convenience, so they don't
> have to mask again.  Whether get_random_bits rounds up to byte boundaries
> internally or not is something else.
> 
> When the current batch runs low, I was actually thinking of throwing
> away the remaining bits and computing a new batch of 512.  But it's
> whatever works best at implementation time.
> 
> > > > It could only mix the output back in every two calls, in which case
> > > > you can backtrack up to one call but you need to do 2^128 work to
> > > > backtrack farther.  But yes, this is getting excessively complicated.
> > > No, if you're willing to accept limited backtrack, this is a perfectly
> > > acceptable solution, and not too complicated.  You could do it phase-less
> > > if you like; store the previous output, then after generating the new
> > > one, mix in both.  Then overwrite the previous output.  (But doing two
> > > rounds of a crypto primtive to avoid one conditional jump is stupid,
> > > so forget that.)
> > Can you quickly explain why we lose the backtracking capability?
> 
> Sure.  An RNG is (state[i], output[i]) = f(state[i-1]).  The goal of
> backtracking is to compute output[i], or better yet state[i-1], given
> state[i].
> 
> For example, consider an OFB or CTR mode generator.  The state is a key
> and and IV, and you encrypt the IV with the key to produce output, then
> either replace the IV with the output, or increment it.  Either way,
> since you still have the key, you can invert the transformation and
> recover the previous IV.
> 
> The standard way around this is to use the Davies-Meyer construction:
> 
> IV[i] = IV[i-1] + E(IV[i-1], key)
> 
> This is the standard way to make a non-invertible random function
> out of an invertible random permutation.
> 
> From the sum, there's no easy way to find the ciphertext *or* the
> plaintext that was encrypted.  Assuming the encryption is secure,
> the only way to reverse it is brute force: guess IV[i-1] and run the
> operation forward to see if the resultant IV[i] matches.
> 
> There are a variety of ways to organize this computation, since the
> guess gives toy both IV[i-1] and E(IV[i-1], key) = IV[i] - IV[i-1], including
> running E forward, backward, or starting from both ends to see if you
> meet in the middle.
> 
> The way you add the encryption output to the IV is not very important.
> It can be addition, xor, or some more complex invertible transformation.
> In the case of SipHash, the "encryption" output is smaller than the
> input, so we have to get a bit more creative, but it's still basically
> the same thing.
> 
> The problem is that the output which is combined with the IV is too small.
> With only 64 bits, trying all possible values is practical.  (The world's
> Bitcoin miners are collectively computing SHA-256(SHA-256(input)) 1.7 * 2^64
> times per second.)
> 
> By basically doing two iterations at once and mixing in 128 bits of
> output, the guessing attack is rendered impractical.  The only downside
> is that you need to remember and store one result between when it's
> computed and last used.  This is part of the state, so an attack can
> find output[i-1], but not anything farther back.

Thanks a lot for the explanation!

> > ChaCha as a block cipher gives a "perfect" permutation from the output
> > of either the CRNG or the CPRNG, which actually itself has backtracking
> > protection.
> 
> I'm not quite understanding.  The /dev/random implementation uses some
> of the ChaCha output as a new ChaCha key (that's another way to mix output
> back into the state) to prevent backtracking.  But this slows it down, and
> again if you want to be efficient, you're generating and storing large batches
> of entropy and storing it in the RNG state.

I was actually referring to the anti-backtrack protection in
/dev/random and also /dev/urandom, from where we reseed every 300
seconds and if our batched entropy runs low with Ted's/Jason's current
patch for get_random_int.

As far as I can understand it, backtracking is not a problem in case of
a reseed event inside extract_crng.

When we hit the chacha20 without doing a reseed we only mutate the
state of chacha, but being an invertible function in its own, a
proposal would be to mix parts of the chacha20 output back into the
state, which, as a result, would cause slowdown because we couldn't
propagate the complete output of the cipher back to the caller (looking
at the function _extract_crng).

Or are you referring that the anti-backtrack protection should happen
in every call from get_random_int?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-23 12:05             ` Hannes Frederic Sowa
@ 2016-12-23 18:26               ` George Spelvin
  2016-12-23 20:48                 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-23 18:26 UTC (permalink / raw)
  To: hannes, linux
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

(Cc: list trimmed slightly as the topic is wandering a bit.)

Hannes Frederic Sowa wrote:
> On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote:
>> Adding might_lock() annotations will improve coverage a lot.
>
> Might be hard to find the correct lock we take later down the code
> path, but if that is possible, certainly.

The point of might_lock() is that you don't have to.  You find the
worst case (most global) lock that the code *might* take if all the
buffer-empty conditions are true, and tell lockdep "assume this lock is
taken all the time".

>> Hannes Frederic Sowa wrote:
>>> Yes, that does look nice indeed. Accounting for bits instead of bytes
>>> shouldn't be a huge problem either. Maybe it gets a bit more verbose in
>>> case you can't satisfy a request with one batched entropy block and have
>>> to consume randomness from two.

For example, here's a simple bit-buffer implementation that wraps around
a get_random_long.  The bitbuffer is of the form "00001xxxx", where the
x bits are valid, and the position of the msbit indicates how many bits
are valid.

extern unsigned long get_random_long();
static unsigned long bitbuffer = 1;	/* Holds 0..BITS_PER_LONG-1 bits */
unsigned long get_random_bits(unsigned char bits)
{
	/* We handle bits == BITS_PER_LONG,and not bits == 0 */
	unsigned long mask = -1ul >> (BITS_PER_LONG - bits);
	unsigned long val;

	if (bitbuffer > mask) {
		/* Request can be satisfied out of the bit buffer */
		val = bitbuffer;
		bitbuffer >>= bits;
	} else {
		/*
		 * Not enough bits, but enough room in bitbuffer for the
		 * leftovers.  avail < bits, so avail + 64 <= bits + 63.
		 */
		val = get_random_long();
		bitbuffer = bitbuffer << (BITS_PER_LONG - bits)
			| val >> 1 >> (bits-1);
	}
	return val & mask;
}

> When we hit the chacha20 without doing a reseed we only mutate the
> state of chacha, but being an invertible function in its own, a
> proposal would be to mix parts of the chacha20 output back into the
> state, which, as a result, would cause slowdown because we couldn't
> propagate the complete output of the cipher back to the caller (looking
> at the function _extract_crng).

Basically, yes.  Half of the output goes to rekeying itself.

But, I just realized I've been overlooking something glaringly obvious...
there's no reason you can't compute multple blocks in advance.

The standard assumption in antibacktracking is that you'll *notice* the
state capture and stop trusting the random numbers afterward; you just
want the output *before* to be secure.  In other words, cops busting
down the door can't find the session key used in the message you just sent.

So you can compute and store random numbers ahead of need.

This can amortize the antibacktracking as much as you'd like.

For example, suppose we gave each CPU a small pool to minimize locking.
When one runs out and calls the global refill, it could refill *all*
of the CPU pools.  (You don't even need locking; there may be a race to
determine *which* random numbers the reader sees, but they're equally
good either way.)

> Or are you referring that the anti-backtrack protection should happen
> in every call from get_random_int?

If you ask for anti-backtracking without qualification, that's the
goal, since you don't know how long will elapse until the next call.  

It's like fsync().  There are lots of more limited forms of "keep my
data safe in case of a crash", but the most basic one is "if we lost
power the very instant the call returned, the data would be safe."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-23 18:26               ` George Spelvin
@ 2016-12-23 20:48                 ` Hannes Frederic Sowa
  2016-12-23 23:39                   ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-23 20:48 UTC (permalink / raw)
  To: George Spelvin
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

Hi,

On Fri, 2016-12-23 at 13:26 -0500, George Spelvin wrote:
> (Cc: list trimmed slightly as the topic is wandering a bit.)
> 
> Hannes Frederic Sowa wrote:
> > On Thu, 2016-12-22 at 19:07 -0500, George Spelvin wrote:
> > > Adding might_lock() annotations will improve coverage a lot.
> > 
> > Might be hard to find the correct lock we take later down the code
> > path, but if that is possible, certainly.
> 
> The point of might_lock() is that you don't have to.  You find the
> worst case (most global) lock that the code *might* take if all the
> buffer-empty conditions are true, and tell lockdep "assume this lock is
> taken all the time".

Yes, of course. Probably indicating input_pool's and primary_crng's
locks are good to indicate here, but on NUMA other locks might be
taken.

> > > Hannes Frederic Sowa wrote:
> > > > Yes, that does look nice indeed. Accounting for bits instead of bytes
> > > > shouldn't be a huge problem either. Maybe it gets a bit more verbose in
> > > > case you can't satisfy a request with one batched entropy block and have
> > > > to consume randomness from two.
> 
> For example, here's a simple bit-buffer implementation that wraps around
> a get_random_long.  The bitbuffer is of the form "00001xxxx", where the
> x bits are valid, and the position of the msbit indicates how many bits
> are valid.
> 
> extern unsigned long get_random_long();
> static unsigned long bitbuffer = 1;	/* Holds 0..BITS_PER_LONG-1 bits */
> unsigned long get_random_bits(unsigned char bits)
> {
> 	/* We handle bits == BITS_PER_LONG,and not bits == 0 */
> 	unsigned long mask = -1ul >> (BITS_PER_LONG - bits);
> 	unsigned long val;
> 
> 	if (bitbuffer > mask) {
> 		/* Request can be satisfied out of the bit buffer */
> 		val = bitbuffer;
> 		bitbuffer >>= bits;
> 	} else {
> 		/*
> 		 * Not enough bits, but enough room in bitbuffer for the
> 		 * leftovers.  avail < bits, so avail + 64 <= bits + 63.
> 		 */
> 		val = get_random_long();
> 		bitbuffer = bitbuffer << (BITS_PER_LONG - bits)
> 			| val >> 1 >> (bits-1);
> 	}
> 	return val & mask;
> }

In general this looks good, but bitbuffer needs to be protected from
concurrent access, thus needing at least some atomic instruction and
disabling of interrupts for the locking if done outside of
get_random_long. Thus I liked your previous approach more where you
simply embed this in the already necessary get_random_long and aliased
get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO.

> > When we hit the chacha20 without doing a reseed we only mutate the
> > state of chacha, but being an invertible function in its own, a
> > proposal would be to mix parts of the chacha20 output back into the
> > state, which, as a result, would cause slowdown because we couldn't
> > propagate the complete output of the cipher back to the caller (looking
> > at the function _extract_crng).
> 
> Basically, yes.  Half of the output goes to rekeying itself.

Okay, thanks and understood. :)

> But, I just realized I've been overlooking something glaringly obvious...
> there's no reason you can't compute multple blocks in advance.

In the current code on the cost of per cpu allocations thus memory.

> The standard assumption in antibacktracking is that you'll *notice* the
> state capture and stop trusting the random numbers afterward; you just
> want the output *before* to be secure.  In other words, cops busting
> down the door can't find the session key used in the message you just sent.

Yep, analogous to forward secrecy and future secrecy (self healing) is
provided by catastrophic reseeding from /dev/random.

In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte
return block to feed it directly back into the state chacha? So we pass
on 56 bytes into the pcpu buffer, and consume 8 bytes for the next
state. This would make the window max shorter than the anti
backtracking protection right now from 300s to 14 get_random_int calls.
Not sure if this is worth it.

> So you can compute and store random numbers ahead of need.
> 
> This can amortize the antibacktracking as much as you'd like.
> 
> For example, suppose we gave each CPU a small pool to minimize locking.
> When one runs out and calls the global refill, it could refill *all*
> of the CPU pools.  (You don't even need locking; there may be a race to
> determine *which* random numbers the reader sees, but they're equally
> good either way.)

Yes, but still disabled interrupts, otherwise the same state could be
used twice on the same CPU. Uff, I think we have this problem in
prandom_u32.

> > Or are you referring that the anti-backtrack protection should happen
> > in every call from get_random_int?
> 
> If you ask for anti-backtracking without qualification, that's the
> goal, since you don't know how long will elapse until the next call.  
> 
> It's like fsync().  There are lots of more limited forms of "keep my
> data safe in case of a crash", but the most basic one is "if we lost
> power the very instant the call returned, the data would be safe."

Yes, it absolutely makes sense.

Thanks a lot,
Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-23 20:48                 ` Hannes Frederic Sowa
@ 2016-12-23 23:39                   ` George Spelvin
  2016-12-24  0:12                     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-23 23:39 UTC (permalink / raw)
  To: daniel, hannes, linux
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

Hannes Frederic Sowa wrote:
> In general this looks good, but bitbuffer needs to be protected from
> concurrent access, thus needing at least some atomic instruction and
> disabling of interrupts for the locking if done outside of
> get_random_long. Thus I liked your previous approach more where you
> simply embed this in the already necessary get_random_long and aliased
> get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO.

It's meant to be part of the same approach, and I didn't include locking
because that's a requirement for *any* solution, and isn't specific
to the part I was trying to illustrate.

(As for re-using the name "get_random_long", that was just so
I didn't have to explain it.  Call it "get_random_long_internal"
if you like.)

Possible locking implementations include:
1) Use the same locking as applies to get_random_long_internal(), or
2) Make bitbuffer a per-CPU variable (note that we currently have 128
   bits of per-CPU state in get_random_int_hash[]), and this is all a
   fast-path to bypass heavier locking in get_random_long_internal().

>> But, I just realized I've been overlooking something glaringly obvious...
>> there's no reason you can't compute multple blocks in advance.
>
> In the current code on the cost of per cpu allocations thus memory.

Yes, but on 4-core machines it's still not much, and 4096-core
behemoths have RAM to burn.

> In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte
> return block to feed it directly back into the state chacha? So we pass
> on 56 bytes into the pcpu buffer, and consume 8 bytes for the next
> state. This would make the window max shorter than the anti
> backtracking protection right now from 300s to 14 get_random_int calls.
> Not sure if this is worth it.

We just finished discussing why 8 bytes isn't enough.  If you only
feed back 8 bytes, an attacker who can do 2^64 computation can find it
(by guessing and computing forward to verify the guess) and recover the
previous state.  You need to feed back at least as much output as your
security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

>> For example, suppose we gave each CPU a small pool to minimize locking.
>> When one runs out and calls the global refill, it could refill *all*
>> of the CPU pools.  (You don't even need locking; there may be a race to
>> determine *which* random numbers the reader sees, but they're equally
>> good either way.)

> Yes, but still disabled interrupts, otherwise the same state could be
> used twice on the same CPU. Uff, I think we have this problem in
> prandom_u32.

There are some locking gotchas, but it is doable lock-free.

Basically, it's a seqlock.  The writer increments it once (to an odd
number) before starting to overwrite the buffer, and a second time (to
an even number) after.  "Before" and "after" mean smp_wmb().

The reader can use this to figure out how much of the data in the buffer
is safely fresh.  The full sequence of checks is a bit intricate,
but straightforward.

I didn't discuss the locking because I'm confident it's solvable,
not because I wasn't aware it has to be solved.

As for prandom_u32(), what's the problem?  Are you worried that
get_cpu_var disables preemption but not interrupts, and so an
ISR might return the same value as process-level code?

First of all, that's not a problem because prandom_u32() doesn't
have security guarantees.  Occasionally returning a dupicate number
is okay.

Second, if you do care, that could be trivially fixed by throwing
a barrier() in the middle of the code.  (Patch appended; S-o-B
if anyone wants it.)

diff --git a/lib/random32.c b/lib/random32.c
index c750216d..6bee4a36 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -55,16 +55,29 @@ static DEFINE_PER_CPU(struct rnd_state, net_rand_state);
  *
  *	This is used for pseudo-randomness with no outside seeding.
  *	For more random results, use prandom_u32().
+ *
+ *	The barrier() is to allow prandom_u32() to be called from interupt
+ *	context without locking.  An interrupt will run to completion,
+ *	updating all four state variables.  The barrier() ensures that
+ *	the interrupted code will compute a different result.  Either it
+ *	will have written s1 and s2 (so the interrupt will start with
+ *	the updated values), or it will use the values of s3 and s4
+ *	updated by the interrupt.
+ *
+ *	(The same logic applies recursively to nested interrupts, trap
+ *	handlers, and NMIs.)
  */
 u32 prandom_u32_state(struct rnd_state *state)
 {
+	register u32 x;
 #define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b)
-	state->s1 = TAUSWORTHE(state->s1,  6, 13, 4294967294U, 18U);
-	state->s2 = TAUSWORTHE(state->s2,  2, 27, 4294967288U,  2U);
-	state->s3 = TAUSWORTHE(state->s3, 13, 21, 4294967280U,  7U);
-	state->s4 = TAUSWORTHE(state->s4,  3, 12, 4294967168U, 13U);
+	x  = state->s1 = TAUSWORTHE(state->s1,  6, 13,   ~1u, 18);
+	x += state->s2 = TAUSWORTHE(state->s2,  2, 27,   ~7u,  2);
+	barrier();
+	x ^= state->s3 = TAUSWORTHE(state->s3, 13, 21,  ~15u,  7);
+	x += state->s4 = TAUSWORTHE(state->s4,  3, 12, ~127u, 13);

-	return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4);
+	return x;
 }
 EXPORT_SYMBOL(prandom_u32_state);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-23 23:39                   ` George Spelvin
@ 2016-12-24  0:12                     ` Hannes Frederic Sowa
  2016-12-24  1:17                       ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-24  0:12 UTC (permalink / raw)
  To: George Spelvin, daniel
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

Hi,

On 24.12.2016 00:39, George Spelvin wrote:
> Hannes Frederic Sowa wrote:
>> In general this looks good, but bitbuffer needs to be protected from
>> concurrent access, thus needing at least some atomic instruction and
>> disabling of interrupts for the locking if done outside of
>> get_random_long. Thus I liked your previous approach more where you
>> simply embed this in the already necessary get_random_long and aliased
>> get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO.
> 
> It's meant to be part of the same approach, and I didn't include locking
> because that's a requirement for *any* solution, and isn't specific
> to the part I was trying to illustrate.
> 
> (As for re-using the name "get_random_long", that was just so
> I didn't have to explain it.  Call it "get_random_long_internal"
> if you like.)
> 
> Possible locking implementations include:
> 1) Use the same locking as applies to get_random_long_internal(), or
> 2) Make bitbuffer a per-CPU variable (note that we currently have 128
>    bits of per-CPU state in get_random_int_hash[]), and this is all a
>    fast-path to bypass heavier locking in get_random_long_internal().

I understood that this is definitely a solvable problem.

>>> But, I just realized I've been overlooking something glaringly obvious...
>>> there's no reason you can't compute multple blocks in advance.
>>
>> In the current code on the cost of per cpu allocations thus memory.
> 
> Yes, but on 4-core machines it's still not much, and 4096-core
> behemoths have RAM to burn.
> 
>> In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte
>> return block to feed it directly back into the state chacha? So we pass
>> on 56 bytes into the pcpu buffer, and consume 8 bytes for the next
>> state. This would make the window max shorter than the anti
>> backtracking protection right now from 300s to 14 get_random_int calls.
>> Not sure if this is worth it.
> 
> We just finished discussing why 8 bytes isn't enough.  If you only
> feed back 8 bytes, an attacker who can do 2^64 computation can find it
> (by guessing and computing forward to verify the guess) and recover the
> previous state.  You need to feed back at least as much output as your
> security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

I followed the discussion but it appeared to me that this has the
additional constraint of time until the next reseeding event happenes,
which is 300s (under the assumption that calls to get_random_int happen
regularly, which I expect right now). After that the existing reseeding
mechansim will ensure enough backtracking protection. The number of
bytes can easily be increased here, given that reseeding was shown to be
quite fast already and we produce enough output. But I am not sure if
this is a bit overengineered in the end?

>>> For example, suppose we gave each CPU a small pool to minimize locking.
>>> When one runs out and calls the global refill, it could refill *all*
>>> of the CPU pools.  (You don't even need locking; there may be a race to
>>> determine *which* random numbers the reader sees, but they're equally
>>> good either way.)
> 
>> Yes, but still disabled interrupts, otherwise the same state could be
>> used twice on the same CPU. Uff, I think we have this problem in
>> prandom_u32.
> 
> There are some locking gotchas, but it is doable lock-free.
> 
> Basically, it's a seqlock.  The writer increments it once (to an odd
> number) before starting to overwrite the buffer, and a second time (to
> an even number) after.  "Before" and "after" mean smp_wmb().
> 
> The reader can use this to figure out how much of the data in the buffer
> is safely fresh.  The full sequence of checks is a bit intricate,
> but straightforward.
> 
> I didn't discuss the locking because I'm confident it's solvable,
> not because I wasn't aware it has to be solved.

Also agreed. Given your solution below to prandom_u32, I do think it
might also work without the seqlock now.

> As for prandom_u32(), what's the problem?  Are you worried that
> get_cpu_var disables preemption but not interrupts, and so an
> ISR might return the same value as process-level code?

Yes, exactly those were my thoughts.

> First of all, that's not a problem because prandom_u32() doesn't
> have security guarantees.  Occasionally returning a dupicate number
> is okay.
>
> Second, if you do care, that could be trivially fixed by throwing
> a barrier() in the middle of the code.  (Patch appended; S-o-B
> if anyone wants it.)

I wouldn't have added a disable irqs, but given that I really like your
proposal, I would take it in my todo branch and submit it when net-next
window opens.

> diff --git a/lib/random32.c b/lib/random32.c
> index c750216d..6bee4a36 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> [...]

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-24  0:12                     ` Hannes Frederic Sowa
@ 2016-12-24  1:17                       ` George Spelvin
  2016-12-28  5:23                         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 19+ messages in thread
From: George Spelvin @ 2016-12-24  1:17 UTC (permalink / raw)
  To: daniel, hannes, linux
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

Hannes Frederic Sowa wrote:
> On 24.12.2016 00:39, George Spelvin wrote:
>> We just finished discussing why 8 bytes isn't enough.  If you only
>> feed back 8 bytes, an attacker who can do 2^64 computation can find it
>> (by guessing and computing forward to verify the guess) and recover the
>> previous state.  You need to feed back at least as much output as your
>> security targete.  For /dev/urandom's ChaCha20, that's 256 bits.

> I followed the discussion but it appeared to me that this has the
> additional constraint of time until the next reseeding event happenes,
> which is 300s (under the assumption that calls to get_random_int happen
> regularly, which I expect right now). After that the existing reseeding
> mechansim will ensure enough backtracking protection. The number of
> bytes can easily be increased here, given that reseeding was shown to be
> quite fast already and we produce enough output. But I am not sure if
> this is a bit overengineered in the end?

I'm not following your description of how the time-based and call-based
mechanisms interact, but for any mix-back, you should either do enough
or none at all.  (Also called "catastrophic reseeding".)

For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
(Because each mixback can be guessed and verified separately.)

> Also agreed. Given your solution below to prandom_u32, I do think it
> might also work without the seqlock now.

It's not technically a seqlock; in particular the reader doesn't
spin.  But the write side, and general logic is so similar it's
a good mental model.

Basically, assume a 64-byte buffer.  The reader has gone through
32 bytes of it, and has 32 left, and when he reads another 8 bytes,
has to distinguish three cases:

1) No update; we read the old bytes and there are now 32 - 24 bytes left.
2) Update completed while we weren't looking.  There are now new
   bytes in the buffer, and we took 8 leaving 64 - 8 = 56.
3) Update in progress at the time of the read.  We don't know if we
   are seeing old bytes or new bytes, so we have to assume the worst
   and not proceeed unless 32 >= 8, but assume at the end there are
   64 - 8 = 56 new bytes left.

> I wouldn't have added a disable irqs, but given that I really like your
> proposal, I would take it in my todo branch and submit it when net-next
> window opens.

If you want that, I have a pile of patches to prandom I really
should push upstream.  Shall I refresh them and send them to you?


commit 4cf1b3d9f4fbccc29ffc2fe4ca4ff52ea77253f1
Author: George Spelvin <linux@horizon.com>
Date:   Mon Aug 31 00:05:00 2015 -0400

    net: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    The net/802 code was already efficient, but prandom_u32_max() is simpler.
    
    In net/batman-adv/bat_iv_ogm.c, batadv_iv_ogm_fwd_send_time() got changed
    from picking a random number of milliseconds and converting to jiffies to
    picking a random number of jiffies, since the number of milliseconds (and
    thus the conversion to jiffies) is a compile-time constant.  The equivalent
    code in batadv_iv_ogm_emit_send_time was not changed, because the number
    of milliseconds is variable.
    
    In net/ipv6/ip6_flowlabel.c, ip6_flowlabel had htonl(prandom_u32()),
    which is silly.  Just cast to __be32 without permuting the bits.
    
    net/sched/sch_netem.c got adjusted to only need one call to prandom_u32
    instead of 2.  (Assuming skb_headlen can't exceed 512 MiB, which is
    hopefully safe for some time yet.)
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 9c8fb80e1fd2be42c35cab1af27187d600fd85e3
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 15:20:47 2014 -0400

    mm/swapfile.c: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 2743eb01e5c5958fd88ae78d19c5fea772d4b117
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 15:19:53 2014 -0400

    lib: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 6a5e91bf395060a3351bfe5efc40ac20ffba2c1b
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 15:18:50 2014 -0400

    fs/xfs: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range".
    
    Also changed the last argument of xfs_error_test() from "unsigned long"
    to "unsigned", since the code never did support values > 2^32, and
    the largest value ever passed is 100.
    
    The code could be improved even further by passing in 2^32/rf rather
    than rf, but I'll leave that to some XFS developers.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 6f6d485d9179ca6ec4e30caa06ade0e0c6931810
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 15:00:17 2014 -0400

    fs/ubifs: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range".
    
    In fs/ubifs/debug.c, the chance() function got rewritten entirely
    to take advantage of the fact its arguments are always constant.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 0b6bf2c874bbbcfa74f6be0413c772b3ac134d12
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 14:52:17 2014 -0400

    fs/extX: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range".
    
    In ext4/ialloc.c:436, there's one more chance to do that, but
    the modulo is required to keep the deterministic option consistent,
    so I left it alone.
    
    Switching to "parent_group = ((u64)grp * ngroups) >> 32;" would be more
    efficient, but would change user-visible behavior.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit e79e0e8b491bc976c0b4e1b2070eccdf55b34f43
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 14:47:15 2014 -0400

    fs/ceph: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit fc628326d8cf4abe364ea01259bc392c0bbaf5a0
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 14:46:29 2014 -0400

    drivers/scsi/fcoe/fcoe_ctlr.c: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 4810d67dd2edf08e7801ef47550d46b7397fe2dc
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 14:45:55 2014 -0400

    drivers/ntb/ntb_hw.c: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit f4a806abbc0785e8f0363e02fac613246eed9e34
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 14:45:27 2014 -0400

    drivers/net: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    In some cases (like ethernet/broadcom/cnic.c and drivers/net/hamradio),
    the range is a compile-time constant power of 2, so the code doesn't
    get any better, but I'm trying to do a full sweep.
    
    In drivers/net/wireless/brcm80211/brcmfmac/p2p.c, a timeout is selected from
    100, 200 or 300 ms.  It would be easy enough to make the granularity finer,
    but I assume the existing code is that way for a reason.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 603e0c311fac09d633ff6c0fd9242108f1c52ead
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 13:55:09 2014 -0400

    drivers/mtd: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range".
    
    And rewrote the 1-in-N code in drivers/mtd/ubi/debug.h to avoid
    even more arithmetic.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit e0657cc865e8e02768906935b8e8bf63af58aa46
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 13:52:13 2014 -0400

    drivers/mmc/core: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 017ee6841ec8d416093fc1f18bdd3df0dfc6aacc
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 13:51:33 2014 -0400

    drivers/lguest/page_tables.c: Use prandom_u32_max()
    
    This doesn't actually change the code because the array size is
    a power of 2 (it's a 4-element array).  But I'm on a roll.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 87556165f46eb42c756bcb94e062c3fbd272a4bf
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 13:49:41 2014 -0400

    drivers/infiniband: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 1eafe1d429f442218810e8c604d4e7c466414cf3
Author: George Spelvin <linux@horizon.com>
Date:   Sun Aug 30 23:42:41 2015 -0400

    block/blk-mq-tag.c: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit f03ee59a63098d244d5b8932fc68c9fc3e2bb222
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 13:46:52 2014 -0400

    arch/x86: Use prandom_u32_max()
    
    It's slightly more efficient than "prandom_u32() % range"
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit feafd3a3fb09924ea633d677a7ab8a25a817f39d
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 13:44:49 2014 -0400

    lib/random32.c: Remove redundant U suffixes on integers
    
    Get rid of a few of the extraneous U suffixes on ordinary integers.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit f14328d248e59c862478633479528c9cb8554d7a
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 12:40:19 2014 -0400

    lib/random32.c: Randomize timeout to the millisecond, not the second.
    
    If you're going to bother randomizing it, do it right.
    And use prandom_u32_max(), which is designed for the job, rather
    than "% 40".
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 143342006adfff718afedf58f639b72337d7c816
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 12:51:26 2014 -0400

    lib/random32.c: Make prandom_u32_max efficient for powers of 2
    
    The multiply-and-shift is efficient in the general case, but slower
    than a simple bitwise AND if the range is a power of 2.  Make the function
    handle this case so callers don't have to worry about micro-optimizing it.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 99864db5cb023e6b09d71cde4997a5f6cafb5cca
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 12:02:17 2014 -0400

    lib/random32.c: Use <asm/unaligned.h> instead of hand-rolling it
    
    The functions exist for a reason; the manual byte-at-a-time code
    is unnecessarily slow (and bloated).
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 4ecd45f6eadd4d171782dc6b451ed1040e47d419
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 11:55:59 2014 -0400

    lib/random32.c: Debloat non-speed-critical code
    
    Unrolling code in rarely-used code paths is just silly.  There are
    two places that static calls to prandom_u32_state() can be removed:
    
    1) prandom_warmup() calls prandom_u32_state 10 times.
    2) prandom_state_selftest() can avoid one call and simplify the
       loop logic by repeating an assignment to a local variable
       (which probably adds zero code anyway).
    
    Signed-off-by: George Spelvin <linux@horizon.com>

commit 8765cff45da1d96e4310d50dd49231790c49b612
Author: George Spelvin <linux@horizon.com>
Date:   Sat May 24 11:52:34 2014 -0400

    lib/random32.c: Mark self-test data as __initconst
    
    So it can be thrown away along with the code that uses it.
    
    Signed-off-by: George Spelvin <linux@horizon.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-24  1:17                       ` George Spelvin
@ 2016-12-28  5:23                         ` Hannes Frederic Sowa
  2016-12-28 10:04                           ` George Spelvin
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-28  5:23 UTC (permalink / raw)
  To: George Spelvin, daniel
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

Hello,

On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote:
> Hannes Frederic Sowa wrote:
> > On 24.12.2016 00:39, George Spelvin wrote:
> > > We just finished discussing why 8 bytes isn't enough.  If you only
> > > feed back 8 bytes, an attacker who can do 2^64 computation can find it
> > > (by guessing and computing forward to verify the guess) and recover the
> > > previous state.  You need to feed back at least as much output as your
> > > security targete.  For /dev/urandom's ChaCha20, that's 256 bits.
> > I followed the discussion but it appeared to me that this has the
> > additional constraint of time until the next reseeding event happenes,
> > which is 300s (under the assumption that calls to get_random_int happen
> > regularly, which I expect right now). After that the existing reseeding
> > mechansim will ensure enough backtracking protection. The number of
> > bytes can easily be increased here, given that reseeding was shown to be
> > quite fast already and we produce enough output. But I am not sure if
> > this is a bit overengineered in the end?
> 
> I'm not following your description of how the time-based and call-based
> mechanisms interact, but for any mix-back, you should either do enough
> or none at all.  (Also called "catastrophic reseeding".)

We call extract_crng when we run out of batched entropy and reseed. How
often we call down to extract_crng depends on how much entropy we
extracted by calls to get_random_int/long, so the number of calls into
those functions matter.

In extract_crng we have a timer which reseeds every 300s the CPRNG and
either uses completely new entropy from the CRNG or calls down into the
CPRNG while also doing backtracing protection (which feeds chacha's
block size / 2 back into chacha, if I read the code correctly, thus
1024 bits, which should be enough).

> For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
> (Because each mixback can be guessed and verified separately.)

Exactly, but the full reseed after running out of entropy is strong
enough to not be defeated by your argumentation. Neither the reseed
from the CRNG.

> > Also agreed. Given your solution below to prandom_u32, I do think it
> > might also work without the seqlock now.
> 
> It's not technically a seqlock; in particular the reader doesn't
> spin.  But the write side, and general logic is so similar it's
> a good mental model.
> 
> Basically, assume a 64-byte buffer.  The reader has gone through
> 32 bytes of it, and has 32 left, and when he reads another 8 bytes,
> has to distinguish three cases:
> 
> 1) No update; we read the old bytes and there are now 32 - 24 bytes left.
> 2) Update completed while we weren't looking.  There are now new
>    bytes in the buffer, and we took 8 leaving 64 - 8 = 56.
> 3) Update in progress at the time of the read.  We don't know if we
>    are seeing old bytes or new bytes, so we have to assume the worst
>    and not proceeed unless 32 >= 8, but assume at the end there are
>    64 - 8 = 56 new bytes left.
> 
> > I wouldn't have added a disable irqs, but given that I really like your
> > proposal, I would take it in my todo branch and submit it when net-next
> > window opens.
> 
> If you want that, I have a pile of patches to prandom I really
> should push upstream.  Shall I refresh them and send them to you?

I would like to have a look at them in the new year, certainly! I can
also take care about the core prandom patches, but don't know if I have
time to submit the others to the different subsystems.

Maybe, if David would be okay with that, we can submit all patches
through his tree, as he is also the dedicated maintainer for prandom.

> [... patch descriptions ...]

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
  2016-12-28  5:23                         ` Hannes Frederic Sowa
@ 2016-12-28 10:04                           ` George Spelvin
  0 siblings, 0 replies; 19+ messages in thread
From: George Spelvin @ 2016-12-28 10:04 UTC (permalink / raw)
  To: daniel, hannes, linux
  Cc: ak, davem, David.Laight, ebiggers3, eric.dumazet, Jason,
	kernel-hardening, linux-crypto, linux-kernel, luto, netdev, tom,
	tytso, vegard.nossum

Hannes Frederic Sowa wrote:
> We call extract_crng when we run out of batched entropy and reseed. How
> often we call down to extract_crng depends on how much entropy we
> extracted by calls to get_random_int/long, so the number of calls into
> those functions matter.
> 
> In extract_crng we have a timer which reseeds every 300s the CPRNG and
> either uses completely new entropy from the CRNG or calls down into the
> CPRNG while also doing backtracing protection (which feeds chacha's
> block size / 2 back into chacha, if I read the code correctly, thus
> 1024 bits, which should be enough).

In the current code, _extract_crng checks to see if more than 300 s
have elapsed since last time it was reseeded, and if so, reseeds with
fresh entropy.

In addition, on every read (or get_random_bytes), if the request leaves
enough ranfom bits in the last ChaCha block, it feeds back 256 bits
(the ChaCha block size is 16*32 = 512 bits) for anti-backtracking.

If the last read happened to not fit under that limit (size % 512 >
256), *and* there are no calls for RNG output for a long time, there is
no  upper limit to how long the old ChaCha key can hang around.

> On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote:
>> For example, two mix-backs of 64 bits gives you 65 bit security, not 128.
>> (Because each mixback can be guessed and verified separately.)

> Exactly, but the full reseed after running out of entropy is strong
> enough to not be defeated by your argumentation. Neither the reseed
> from the CRNG.

Yes, I was just reacting to your original statement:

>>>>> couldn't we simply use 8 bytes of the 64 byte
>>>>> return block to feed it directly back into the state chacha?

It's not the idea that's bad, just the proposed quantity.


>> If you want that, I have a pile of patches to prandom I really
>> should push upstream.  Shall I refresh them and send them to you?

> I would like to have a look at them in the new year, certainly! I can
> also take care about the core prandom patches, but don't know if I have
> time to submit the others to the different subsystems.
>
> Maybe, if David would be okay with that, we can submit all patches
> through his tree, as he is also the dedicated maintainer for prandom.

Amazing, thank you very much!  They're just minor cleanups, nothing
too exciting.  I'll put it in the queue to make sure they're up to
date.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
@ 2016-12-22  2:40 Jason A. Donenfeld
  0 siblings, 0 replies; 19+ messages in thread
From: Jason A. Donenfeld @ 2016-12-22  2:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: George Spelvin, Ted Ts'o, Andi Kleen, David S. Miller,
	David Laight, D. J. Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jean-Philippe Aumasson, kernel-hardening,
	Linux Crypto Mailing List, linux-kernel, Network Development,
	Tom Herbert, Linus Torvalds, Vegard Nossum

> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
>> After some thinking, I still like the "state-preserving" construct
>> that's equivalent to the current MD5 code.  Yes, we could just do
>> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
>> preserve a bit more.
>>
>> It requires library support from the SipHash code to return the full
>> SipHash state, but I hope that's a fair thing to ask for.

This is not a good idea. If I understand correctly, the idea here is
to just keep around SipHash's internal state variables, and chain them
over to the next call, sort of like how md5_transform with the current
code works on the same scratch space. There has been no security
analysis in the literature on this use of the primitive, and I have no
confidence that this is a secure use of the function. Unless somebody
can point me toward a paper I missed or a comment from a real
cryptographer about the specifics of SipHash, I think I'm right to
admonish against this dangerous road.

Let's talk about constructions. And let's only decide on a
construction that we're actually equipped to analyze. Let's definitely
not talk about making our own primitives, or retrofitting nice
primitive's internals into our own Frankenstein.

Alternatively, if I'm wrong, please send me an eprint/arXiv link to a
paper that discusses this use of SipHash.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-12-28 10:04 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-22  2:07 George's crazy full state idea (Re: HalfSipHash Acceptable Usage) Andy Lutomirski
2016-12-22  5:01 ` George Spelvin
2016-12-22  5:42   ` Andy Lutomirski
2016-12-22  8:02     ` George Spelvin
2016-12-22 16:09 ` Andy Lutomirski
2016-12-22 19:24   ` George Spelvin
2016-12-22 19:32     ` Andy Lutomirski
2016-12-22 21:11       ` George Spelvin
2016-12-22 21:38         ` Hannes Frederic Sowa
2016-12-23  0:07           ` George Spelvin
2016-12-23 12:05             ` Hannes Frederic Sowa
2016-12-23 18:26               ` George Spelvin
2016-12-23 20:48                 ` Hannes Frederic Sowa
2016-12-23 23:39                   ` George Spelvin
2016-12-24  0:12                     ` Hannes Frederic Sowa
2016-12-24  1:17                       ` George Spelvin
2016-12-28  5:23                         ` Hannes Frederic Sowa
2016-12-28 10:04                           ` George Spelvin
2016-12-22  2:40 Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).