Re: Updated scalable urandom patchkit

From: "George Spelvin" <linux@horizon.com>
To: linux@horizon.com, tytso@mit.edu
Cc: ahferroin7@gmail.com, andi@firstfloor.org, jepler@unpythonic.net,
	linux-kernel@vger.kernel.org, linux@rasmusvillemoes.dk
Subject: Re: Updated scalable urandom patchkit
Date: 12 Oct 2015 16:30:59 -0400	[thread overview]
Message-ID: <20151012203059.26372.qmail@ns.horizon.com> (raw)
In-Reply-To: <20151012135434.GA3548@thunk.org>

Theodore Ts'o wrote:
> On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote:
>> Words 5..9 never got mixed at all.

> Hmm, good point.  It *shouldn't* matter if the hash is secure, but the
> potentially uneven mixing would be a concern.

I don't quite follow the "shouldn't matter" part.  The problem is that
insufficient input mixing can cause collisions, which are loss of entropy.
The output hash(the secure one) can't fix that.

The whole reason for the 5-tap LFSR is to ensure that each input word
affects a significant number of *other* words before getting stored into
by fresh input.

I just realized a worse problem: suppose CPU 1 has an offset of -1.
It will proceed to *undo* the mixing just done by CPU 0.  I don't just
mean that it'll store its input over top, but it will also undo the
LFSR XOR.  Ack!

(Maybe I'll retract that "bow before the master" remark. :-))

> They used to go through the input pool, and when we added the per-CPU
> fast pool, we didn't revisit the question of whether the input_rotate
> was still needed.  So yes, at this point it might be worthwhile to
> remove the input_rotate part of the mixing scheme.

The fast pools actually just extended an issue which the three-pool
design created: when spilling from one pool to the other, the whole
"expensive output, cheap input" optimization makes no sense.

For raw data and interrupt overhead, yes obviously.  But when it's
8 sha_transform() calls to generate 10 bytes, the design of the
subsequent input hash needs some rethinking.

It may make sense to have a separate path for "known high-quality"
input that puts more effort into diffusion.

Idea 1: Given that we do have an entropy estimate with every write to a
        pool, what if we did an input hash that did work proportional
        to the *entropy* of the input.	With some lower bound for
        zero-entropy writes like plain /dev/random writes and mixback.)

Idea 2: For input mixing not in interrupt context, which includes all
        input to the output pools, it would be smaller and simpler code
        to just XOR into the pool, and defer mixing until one whole-pool
        mixing pass after the XOR position reched the end of the pool.

        Concurrent writers could do a single atomic_add_return to
        find an add position, and if that extended past the end of the
        buffer, they'd obtain the lock, stir the pool, decrement the add
        position (which would permit other concurrent writers) and then
        do their XORs.

        There's still the risk of races between delayed non-locking
        inputters and stirrers, but maybe if non-locking inputters
        were limited to non-critical mixbacks, that would be manageable.

>> However, I *also* have to use atomic_add to write r->pool[i],
>> which is indeed expensive.  PoC diff attached, but I'm not sure if
>> you'll like it.

> Yeah, that was the part that worried me.  Whether we use an atomic add
> or an atomic xor, it's going to be very expensive. Using an atomic add
> means that it will work with arm, but there's an explicit loop with
> arm, so on an arm system, it effectively turns into a more efficient
> spinlock, except you're spinning on each byte, so instead of bouncing
> cache lines with the locking granularity of the extract, we would be
> bouncing cache lines for every single byte of the mixback operation.

Yes.  The current ticket spinlocks are little more than a single
atomic_add_return, and the dangers of putting multiple locks in the same
cache line is well known.

> Segregating abusers solves both problems.  If we do this then we don't
> need to drop the locks from the nonblocking pool, which solves the
> security problem.

Er, sort of.  I still think my points were valid, but they're
about a particular optimization suggestion you had.  By avoiding
the need for the optimization, the entire issue is mooted.

I still say a separate firewall is a bad way to *solve* security problems,
but it can avoid preformance pressure to *create* them in the first place.

>> I can imagine that one /dev/urandom abuser can minopolize the locck and
>> cause problems for a shell script.

> Well, that's my point.  If we restrict the nonblocking pool to kernel
> users, then a /dev/urandom abuser won't be able to cause problems for
> the shell script.

And I quite agree with that.  Sorry, sometimes a discussion goes in
multiple directions and it's hard to keep track of the point of any
particular sentence.

I get fixated on one point and intrepret comments about something else
as if they pertained to the issue I'm thinking about.

> The trylock() scheme is a bit dangerous, since we would need to make
> sure that *all* times when other callers take the lock, the contents
> of the pool would have changed.  Or else, a subseqent call to extract
> entropy from that CPU will return the same value.  And even if that
> were true today (or we could make it be true) if that ever changed, it
> would break the code.  So it makes it pretty fragile.

Sorry, yes, I'm prefectly aware of this, I just glossed over it.
Each extraction has to have a unique nonce; that's non-negotiable, but
not hard to achieve.  We'd need to add a per-cpu counter to go along
with the CPU id to make the extraction nonce.

Doing this would be valuable *anyway*, because it would let us
reduce the mixback to once per read() rather than once per 10 bytes.

(Also, what would you say to reducing arch_get_random_long() to 16
bytes, and reserving hash.w[5] for the nonce?  That would ensure that
even a malicious arch_get_random_long() couldn't force a collision.)

> The ChaCha 20 state block is 64 bytes, and then I was assuming two 4
> byte control variables, for the number of cranks and the time since
> last reseed.  We could actually store those control variables into the
> IV portion of the ChaCha state structure, which would make it be 64
> bytes --- or we could generate a new state structure each time and not
> bother storing the 16 bytes of constants in the state structure, which
> trades a tiny amount of time and stack space for memory.  These are
> all implementation details, though....

You have to copy the state *anyway* because you don't want it overwritten
by the ChaCha output, so there's really no point storing the constants.
(Also, ChaCha has a simpler input block structure than Salsa20; the
constants are all adjacent.)

And ChaCha includes a stream position counter itself, so you don't need
a separate one.

(Note: one problem with ChaCha specifically is that is needs 16x32 bits
of registers, and Arm32 doesn't quite have enough.  We may want to provide
an arch CPRNG hook so people can plug in other algorithms with good
platform support, like x86 AES instructions.)

>> But why bother with even that much?  If you're willing to discard
>> antibacktracking, just have *one* key for the kernel, reseeded more
>> often, and a per-thread nonce and stream position.

> Well, we're not completely discarding backtracking protection.  We are
> discarding backtracking protection between successive reads from a single
> process, and even there we would be reseeding every five minutes (and
> this could be tuned), so there is *some* anti-backtracking protection.

Yeah, I realized this a few hours after sending that e-mail.
Anti-backtracking between processes is more important than within.

> I'm going to make the claim that if a particular process is reading
> from /dev/urandom in a tight loop, you probably don't *need*
> backtracking.  You're probably doing something silly such as using
> /dev/urandom to overwrite a disk, or some such.

I fully agree.

> On the flip side, the time when you might care about anti-backtracing
> protection is say, when you're generating a new session key for a new
> connection.  So perhaps one approach is to use some kind of
> ratelimiter algorithm so that if you're using /dev/urandom "carefully"
> (say, no more than ten times a second), we'll use the non-blocking
> pool.  But once a process exceeds that limit, it will switch over the
> the CRNG, and then the only performance that abuser process will hurt
> is its own (because it would be even faster if they were running
> something like arc4random in userspace).

I consider the size of the read at least as much as the rate.
As mentioned in the random(4) man page, I don't think more than 256
bits of computational security even exists.  Human Bitcoin mining is
(per http://bitcoincharts.com/) doing 440051 Thash/s, which is 80 bits
per month.

We can extrapolate computational feasibility beyond what's currently
been done.  I think 128-bit claims are defensible.  But once we get beyond
the square of anything ever attempted (170 bits or so), security claims
really don't know what they're talking about.  256 bits is the *cube*.
That's just crazy talk.

This in turn means that someone asking for more than 256 bits of seed
material is doing something stupid.  (In the more diplomatic language of
the man page, "should be taken as a sign that its cryptography is _not_
skillfully implemented."  I didn't want to totally condmen quick one-time
hacks.)

>> But without interfering with legitimate users of /dev/urandom at all,
>> I'd be quite willing to say that as soon as you read more than 32 bytes
>> (current request plus recent history) from /dev/urandom, you get a
>> private ChaCha20 structure to read from.

> Yeah, that's what I was thinking, although I was willing to be a bit
> more generous about when we switch them over to using the ChaCha pool.
> I think doing this automatically is going to be much better than
> creating a new /dev/frandom, since it's going to be hard to get
> applications to switch over.  We're still several years away from
> swtiching people over to the new getrandom(2) system call, after all.

I worry this might attract the same sort of shitstorm as the SHA-3
proposal to tweak Keccak's preimage resistance (which, for the benefit of
the peanut gallery, was solid cryptographic engineering that belatedly
fixed a stupidity in the original SHA-3 contest rules), but if you're
up for it, I'm not going to object.

I'm not too hung up on the thresholds, as long as we catch bulk
readers on the first read, rather than wait until *after* they've
drained /dev/urandom.  I'd prefer a single 64-byte read request would
move to mitigation mode, but I could be talked into 128.  (Actually,
it would probably make more sense to have the threshold be a multiple
of EXTRACT_SIZE.)

History is probably best kept in a leaky bucket of some sort.  That has a
current level and last update jiffies, and each read, we subtract off
the allowed maximum rate from the level.

(Or would an exponential average be better?  That would make the maximum
read rate more strongly dependent on the evenness of the spacing.)

Then add that (scaled somehow?) to the current read size and compare
with the threshold.

The same variables can be used (with different parameters) to decide if
we want to get out of mitigation mode.  The one thing to watch out for
is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
the buffer cache fills.  We don't want to forgive after too small a
fixed interval.

Finally, we have the issue of where to attach this rate-limiting structure
and crypto context.  My idea was to use the struct file.  But now that
we have getrandom(2), it's harder.  mm, task_struct, signal_struct, what?

(Post-finally, do we want this feature to be configurable under
CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
a speficic design goal, and abuse mitigation is optional.)