Re: Updated scalable urandom patchkit

From: "Theodore Ts'o" <tytso@mit.edu>
To: George Spelvin <linux@horizon.com>
Cc: ahferroin7@gmail.com, andi@firstfloor.org, jepler@unpythonic.net,
	linux-kernel@vger.kernel.org, linux@rasmusvillemoes.dk
Subject: Re: Updated scalable urandom patchkit
Date: Mon, 12 Oct 2015 00:05:00 -0400	[thread overview]
Message-ID: <20151012040500.GD5341@thunk.org> (raw)
In-Reply-To: <20151012001601.18155.qmail@ns.horizon.com>

On Sun, Oct 11, 2015 at 08:16:01PM -0400, George Spelvin wrote:
> 
> I'm not thrilled with incrementing the pointer from i to len, but mixing
> at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
> pass structure.

That part I'm not worried about.  We still have a regular pass
structure --- since for each CPU, we are still iterating over the pool
in a regular fashion.

> How about this instead: drop the hashed offset, and instead let each
> writer do an atomic_add_return on the index, then iterate over the
> "reserved" slots.  Concurrent additions will at least do non-overlapping
> writes until the numer equals the pool size.

One atomic operation per byte that we're mixing in?  That's quite
expensive.

> Personally, I hate the input_rotate.  It's not that it's harmful, just
> that it doesn't do much good compared to the cost; for the number of cycles
> and context space required there are more efficient mixing operations.

The input_rotate is useful for the input pool, for things like
keyboard code so we can make sure they enter the pool at more points
than just the low bits of each word.  But for the output pools, it
really doesn't make any sense.  And we are getting to the point where
we may end up having different mixing algorithms for the nonblocking
pool, and in that case I have absolutely no trouble dropping the
input_rotate part of the mixing algorithm for the non-blocking pool.

> Or a small machine with a couple of concurrent /dev/urandom abusers.
> Remember, it's globally readable, so it has to be resistance to malicious
> abuse.

One of the other ways we could solve this is by hanging a struct off
the task structure, and if we detect that we have a /dev/urandom
abuser, we give that process its own urandom pool, so any pissing that
it does will be in its own pool.  (So to speak.)

Most processes don't actually use that much randomness, and I'm not
that worried about in-kernel users of the nonblocking pool.  Even with
the most exec-heavy workload, setting up a new exec image is
heavyweight enough that you're really not going to be contending on
the lock.  I also have trouble with someone spending $$$$ on a system
with 1K cpu cores and wasting all of their CPU power with running
shell scripts that fork and exec a lot.  :-)

The reality is that most processes don't use /dev/urandom or
getrandom(2) at all, and those that do, many of them only use it
sparingly.  So maybe the right answer is to do something simple which
takes care of the abusers.

> You can add rather than XOR, and we have atomic add primitives.

Atomic-add primitives aren't portable either.  The representation
isn't guaranteed to be 32-bits, and some platforms an atomic int is
only 24-bits wide (the top 8 bits being used for locking purposes).

> There are several possible solutions that don't need separate pools
> (including separate add-back pools, with a shared seeded pool that
> is never touched by add-back), so I don't think it's necessary to
> give up yet.

Hmm, maybe.  I'm a bit worried about the amount of complexity that
this entails, and the reality is that the urandom pool or pools don't
provide anything other than cryptogaphic randomness.

At this point, I wonder if it might not be simpler to restrict the
current nonblocking pool to kernel users, and for userspace users, the
first time a process reads from /dev/urandom or calls getrandom(2), we
create for them a ChaCha20 CRNG, which hangs off of the task
structure.  This would require about 72 bytes of state per process,
but normally very few processes are reading from /dev/urandom or
calling getrandom(2) from userspace.

The CRNG would be initialized from the non-blocking pool, and is
reseeded after, say, 2**24 cranks or five minutes.  It's essentially
an OpenBSD-style arc4random in the kernel.  (Arguably the right answer
is to put arc4random in libc, where it can automatically handle
forks/clones/pthread automatically, but it seems pretty clear *that*
train has left a long time ago.)

I have a feeling this may be less code and complexity, and it nicely
handles the case where we have a /dev/urandom abuser who feels that
they want to call /dev/urandom in a tight loop, even on a 4 socket
Xeon system.  :-)

							- Ted