All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Updated scalable urandom patchkit
@ 2015-10-10 18:45 George Spelvin
  2015-10-11  2:31 ` Theodore Ts'o
  0 siblings, 1 reply; 30+ messages in thread
From: George Spelvin @ 2015-10-10 18:45 UTC (permalink / raw)
  To: andi; +Cc: ahferroin7, jepler, linux, linux-kernel, linux, tytso

I'm very very sorry to be so late to the party; I didn't see this thread
until the week-delayed LWN article on the subject drew my attention to it.

There's a fundamental design issue with this patch set that seems
unnecessary to me: multiple nonblocking pools.

I know, I know, that seems like the whole point of it, but hear me out.

Entropy is a holographic property of the pool, not located in any
particular bit position.

And the most basic operation, of reading from the pool, can easily be
done by multiple readers at the same time from the same bit pattern.
They just need to be salted with distinguishing nonces (CPU IDs will do
nicely) to ensure they all get different results.  So a reader doesn't
just get hash(pool[]), but hash(unique_nonce, pool[]).

This in turn lets the spin_lock_irqsave() in extract_buf be moved to
after the sha_transform() calls.

I think the whole thing can be done, more elegantly, using a single pool
and judiciously relaxed locking rules.  You don't even need reader-writer
locks; it's actually beneficial rather than harmful if there's a race
between a pool reader and writer.

In general, fewer larger pools is better for entropy storage.  The only
reason for the current three-pool design is to achieve strict isolation
of the /dev/random pool.

The two operations that require locking are:

1. Entropy accounting.  However, that only has to be strict
   for /dev/random.  For /dev/urandom, it can be late in various ways.
   One possibility is to have separate entropy accounding per NUMA
   node; only the pool proper has to be shared.
  
2. Add-back of the output to the pool.  This involves writing to pool,
   but for non-invertibility, it only has to be do ne byone of a number of
   concurrent readers.  I think significant cleverness can be applied
   here.

There are a lot of things that can be done about this second point:

2a. The obvious one is to batch the add-back.  For backtracking
    protection, we only need to do one add-back per read, not one per
    10 bytes of output.  Combined with the above change that locks only
    around the add-back and not the pool hashing, this greatly reduces
    both the lock window and the number of times it's taken.

2b. Another idea would be to optimize for 16 bytes.  I like the basic
    concept of having a larger hash internal state than the output, but
    would it be possible to output 16 of the 20 bytes of SHA-1 output
    rather than the current 10?  This is obviously a Ted question (the
    current folding is his idea), but even 32 bits is enough to protect
    against internal collisions in the hash state.

2c. Another possibility would be to add small separate "salt piles"
    which are initialized to the CPU id, and stirred by reads.  They only
    need to be large enough to not experience birthsay collisions even
    assuming a stupid number of reads between stirs of the main pool.
    (So 128 bits would be more than ample.)  A read would consist of:

	hash[5] = { initial values }
	sha1(hash, my_salt)
	sha1(hash, main_pool)
	if (spin_trylock(main_pool)) {
		add_back(hash, main_pool);
		spin_unlock(main_pool);
	} else {
		add_back(hash, my_salt);
	}

2d. A more complete solution involves noting that, if there are multiple
    concurrent readers, only one has to add back its output to prevent
    backtracking, because all of the concurrent reads are equivalent
    in that sense.  (Even though, because they're salted with separate
    nonces like the CPU id, they're not identical.)

I'm thinking in terms of a "generation counter" for the pool.  (Perhaps
implemented as a seqlock.)  The readers of the generation-n pool must
ensure that *someone* is responsible for bumping the generation to n+1
to prevent backtracking of the generation-n state, but only one.

A key point is that the add-back that bumps the generation to n+2 may
actaully be a hash of the generation-n pool state, the generation-n+1
state, or some transient intermedicate state due to races with the
generation-n+1 add-back.  It is not necessary for readers to wait for
the pool to be stable, only that it's a non-invertible transformation
based on some earlier generation.

Now, to actually implement that, a "spin lock while checking extra
condition" would be ideal, but I do not want to add Yet Another Locking
Primitive to the kernel.

One possibility would be to accept the possibility of a race condition
breaking anti-backtracking.  Another reader will come along soom enough
and fix it.

But if that's not okay, consider the following:

We add a second lock, the "responsibility lock", the holder of which
is responsible for doing anti-backtracking.

After hashing the pool, each reader does the following:
1. (Optional) trylock() the pool lock.
   This is a low-load optimization.  If it succeeds, go
   directly to step 5.
2. Trylock() the responsibility lock.
   If this fails, we're done; return.
3. Wait for the pool lock.
4. Drop the responsibility lock.  This must be done before
   updating the pool proper.
5. Add our anti-backtracking bytes to the pool.
6. Release the pool lock.

(I originally thought I'd need to ping-pong between two responsibility
locks based on the generation counter % 2, but now I don't think so.)

Under high load, you have one processor adding back, a second waiting
to add back, and everyone else just sails right through without
waiting at all.

Details to be filled in:
- Each reader needs some sort of nonce to distinguish multiple reads
  if there's no main pool add-back between them.  RDTSC, per-cpu
  counter, or whatever.
- Entropy accounting.  As mentioned, some sloppiness is okay, so I
  think it's solvable, but I haven't looked at it yet.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-10 18:45 Updated scalable urandom patchkit George Spelvin
@ 2015-10-11  2:31 ` Theodore Ts'o
  2015-10-11  2:53   ` Theodore Ts'o
  2015-10-11  4:35   ` George Spelvin
  0 siblings, 2 replies; 30+ messages in thread
From: Theodore Ts'o @ 2015-10-11  2:31 UTC (permalink / raw)
  To: George Spelvin; +Cc: andi, ahferroin7, jepler, linux-kernel, linux

On Sat, Oct 10, 2015 at 02:45:46PM -0400, George Spelvin wrote:
> In general, fewer larger pools is better for entropy storage.  The only
> reason for the current three-pool design is to achieve strict isolation
> of the /dev/random pool.

You're absolutely right, and I would like to see if we can get away
with keeping with a single pool design.  Your idea of using the CPU id
as a nonce is a good one, and I think we can do something similar that
should be good enough for mixing the hash back into the pool.  We can
simply use a hash of the CPU id to change the offset where we do the
mixing, and this should be good enough to avoid collisions when we do
the add-back.

HOWEVER.  One downside of this approach is that, especially on a NUMA
system, the costs of the cache coherency across a single pool which is
constantly being modified and shared by all of the NUMA nodes could be
a killer, even if we go completely lockless.

To that end, Andi, can you try benchmarking the scalability of the
patch below?  I really hope it will be good enough, since besides
using less memory, there are security advantages in not spreading the
entropy across N pools.

If it isn't, we might be able to play a trick where we sample the
r->add_ptr value before we start hashing the pool, and then check to
see if it's changed afterwards.  If it has, we could skip doing the
hash back, which could reduce the cache coherency traffic, since as
you point out:

> 2d. A more complete solution involves noting that, if there are multiple
>     concurrent readers, only one has to add back its output to prevent
>     backtracking, because all of the concurrent reads are equivalent
>     in that sense.  (Even though, because they're salted with separate
>     nonces like the CPU id, they're not identical.)

However, even if we put in that optimization, the primary question is
how good is Intel's cache coherency protocols on their NUMA systems?
I'm pretty sure this would be a disaster on, say, Sequent's old NUMA
machines, but I'm quite confident all of those servers are dead and
buried by now.  :-)

						- Ted

commit 3cb51896deab45bddc1b8f571b1103eae8f50e0e
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Oct 10 22:03:53 2015 -0400

    random: make the reading from the non-blocking pool more scalable
    
    Andi Kleen reported a case where a 4 socket system spent >80% of its
    total CPU time contending on the global urandom nonblocking pool
    spinlock. While the application could probably have used an own PRNG,
    it may have valid reasons to use the best possible key for different
    session keys.
    
    Instead of creating separate multiple per-NUMA node non-blocking
    pools, use a trick suggested by George Spelvin:
    
       Entropy is a holographic property of the pool, not located in any
       particular bit position.
    
       And the most basic operation, of reading from the pool, can easily be
       done by multiple readers at the same time from the same bit pattern.
       They just need to be salted with distinguishing nonces (CPU IDs will do
       nicely) to ensure they all get different results....
    
    We use this trick, and in addition use a hash of the cpu id to change
    where we mix the hash back into the pool to avoid collisions.
    
    Since we are already using a lockless technique (cmpxchg) to update
    the entropy accounting, we don't need to change this around.
    
    This technique won't be quite as scalable since on a NUMA node we will
    still be forcing cache lines to bounce around, but from the
    perspective of entropy storage we're much better using a single pool
    rather than spreading it across multiple pools.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..be6b315 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
 #include <linux/irq.h>
 #include <linux/syscalls.h>
 #include <linux/completion.h>
+#include <linux/hash.h>
 
 #include <asm/processor.h>
 #include <asm/uaccess.h>
@@ -494,7 +495,9 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 {
 	unsigned long i, tap1, tap2, tap3, tap4, tap5;
 	int input_rotate;
+	int is_nonblock = (r == &nonblocking_pool);
 	int wordmask = r->poolinfo->poolwords - 1;
+	int poolbits = r->poolinfo->poolbitshift - 5;
 	const char *bytes = in;
 	__u32 w;
 
@@ -506,6 +509,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 
 	input_rotate = r->input_rotate;
 	i = r->add_ptr;
+	if (is_nonblock)
+		i += hash_32(get_cpu(), poolbits);
 
 	/* mix one byte at a time to simplify size handling and churn faster */
 	while (nbytes--) {
@@ -534,6 +539,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 
 	r->input_rotate = input_rotate;
 	r->add_ptr = i;
+	if (is_nonblock)
+		put_cpu();
 }
 
 static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1090,6 +1097,7 @@ retry:
 static void extract_buf(struct entropy_store *r, __u8 *out)
 {
 	int i;
+	int is_nonblock = (r == &nonblocking_pool);
 	union {
 		__u32 w[5];
 		unsigned long l[LONGS(20)];
@@ -1108,9 +1116,12 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
 			break;
 		hash.l[i] = v;
 	}
+	if (is_nonblock)
+		hash.w[0] ^= hash_32(get_cpu(), 32);
+	else
+		spin_lock_irqsave(&r->lock, flags);
 
 	/* Generate a hash across the pool, 16 words (512 bits) at a time */
-	spin_lock_irqsave(&r->lock, flags);
 	for (i = 0; i < r->poolinfo->poolwords; i += 16)
 		sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);
 
@@ -1124,7 +1135,10 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
 	 * hash.
 	 */
 	__mix_pool_bytes(r, hash.w, sizeof(hash.w));
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (is_nonblock)
+		put_cpu();
+	else
+		spin_unlock_irqrestore(&r->lock, flags);
 
 	memzero_explicit(workspace, sizeof(workspace));
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11  2:31 ` Theodore Ts'o
@ 2015-10-11  2:53   ` Theodore Ts'o
  2015-10-11  4:35   ` George Spelvin
  1 sibling, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2015-10-11  2:53 UTC (permalink / raw)
  To: George Spelvin, andi, ahferroin7, jepler, linux-kernel, linux

On Sat, Oct 10, 2015 at 10:31:46PM -0400, Theodore Ts'o wrote:
> To that end, Andi, can you try benchmarking the scalability of the
> patch below?  I really hope it will be good enough, since besides
> using less memory, there are security advantages in not spreading the
> entropy across N pools.

Andi, I forgot to menion --- if you could benchmark using both on your
"run get_entropy() in a tight loop" and the actual real-life
application --- since if is CPU cache behavior is going to be a large
factor in the results, a real application is going to be more
representative that a micro-benchmark, that would be much appreciated.

Thanks!

					- Ted
					


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11  2:31 ` Theodore Ts'o
  2015-10-11  2:53   ` Theodore Ts'o
@ 2015-10-11  4:35   ` George Spelvin
  2015-10-11 22:25     ` Theodore Ts'o
  1 sibling, 1 reply; 30+ messages in thread
From: George Spelvin @ 2015-10-11  4:35 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

Damn, I bow before the master.  That is a much neater solution
than mine; I had assumed a lock was required for writing.

While it's good enough for benchmarking, there are a few leftover
problems I mention so they don't get missed.

One is the final write back of add_ptr on the last line of
_mix_pool_bytes.  It actually writes i, which includes the per-cpu nonce,
and will have it jumping all over without the steady progression that
the mixing polynomial assumes.

(There's a similar, lesser problem with input_rotate.)

The second, less obvious, problem is that by calling _mix_pool_bytes
completely lockless, there's the risk that it will race with and overwrite
the addition of new seed material to the pool.

The add-back is not critical, and races between two writers don't really
do any harm.  But seed entropy is valuable.

And unfortunately, transferring 256 bits (32 bytes) to the output pool
will try to write every word, so *any* concurrent add-back is risky; there's
no "safe" part of the pool that can be accessed lockless.

(The first crude hack that comes to mind is to double the size
of the output pool, without increasing its nominal capacity, and
do seeding and add-back to different halves.  Hopefully there's
something more elegant.)


Two minor suggestions about is_nonblock:
1) Rather than using (r == &nonblocking_pool), how about !r->limit?

2) Would you mind making is_nonblock bool?

   I know back before the sacred text of the prophets Kernighan and
   Ritchie was corrupted by modern heresies, we used "int" for everything,
   and we liked it.  5 miles in the snow, uphill both ways, yadda yadda.

   But I like to document range restrictions as much as possible, and "bool"
   makes it clearer to both the compiler and the reader of the code.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11  4:35   ` George Spelvin
@ 2015-10-11 22:25     ` Theodore Ts'o
  2015-10-12  0:16       ` George Spelvin
  0 siblings, 1 reply; 30+ messages in thread
From: Theodore Ts'o @ 2015-10-11 22:25 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Sun, Oct 11, 2015 at 12:35:46AM -0400, George Spelvin wrote:
> 
> One is the final write back of add_ptr on the last line of
> _mix_pool_bytes.  It actually writes i, which includes the per-cpu nonce,
> and will have it jumping all over without the steady progression that
> the mixing polynomial assumes.

Yep, good catch; I need to subtract that off.

> (There's a similar, lesser problem with input_rotate.)

I didn't bother messing with input_rotate at all, and I don't think
that will be a problem.

> The second, less obvious, problem is that by calling _mix_pool_bytes
> completely lockless, there's the risk that it will race with and overwrite
> the addition of new seed material to the pool.
> 
> The add-back is not critical, and races between two writers don't really
> do any harm.  But seed entropy is valuable.

That's true, but I've been going back and forth about how much it's
worth to fix this.  After all, the the non-blocking pool is guaranteed
to have cryptographic randomness, and so it's a question of how much
effort we want to avoid the possibility of losing some seed entropy.

One approach might be to use take a reader lock when mixing back
entropy, but take an exclusive lock in the case when seeding the
non-random pool.

Another approach is to have an alternate mechanism for the
non-blocking pool which uses the LOCK prefix when XOR'ing into the
pool.  This would mean that we would have to drop the twisted LFSR and
replace it with something simpler but so long as the mixing algothm
eventually involves all of the bits in the pool, that will probably be
OK.

Of course, using the LOCK prefix is CPU architecture specific, and in
particular, there is no equivalent on the ARM architecture.

The final thing we could do is to just throw in a smp_mb() before the
write into the pool, and just call it day, and accept that while we
might lose a small amount of entropy due to a race, it will only a
byte's worth each time we lose a race (instead of the whole 32 byte
cache line).

None of the alternatives are all that satisfying, and they will all be
a bit more expensive than the first patch.  The question is how to
weigh these problems against just simply using a separate pool for
each NUMA node, and how much scalability are we really need for
"realistic" workloads?

						- Ted
						

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11 22:25     ` Theodore Ts'o
@ 2015-10-12  0:16       ` George Spelvin
  2015-10-12  4:05         ` Theodore Ts'o
  0 siblings, 1 reply; 30+ messages in thread
From: George Spelvin @ 2015-10-12  0:16 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

TedTs'o wrote:
> Yep, good catch; I need to subtract that off.

I'm not thrilled with incrementing the pointer from i to len, but mixing
at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
pass structure.

How about this instead: drop the hashed offset, and instead let each
writer do an atomic_add_return on the index, then iterate over the
"reserved" slots.  Concurrent additions will at least do non-overlapping
writes until the numer equals the pool size.

(Admittedly, if the pool is 32 words and we're adding back 20 bytes
per reader, overlap is hard to avoid.  Could we do seeding a byte at a
time with input rotate, but add-back 32 bits at a time for simplicity
and speed?)

The other one would be to have separate indexes for add-back and seed
addition, with the latter protected by a lock.

>> (There's a similar, lesser problem with input_rotate.)

> I didn't bother messing with input_rotate at all, and I don't think
> that will be a problem.

It was more that it's going to do strange non-deterministic things.

Personally, I hate the input_rotate.  It's not that it's harmful, just
that it doesn't do much good compared to the cost; for the number of cycles
and context space required there are more efficient mixing operations.

But if you want, you can easily compute it on demand.
If you avoid reducing r->add_words mod poolwords, then
	input_rotate = 7*(r->add_words + r->add_words/poolwords)

>> The add-back is not critical, and races between two writers don't really
>> do any harm.  But seed entropy is valuable.

> That's true, but I've been going back and forth about how much it's
> worth to fix this.  After all, the the non-blocking pool is guaranteed
> to have cryptographic randomness, and so it's a question of how much
> effort we want to avoid the possibility of losing some seed entropy.

I worry that with only 32 words of pool, on a large (1K CPUs) machine
that's execv()ing and geerating AT_RANDOM vectors a lot, the traffic
could be pretty heavy, and the odds of stomping a byte of seed material
cound get substantial.

Or a small machine with a couple of concurrent /dev/urandom abusers.
Remember, it's globally readable, so it has to be resistance to malicious
abuse.

> One approach might be to use take a reader lock when mixing back
> entropy, but take an exclusive lock in the case when seeding the
> non-random pool.

Even a reader lock is at least one atomic operation.

> Another approach is to have an alternate mechanism for the
> non-blocking pool which uses the LOCK prefix when XOR'ing into the
> pool.  This would mean that we would have to drop the twisted LFSR and
> replace it with something simpler but so long as the mixing algothm
> eventually involves all of the bits in the pool, that will probably be
> OK.
>
> Of course, using the LOCK prefix is CPU architecture specific, and in
> particular, there is no equivalent on the ARM architecture.

You can add rather than XOR, and we have atomic add primitives.

(You could even, if you wanted to preserve the period proof of the
mixing function, compute the value which, when added to the original
word, would make the XOR change, and then atomically add that.)

> The final thing we could do is to just throw in a smp_mb() before the
> write into the pool, and just call it day, and accept that while we
> might lose a small amount of entropy due to a race, it will only a
> byte's worth each time we lose a race (instead of the whole 32 byte
> cache line).

If you're willing to accept "probably" solutions, just add the seed
material twice.  If one byte is lost, the other copy will probably
survive.

Or, if you want to be cleverer, any form of error-correcting code
(duplication is just a simple for of it) will add enough redundant
information that a few erasures won't lose entropy.

But the atomic add seems safer.  I don't have a good feel for the fraction
of lost entropy a deliberate attack on a large machine could cause and
/dev/urandom is not an area where I'm comfortable hoping.

> None of the alternatives are all that satisfying, and they will all be
> a bit more expensive than the first patch.  The question is how to
> weigh these problems against just simply using a separate pool for
> each NUMA node, and how much scalability are we really need for
> "realistic" workloads?

There are several possible solutions that don't need separate pools
(including separate add-back pools, with a shared seeded pool that
is never touched by add-back), so I don't think it's necessary to
give up yet.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12  0:16       ` George Spelvin
@ 2015-10-12  4:05         ` Theodore Ts'o
  2015-10-12  7:49           ` George Spelvin
  0 siblings, 1 reply; 30+ messages in thread
From: Theodore Ts'o @ 2015-10-12  4:05 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Sun, Oct 11, 2015 at 08:16:01PM -0400, George Spelvin wrote:
> 
> I'm not thrilled with incrementing the pointer from i to len, but mixing
> at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
> pass structure.

That part I'm not worried about.  We still have a regular pass
structure --- since for each CPU, we are still iterating over the pool
in a regular fashion.

> How about this instead: drop the hashed offset, and instead let each
> writer do an atomic_add_return on the index, then iterate over the
> "reserved" slots.  Concurrent additions will at least do non-overlapping
> writes until the numer equals the pool size.

One atomic operation per byte that we're mixing in?  That's quite
expensive.

> Personally, I hate the input_rotate.  It's not that it's harmful, just
> that it doesn't do much good compared to the cost; for the number of cycles
> and context space required there are more efficient mixing operations.

The input_rotate is useful for the input pool, for things like
keyboard code so we can make sure they enter the pool at more points
than just the low bits of each word.  But for the output pools, it
really doesn't make any sense.  And we are getting to the point where
we may end up having different mixing algorithms for the nonblocking
pool, and in that case I have absolutely no trouble dropping the
input_rotate part of the mixing algorithm for the non-blocking pool.

> Or a small machine with a couple of concurrent /dev/urandom abusers.
> Remember, it's globally readable, so it has to be resistance to malicious
> abuse.

One of the other ways we could solve this is by hanging a struct off
the task structure, and if we detect that we have a /dev/urandom
abuser, we give that process its own urandom pool, so any pissing that
it does will be in its own pool.  (So to speak.)

Most processes don't actually use that much randomness, and I'm not
that worried about in-kernel users of the nonblocking pool.  Even with
the most exec-heavy workload, setting up a new exec image is
heavyweight enough that you're really not going to be contending on
the lock.  I also have trouble with someone spending $$$$ on a system
with 1K cpu cores and wasting all of their CPU power with running
shell scripts that fork and exec a lot.  :-)

The reality is that most processes don't use /dev/urandom or
getrandom(2) at all, and those that do, many of them only use it
sparingly.  So maybe the right answer is to do something simple which
takes care of the abusers.

> You can add rather than XOR, and we have atomic add primitives.

Atomic-add primitives aren't portable either.  The representation
isn't guaranteed to be 32-bits, and some platforms an atomic int is
only 24-bits wide (the top 8 bits being used for locking purposes).

> There are several possible solutions that don't need separate pools
> (including separate add-back pools, with a shared seeded pool that
> is never touched by add-back), so I don't think it's necessary to
> give up yet.

Hmm, maybe.  I'm a bit worried about the amount of complexity that
this entails, and the reality is that the urandom pool or pools don't
provide anything other than cryptogaphic randomness.

At this point, I wonder if it might not be simpler to restrict the
current nonblocking pool to kernel users, and for userspace users, the
first time a process reads from /dev/urandom or calls getrandom(2), we
create for them a ChaCha20 CRNG, which hangs off of the task
structure.  This would require about 72 bytes of state per process,
but normally very few processes are reading from /dev/urandom or
calling getrandom(2) from userspace.

The CRNG would be initialized from the non-blocking pool, and is
reseeded after, say, 2**24 cranks or five minutes.  It's essentially
an OpenBSD-style arc4random in the kernel.  (Arguably the right answer
is to put arc4random in libc, where it can automatically handle
forks/clones/pthread automatically, but it seems pretty clear *that*
train has left a long time ago.)

I have a feeling this may be less code and complexity, and it nicely
handles the case where we have a /dev/urandom abuser who feels that
they want to call /dev/urandom in a tight loop, even on a 4 socket
Xeon system.  :-)

							- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12  4:05         ` Theodore Ts'o
@ 2015-10-12  7:49           ` George Spelvin
  2015-10-12 13:54             ` Theodore Ts'o
  0 siblings, 1 reply; 30+ messages in thread
From: George Spelvin @ 2015-10-12  7:49 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

>> I'm not thrilled with incrementing the pointer from i to len, but mixing
>> at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
>> pass structure.

> That part I'm not worried about.  We still have a regular pass
> structure --- since for each CPU, we are still iterating over the pool
> in a regular fashion.

Not if I understand what you are doing.

Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12.
(I'll also use a count-up index even though I know it's actually
count-down in the code.)

CPU0 writes 5 words at 0..4, leaving add_ptr = 5.
Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10
The CPU 0 writes its next word at 10.

Words 5..9 never got mixed at all.

>> How about this instead: drop the hashed offset, and instead let each
>> writer do an atomic_add_return on the index, then iterate over the
>> "reserved" slots.  Concurrent additions will at least do non-overlapping
>> writes until the number equals the pool size.

> One atomic operation per byte that we're mixing in?  That's quite
> expensive.

Well, for this part, it's one atomic operation per _mix_pool_bytes:
	i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes;

However, I *also* have to use atomic_add to write r->pool[i],
which is indeed expensive.  PoC diff attached, but I'm not sure if
you'll like it.

> The input_rotate is useful for the input pool, for things like
> keyboard code so we can make sure they enter the pool at more points
> than just the low bits of each word.  But for the output pools, it
> really doesn't make any sense.  And we are getting to the point where
> we may end up having different mixing algorithms for the nonblocking
> pool, and in that case I have absolutely no trouble dropping the
> input_rotate part of the mixing algorithm for the non-blocking pool.

You're forgetting: keyboard codes don't go into the input pool.
They go into the per-CPU fast pool, which does some quite thorough
mixing and delivers 128 bits with the entropy effectively distributed
across it.

These days, the only thing that goes into the large pools are
the output from other pools and /dev/random writes.

But also, that's what the twist_table achieves.  By the time the
pool wraps around, the previous value at a given array posistion
has been read and twisted many times, and is now smeared across
all 32 bits.

If you want larger shifts *and* greater efficiency by removing a table
lookup from the critical path, I'll happily replace it with an Xorshift
structure (will take me a little while; I've generated primitive
polynomials in the past but need to rewrite the code).

>> Or a small machine with a couple of concurrent /dev/urandom abusers.
>> Remember, it's globally readable, so it has to be resistance to malicious
>> abuse.

> One of the other ways we could solve this is by hanging a struct off
> the task structure, and if we detect that we have a /dev/urandom
> abuser, we give that process its own urandom pool, so any pissing that
> it does will be in its own pool.  (So to speak.)

Let me make sure we're talking about the same thing here.

Segregating abusers so they don't cause *performance* problems
for other threads is a reasonable optimization.

But the quoted conversation was talking about a *security*
problem that could be created by /dev/urandom abusers if some
of your lock-relaxing ideas were followed.

I was observing that the failure case you were hoping was rare could
be made common by a malicious user.

And for that, I don't think a cacheing wrapper is the way to solve
the problem.

For security, either the underlying pool is robust or fragile.  If it's
robust, we don't need this extra layer; abusers will get shitty
performance but won't hurt the quality of the output.

If we need the layer, then we have to ask if we understand the
vulnerability to abuse and have successfully segregated all damaging
forms of abuse.  That's more analysis and design effort than simply
eliminating the problem in the first place.

> Most processes don't actually use that much randomness, and I'm not
> that worried about in-kernel users of the nonblocking pool.  Even with
> the most exec-heavy workload, setting up a new exec image is
> heavyweight enough that you're really not going to be contending on
> the lock.  I also have trouble with someone spending $$$$ on a system
> with 1K cpu cores and wasting all of their CPU power with running
> shell scripts that fork and exec a lot.  :-)

But I just re-read Andi's original trouble report, and although he mentions
AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's
causing the problem *not* execve.  My memory may have gotten confused.

I can imagine that one /dev/urandom abuser can minopolize the locck and
cause problems for a shell script.

But yes, for this performance problem, a segregate-the-abusers solution
is quite attractive.

> Atomic-add primitives aren't portable either.  The representation
> isn't guaranteed to be 32-bits, and some platforms an atomic int is
> only 24-bits wide (the top 8 bits being used for locking purposes).

No longer true; see Documentation/atomic_ops.txt.

But yes, atomic ops are *ridiculously* expensive on SMP 32-bit SPARC.

Do we care?  Is anyone using multiprocessor V7 SPARC in anger these days?

>> There are several possible solutions that don't need separate pools
>> (including separate add-back pools, with a shared seeded pool that
>> is never touched by add-back), so I don't think it's necessary to
>> give up yet.

> Hmm, maybe.  I'm a bit worried about the amount of complexity that
> this entails, and the reality is that the urandom pool or pools don't
> provide anything other than cryptogaphic randomness.

Well, *one* add-back pool would also do, and be a quite simple addition
to the code.

> the urandom pool or pools don't
> provide anything other than cryptogaphic randomness.

Well, there's cryptographic and there's cryptographic.
/dev/urandom is deliberately *ridiculosuly* overengineered;
It's not dependent on the security of a single primitive.

I'd rather not downgrade it to depending on the security of AES, or
ChaCha, or Keccak, or any other performance-optimized primitive.

For my own code, sure.  But /dev/random is sold as "you *really* don't
have to worry about it".

> At this point, I wonder if it might not be simpler to restrict the
> current nonblocking pool to kernel users, and for userspace users, the
> first time a process reads from /dev/urandom or calls getrandom(2), we
> create for them a ChaCha20 CRNG, which hangs off of the task
> structure.  This would require about 72 bytes of state per process,
> but normally very few processes are reading from /dev/urandom or
> calling getrandom(2) from userspace.

Interesting idea, although the loss of backtracking protection for
long-running servers is a significant semantic change.

Remember that making antibacktracking work is the entire problem we're
struggling to solve.  If discarding it is on the table, I can solve
the scalability problems extremely easily.  Even without giving it up
completely, just put the add-back inside if(trylock()) and the problem
goes away.

I'm also not sure where you get 72 bytes from.  You need 32 bytes of
key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the
ChaCha input block.

Then optionally add 64 bytes of output buffer if you don't want to
discard generated bytes after an unaligned read.

It adds up to 44/48, or 108/112 bytes.  Where does 72 come from?

But why bother with even that much?  If you're willing to discard
antibacktracking, just have *one* key for the kernel, reseeded more
often, and a per-thread nonce and stream position.

> (Arguably the right answer is to put arc4random in libc, where it can
> automatically handle forks/clones/pthread automatically, but it seems
> pretty clear *that* train has left a long time ago.)

I'm not quite which viatical incident you're alluding to, but yes that
would be the preferred solution.

> I have a feeling this may be less code and complexity, and it nicely
> handles the case where we have a /dev/urandom abuser who feels that
> they want to call /dev/urandom in a tight loop, even on a 4 socket
> Xeon system.  :-)

I'm not sanguine about its simplicity; it's a whole other layer.

Might be a good idea anyway.  I have a similar patch set in work, but
it's called /dev/frandom because I wasn't willing to suggest such a
semantic change.

But without interfering with legitimate users of /dev/urandom at all,
I'd be quite willing to say that as soon as you read more than 32 bytes
(current request plus recent history) from /dev/urandom, you get a
private ChaCha20 structure to read from.

(This is just enforcing the usage rules from the random(4) man page.
Since I wrote that text, I'm hardly going to object!)



Here's that atomic_t patch I mentioned above.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e62b30ba..e65357c4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
 #include <linux/irq.h>
 #include <linux/syscalls.h>
 #include <linux/completion.h>
+#include <linux/hash.h>
 
 #include <asm/processor.h>
 #include <asm/uaccess.h>
@@ -423,7 +424,7 @@ struct entropy_store;
 struct entropy_store {
 	/* read-only data: */
 	const struct poolinfo *poolinfo;
-	__u32 *pool;
+	atomic_t *pool;
 	const char *name;
 	struct entropy_store *pull;
 	struct work_struct push_work;
@@ -431,20 +432,20 @@ struct entropy_store {
 	/* read-write data: */
 	unsigned long last_pulled;
 	spinlock_t lock;
-	unsigned short add_ptr;
-	unsigned short input_rotate;
+	atomic_t add_ptr;
 	int entropy_count;
 	int entropy_total;
 	unsigned int initialized:1;
 	unsigned int limit:1;
 	unsigned int last_data_init:1;
+	unsigned char input_rotate;
 	__u8 last_data[EXTRACT_SIZE];
 };
 
 static void push_to_pool(struct work_struct *work);
-static __u32 input_pool_data[INPUT_POOL_WORDS];
-static __u32 blocking_pool_data[OUTPUT_POOL_WORDS];
-static __u32 nonblocking_pool_data[OUTPUT_POOL_WORDS];
+static atomic_t input_pool_data[INPUT_POOL_WORDS];
+static atomic_t blocking_pool_data[OUTPUT_POOL_WORDS];
+static atomic_t nonblocking_pool_data[OUTPUT_POOL_WORDS];
 
 static struct entropy_store input_pool = {
 	.poolinfo = &poolinfo_table[0],
@@ -488,6 +489,10 @@ static __u32 const twist_table[8] = {
  * degree, and then twisted.  We twist by three bits at a time because
  * it's cheap to do so and helps slightly in the expected case where
  * the entropy is concentrated in the low-order bits.
+ *
+ * This function is designed to be safe to call without locking the
+ * entropy_store.  Race conditions might do strange things, but will
+ * leave the pool valid and not lose entropy.
  */
 static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 			    int nbytes)
@@ -496,7 +501,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 	int input_rotate;
 	int wordmask = r->poolinfo->poolwords - 1;
 	const char *bytes = in;
-	__u32 w;
+
+	BUILD_BUG_ON(sizeof(int) != sizeof(__u32));
 
 	tap1 = r->poolinfo->tap1;
 	tap2 = r->poolinfo->tap2;
@@ -505,23 +511,31 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 	tap5 = r->poolinfo->tap5;
 
 	input_rotate = r->input_rotate;
-	i = r->add_ptr;
+	i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes;
 
 	/* mix one byte at a time to simplify size handling and churn faster */
 	while (nbytes--) {
-		w = rol32(*bytes++, input_rotate);
-		i = (i - 1) & wordmask;
+		__u32 v, w = rol32(*bytes++, input_rotate);
 
 		/* XOR in the various taps */
-		w ^= r->pool[i];
-		w ^= r->pool[(i + tap1) & wordmask];
-		w ^= r->pool[(i + tap2) & wordmask];
-		w ^= r->pool[(i + tap3) & wordmask];
-		w ^= r->pool[(i + tap4) & wordmask];
-		w ^= r->pool[(i + tap5) & wordmask];
+		i = (i - 1) & wordmask;
+		w ^= atomic_read(r->pool + ((i + tap1) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap2) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap3) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap4) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap5) & wordmask));
+		w ^= v = atomic_read(r->pool + i);
+		/* Twist the result to mix bits within words */
+		w = (w >> 3) ^ twist_table[w & 7];
 
-		/* Mix the result back in with a twist */
-		r->pool[i] = (w >> 3) ^ twist_table[w & 7];
+		/*
+		 * Now, to put the result back, we don't have an
+		 * atomic_xor operation, but we do have an atomic_add.
+		 * Atomically add the value which would cause the desired
+		 * change.  If there's a race, something strange will happen,
+		 * but we won't lose entropy.
+		 */
+		atomic_add(w - v, r->pool + i);
 
 		/*
 		 * Normally, we add 7 bits of rotation to the pool.
@@ -532,8 +546,13 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 		input_rotate = (input_rotate + (i ? 7 : 14)) & 31;
 	}
 
-	r->input_rotate = input_rotate;
-	r->add_ptr = i;
+	/*
+	 * Access to input_rotate is not thread-safe, because the
+	 * actual value doesn't matter to the security guarantees,
+	 * and the fact that it changes is enough for its purpose.
+	 * So if there's a race condition, it doesn't matter who wins.
+	 */
+	r->input_rotate = (unsigned char)input_rotate;
 }
 
 static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1109,17 +1128,26 @@ retry:
  * This function does the actual extraction for extract_entropy and
  * extract_entropy_user.
  *
+ * For the blocking pools, we lock the pool until we feed back the
+ * extracted entropy, so the next extract_buf will return different
+ * results.
+ *
+ * For the non-blocking pool, concurrent access to the pool is allowed,
+ * and the CPU id is used as a salt to ensure that concurrent readers
+ * will get different results.
+ *
  * Note: we assume that .poolwords is a multiple of 16 words.
  */
 static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 {
 	int i;
+	bool is_blocking = r->limit;
 	union {
 		__u32 w[5];
 		unsigned long l[LONGS(20)];
 	} hash;
 	__u32 workspace[SHA_WORKSPACE_WORDS];
-	unsigned long flags;
+	unsigned long flags = flags;	/* "initialization" shuts up GCC */
 
 	/*
 	 * If we have an architectural hardware random number
@@ -1133,8 +1161,11 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 		hash.l[i] = v;
 	}
 
+	if (is_blocking)
+		spin_lock_irqsave(&r->lock, flags);
+	else
+		hash.w[0] ^= hash_32(get_cpu(), 32);
 	/* Generate a hash across the pool, 16 words (512 bits) at a time */
-	spin_lock_irqsave(&r->lock, flags);
 	for (i = 0; i < r->poolinfo->poolwords; i += 16)
 		sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);
 
@@ -1148,7 +1179,10 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 	 * hash.
 	 */
 	__mix_pool_bytes(r, hash.w, sizeof(hash.w));
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (is_blocking)
+		spin_unlock_irqrestore(&r->lock, flags);
+	else
+		put_cpu();
 
 	memzero_explicit(workspace, sizeof(workspace));
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12  7:49           ` George Spelvin
@ 2015-10-12 13:54             ` Theodore Ts'o
  2015-10-12 20:30               ` George Spelvin
  0 siblings, 1 reply; 30+ messages in thread
From: Theodore Ts'o @ 2015-10-12 13:54 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote:
> 
> Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12.
> (I'll also use a count-up index even though I know it's actually
> count-down in the code.)
> 
> CPU0 writes 5 words at 0..4, leaving add_ptr = 5.
> Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10
> The CPU 0 writes its next word at 10.
> 
> Words 5..9 never got mixed at all.

Hmm, good point.  It *shouldn't* matter if the hash is secure, but the
potentially uneven mixing would be a concern.

> However, I *also* have to use atomic_add to write r->pool[i],
> which is indeed expensive.  PoC diff attached, but I'm not sure if
> you'll like it.

Yeah, that was the part that worried me.  Whether we use an atomic add
or an atomic xor, it's going to be very expensive. Using an atomic add
means that it will work with arm, but there's an explicit loop with
arm, so on an arm system, it effectively turns into a more efficient
spinlock, except you're spinning on each byte, so instead of bouncing
cache lines with the locking granularity of the extract, we would be
bouncing cache lines for every single byte of the mixback operation.

> You're forgetting: keyboard codes don't go into the input pool.
> They go into the per-CPU fast pool, which does some quite thorough
> mixing and delivers 128 bits with the entropy effectively distributed
> across it.

They used to go through the input pool, and when we added the per-CPU
fast pool, we didn't revisit the question of whether the input_rotate
was still needed.  So yes, at this point it might be worthwhile to
remove the input_rotate part of the mixing scheme.

> Segregating abusers so they don't cause *performance* problems
> for other threads is a reasonable optimization.
> 
> But the quoted conversation was talking about a *security*
> problem that could be created by /dev/urandom abusers if some
> of your lock-relaxing ideas were followed.

Segregating abusers solves both problems.  If we do this then we don't
need to drop the locks from the nonblocking pool, which solves the
security problem.  And it also means that each process has its own
CRNG, which is way faster than /dev/urandom even if you don't have the
locking problem, and it provides per-process isolation, which means
that process A can't carry out a backtracking attack on process B.

> > Most processes don't actually use that much randomness, and I'm not
> > that worried about in-kernel users of the nonblocking pool.  Even with
> > the most exec-heavy workload, setting up a new exec image is
> > heavyweight enough that you're really not going to be contending on
> > the lock.  I also have trouble with someone spending $$$$ on a system
> > with 1K cpu cores and wasting all of their CPU power with running
> > shell scripts that fork and exec a lot.  :-)
> 
> But I just re-read Andi's original trouble report, and although he mentions
> AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's
> causing the problem *not* execve.  My memory may have gotten confused.
> 
> I can imagine that one /dev/urandom abuser can minopolize the locck and
> cause problems for a shell script.

Well, that's my point.  If we restrict the nonblocking pool to kernel
users, then a /dev/urandom abuser won't be able to cause problems for
the shell script.  And even the most inefficient shell script which is
firing off hundreds of processes per second is unlikely to be able to
swamp the spin loop, *and* I would hope that someone who spent $$$ on
a super-expensive 1K cpu system would also have the wit to actually
recode the shell script in a more efficient compiled language, instead
of wasting all of that expensive hardware.

> Remember that making antibacktracking work is the entire problem we're
> struggling to solve.  If discarding it is on the table, I can solve
> the scalability problems extremely easily.  Even without giving it up
> completely, just put the add-back inside if(trylock()) and the problem
> goes away.

The trylock() scheme is a bit dangerous, since we would need to make
sure that *all* times when other callers take the lock, the contents
of the pool would have changed.  Or else, a subseqent call to extract
entropy from that CPU will return the same value.  And even if that
were true today (or we could make it be true) if that ever changed, it
would break the code.  So it makes it pretty fragile.

> I'm also not sure where you get 72 bytes from.  You need 32 bytes of
> key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the
> ChaCha input block.

The ChaCha 20 state block is 64 bytes, and then I was assuming two 4
byte control variables, for the number of cranks and the time since
last reseed.  We could actually store those control variables into the
IV portion of the ChaCha state structure, which would make it be 64
bytes --- or we could generate a new state structure each time and not
bother storing the 16 bytes of constants in the state structure, which
trades a tiny amount of time and stack space for memory.  These are
all implementation details, though....

> But why bother with even that much?  If you're willing to discard
> antibacktracking, just have *one* key for the kernel, reseeded more
> often, and a per-thread nonce and stream position.

Well, we're not completely discarding backtracking protection.  We are
discarding backtracking protection between successive reads from a
single process, and even there we would be reseeding every five
minutes (and this could be tuned), so there is *some*
anti-backtracking protection.

I'm going to make the claim that if a particular process is reading
from /dev/urandom in a tight loop, you probably don't *need*
backtracking.  You're probably doing something silly such as using
/dev/urandom to overwrite a disk, or some such.

On the flip side, the time when you might care about anti-backtracing
protection is say, when you're generating a new session key for a new
connection.  So perhaps one approach is to use some kind of
ratelimiter algorithm so that if you're using /dev/urandom "carefully"
(say, no more than ten times a second), we'll use the non-blocking
pool.  But once a process exceeds that limit, it will switch over the
the CRNG, and then the only performance that abuser process will hurt
is its own (because it would be even faster if they were running
something like arc4random in userspace).

> But without interfering with legitimate users of /dev/urandom at all,
> I'd be quite willing to say that as soon as you read more than 32 bytes
> (current request plus recent history) from /dev/urandom, you get a
> private ChaCha20 structure to read from.

Yeah, that's what I was thinking, although I was willing to be a bit
more generous about when we switch them over to using the ChaCha pool.
I think doing this automatically is going to be much better than
creating a new /dev/frandom, since it's going to be hard to get
applications to switch over.  We're still several years away from
swtiching people over to the new getrandom(2) system call, after all.

	  	      	     	 	      - Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12 13:54             ` Theodore Ts'o
@ 2015-10-12 20:30               ` George Spelvin
  2015-10-12 20:34                 ` George Spelvin
  2015-10-13  2:46                 ` Theodore Ts'o
  0 siblings, 2 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-12 20:30 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

Theodore Ts'o wrote:
> On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote:
>> Words 5..9 never got mixed at all.

> Hmm, good point.  It *shouldn't* matter if the hash is secure, but the
> potentially uneven mixing would be a concern.

I don't quite follow the "shouldn't matter" part.  The problem is that
insufficient input mixing can cause collisions, which are loss of entropy.
The output hash(the secure one) can't fix that.

The whole reason for the 5-tap LFSR is to ensure that each input word
affects a significant number of *other* words before getting stored into
by fresh input.

I just realized a worse problem: suppose CPU 1 has an offset of -1.
It will proceed to *undo* the mixing just done by CPU 0.  I don't just
mean that it'll store its input over top, but it will also undo the
LFSR XOR.  Ack!

(Maybe I'll retract that "bow before the master" remark. :-))

> They used to go through the input pool, and when we added the per-CPU
> fast pool, we didn't revisit the question of whether the input_rotate
> was still needed.  So yes, at this point it might be worthwhile to
> remove the input_rotate part of the mixing scheme.

The fast pools actually just extended an issue which the three-pool
design created: when spilling from one pool to the other, the whole
"expensive output, cheap input" optimization makes no sense.

For raw data and interrupt overhead, yes obviously.  But when it's
8 sha_transform() calls to generate 10 bytes, the design of the
subsequent input hash needs some rethinking.

It may make sense to have a separate path for "known high-quality"
input that puts more effort into diffusion.

Idea 1: Given that we do have an entropy estimate with every write to a
        pool, what if we did an input hash that did work proportional
        to the *entropy* of the input.	With some lower bound for
        zero-entropy writes like plain /dev/random writes and mixback.)

Idea 2: For input mixing not in interrupt context, which includes all
        input to the output pools, it would be smaller and simpler code
        to just XOR into the pool, and defer mixing until one whole-pool
        mixing pass after the XOR position reched the end of the pool.

        Concurrent writers could do a single atomic_add_return to
        find an add position, and if that extended past the end of the
        buffer, they'd obtain the lock, stir the pool, decrement the add
        position (which would permit other concurrent writers) and then
        do their XORs.

        There's still the risk of races between delayed non-locking
        inputters and stirrers, but maybe if non-locking inputters
        were limited to non-critical mixbacks, that would be manageable.

>> However, I *also* have to use atomic_add to write r->pool[i],
>> which is indeed expensive.  PoC diff attached, but I'm not sure if
>> you'll like it.

> Yeah, that was the part that worried me.  Whether we use an atomic add
> or an atomic xor, it's going to be very expensive. Using an atomic add
> means that it will work with arm, but there's an explicit loop with
> arm, so on an arm system, it effectively turns into a more efficient
> spinlock, except you're spinning on each byte, so instead of bouncing
> cache lines with the locking granularity of the extract, we would be
> bouncing cache lines for every single byte of the mixback operation.

Yes.  The current ticket spinlocks are little more than a single
atomic_add_return, and the dangers of putting multiple locks in the same
cache line is well known.

> Segregating abusers solves both problems.  If we do this then we don't
> need to drop the locks from the nonblocking pool, which solves the
> security problem.

Er, sort of.  I still think my points were valid, but they're
about a particular optimization suggestion you had.  By avoiding
the need for the optimization, the entire issue is mooted.

I still say a separate firewall is a bad way to *solve* security problems,
but it can avoid preformance pressure to *create* them in the first place.

>> I can imagine that one /dev/urandom abuser can minopolize the locck and
>> cause problems for a shell script.

> Well, that's my point.  If we restrict the nonblocking pool to kernel
> users, then a /dev/urandom abuser won't be able to cause problems for
> the shell script.

And I quite agree with that.  Sorry, sometimes a discussion goes in
multiple directions and it's hard to keep track of the point of any
particular sentence.

I get fixated on one point and intrepret comments about something else
as if they pertained to the issue I'm thinking about.

> The trylock() scheme is a bit dangerous, since we would need to make
> sure that *all* times when other callers take the lock, the contents
> of the pool would have changed.  Or else, a subseqent call to extract
> entropy from that CPU will return the same value.  And even if that
> were true today (or we could make it be true) if that ever changed, it
> would break the code.  So it makes it pretty fragile.

Sorry, yes, I'm prefectly aware of this, I just glossed over it.
Each extraction has to have a unique nonce; that's non-negotiable, but
not hard to achieve.  We'd need to add a per-cpu counter to go along
with the CPU id to make the extraction nonce.

Doing this would be valuable *anyway*, because it would let us
reduce the mixback to once per read() rather than once per 10 bytes.

(Also, what would you say to reducing arch_get_random_long() to 16
bytes, and reserving hash.w[5] for the nonce?  That would ensure that
even a malicious arch_get_random_long() couldn't force a collision.)

> The ChaCha 20 state block is 64 bytes, and then I was assuming two 4
> byte control variables, for the number of cranks and the time since
> last reseed.  We could actually store those control variables into the
> IV portion of the ChaCha state structure, which would make it be 64
> bytes --- or we could generate a new state structure each time and not
> bother storing the 16 bytes of constants in the state structure, which
> trades a tiny amount of time and stack space for memory.  These are
> all implementation details, though....

You have to copy the state *anyway* because you don't want it overwritten
by the ChaCha output, so there's really no point storing the constants.
(Also, ChaCha has a simpler input block structure than Salsa20; the
constants are all adjacent.)

And ChaCha includes a stream position counter itself, so you don't need
a separate one.

(Note: one problem with ChaCha specifically is that is needs 16x32 bits
of registers, and Arm32 doesn't quite have enough.  We may want to provide
an arch CPRNG hook so people can plug in other algorithms with good
platform support, like x86 AES instructions.)

>> But why bother with even that much?  If you're willing to discard
>> antibacktracking, just have *one* key for the kernel, reseeded more
>> often, and a per-thread nonce and stream position.

> Well, we're not completely discarding backtracking protection.  We are
> discarding backtracking protection between successive reads from a single
> process, and even there we would be reseeding every five minutes (and
> this could be tuned), so there is *some* anti-backtracking protection.

Yeah, I realized this a few hours after sending that e-mail.
Anti-backtracking between processes is more important than within.

> I'm going to make the claim that if a particular process is reading
> from /dev/urandom in a tight loop, you probably don't *need*
> backtracking.  You're probably doing something silly such as using
> /dev/urandom to overwrite a disk, or some such.

I fully agree.

> On the flip side, the time when you might care about anti-backtracing
> protection is say, when you're generating a new session key for a new
> connection.  So perhaps one approach is to use some kind of
> ratelimiter algorithm so that if you're using /dev/urandom "carefully"
> (say, no more than ten times a second), we'll use the non-blocking
> pool.  But once a process exceeds that limit, it will switch over the
> the CRNG, and then the only performance that abuser process will hurt
> is its own (because it would be even faster if they were running
> something like arc4random in userspace).

I consider the size of the read at least as much as the rate.
As mentioned in the random(4) man page, I don't think more than 256
bits of computational security even exists.  Human Bitcoin mining is
(per http://bitcoincharts.com/) doing 440051 Thash/s, which is 80 bits
per month.

We can extrapolate computational feasibility beyond what's currently
been done.  I think 128-bit claims are defensible.  But once we get beyond
the square of anything ever attempted (170 bits or so), security claims
really don't know what they're talking about.  256 bits is the *cube*.
That's just crazy talk.

This in turn means that someone asking for more than 256 bits of seed
material is doing something stupid.  (In the more diplomatic language of
the man page, "should be taken as a sign that its cryptography is _not_
skillfully implemented."  I didn't want to totally condmen quick one-time
hacks.)

>> But without interfering with legitimate users of /dev/urandom at all,
>> I'd be quite willing to say that as soon as you read more than 32 bytes
>> (current request plus recent history) from /dev/urandom, you get a
>> private ChaCha20 structure to read from.

> Yeah, that's what I was thinking, although I was willing to be a bit
> more generous about when we switch them over to using the ChaCha pool.
> I think doing this automatically is going to be much better than
> creating a new /dev/frandom, since it's going to be hard to get
> applications to switch over.  We're still several years away from
> swtiching people over to the new getrandom(2) system call, after all.

I worry this might attract the same sort of shitstorm as the SHA-3
proposal to tweak Keccak's preimage resistance (which, for the benefit of
the peanut gallery, was solid cryptographic engineering that belatedly
fixed a stupidity in the original SHA-3 contest rules), but if you're
up for it, I'm not going to object.

I'm not too hung up on the thresholds, as long as we catch bulk
readers on the first read, rather than wait until *after* they've
drained /dev/urandom.  I'd prefer a single 64-byte read request would
move to mitigation mode, but I could be talked into 128.  (Actually,
it would probably make more sense to have the threshold be a multiple
of EXTRACT_SIZE.)

History is probably best kept in a leaky bucket of some sort.  That has a
current level and last update jiffies, and each read, we subtract off
the allowed maximum rate from the level.

(Or would an exponential average be better?  That would make the maximum
read rate more strongly dependent on the evenness of the spacing.)

Then add that (scaled somehow?) to the current read size and compare
with the threshold.

The same variables can be used (with different parameters) to decide if
we want to get out of mitigation mode.  The one thing to watch out for
is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
the buffer cache fills.  We don't want to forgive after too small a
fixed interval.

Finally, we have the issue of where to attach this rate-limiting structure
and crypto context.  My idea was to use the struct file.  But now that
we have getrandom(2), it's harder.  mm, task_struct, signal_struct, what?

(Post-finally, do we want this feature to be configurable under
CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
a speficic design goal, and abuse mitigation is optional.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12 20:30               ` George Spelvin
@ 2015-10-12 20:34                 ` George Spelvin
  2015-10-13  2:46                 ` Theodore Ts'o
  1 sibling, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-12 20:34 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

(BTW, my previous e-mail went out early due to a mis-click.  It was almost
done, but please excuse any unfinished sentences.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12 20:30               ` George Spelvin
  2015-10-12 20:34                 ` George Spelvin
@ 2015-10-13  2:46                 ` Theodore Ts'o
  2015-10-13  3:50                   ` Raymond Jennings
                                     ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Theodore Ts'o @ 2015-10-13  2:46 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote:
> > Segregating abusers solves both problems.  If we do this then we don't
> > need to drop the locks from the nonblocking pool, which solves the
> > security problem.
> 
> Er, sort of.  I still think my points were valid, but they're
> about a particular optimization suggestion you had.  By avoiding
> the need for the optimization, the entire issue is mooted.

Sure, I'm not in love with anyone's particular optimization, whether
it's mine, yours, or Andi's.  I'm just trying to solve the scalability
problem while also trying to keep the code maintainable and easy to
understand (and over the years we've actually made things worse, to
the extent that having a single mixing for the input and output pools
is starting to be more of problem than a feature, since we're coding
in a bunch of exceptions when it's the output pool, etc.).

So if we can solve a problem by routing around it, that's fine in my
book.

> You have to copy the state *anyway* because you don't want it overwritten
> by the ChaCha output, so there's really no point storing the constants.
> (Also, ChaCha has a simpler input block structure than Salsa20; the
> constants are all adjacent.)

We're really getting into low-level implementations here, and I think
it's best to worry about these sorts of things when we have a patch to
review.....

> (Note: one problem with ChaCha specifically is that is needs 16x32 bits
> of registers, and Arm32 doesn't quite have enough.  We may want to provide
> an arch CPRNG hook so people can plug in other algorithms with good
> platform support, like x86 AES instructions.)

So while a ChaCha20-based CRNG should be faster than a SHA-1 based
CRNG, and I consider this a good thing, for me speed is **not** more
important than keeping the underlying code maintainable and simple.
This is one of the reasons why I looked at, and then discarded, to use
x86 accelerated AES as the basis for a CRNG.  Setting up AES so that
it can be used easily with or without hardware acceleration looks very
complicated to do in a cross-architectural way, and I don't want to
drag in all of the crypto layer for /dev/random.

> The same variables can be used (with different parameters) to decide if
> we want to get out of mitigation mode.  The one thing to watch out for
> is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
> the buffer cache fills.  We don't want to forgive after too small a
> fixed interval.

At least initially, once we go into mitigation mode for a particular
process, it's probably safer to simply not exit it.

> Finally, we have the issue of where to attach this rate-limiting structure
> and crypto context.  My idea was to use the struct file.  But now that
> we have getrandom(2), it's harder.  mm, task_struct, signal_struct, what?

I'm personally more inclined to keep it with the task struct, so that
different threads will use different crypto contexts, just from
simplicity point of view since we won't need to worry about locking.

Since many processes don't use /dev/urandom or getrandom(2) at all,
the first time they do, we'd allocate a structure and hang it off the
task_struct.  When the process exits, we would explicitly memzero it
and then release the memory.

> (Post-finally, do we want this feature to be configurable under
> CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
> a speficic design goal, and abuse mitigation is optional.)

Once we code it up we can see how many bytes this takes, we can have
this discussion.  I'll note that ChaCha20 is much more compact than SHA1:

   text	   data	    bss	    dec	    hex	filename
   4230	      0	      0	   4230	   1086	/build/ext4-64/lib/sha1.o
   1152	    304	      0	   1456	    5b0	/build/ext4-64/crypto/chacha20_generic.o

... and I've thought about this as being the first step towards
potentially replacing SHA1 with something ChaCha20 based, in light of
the SHAppening attack.  Unfortunately, BLAKE2s is similar to ChaCha
only from design perspective, not an implementation perspective.
Still, I suspect the just looking at the crypto primitives, even if we
need to include two independent copies of the ChaCha20 core crypto and
the Blake2s core crypto, it still should be about half the size of the
SHA-1 crypto primitive.

And from the non-plumbing side of things, Andi's patchset increases
the size of /dev/random by a bit over 6%, or 974 bytes from a starting
base of 15719 bytes.  It ought to be possible to implement a ChaCha20
based CRNG (ignoring the crypto primitives) in less than 974 bytes of
x86_64 assembly.  :-)

						- Ted


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  2:46                 ` Theodore Ts'o
@ 2015-10-13  3:50                   ` Raymond Jennings
  2015-10-13  7:50                     ` George Spelvin
  2015-10-13  6:24                   ` George Spelvin
  2015-10-13 16:20                   ` Andi Kleen
  2 siblings, 1 reply; 30+ messages in thread
From: Raymond Jennings @ 2015-10-13  3:50 UTC (permalink / raw)
  To: Theodore Ts'o, George Spelvin, ahferroin7, andi, jepler,
	linux-kernel, linux



On Mon, Oct 12, 2015 at 7:46 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote:
>>  > Segregating abusers solves both problems.  If we do this then we 
>> don't
>>  > need to drop the locks from the nonblocking pool, which solves the
>>  > security problem.
>> 
>>  Er, sort of.  I still think my points were valid, but they're
>>  about a particular optimization suggestion you had.  By avoiding
>>  the need for the optimization, the entire issue is mooted.
> 
> Sure, I'm not in love with anyone's particular optimization, whether
> it's mine, yours, or Andi's.  I'm just trying to solve the scalability
> problem while also trying to keep the code maintainable and easy to
> understand (and over the years we've actually made things worse, to
> the extent that having a single mixing for the input and output pools
> is starting to be more of problem than a feature, since we're coding
> in a bunch of exceptions when it's the output pool, etc.).
> 
> So if we can solve a problem by routing around it, that's fine in my
> book.
> 
>>  You have to copy the state *anyway* because you don't want it 
>> overwritten
>>  by the ChaCha output, so there's really no point storing the 
>> constants.
>>  (Also, ChaCha has a simpler input block structure than Salsa20; the
>>  constants are all adjacent.)
> 
> We're really getting into low-level implementations here, and I think
> it's best to worry about these sorts of things when we have a patch to
> review.....
> 
>>  (Note: one problem with ChaCha specifically is that is needs 16x32 
>> bits
>>  of registers, and Arm32 doesn't quite have enough.  We may want to 
>> provide
>>  an arch CPRNG hook so people can plug in other algorithms with good
>>  platform support, like x86 AES instructions.)
> 
> So while a ChaCha20-based CRNG should be faster than a SHA-1 based
> CRNG, and I consider this a good thing, for me speed is **not** more
> important than keeping the underlying code maintainable and simple.
> This is one of the reasons why I looked at, and then discarded, to use
> x86 accelerated AES as the basis for a CRNG.  Setting up AES so that
> it can be used easily with or without hardware acceleration looks very
> complicated to do in a cross-architectural way, and I don't want to
> drag in all of the crypto layer for /dev/random.
> 
>>  The same variables can be used (with different parameters) to 
>> decide if
>>  we want to get out of mitigation mode.  The one thing to watch out 
>> for
>>  is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
>>  the buffer cache fills.  We don't want to forgive after too small a
>>  fixed interval.
> 
> At least initially, once we go into mitigation mode for a particular
> process, it's probably safer to simply not exit it.
> 
>>  Finally, we have the issue of where to attach this rate-limiting 
>> structure
>>  and crypto context.  My idea was to use the struct file.  But now 
>> that
>>  we have getrandom(2), it's harder.  mm, task_struct, signal_struct, 
>> what?
> 
> I'm personally more inclined to keep it with the task struct, so that
> different threads will use different crypto contexts, just from
> simplicity point of view since we won't need to worry about locking.
> 
> Since many processes don't use /dev/urandom or getrandom(2) at all,
> the first time they do, we'd allocate a structure and hang it off the
> task_struct.  When the process exits, we would explicitly memzero it
> and then release the memory.
> 
>>  (Post-finally, do we want this feature to be configurable under
>>  CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
>>  a speficic design goal, and abuse mitigation is optional.)
> 
> Once we code it up we can see how many bytes this takes, we can have
> this discussion.  I'll note that ChaCha20 is much more compact than 
> SHA1:
> 
>    text	   data	    bss	    dec	    hex	filename
>    4230	      0	      0	   4230	   1086	/build/ext4-64/lib/sha1.o
>    1152	    304	      0	   1456	    
> 5b0	/build/ext4-64/crypto/chacha20_generic.o
> 
> ... and I've thought about this as being the first step towards
> potentially replacing SHA1 with something ChaCha20 based, in light of
> the SHAppening attack.  Unfortunately, BLAKE2s is similar to ChaCha
> only from design perspective, not an implementation perspective.
> Still, I suspect the just looking at the crypto primitives, even if we
> need to include two independent copies of the ChaCha20 core crypto and
> the Blake2s core crypto, it still should be about half the size of the
> SHA-1 crypto primitive.
> 
> And from the non-plumbing side of things, Andi's patchset increases
> the size of /dev/random by a bit over 6%, or 974 bytes from a starting
> base of 15719 bytes.  It ought to be possible to implement a ChaCha20
> based CRNG (ignoring the crypto primitives) in less than 974 bytes of
> x86_64 assembly.  :-)
> 
> 						- Ted
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

This might be stupid, but could something asynchronous work?  Perhaps 
have the entropy generators dump their entropy into a central pool via 
a cycbuf, and have a background kthread manage the per-cpu or 
per-process entropy pools?




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  2:46                 ` Theodore Ts'o
  2015-10-13  3:50                   ` Raymond Jennings
@ 2015-10-13  6:24                   ` George Spelvin
  2015-10-13 16:20                   ` Andi Kleen
  2 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-13  6:24 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

> We're really getting into low-level implementations here, and I think
> it's best to worry about these sorts of things when we have a patch to
> review.....

> it's probably safer to simply not exit it.

> I'm personally more inclined to keep it with the task struct, so that
> different threads will use different crypto contexts, just from
> simplicity point of view since we won't need to worry about locking.

> Once we code it up we can see how many bytes this takes, we can have
> this discussion.

I'm fine with all of these; thank you.

> This is one of the reasons why I looked at, and then discarded, to use
> x86 accelerated AES as the basis for a CRNG.  Setting up AES so that
> it can be used easily with or without hardware acceleration looks very
> complicated to do in a cross-architectural way, and 

I haven't looked as deeply, but it didn't look too hard.  Is it possible
to briefly explain the problem?

I assumed you'd have an arch-specific capabilities probe function that
would set up an operations structure.  That would provide the various
buffer sizes required, and setup (kernel_fpu_begin() and key scheduling)
CPRNG core, and teardown (kernel_fpu_end()) functions.

It there some huge gotcha I'm overlooking?

> I don't want to drag in all of the crypto layer for /dev/random.

Oh, gods, no; the crypto layer drives me nuts.  Truthfully, the main
hair of the crypto layer is all the modular cipher modes on top of block
ciphers, and the scatterlist stuff to handle arbitrarily fragmented
input and output buffers for the benefit of the network layer, but the
code is horrible reading.

Every time I look, I find something I want to fix (the CTS mode
implementation uses 6 blocks worth of stack buffer; I have a patch to
reduce that to 3) but then I get lost is the morass of structures and
wrappers trying to make the code fit in with the rest.

There's a struct crypto_tfm, crypto_alg, crypto_instance, cipher_desc,
crypto_type, crypto_template, crypto_spawn... I've been trying to read it
and I still have no idea what half of them are for.

And I *still* haven't figured out how to get the self-test code to tell
me that test X was performed and passed.  I ended up writing my own test,
which seems wrong.

> ... and I've thought about this as being the first step towards
> potentially replacing SHA1 with something ChaCha20 based, in light of
> the SHAppening attack.  Unfortunately, BLAKE2s is similar to ChaCha
> only from design perspective, not an implementation perspective.
> Still, I suspect the just looking at the crypto primitives, even if we
> need to include two independent copies of the ChaCha20 core crypto and
> the Blake2s core crypto, it still should be about half the size of the
> SHA-1 crypto primitive.

Well, the SHAppening doesn't really change much except a slight schedule
tweak, but yeah, it's one of those things that would be nice to get
around to.

I'm not sure what you expect to do with ChaCha, though; it's really
an expansion function, not compression, and not easily adapted to be one.

BLAKE2 is a bit ugly.  I'm generally not liking MD5/SHA-like designs
that dribble the message and some round constants in; I'm much preferring
the large-state "add a bunch of input all at once" designs like Keccak,
SipHash and DJB's SHA-3 entry, CubeHash.

Have you seen it?  It's quite similar to Keccak, just using a 32x32 = 1024
bit state rather than Keccak's 25*64=1600.

The reason it got dumped is because, like Keccak, to get n bits of
preimage resistance, it requires 2n bits of "capacity" bits unused each
round.  When you ask for 512 bits of preimage resistance, you can
only import a few bits of message each block.

Keccak has the same problem, but it has a big enough block that it can handle
it.

In Dan's submission, you'll see his usual "this is a stupid request
which I'm only paying lip service to", and his "SHA-3-512-formal"
proposal was dog-slow.  To quote:

	The "SHA-3-512-formal" proposal is aimed at users who are
	(1) concerned with attacks using 2^384 operations,
	(2) unconcerned with quantum attacks that cost far less, and
	(3) unaware that attackers able to carry out 2^256 operations
	    would wreak havoc on the entire SHA-3 landscape, forcing SHA-3
	    to be replaced no matter which function is selected as SHA-3.
	The "SHA-3-512-normal" proposal is aimed at sensible users.
	For all real-world cryptographic applications, the "formal"
	versions here can be ignored, and the tweak amounts to a proposal
	of CubeHash16/32 as SHA-3."

If NIST had proposed changing the preimage resistance rules *before*
the final decision, things would have gone a lot differently.

> And from the non-plumbing side of things, Andi's patchset increases
> the size of /dev/random by a bit over 6%, or 974 bytes from a starting
> base of 15719 bytes.  It ought to be possible to implement a ChaCha20
> based CRNG (ignoring the crypto primitives) in less than 974 bytes of
> x86_64 assembly.  :-)

Yes, not hard.  Are you inspired, or would you like me to put together
a patch?

And should I do moderately evil space-saving things like store the
pointer to the crypto state and the leaky bucket in the same task slot,
distinguished by the pointer lsbit?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  3:50                   ` Raymond Jennings
@ 2015-10-13  7:50                     ` George Spelvin
  0 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-13  7:50 UTC (permalink / raw)
  To: ahferroin7, andi, jepler, linux-kernel, linux, linux, shentino, tytso

> This might be stupid, but could something asynchronous work?  Perhaps 
> have the entropy generators dump their entropy into a central pool via 
> a cycbuf, and have a background kthread manage the per-cpu or 
> per-process entropy pools?

No for two reasons:

(Minor): One of the functions of the mixback is to ensure that the next
	reader hashes a *different* pool state.  If the mixback is
	delayed, the next reader might hash the *same* pool state and
	return the same numbers.  (There are easy workarounds for this.)
(Major): What do you do when the circular buffer is full?  If it's not safe
	to skip the mixback, then we have to block and get into the same
	lock-contention problem.


But... this suggestion of having a separate thread do the mixback gives
me an idea.  In fact, I think it's a good idea.

Ted (or anyone else listening), what do you think of the following?
I think it would solve Andi's problem and be a smaller code change than
the abuse mitigation mode.  (Which is still a good idea, but is off the
critical path.)

- Associated with a pool is an atomic "mixback needed" flag.
- Also an optional mixback buffer.  (Optional because the mixback
  could just recompute it.)

- Dropping the lock requires the following operations:
  - Test the mixback needed flag.  If set,
    - Copy out and wipe the buffer,
    - smp_wmb()
    - Clear the flag
    - smp_wmb()
    - Do the mixback, and
    - Re-check again before dropping the lock.
    (This check before dropping the lock is technically an optional
    optimization.)
  - Drop the lock.
  - smp_mb()	(Since it's write-to-read, we can't use _rmb or _wmb.)
  - Test the mixback pending flag again.
  - If it's set, trylock().  If that succeeds, go do the mixback as above.
  - If it fails, return.

Each reader uses already-discussed nonce techniques to safely do concurrent
reads from the same pool.  Then, at the end:
- (Optional) trylock() and, if it succeeds,
  do mixback directly.
- Copy our mixback data to the buffer (race conditions be damned)
- smp_wmb()
- set the mixback needed flag
- smp_mb()	(Since it's write-to-read; or use smp_store_mb())
- trylock()
  - If that fails. return
  - If that succeeds (and the flag is still set) do the mixback

This is based on the fact that if there are multiple concurrent reads,
we only need one mixback (thus, only one buffer/flag), but the "last one
out the door" has to do it.

Also, we don't care if we mis-count and end up doing it twice.

Each reader sets the flag and *then* does a trylock.  If the trylock fails,
it's guaranteed that the lock-holder will see the flag and take care of
the mixback for us.

The writers drop the lock and *then* test the flag.


The result is that readers *never* do a blocking acquire of the pool
lock.  Which should solve all the contention problems.  Andi's stupid
application will still be stupid, but won't fall off a locking cliff.


(We could also use w[5] as the "mixback needed" flag and just
force it to 1 on the off chance it's zero with negligible loss
of entropy and zero security loss.)


The one thing I worry about is livelock keeping one thread in the
mixback code indefinitely, which can be mitigated by dropping the lock
and waiting before re-testing and re-acquiring if we loop too often.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  2:46                 ` Theodore Ts'o
  2015-10-13  3:50                   ` Raymond Jennings
  2015-10-13  6:24                   ` George Spelvin
@ 2015-10-13 16:20                   ` Andi Kleen
  2015-10-13 21:10                     ` George Spelvin
  2 siblings, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2015-10-13 16:20 UTC (permalink / raw)
  To: Theodore Ts'o, George Spelvin, ahferroin7, andi, jepler,
	linux-kernel, linux


I tested the two proposed patches from earlier this thread on a 4S
system. This was just with my (worst case) micro.

Unfortunately both patches scale much worse than the duplicated pools,
and can be even worse than the baseline (not sure why).

The base line peaks at slightly above 200K ops/s with less than 20 CPUs.
And the it gets slower until settling around 100K ops/s with more CPUs.

Ted's patch peaks at 350K with four CPUs, but then quickly degrades to
50K ops/s at 20+ CPUs. At 144 CPUs it is slightly faster again at ~80K.

Spelvin's patch peaks at only 140K at 2 CPUs (so it's slower than base line),
stays around 120K upto 20, then degrades quickly to 50K and then slowly
improves again to ~80K.

The duplicated pool patch is ~200K upto 20 CPus, 400K upto 40, 600K at
slightly below 60 CPUs, and then very slowly degrades to 520K at 144. 

-Andi



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13 16:20                   ` Andi Kleen
@ 2015-10-13 21:10                     ` George Spelvin
  2015-10-14  2:15                       ` Andi Kleen
  2015-10-21  8:27                       ` Updated scalable urandom patchkit George Spelvin
  0 siblings, 2 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-13 21:10 UTC (permalink / raw)
  To: ahferroin7, andi, jepler, linux-kernel, linux, linux, tytso

> Ted's patch peaks at 350K with four CPUs, but then quickly degrades to
> 50K ops/s at 20+ CPUs. At 144 CPUs it is slightly faster again at ~80K.

Good to know, thanks!  With its race conditions, it's basically a "best
case" for that particular design, which tells me that more significant
changes are required.

Off hand, do you know how large a read each operation is?  I want to
reduce mixback from once per 10 bytes to once per read, and the size
ratio will give me some idea of how large an improvement to expect.

> Spelvin's patch peaks at only 140K at 2 CPUs (so it's slower than base line),
> stays around 120K upto 20, then degrades quickly to 50K and then slowly
> improves again to ~80K.

Sorry to make you go to the trouble; I knew from discussions with Ted
that it wasn't going to work.  It was mostly just in the form of a patch
for the sake of a more concrete discussion.

I'll have a patch that I hope will do some good for testing in a couple
of hours.

> The duplicated pool patch is ~200K upto 20 CPus, 400K upto 40, 600K at
> slightly below 60 CPUs, and then very slowly degrades to 520K at 144. 

Shitty performance is practically a design goal of /dev/urandom.  You
are NOT supposed to hit it more than once per minute per thread.

But since we have a real-world problem with it, Ted's "abusse mitigation
mode" idea (where the kernel does what the app should do: seed a private
CPRNG and use that) will provide good security at extremely high access
rates.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13 21:10                     ` George Spelvin
@ 2015-10-14  2:15                       ` Andi Kleen
  2015-10-16  5:28                         ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin
  2015-10-21  8:27                       ` Updated scalable urandom patchkit George Spelvin
  1 sibling, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2015-10-14  2:15 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux, tytso

> Off hand, do you know how large a read each operation is?  I want to
> reduce mixback from once per 10 bytes to once per read, and the size
> ratio will give me some idea of how large an improvement to expect.

My test reads 64 bytes using the syscall.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 0/4] Alternate sclable urandom patchset
  2015-10-14  2:15                       ` Andi Kleen
@ 2015-10-16  5:28                         ` George Spelvin
  2015-10-16  5:29                           ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin
                                             ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-16  5:28 UTC (permalink / raw)
  To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux

Sent as separate patches to make things easier to review; the action is
all in #4.

The first two are minor cleanups I've had in my tree for a while and
felt like working on top of rather than backing out.

The third I'm not happy with, but will serve as a proof-of-concept stub
that I'll rewrite if the rest meets with approval.

Since writing it, I've remembered that we already have a per-CPU "nonce
pool" in the form of the get_random_int_hash array, and I should take
advantage of that.  Another thing I don't like is the mixback is only
160 bits, 80 of which are private.  I'd rather have twice that, if the
read is large enough.  But that requires more code refactoring.

But it does illustrate the basic idea of using a nonce to avoid the
need to do synchronous mixback.  We can do many calls to extract_buf,
from the same or different processors, without doing mixback.

The fourth is where the fun is.

The names of things is certainly up for debate; creating a subclass
of struct entropy_store called "entropy_store_plus" is not my proudest
moment.  I also considered just making the extra fields global variables.
If someone wants to suggest an arrangement that's good for cache line
sharing, that would be appreciated.

But I think the nonblocking_mixback() function came out rather nice.

This uses two locks, rather that the one in the idea I most recently
posted, because add_interrupt_randomness() needs to take the lock,
and I didn't want to force it to do an unbounded amount of mixback
work.

Using two locks avoids the need for interrupt handlers to do
any mixback, and for readers to loop and thus possibly livelock.

George Spelvin (4):
  random: Reduce stack usage in _xfer_secondary_pool
  random: Remove two unused arguments from extract_entropy()
  random: Only do mixback once per read
  random: Make non-blocking mixback non-blocking

 drivers/char/random.c | 218 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 184 insertions(+), 34 deletions(-)

-- 
2.6.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool
  2015-10-16  5:28                         ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin
@ 2015-10-16  5:29                           ` George Spelvin
  2015-10-16  5:30                           ` [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() George Spelvin
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-16  5:29 UTC (permalink / raw)
  To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux

Rather than asking extract_entropy to fill a large buffer, transfer
bytes in EXTRACT_SIZE chunks using multiple calls to extract_buf.
(Which is what extract_entropy does internally.)

Signed-off-by: George Spelvin <linux@horizon.com>
---
 drivers/char/random.c | 48 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d85..c8ad49ba 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -964,8 +964,9 @@ EXPORT_SYMBOL_GPL(add_disk_randomness);
  *
  *********************************************************************/
 
-static ssize_t extract_entropy(struct entropy_store *r, void *buf,
-			       size_t nbytes, int min, int rsvd);
+static size_t account(struct entropy_store *r, size_t nbytes, int min,
+		      int reserved);
+static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]);
 
 /*
  * This utility inline function is responsible for transferring entropy
@@ -994,23 +995,46 @@ static void xfer_secondary_pool(struct entropy_store *r, size_t nbytes)
 
 static void _xfer_secondary_pool(struct entropy_store *r, size_t nbytes)
 {
-	__u32	tmp[OUTPUT_POOL_WORDS];
-
-	/* For /dev/random's pool, always leave two wakeups' worth */
-	int rsvd_bytes = r->limit ? 0 : random_read_wakeup_bits / 4;
+	u8	tmp[EXTRACT_SIZE];
 	int bytes = nbytes;
 
 	/* pull at least as much as a wakeup */
 	bytes = max_t(int, bytes, random_read_wakeup_bits / 8);
 	/* but never more than the buffer size */
-	bytes = min_t(int, bytes, sizeof(tmp));
+	bytes = min_t(int, bytes, OUTPUT_POOL_WORDS*sizeof(u32));
 
+	/*
+	 * FIXME: Move this to after account(), so it shows the true amount
+	 * transferred?
+	 */
 	trace_xfer_secondary_pool(r->name, bytes * 8, nbytes * 8,
 				  ENTROPY_BITS(r), ENTROPY_BITS(r->pull));
-	bytes = extract_entropy(r->pull, tmp, bytes,
-				random_read_wakeup_bits / 8, rsvd_bytes);
-	mix_pool_bytes(r, tmp, bytes);
-	credit_entropy_bits(r, bytes*8);
+
+	/*
+	 * This is the only place we call account() with non-zero
+	 * "min" and "reserved" values.  The minimum is used to
+	 * enforce catastrophic reseeding: if we can't get at least
+	 * random_read_wakeup_bits of entropy, don't bother reseeding
+	 * at all, but wait until a useful amount is available.
+	 *
+	 * The "reserved" is used to prevent reads from /dev/urandom
+	 * from emptying the unput pool; leave two wakeups' worth
+	 * for /dev/random.
+	 */
+	bytes = account(r->pull, bytes, random_read_wakeup_bits / 8,
+	                r->limit ? 0 : random_read_wakeup_bits / 4);
+
+	/* Now to the actual transfer, in EXTRACT_SIZE units */
+	while (bytes) {
+		int i = min_t(int, bytes, EXTRACT_SIZE);
+
+		extract_buf(r->pull, tmp);
+		mix_pool_bytes(r, tmp, i);
+		credit_entropy_bits(r, i*8);
+		bytes -= i;
+	}
+
+	memzero_explicit(tmp, sizeof(tmp));
 }
 
 /*
@@ -1087,7 +1111,7 @@ retry:
  *
  * Note: we assume that .poolwords is a multiple of 16 words.
  */
-static void extract_buf(struct entropy_store *r, __u8 *out)
+static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 {
 	int i;
 	union {
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy()
  2015-10-16  5:28                         ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin
  2015-10-16  5:29                           ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin
@ 2015-10-16  5:30                           ` George Spelvin
  2015-10-16  5:33                           ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin
  2015-10-16  5:34                           ` [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking George Spelvin
  3 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-16  5:30 UTC (permalink / raw)
  To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux

Now that _xfer_secondary_pool doesn't need the magic functionality,
the "min" and "reserved" arguments are always zero, so stop passing
them.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 drivers/char/random.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index c8ad49ba..e62b30ba 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1175,7 +1175,7 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
  * pool after each pull to avoid starving other readers.
  */
 static ssize_t extract_entropy(struct entropy_store *r, void *buf,
-				 size_t nbytes, int min, int reserved)
+				 size_t nbytes)
 {
 	ssize_t ret = 0, i;
 	__u8 tmp[EXTRACT_SIZE];
@@ -1199,7 +1199,7 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf,
 
 	trace_extract_entropy(r->name, nbytes, ENTROPY_BITS(r), _RET_IP_);
 	xfer_secondary_pool(r, nbytes);
-	nbytes = account(r, nbytes, min, reserved);
+	nbytes = account(r, nbytes, 0, 0);
 
 	while (nbytes) {
 		extract_buf(r, tmp);
@@ -1284,7 +1284,7 @@ void get_random_bytes(void *buf, int nbytes)
 		       nonblocking_pool.entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(&nonblocking_pool, buf, nbytes);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1374,7 +1374,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(&nonblocking_pool, p, nbytes);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
-- 
2.6.1

h

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/4] random: Only do mixback once per read
  2015-10-16  5:28                         ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin
  2015-10-16  5:29                           ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin
  2015-10-16  5:30                           ` [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() George Spelvin
@ 2015-10-16  5:33                           ` George Spelvin
  2015-10-16  6:12                             ` kbuild test robot
  2015-10-16  6:23                             ` kbuild test robot
  2015-10-16  5:34                           ` [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking George Spelvin
  3 siblings, 2 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-16  5:33 UTC (permalink / raw)
  To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux

Anti-backtracking is only required on read request boundaries, not on
each few bytes of output.

This reduces contention on the pool lock.  Without mixback for each
block of output, some other mechanism must guarantee the hash result
will change each time the pool is hashed.  This is provided by the
CPU ID and a per-CPU counter acting as a nonce.

Although a per-read nonce would be simpler for the current operation,
doing it per-CPU helps later.

This is a draft patch; this change has security implications which I'm
not entirely happy with in its current state, but it serves as a stub
for testing performance improvements.  (The output is, at least, not
so bad as to cause problems in limited deployment for testing.)

Also, allowing concurrent output from a single pool breaks the FIPS
mode anti-repetition test in a fundamental way.  I'm not sure how to
fix that.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 drivers/char/random.c | 70 ++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e62b30ba..cf34e83d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -966,7 +966,8 @@ EXPORT_SYMBOL_GPL(add_disk_randomness);
 
 static size_t account(struct entropy_store *r, size_t nbytes, int min,
 		      int reserved);
-static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]);
+static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE],
+			bool mixback);
 
 /*
  * This utility inline function is responsible for transferring entropy
@@ -1028,10 +1029,10 @@ static void _xfer_secondary_pool(struct entropy_store *r, size_t nbytes)
 	while (bytes) {
 		int i = min_t(int, bytes, EXTRACT_SIZE);
 
-		extract_buf(r->pull, tmp);
+		bytes -= i;
+		extract_buf(r->pull, tmp, bytes == 0);
 		mix_pool_bytes(r, tmp, i);
 		credit_entropy_bits(r, i*8);
-		bytes -= i;
 	}
 
 	memzero_explicit(tmp, sizeof(tmp));
@@ -1110,31 +1111,54 @@ retry:
  * extract_entropy_user.
  *
  * Note: we assume that .poolwords is a multiple of 16 words.
+ *
+ * The "mixback" flag enables anti-backtracking.  This need only be
+ * set on the last extract_buf in a contiguous series of requests.
+ * Ensuring distinct outputs is done by including a unique nonce
+ * (consisting of the CPU ID and a per-CPU counter that will not wrap
+ * before a mixback occurs)
+ *
+ * (FIXME: Is one ahsn's worth of mixback sufficient anti-backtracking
+ * protection?  Should we feed back more?)
  */
-static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
+static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE],
+	bool mixback)
 {
 	int i;
 	union {
 		__u32 w[5];
-		unsigned long l[LONGS(20)];
+		unsigned long l[LONGS(16)];
 	} hash;
+	__u32 nonce;
 	__u32 workspace[SHA_WORKSPACE_WORDS];
-	unsigned long flags;
+	static DEFINE_PER_CPU(__u32, random_nonce);
 
 	/*
 	 * If we have an architectural hardware random number
 	 * generator, use it for SHA's initial vector
 	 */
 	sha_init(hash.w);
-	for (i = 0; i < LONGS(20); i++) {
+	for (i = 0; i < LONGS(16); i++) {
 		unsigned long v;
 		if (!arch_get_random_long(&v))
 			break;
 		hash.l[i] = v;
 	}
 
+	/* Add the current CPU ID and nonce */
+	hash.w[3] += get_cpu();
+	nonce = __this_cpu_inc_return(random_nonce);
+	put_cpu();
+
+	/*
+	 * Theoretically, it's possible on a 64-bit system for someone to
+	 * request EXTRACT_SIZE << 32 bytes in one read.  So force mixback
+	 * to be true each time the nonce wraps.
+	 */
+	hash.w[4] += nonce;
+	mixback |= !nonce;
+
 	/* Generate a hash across the pool, 16 words (512 bits) at a time */
-	spin_lock_irqsave(&r->lock, flags);
 	for (i = 0; i < r->poolinfo->poolwords; i += 16)
 		sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);
 
@@ -1146,9 +1170,11 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 	 * mixing at least a SHA1 worth of hash data back, we make
 	 * brute-forcing the feedback as hard as brute-forcing the
 	 * hash.
+	 *
+	 * FIXME: update security analysis in light of reduced mixback.
 	 */
-	__mix_pool_bytes(r, hash.w, sizeof(hash.w));
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (mixback)
+		mix_pool_bytes(r, hash.w, sizeof(hash.w));
 
 	memzero_explicit(workspace, sizeof(workspace));
 
@@ -1177,7 +1203,7 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 static ssize_t extract_entropy(struct entropy_store *r, void *buf,
 				 size_t nbytes)
 {
-	ssize_t ret = 0, i;
+	ssize_t ret = 0;
 	__u8 tmp[EXTRACT_SIZE];
 	unsigned long flags;
 
@@ -1190,7 +1216,7 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf,
 			trace_extract_entropy(r->name, EXTRACT_SIZE,
 					      ENTROPY_BITS(r), _RET_IP_);
 			xfer_secondary_pool(r, EXTRACT_SIZE);
-			extract_buf(r, tmp);
+			extract_buf(r, tmp, nbytes == 0);
 			spin_lock_irqsave(&r->lock, flags);
 			memcpy(r->last_data, tmp, EXTRACT_SIZE);
 		}
@@ -1202,7 +1228,10 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf,
 	nbytes = account(r, nbytes, 0, 0);
 
 	while (nbytes) {
-		extract_buf(r, tmp);
+		size_t i = min_t(size_t, nbytes, EXTRACT_SIZE);
+
+		nbytes -= i;
+		extract_buf(r, tmp, nbytes == 0);
 
 		if (fips_enabled) {
 			spin_lock_irqsave(&r->lock, flags);
@@ -1211,9 +1240,8 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf,
 			memcpy(r->last_data, tmp, EXTRACT_SIZE);
 			spin_unlock_irqrestore(&r->lock, flags);
 		}
-		i = min_t(int, nbytes, EXTRACT_SIZE);
+
 		memcpy(buf, tmp, i);
-		nbytes -= i;
 		buf += i;
 		ret += i;
 	}
@@ -1231,15 +1259,17 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf,
 static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
 				    size_t nbytes)
 {
-	ssize_t ret = 0, i;
+	ssize_t ret = 0;
 	__u8 tmp[EXTRACT_SIZE];
-	int large_request = (nbytes > 256);
+	bool large_request = (nbytes > 256);
 
 	trace_extract_entropy_user(r->name, nbytes, ENTROPY_BITS(r), _RET_IP_);
 	xfer_secondary_pool(r, nbytes);
 	nbytes = account(r, nbytes, 0, 0);
 
 	while (nbytes) {
+		size_t i;
+
 		if (large_request && need_resched()) {
 			if (signal_pending(current)) {
 				if (ret == 0)
@@ -1249,14 +1279,14 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
 			schedule();
 		}
 
-		extract_buf(r, tmp);
-		i = min_t(int, nbytes, EXTRACT_SIZE);
+		i = min_t(size_t, nbytes, EXTRACT_SIZE);
+		nbytes -= i;
+		extract_buf(r, tmp, i == 0);
 		if (copy_to_user(buf, tmp, i)) {
 			ret = -EFAULT;
 			break;
 		}
 
-		nbytes -= i;
 		buf += i;
 		ret += i;
 	}
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking
  2015-10-16  5:28                         ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin
                                             ` (2 preceding siblings ...)
  2015-10-16  5:33                           ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin
@ 2015-10-16  5:34                           ` George Spelvin
  3 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-16  5:34 UTC (permalink / raw)
  To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux

Andi Kleen has reported that some loads cause hideous lock congestion
on the /dev/urandom pool lock.  This patch uses some trickery to
allow extract_buf to avoid taking the pool lock to do mixback in case
of contention.

The fundamental idea is that, if there are multiple concurrent
readers, only one mixback is necessary for antibacktracking.
What a reader has to be sure of is that *some* mixback happens
after the reader is done hashing the pool.

So if one CPU is holding the mixback_lock, the others just spam their
mixback data into a global buffer (races be damned) and set a flag.
The lock holder promises to check that flag and do the mixback
before returning.  This promise applies the entire time the
mixback_lock is held, so the flag must be checked after dropping it.

In theory this could be applied to the /dev/random pool as well, but
that's not subject to the same kind of abuse as urandom, so leave it
alone for the sake of conservatism.

The first two spin_trylock() operations in nonblocking_mixback() are
not strictly necessary; if either one succeeds, that shortens the
code path, but things would work if they failed unconditionally.

Another legal change would be to move the release of the mixback_lock
after the first __mix_pool_bytes.  But since there's a shortcut
optimization if the pool lock is available, that would require keeping
a flag to indicate if the mixback_lock is held and needs to be released.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 drivers/char/random.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index cf34e83d..54df2982 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -419,7 +419,6 @@ static LIST_HEAD(random_ready_list);
  *
  **********************************************************************/
 
-struct entropy_store;
 struct entropy_store {
 	/* read-only data: */
 	const struct poolinfo *poolinfo;
@@ -465,7 +464,17 @@ static struct entropy_store blocking_pool = {
 					push_to_pool),
 };
 
-static struct entropy_store nonblocking_pool = {
+/* This is a variant with some extra fields for nonblocking mixback */
+struct entropy_store_plus {
+	struct entropy_store store;
+	bool mixback_flag;
+	__u32 mixback[SHA_DIGEST_WORDS];
+	spinlock_t mixback_lock;
+};
+
+#define nonblocking_pool nonblocking_plus.store
+static struct entropy_store_plus nonblocking_plus = {
+    .store = {
 	.poolinfo = &poolinfo_table[1],
 	.name = "nonblocking",
 	.pull = &input_pool,
@@ -473,6 +482,9 @@ static struct entropy_store nonblocking_pool = {
 	.pool = nonblocking_pool_data,
 	.push_work = __WORK_INITIALIZER(nonblocking_pool.push_work,
 					push_to_pool),
+    },
+    .mixback_flag = false,
+    .mixback_lock = __SPIN_LOCK_UNLOCKED(nonblocking_plus.mixback_lock),
 };
 
 static __u32 const twist_table[8] = {
@@ -554,6 +566,83 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in,
 	spin_unlock_irqrestore(&r->lock, flags);
 }
 
+/*
+ * Because the /dev/urandom pool is by far the busiest, large machines
+ * with applications that hit the pool hard can fall off the locking
+ * cliff on the pool lock.  (They *should* use a private CPRNG reseeded
+ * infrequently, but there are lazy programmers in the world.)
+ *
+ * This function avoids contention on the pool's lock, by taking
+ * advantage of the fact that mixback from a given pool state is logically
+ * idempotent: if there are multiple threads wanting to mix back, only
+ * one of them has to succeed.  A non-deterministic mixture of the
+ * competing mixback data is also okay, so we use a common temporary
+ * buffer written without locking.
+ *
+ * It is also not necessary that a thread confirm that mixback has
+ * happened before returning to user space; a backtracking window a few
+ * microseconds long is not a concern, as long as it is strongly bounded.
+ * In this case, because the mixback lock is taken without disabling
+ * interrupts, the window is up to the interrupt latency long, but that's
+ * considered acceptable.
+ *
+ * What is important is that, after a mixback has started, any later
+ * pool readers will schedule an additional mixback.  That's what all
+ * the dancing around with mixback_flag and mixback_lock is about.
+ */
+static void nonblocking_mixback(struct entropy_store_plus *r,
+				const void *in, int nbytes)
+{
+	unsigned long flags;
+
+	if (!spin_trylock_irqsave(&r->store.lock, flags)) {
+		/* Someone's currently writing to the pool.  Anyone waiting? */
+		if (!spin_trylock(&r->mixback_lock)) {
+			/* Leave it for the lock holder to take care of. */
+			nbytes = min_t(int, nbytes, sizeof(r->mixback));
+			memcpy(r->mixback, in, nbytes);
+			smp_store_release(&r->mixback_flag, true);
+			smp_wmb();
+			/*
+			 * If the lock is still held, we are guaranteed that
+			 * the lock holder will promptly do the mixback for us.
+			 */
+			if (!spin_trylock(&r->mixback_lock))
+				return;
+		}
+		/*
+		 * Wait in line to do mixback.  This has to be a separate
+		 * lock because we don't want to oblige interrupt entropy
+		 * sources which also take the pool lock to do mixback.
+		 */
+		spin_lock_irqsave(&r->store.lock, flags);
+		spin_unlock(&r->mixback_lock);
+		smp_mb();	/* Between unlock and read of r->mixback flag */
+	}
+
+	__mix_pool_bytes(&r->store, in, nbytes);
+
+	/*
+	 * Do any buffered mixback.  Holding the mixback_lock obliges
+	 * us to check after dropping it, but we only have to check once.
+	 * (We aren't required to check at all if we never took the
+	 * mixback_lock, but it does no harm to.)
+	 */
+	if (READ_ONCE_CTRL(r->mixback_flag)) {
+		__u32 mixback[ARRAY_SIZE(r->mixback)];
+
+		memcpy(mixback, r->mixback, sizeof mixback);
+		memset(r->mixback, 0, sizeof r->mixback);
+		smp_store_release(&r->mixback_flag, false);
+
+		smp_wmb();	/* Ensure release is seen before mixing */
+		__mix_pool_bytes(&r->store, mixback, sizeof mixback);
+		memzero_explicit(mixback, sizeof mixback);
+	}
+	spin_unlock_irqrestore(&r->store.lock, flags);
+}
+
+
 struct fast_pool {
 	__u32		pool[4];
 	unsigned long	last;
@@ -1173,8 +1262,15 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE],
 	 *
 	 * FIXME: update security analysis in light of reduced mixback.
 	 */
-	if (mixback)
-		mix_pool_bytes(r, hash.w, sizeof(hash.w));
+	if (mixback) {
+		if (r->limit) {
+			mix_pool_bytes(r, hash.w, sizeof(hash.w));
+		} else {
+			struct entropy_store_plus *p =
+			    container_of(r, struct entropy_store_plus, store);
+			nonblocking_mixback(p, hash.w, sizeof(hash.w));
+		}
+	}
 
 	memzero_explicit(workspace, sizeof(workspace));
 
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] random: Only do mixback once per read
  2015-10-16  5:33                           ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin
@ 2015-10-16  6:12                             ` kbuild test robot
  2015-10-16  8:11                               ` George Spelvin
  2015-10-16  6:23                             ` kbuild test robot
  1 sibling, 1 reply; 30+ messages in thread
From: kbuild test robot @ 2015-10-16  6:12 UTC (permalink / raw)
  To: George Spelvin
  Cc: kbuild-all, andi, tytso, ahferroin7, jepler, linux-kernel, linux, linux

[-- Attachment #1: Type: text/plain, Size: 6019 bytes --]

Hi George,

[auto build test ERROR on v4.3-rc5 -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/George-Spelvin/Alternate-sclable-urandom-patchset/20151016-133627
config: i386-randconfig-i1-201541 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/asm-generic/percpu.h:6:0,
                    from arch/x86/include/asm/percpu.h:551,
                    from arch/x86/include/asm/preempt.h:5,
                    from include/linux/preempt.h:64,
                    from include/linux/spinlock.h:50,
                    from include/linux/seqlock.h:35,
                    from include/linux/time.h:5,
                    from include/uapi/linux/timex.h:56,
                    from include/linux/timex.h:56,
                    from include/linux/sched.h:19,
                    from include/linux/utsname.h:5,
                    from drivers/char/random.c:238:
   drivers/char/random.c: In function 'extract_buf':
   include/linux/percpu-defs.h:91:33: error: section attribute cannot be specified for local variables
     extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;  \
                                    ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   include/linux/percpu-defs.h:92:26: error: section attribute cannot be specified for local variables
     __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;   \
                             ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   include/linux/percpu-defs.h:92:26: error: declaration of '__pcpu_unique_random_nonce' with no linkage follows extern declaration
     __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;   \
                             ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   include/linux/percpu-defs.h:91:33: note: previous declaration of '__pcpu_unique_random_nonce' was here
     extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;  \
                                    ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
>> drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     extern __PCPU_ATTRS(sec) __typeof__(type) name;   \
                                               ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
>> drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     __typeof__(type) name
                      ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
>> drivers/char/random.c:1134:31: error: weak declaration of 'random_nonce' must be public
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     __typeof__(type) name
                      ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
>> drivers/char/random.c:1134:31: error: declaration of 'random_nonce' with no linkage follows extern declaration
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     __typeof__(type) name
                      ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   drivers/char/random.c:1134:31: note: previous declaration of 'random_nonce' was here
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     extern __PCPU_ATTRS(sec) __typeof__(type) name;   \
                                               ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^

vim +1134 drivers/char/random.c

  1128		union {
  1129			__u32 w[5];
  1130			unsigned long l[LONGS(16)];
  1131		} hash;
  1132		__u32 nonce;
  1133		__u32 workspace[SHA_WORKSPACE_WORDS];
> 1134		static DEFINE_PER_CPU(__u32, random_nonce);
  1135	
  1136		/*
  1137		 * If we have an architectural hardware random number

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 18058 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] random: Only do mixback once per read
  2015-10-16  5:33                           ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin
  2015-10-16  6:12                             ` kbuild test robot
@ 2015-10-16  6:23                             ` kbuild test robot
  1 sibling, 0 replies; 30+ messages in thread
From: kbuild test robot @ 2015-10-16  6:23 UTC (permalink / raw)
  To: George Spelvin
  Cc: kbuild-all, andi, tytso, ahferroin7, jepler, linux-kernel, linux, linux

[-- Attachment #1: Type: text/plain, Size: 6886 bytes --]

Hi George,

[auto build test ERROR on v4.3-rc5 -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url:    https://github.com/0day-ci/linux/commits/George-Spelvin/Alternate-sclable-urandom-patchset/20151016-133627
config: x86_64-randconfig-x010-10130227 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from include/asm-generic/percpu.h:6:0,
                    from arch/x86/include/asm/percpu.h:551,
                    from arch/x86/include/asm/preempt.h:5,
                    from include/linux/preempt.h:64,
                    from include/linux/spinlock.h:50,
                    from include/linux/seqlock.h:35,
                    from include/linux/time.h:5,
                    from include/uapi/linux/timex.h:56,
                    from include/linux/timex.h:56,
                    from include/linux/sched.h:19,
                    from include/linux/utsname.h:5,
                    from drivers/char/random.c:238:
   drivers/char/random.c: In function 'extract_buf':
>> include/linux/percpu-defs.h:91:33: error: section attribute cannot be specified for local variables
     extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;  \
                                    ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   include/linux/percpu-defs.h:92:26: error: section attribute cannot be specified for local variables
     __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;   \
                             ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
>> include/linux/percpu-defs.h:92:26: error: declaration of '__pcpu_unique_random_nonce' with no linkage follows extern declaration
     __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;   \
                             ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   include/linux/percpu-defs.h:91:33: note: previous declaration of '__pcpu_unique_random_nonce' was here
     extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;  \
                                    ^
   include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION'
     DEFINE_PER_CPU_SECTION(type, name, "")
     ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     extern __PCPU_ATTRS(sec) __typeof__(type) name;   \
                                               ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     __typeof__(type) name
                      ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   drivers/char/random.c:1134:31: error: weak declaration of 'random_nonce' must be public
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     __typeof__(type) name
                      ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   drivers/char/random.c:1134:31: error: declaration of 'random_nonce' with no linkage follows extern declaration
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     __typeof__(type) name
                      ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^
   drivers/char/random.c:1134:31: note: previous declaration of 'random_nonce' was here
     static DEFINE_PER_CPU(__u32, random_nonce);
                                  ^
   include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION'
     extern __PCPU_ATTRS(sec) __typeof__(type) name;   \
                                               ^
   drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU'
     static DEFINE_PER_CPU(__u32, random_nonce);
            ^

vim +91 include/linux/percpu-defs.h

7c756e6e Tejun Heo     2009-06-24  85  #define DECLARE_PER_CPU_SECTION(type, name, sec)			\
7c756e6e Tejun Heo     2009-06-24  86  	extern __PCPU_DUMMY_ATTRS char __pcpu_scope_##name;		\
dd17c8f7 Rusty Russell 2009-10-29  87  	extern __PCPU_ATTRS(sec) __typeof__(type) name
7c756e6e Tejun Heo     2009-06-24  88  
7c756e6e Tejun Heo     2009-06-24  89  #define DEFINE_PER_CPU_SECTION(type, name, sec)				\
7c756e6e Tejun Heo     2009-06-24  90  	__PCPU_DUMMY_ATTRS char __pcpu_scope_##name;			\
0f5e4816 Tejun Heo     2009-10-29 @91  	extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;		\
7c756e6e Tejun Heo     2009-06-24 @92  	__PCPU_DUMMY_ATTRS char __pcpu_unique_##name;			\
b1a0fbfd Tejun Heo     2013-12-04  93  	extern __PCPU_ATTRS(sec) __typeof__(type) name;			\
c43768cb Tejun Heo     2009-07-04  94  	__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES __weak			\
dd17c8f7 Rusty Russell 2009-10-29  95  	__typeof__(type) name

:::::: The code at line 91 was first introduced by commit
:::::: 0f5e4816dbf38ce9488e611ca2296925c1e90d5e percpu: remove some sparse warnings

:::::: TO: Tejun Heo <tj@kernel.org>
:::::: CC: Tejun Heo <tj@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 30837 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/4] random: Only do mixback once per read
  2015-10-16  6:12                             ` kbuild test robot
@ 2015-10-16  8:11                               ` George Spelvin
  0 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-16  8:11 UTC (permalink / raw)
  To: linux, lkp
  Cc: ahferroin7, andi, jepler, kbuild-all, linux-kernel, linux, tytso

If anyone has the same compile pronlem, move the "random_nonce"
definition up 10 lines to before the definition of extract_buf().

It compiles (and boots; I'm running it right now) for me as posted;
I'm not sure what the difference is.  Compiler version?  I'm running
gcc 5.2.1.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13 21:10                     ` George Spelvin
  2015-10-14  2:15                       ` Andi Kleen
@ 2015-10-21  8:27                       ` George Spelvin
  2015-10-21 11:47                         ` Andi Kleen
  1 sibling, 1 reply; 30+ messages in thread
From: George Spelvin @ 2015-10-21  8:27 UTC (permalink / raw)
  To: ahferroin7, andi, jepler, linux-kernel, linux, linux, tytso

After having worked out the basics of a non-blocking /dev/urandom read,
(patches poated; I'm hoping someone will comment on them!), I need to
work out the crypto.

Patch 3/4 of that series showed how to have concurrent pool readers not
separated by mixback operations, but it was a stub, not really well
thought through.  Since then, I've been trying to think of how to do
it properly.  That entails figuring out exactly what the design goals
of the mixback are.

The current mixback operation works as follows:
* Generate 160 bits of hash output
* Produce 80 bits of that as user output
* Mix all 160 bits back into the pool.  This achieves two things:
  * Salting.  It changes the pool state so the next hash will
    produce different output.
  * Anti-backtracking protection.  By adding the output back to the input,
    it makes it difficult for an attacker who captures the later pool
    state to figure out the 

Now, the salting goal can be achieved just as well with a simple
unique nonce.  In the example, it's an incrementing counter and CPU ID.

But the anti-backtracking is more complex.

First of all, we note that the current design is only computationally
secure, not information theoretically.  To achieve the latter, we'd
have to actually destroy entropy in the pool (by zeroing it or doing
something like Spritz's crush() operation), which doesn't seem worth it.

The basic idea of having as much mixback as output seems good.
If you're going to do it at all, you might as well do it right,
and you don't want a 256-bit key to have an 160-bit attack from
a captured pool state.

But there is a practical maximum.  I can't think of a reason why you'd
need more than the output pool size (128 bytes = 1024 bits), but maybe
256 bits is good enough.  Thoughts?

(256 bits also corresponds to one full cycle of the input mix
over the pool, FWIW.)


One thing that's questionable is the fact that the entire 20 bytes is mixed
back.  Because the input hash is not very rapid, it's possible for the
mixback to land on a low-entropy part of the pool and not be well hidden.

If that happens, the previous random output (or partial information about
it) would be recoverable, even if the full previous pool state is not.

I'm not sure how big a problem this is, but only mixing back 80
bits (independent of the output bits) might be better.


A change I'm thinking about, which would simplify implementation
considerably, would be to use all 160 bits of hash output for the user
output, and then do additional hashing, to do the mixback.

This greatly simplifies the contended-pool case by only requiring
readers to pass the *amount* of mixback to the lock holder, which
can be done with an atomic_add.  There's no need to pass actual data,
because the lock-holder can just generate it themselves.

Specifically, in the low-contention case, the reader would use any
leftover hash bytes for mixback if it could get the lock, but it wouldn't
be a big performance hit to throw them away if someone else was holding
the lock.

In the case of extremely high contention, the limit on maximum total
mixback would let the lock holder do less mixback than the total
that would be done by the concurrent readers.


But that requires understanding the reasons for only outputting
half of the hash result, to know if it's safe to make the change.
Three possibilities come to mind:

1. To produce enough mixback.  This is preserved by the proposed change.

2. To avoid entropy loss due to collisions in the non-invertible final
   addition in a "narrow-pipe" hash like SHA.

   This is a generic issue with hashes which do not maintain an internal
   state wider than the input.  This includes MD5, SHA1, SHA2, and
   BLAKE/BLAKE2.  (But not Keccak, SipHash, JH, Gr{\o}stl, Skein, etc.)

   Assume the first half of the pool has lots of entropy, so the 160-bit
   output of the first sha_transform() call is perfectly uniformly
   distributed.  (Each of the 2^160 possible output occurs with probably
   2^-160, so 160 bits of entropy by any easure.)

   If the second half of the pool has no entropy (a known, fixed value),
   then sha_transform approximates a 160-to-160 bit random function.
   Such a function has output collisions with probabilities described
   by the Poisson distribution with mean (lambda) 1.  (This number is
   the ratio of input and output state sizes.)

   In particular, 1/e = e^-1 = 36.788% of the outputs do not appear for
   any input.  (If there are k time more inputs than output, all equally
   likely, it's e^-k.  So just a few bits reduces that drastically.)

   If you do the Shannon entropy math, that means that the output
   has log2(e) = 0.827245389... fewer bits of entropy than the input.
   (If there are k states in the second half of the pool, the entropy loss
   is slightly less than 1/k of this value.  So 163 bits in produces 160 -
   0.827/8 = 159.9 bits out.)

   If you're going by min-entropy, it's worse.  Over 2^160 inputs,
   the Poisson distribution predicts 3.5 41-way collisions, and 0.0866
   42-way collisions.  So the loss is log2(41) = 5.358 bits.  (If there
   are 8 states in the second half, then we expect a 77-way collision, for
   a min-entropy of 160 - log2(77/8) = 160 - 3.2668 = 156.73 bits.)

   (For more on this, the work of Andrea R\"ock, e.g. "Collision Attacks
   based on the Entropy Loss caused by Random Functions", is good reading.)

   Although this affects all narrow-pipe hashes, you only need to chop
   a few bits off the state size to be nearly perfect in Shannon terms,
   and 8 bits gets you less than 1 bit of min-entropy lost.  2 bytes
   makes the largest collision that you expect to find at least 1 of among
   2^176 inputs is a 69270-way, resulting in a loss of log2(69270/65536)
   = 0.08 bit of min-entropy.

   So while I can see the point to truncating the hash, cutting it in half
   seems unnecessary.

3. General "I don't trust SHA-1" principles without a more specific
   reason.  If this is all there is, would substituting a different hash
   (I'm looking into BLAKE2 at Ted's suggestion) help?

   Note that the entire SHAppening issue is not particularly relevant
   to /dev/urandom attacks.  Attacking the /dev/urandom output mix is
   more like a pre-image attack, but considerably harder, since you're
   not trying to find not just *a* pre image, but *the* pre-image.

   Using the expected pre-image resistance of the hash is insanely
   conservaitve.  Perhaps not a bad thing, though.  But could we use
   128 bits rather than 80?


Finally, I'm trying to figure out how to use the get_radom_int_hash
array for salt material.  We already have it, so it would be nice
to use.  But the current way it's used (basically, uses the random_int_secret
as a key and runs in OFBmode) leaves the last random number visible.

A pool reader must change its salt if it can't get the mixback lock.
There may be a few bytes of leftover entropy available for the purpose
if the hash size doesn't divide the read size.

I could always do another hash iteration to reseed get_random_int_hash if
I can't get the mixback lock, but is there a cheaper way?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-21  8:27                       ` Updated scalable urandom patchkit George Spelvin
@ 2015-10-21 11:47                         ` Andi Kleen
  2015-10-21 18:10                           ` George Spelvin
  0 siblings, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2015-10-21 11:47 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux, tytso



On Wed, Oct 21, 2015 at 04:27:43AM -0400, George Spelvin wrote:
> After having worked out the basics of a non-blocking /dev/urandom read,
> (patches poated; I'm hoping someone will comment on them!), I need to
> work out the crypto.

Can we just use my multi pool patch for now? It works and is actually scalable
and does not require any new "cryptographic research" or other risks.

The main concern I heard was code size. I addressed this in the latest
version by making the new code depend on CONFIG_NUMA, which is generally
not set on small systems.

-Andi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-21 11:47                         ` Andi Kleen
@ 2015-10-21 18:10                           ` George Spelvin
  0 siblings, 0 replies; 30+ messages in thread
From: George Spelvin @ 2015-10-21 18:10 UTC (permalink / raw)
  To: andi, linux; +Cc: ahferroin7, jepler, linux-kernel, linux, tytso

> Can we just use my multi pool patch for now? It works and is actually scalable
> and does not require any new "cryptographic research" or other risks.

To clarify, it's actually quite easy to come up with something that works,
cryptographically.  All the discussion is:
1) Erring on the side of public discussion for a security-sensitive area, and
2) Trying to find "The Right Thing" among the many workable solutions.

Earlier, Ted seemed to be interested in more significant changes like
replacing sha_transform and "abuse mitigation mode", so I've been
thinking expansively.  The immediate plans aren't necessarily quite
as complex.

Mostly, it's an embarrassment of ways to do it, and trying to find
a reason to choose one over the other.  I wrote one very quickly, said
"I can do better", thought about coding that up, thought of an even
better solution, and decided to post the performance-affecting part
while I played donkey-between-two-piles-of-hay.

> or other risks.

Ted gets to decide, but my objection is that the already-sparse entropy
supply is considerably diluted.  That *is* a risk.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Updated scalable urandom patchkit
@ 2015-09-24 17:19 Andi Kleen
  0 siblings, 0 replies; 30+ messages in thread
From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux

Updated patchkit to make /dev/urandom scale on larger systesm.
I addressed all the review comments. See ChangeLogs in the individual
patches.

-Andi


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2015-10-21 18:10 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-10 18:45 Updated scalable urandom patchkit George Spelvin
2015-10-11  2:31 ` Theodore Ts'o
2015-10-11  2:53   ` Theodore Ts'o
2015-10-11  4:35   ` George Spelvin
2015-10-11 22:25     ` Theodore Ts'o
2015-10-12  0:16       ` George Spelvin
2015-10-12  4:05         ` Theodore Ts'o
2015-10-12  7:49           ` George Spelvin
2015-10-12 13:54             ` Theodore Ts'o
2015-10-12 20:30               ` George Spelvin
2015-10-12 20:34                 ` George Spelvin
2015-10-13  2:46                 ` Theodore Ts'o
2015-10-13  3:50                   ` Raymond Jennings
2015-10-13  7:50                     ` George Spelvin
2015-10-13  6:24                   ` George Spelvin
2015-10-13 16:20                   ` Andi Kleen
2015-10-13 21:10                     ` George Spelvin
2015-10-14  2:15                       ` Andi Kleen
2015-10-16  5:28                         ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin
2015-10-16  5:29                           ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin
2015-10-16  5:30                           ` [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() George Spelvin
2015-10-16  5:33                           ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin
2015-10-16  6:12                             ` kbuild test robot
2015-10-16  8:11                               ` George Spelvin
2015-10-16  6:23                             ` kbuild test robot
2015-10-16  5:34                           ` [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking George Spelvin
2015-10-21  8:27                       ` Updated scalable urandom patchkit George Spelvin
2015-10-21 11:47                         ` Andi Kleen
2015-10-21 18:10                           ` George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2015-09-24 17:19 Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.