Re: Updated scalable urandom patchkit

From: "Theodore Ts'o" <tytso@mit.edu>
To: George Spelvin <linux@horizon.com>
Cc: andi@firstfloor.org, ahferroin7@gmail.com, jepler@unpythonic.net,
	linux-kernel@vger.kernel.org, linux@rasmusvillemoes.dk
Subject: Re: Updated scalable urandom patchkit
Date: Sat, 10 Oct 2015 22:31:46 -0400	[thread overview]
Message-ID: <20151011023146.GA5341@thunk.org> (raw)
In-Reply-To: <20151010184546.759.qmail@ns.horizon.com>

On Sat, Oct 10, 2015 at 02:45:46PM -0400, George Spelvin wrote:
> In general, fewer larger pools is better for entropy storage.  The only
> reason for the current three-pool design is to achieve strict isolation
> of the /dev/random pool.

You're absolutely right, and I would like to see if we can get away
with keeping with a single pool design.  Your idea of using the CPU id
as a nonce is a good one, and I think we can do something similar that
should be good enough for mixing the hash back into the pool.  We can
simply use a hash of the CPU id to change the offset where we do the
mixing, and this should be good enough to avoid collisions when we do
the add-back.

HOWEVER.  One downside of this approach is that, especially on a NUMA
system, the costs of the cache coherency across a single pool which is
constantly being modified and shared by all of the NUMA nodes could be
a killer, even if we go completely lockless.

To that end, Andi, can you try benchmarking the scalability of the
patch below?  I really hope it will be good enough, since besides
using less memory, there are security advantages in not spreading the
entropy across N pools.

If it isn't, we might be able to play a trick where we sample the
r->add_ptr value before we start hashing the pool, and then check to
see if it's changed afterwards.  If it has, we could skip doing the
hash back, which could reduce the cache coherency traffic, since as
you point out:

> 2d. A more complete solution involves noting that, if there are multiple
>     concurrent readers, only one has to add back its output to prevent
>     backtracking, because all of the concurrent reads are equivalent
>     in that sense.  (Even though, because they're salted with separate
>     nonces like the CPU id, they're not identical.)

However, even if we put in that optimization, the primary question is
how good is Intel's cache coherency protocols on their NUMA systems?
I'm pretty sure this would be a disaster on, say, Sequent's old NUMA
machines, but I'm quite confident all of those servers are dead and
buried by now.  :-)

						- Ted

commit 3cb51896deab45bddc1b8f571b1103eae8f50e0e
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Oct 10 22:03:53 2015 -0400

    random: make the reading from the non-blocking pool more scalable
    
    Andi Kleen reported a case where a 4 socket system spent >80% of its
    total CPU time contending on the global urandom nonblocking pool
    spinlock. While the application could probably have used an own PRNG,
    it may have valid reasons to use the best possible key for different
    session keys.
    
    Instead of creating separate multiple per-NUMA node non-blocking
    pools, use a trick suggested by George Spelvin:
    
       Entropy is a holographic property of the pool, not located in any
       particular bit position.
    
       And the most basic operation, of reading from the pool, can easily be
       done by multiple readers at the same time from the same bit pattern.
       They just need to be salted with distinguishing nonces (CPU IDs will do
       nicely) to ensure they all get different results....
    
    We use this trick, and in addition use a hash of the cpu id to change
    where we mix the hash back into the pool to avoid collisions.
    
    Since we are already using a lockless technique (cmpxchg) to update
    the entropy accounting, we don't need to change this around.
    
    This technique won't be quite as scalable since on a NUMA node we will
    still be forcing cache lines to bounce around, but from the
    perspective of entropy storage we're much better using a single pool
    rather than spreading it across multiple pools.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..be6b315 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
 #include <linux/irq.h>
 #include <linux/syscalls.h>
 #include <linux/completion.h>
+#include <linux/hash.h>
 
 #include <asm/processor.h>
 #include <asm/uaccess.h>
@@ -494,7 +495,9 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 {
 	unsigned long i, tap1, tap2, tap3, tap4, tap5;
 	int input_rotate;
+	int is_nonblock = (r == &nonblocking_pool);
 	int wordmask = r->poolinfo->poolwords - 1;
+	int poolbits = r->poolinfo->poolbitshift - 5;
 	const char *bytes = in;
 	__u32 w;
 
@@ -506,6 +509,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 
 	input_rotate = r->input_rotate;
 	i = r->add_ptr;
+	if (is_nonblock)
+		i += hash_32(get_cpu(), poolbits);
 
 	/* mix one byte at a time to simplify size handling and churn faster */
 	while (nbytes--) {
@@ -534,6 +539,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 
 	r->input_rotate = input_rotate;
 	r->add_ptr = i;
+	if (is_nonblock)
+		put_cpu();
 }
 
 static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1090,6 +1097,7 @@ retry:
 static void extract_buf(struct entropy_store *r, __u8 *out)
 {
 	int i;
+	int is_nonblock = (r == &nonblocking_pool);
 	union {
 		__u32 w[5];
 		unsigned long l[LONGS(20)];
@@ -1108,9 +1116,12 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
 			break;
 		hash.l[i] = v;
 	}
+	if (is_nonblock)
+		hash.w[0] ^= hash_32(get_cpu(), 32);
+	else
+		spin_lock_irqsave(&r->lock, flags);
 
 	/* Generate a hash across the pool, 16 words (512 bits) at a time */
-	spin_lock_irqsave(&r->lock, flags);
 	for (i = 0; i < r->poolinfo->poolwords; i += 16)
 		sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);
 
@@ -1124,7 +1135,10 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
 	 * hash.
 	 */
 	__mix_pool_bytes(r, hash.w, sizeof(hash.w));
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (is_nonblock)
+		put_cpu();
+	else
+		spin_unlock_irqrestore(&r->lock, flags);
 
 	memzero_explicit(workspace, sizeof(workspace));