All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] Make /dev/urandom scalable
@ 2015-09-22 23:16 Andi Kleen
  2015-09-22 23:16 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: Andi Kleen @ 2015-09-22 23:16 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, kirill.shutemov, herbert, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

We had a case where a 4 socket system spent >80% of its total CPU time
contending on the global urandom nonblocking pool spinlock. While the
application could probably have used an own PRNG, it may have valid
reasons to use the best possible key for different session keys.

The application still ran acceptable under 2S, but just fell over
the locking cliff on 4S.

Implementation
==============

The non blocking pool is used widely these days, from every execve() (to
set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
/dev/urandom users in user space. Clearly having such a popular resource
under a global lock is bad thing.

This patch changes the random driver to use distributed per NUMA node
nonblocking pools. The basic structure is not changed: entropy is
first fed into the input pool and later from there distributed
round-robin into the blocking and non blocking pools. This patch extends
this to use an dedicated non blocking pool for each node, and distribute
evenly from the input pool into these distributed pools, in
addition to the blocking pool.

Then every urandom/getrandom user fetches data from its node local
pool. At boot time when users may be still waiting for the non
blocking pool initialization we use the node 0 non blocking pool,
to avoid the need for different wake up queues.

For single node systems (like the vast majority of non server systems)
nothing changes. There is still only a single non blocking pool.

The different per-node pools also start with different start
states and diverge more and more over time, as they get
feed different input data. So "replay" attacks are
difficult after some time.

Without hardware random number seed support the start states
(until enough real entropy is collected) are not very random, but
that's not worse than before

Since we still have a global input pool there are no problems
with load balancing entropy data between nodes. Any node that never
runs any interrupts would still get the same amount of entropy as
other nodes.

Entropy is fed preferably to nodes that need it more using
the existing 75% threshold.

For saving/restoring /dev/urandom, there is currently no mechanism
to access the non local node pool (short of setting task affinity).
This implies that currently the standard init/exit random save/restore
scripts would only save node 0. On restore all pools are updates.
So the entropy of non 0 gets lost over reboot. That seems acceptable
to me for now (fixing this would need a new separate save/restore interface)

Scalability
===========

I tested the patch with a simple will-it-scale test banging
on get_random() in parallel on more and more CPUs. Of course
that is not a realistic scenario, as real programs should
do some work between getting random numbers. But it's a worst
case for the random scalability.

On a 4S Xeon v3 system _without_ the patchkit the benchmark
maxes out when using all the threads on one node. After
that it quickly settles to about half the throughput of
one node with 2-4 nodes.

(all throughput factors, bigger is better)
Without patchkit:

1 node:  1x
2 nodes: 0.75x
3 nodes: 0.55x
4 nodes: 0.42x

With the patchkit applied:

1 node:  1x
2 nodes: 2x
3 nodes: 3.4x
4 nodes: 6x

So it's not quite linear scalability, but 6x maximum throughput
is already a lot better.

A node can still have a large number of CPUs: on my test system 36
logical software threads (18C * 2T). In principle it may make
sense to split it up further. Per logical CPU would be clearly
overkill. But that would also add more pressure on the input
pools. For now per node seems like a acceptable compromise.

/dev/random still uses a single global lock. For now that seems
acceptable as it normally cannot be used for real high volume
accesses anyways.

The input pool also still uses a global lock. The existing per CPU
fast pool and "give up when busy" mechanism seems to scale well enough
even on larger systems.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 142 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 114 insertions(+), 28 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..d0302be 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -156,6 +156,17 @@
  * particular randomness source.  They do this by keeping track of the
  * first and second order deltas of the event timings.
  *
+ * Distributed pools
+ * =================
+ *
+ * On larger systems the locking on the single non blocking pool can
+ * become a bottleneck. To avoid this, we use an own non blocking pool
+ * for each NUMA node. The distributed pools are fed round robin from
+ * the input pool. Each user then only reads entropy from their local
+ * pool.
+ *
+ * For the blocking pool there is still only a single instance.
+ *
  * Ensuring unpredictability at system startup
  * ============================================
  *
@@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
 
 static struct entropy_store nonblocking_pool = {
 	.poolinfo = &poolinfo_table[1],
-	.name = "nonblocking",
+	.name = "nonblocking 0",
 	.pull = &input_pool,
 	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
 	.pool = nonblocking_pool_data,
@@ -475,6 +486,32 @@ static struct entropy_store nonblocking_pool = {
 					push_to_pool),
 };
 
+/*
+ * Per NUMA node nonblocking pool. This avoids lock contention
+ * when many processes extract data from /dev/urandom in parallel.
+ * /dev/random stays single instance for now.
+ */
+static struct entropy_store **nonblocking_node_pool __read_mostly;
+
+#define for_each_nb_pool(i, pool) for (i = 0; i < num_possible_nodes(); i++) { \
+	pool = nonblocking_node_pool[i];
+#define end_for_each_nb() }
+
+static inline struct entropy_store *get_nonblocking_pool(void)
+{
+	struct entropy_store *pool = &nonblocking_pool;
+
+	/*
+	 * Non node 0 pools may take longer to initialize. Keep using
+	 * the boot nonblocking pool while this happens.
+	 */
+	if (nonblocking_node_pool)
+		pool = nonblocking_node_pool[numa_node_id()];
+	if (!pool->initialized)
+		pool = &nonblocking_pool;
+	return pool;
+}
+
 static __u32 const twist_table[8] = {
 	0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 	0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -698,18 +735,23 @@ retry:
 			kill_fasync(&fasync, SIGIO, POLL_IN);
 		}
 		/* If the input pool is getting full, send some
-		 * entropy to the two output pools, flipping back and
-		 * forth between them, until the output pools are 75%
+		 * entropy to the other output pools, cycling
+		 * between them, until the output pools are 75%
 		 * full.
+		 * Feed into all pools round robin.
 		 */
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
 			static struct entropy_store *last = &blocking_pool;
+			static int next_pool = -1;
 			struct entropy_store *other = &blocking_pool;
 
-			if (last == &blocking_pool)
-				other = &nonblocking_pool;
+			/* -1: use blocking pool, 0<=max_node: node nb pool */
+			if (next_pool > -1)
+				other = nonblocking_node_pool[next_pool];
+			if (++next_pool >= num_possible_nodes())
+				next_pool = -1;
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -748,6 +790,17 @@ struct timer_rand_state {
 
 #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };
 
+static void pool_add_buf(struct entropy_store *pool,
+			 const void *buf, unsigned int size)
+{
+	unsigned long flags;
+	unsigned long time = random_get_entropy() ^ jiffies;
+	spin_lock_irqsave(&pool->lock, flags);
+	_mix_pool_bytes(pool, buf, size);
+	_mix_pool_bytes(pool, &time, sizeof(time));
+	spin_unlock_irqrestore(&pool->lock, flags);
+}
+
 /*
  * Add device- or boot-specific data to the input and nonblocking
  * pools to help initialize them to unique values.
@@ -758,19 +811,18 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
-	unsigned long time = random_get_entropy() ^ jiffies;
-	unsigned long flags;
+	struct entropy_store *pool;
+	int i;
 
 	trace_add_device_randomness(size, _RET_IP_);
-	spin_lock_irqsave(&input_pool.lock, flags);
-	_mix_pool_bytes(&input_pool, buf, size);
-	_mix_pool_bytes(&input_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&input_pool.lock, flags);
-
-	spin_lock_irqsave(&nonblocking_pool.lock, flags);
-	_mix_pool_bytes(&nonblocking_pool, buf, size);
-	_mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
+	pool_add_buf(&input_pool, buf, size);
+	if (!nonblocking_node_pool) {
+		pool_add_buf(&nonblocking_pool, buf, size);
+	} else {
+		for_each_nb_pool (i, pool) {
+			pool_add_buf(pool, buf, size);
+		} end_for_each_nb()
+	}
 }
 EXPORT_SYMBOL(add_device_randomness);
 
@@ -1252,15 +1304,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
  */
 void get_random_bytes(void *buf, int nbytes)
 {
+	struct entropy_store *pool = get_nonblocking_pool();
 #if DEBUG_RANDOM_BOOT > 0
-	if (unlikely(nonblocking_pool.initialized == 0))
+	if (unlikely(pool->initialized == 0))
 		printk(KERN_NOTICE "random: %pF get_random_bytes called "
 		       "with %d bits of entropy available\n",
 		       (void *) _RET_IP_,
-		       nonblocking_pool.entropy_total);
+		       pool->entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(pool, buf, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1350,7 +1403,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
@@ -1393,9 +1446,32 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+	int i;
+	int num_nodes = num_possible_nodes();
+	char name[40];
+
+	nonblocking_node_pool = kzalloc(num_nodes * sizeof(void *),
+					GFP_KERNEL|__GFP_NOFAIL);
+
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
+	nonblocking_node_pool[0] = &nonblocking_pool;
 	init_std_data(&nonblocking_pool);
+	for (i = 1; i < num_nodes; i++) {
+		struct entropy_store *pool = kzalloc(sizeof(struct entropy_store),
+						     GFP_KERNEL|__GFP_NOFAIL);
+		nonblocking_node_pool[i] = pool;
+		pool->poolinfo = &poolinfo_table[1];
+		pool->pull = &input_pool;
+		spin_lock_init(&pool->lock);
+		/* pool data not cleared intentionally */
+		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
+				     GFP_KERNEL|__GFP_NOFAIL);
+		INIT_WORK(&pool->push_work, push_to_pool);
+		snprintf(name, sizeof name, "nonblocking pool %d", i);
+		pool->name = kstrdup(name, GFP_KERNEL|__GFP_NOFAIL);
+		init_std_data(pool);
+	}
 	return 0;
 }
 early_initcall(rand_initialize);
@@ -1458,17 +1534,19 @@ static ssize_t
 urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	int ret;
+	struct entropy_store *pool;
 
-	if (unlikely(nonblocking_pool.initialized == 0))
+	pool = get_nonblocking_pool();
+	if (unlikely(pool->initialized == 0))
 		printk_once(KERN_NOTICE "random: %s urandom read "
 			    "with %d bits of entropy available\n",
 			    current->comm, nonblocking_pool.entropy_total);
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
-	ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
+	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
-			   ENTROPY_BITS(&input_pool));
+	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+			   ENTROPY_BITS(pool));
 	return ret;
 }
 
@@ -1513,22 +1591,28 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
 	size_t ret;
+	struct entropy_store *pool;
+	int i;
 
 	ret = write_pool(&blocking_pool, buffer, count);
 	if (ret)
 		return ret;
-	ret = write_pool(&nonblocking_pool, buffer, count);
-	if (ret)
-		return ret;
+	for_each_nb_pool (i, pool) {
+		ret = write_pool(pool, buffer, count);
+		if (ret)
+			return ret;
+	} end_for_each_nb()
 
 	return (ssize_t)count;
 }
 
 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
+	int i;
 	int size, ent_count;
 	int __user *p = (int __user *)arg;
 	int retval;
+	struct entropy_store *pool;
 
 	switch (cmd) {
 	case RNDGETENTCNT:
@@ -1568,7 +1652,9 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
 		input_pool.entropy_count = 0;
-		nonblocking_pool.entropy_count = 0;
+		for_each_nb_pool (i, pool) {
+			pool->entropy_count = 0;
+		} end_for_each_nb();
 		blocking_pool.entropy_count = 0;
 		return 0;
 	default:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3] random: Make input to output pool balancing per cpu
  2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
@ 2015-09-22 23:16 ` Andi Kleen
  2015-09-22 23:16 ` [PATCH 3/3] random: Add pool name to urandom_read trace point Andi Kleen
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2015-09-22 23:16 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, kirill.shutemov, herbert, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The load balancing from input pool to output pools was
essentially unlocked. Before it didn't matter much because
there were only two choices (blocking and non blocking).

But now with the distributed non blocking pools we have
a lot more pools, and unlocked access of the counters
may systematically deprive some nodes from their deserved
entropy.

Turn the round-robin state into per CPU variables
to avoid any possibility of races. This code already
runs with preemption disabled.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0302be..b74919a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -743,15 +743,20 @@ retry:
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
-			static struct entropy_store *last = &blocking_pool;
-			static int next_pool = -1;
-			struct entropy_store *other = &blocking_pool;
+			static DEFINE_PER_CPU(struct entropy_store *, lastp) =
+				&blocking_pool;
+			static DEFINE_PER_CPU(int, next_pool);
+			struct entropy_store *other = &blocking_pool, *last;
+			int np;
 
 			/* -1: use blocking pool, 0<=max_node: node nb pool */
-			if (next_pool > -1)
-				other = nonblocking_node_pool[next_pool];
-			if (++next_pool >= num_possible_nodes())
-				next_pool = -1;
+			np = __this_cpu_read(next_pool);
+			if (np > -1)
+				other = nonblocking_node_pool[np];
+			if (++np >= num_possible_nodes())
+				np = -1;
+			__this_cpu_write(next_pool, np);
+			last = __this_cpu_read(lastp);
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -760,6 +765,7 @@ retry:
 				schedule_work(&last->push_work);
 				r->entropy_total = 0;
 			}
+			__this_cpu_write(lastp, last);
 		}
 	}
 }
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/3] random: Add pool name to urandom_read trace point
  2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
  2015-09-22 23:16 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
@ 2015-09-22 23:16 ` Andi Kleen
  2015-09-22 23:25 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2015-09-22 23:16 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, kirill.shutemov, herbert, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Now that we have multiple nonblocking_pools it makes sense to report
the name of the pool in the urandom_read trace point. Extend
the trace point to report the name too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c         |  2 +-
 include/trace/events/random.h | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b74919a..23ec90e 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1551,7 +1551,7 @@ urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
 	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+	trace_urandom_read(pool->name, 8 * nbytes, ENTROPY_BITS(pool),
 			   ENTROPY_BITS(pool));
 	return ret;
 }
diff --git a/include/trace/events/random.h b/include/trace/events/random.h
index 4684de3..3b5ab5a 100644
--- a/include/trace/events/random.h
+++ b/include/trace/events/random.h
@@ -288,25 +288,27 @@ TRACE_EVENT(random_read,
 );
 
 TRACE_EVENT(urandom_read,
-	TP_PROTO(int got_bits, int pool_left, int input_left),
+	TP_PROTO(const char *pool, int got_bits, int pool_left, int input_left),
 
-	TP_ARGS(got_bits, pool_left, input_left),
+	TP_ARGS(pool, got_bits, pool_left, input_left),
 
 	TP_STRUCT__entry(
+		__field( const char *,  pool			)
 		__field(	  int,	got_bits		)
 		__field(	  int,	pool_left		)
 		__field(	  int,	input_left		)
 	),
 
 	TP_fast_assign(
+		__entry->pool		= pool;
 		__entry->got_bits	= got_bits;
 		__entry->pool_left	= pool_left;
 		__entry->input_left	= input_left;
 	),
 
 	TP_printk("got_bits %d nonblocking_pool_entropy_left %d "
-		  "input_entropy_left %d", __entry->got_bits,
-		  __entry->pool_left, __entry->input_left)
+		  "input_entropy_left %d from %s", __entry->got_bits,
+		  __entry->pool_left, __entry->input_left, __entry->pool)
 );
 
 #endif /* _TRACE_RANDOM_H */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
  2015-09-22 23:16 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
  2015-09-22 23:16 ` [PATCH 3/3] random: Add pool name to urandom_read trace point Andi Kleen
@ 2015-09-22 23:25 ` Andi Kleen
  2015-09-23 10:32 ` Rasmus Villemoes
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2015-09-22 23:25 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, kirill.shutemov, herbert

Andi Kleen <andi@firstfloor.org> writes:
>
> With the patchkit applied:
>
> 1 node:  1x
> 2 nodes: 2x
> 3 nodes: 3.4x
> 4 nodes: 6x

Sorry there was a typo in the numbers. Correct results are:

With the patchkit applied:

    1 node:  1x
    2 nodes: 2x
    3 nodes: 2.4x
    4 nodes: 3x

So it's not quite linear scalability, but 3x maximum throughput
is already a lot better.
        


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
                   ` (2 preceding siblings ...)
  2015-09-22 23:25 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
@ 2015-09-23 10:32 ` Rasmus Villemoes
  2015-09-23 21:54   ` Andi Kleen
  2015-09-23 19:40 ` Austin S Hemmelgarn
  2015-09-23 21:10 ` Theodore Ts'o
  5 siblings, 1 reply; 27+ messages in thread
From: Rasmus Villemoes @ 2015-09-23 10:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tytso, linux-kernel, kirill.shutemov, herbert, Andi Kleen

On Wed, Sep 23 2015, Andi Kleen <andi@firstfloor.org> wrote:

> @@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
>  
>  static struct entropy_store nonblocking_pool = {
>  	.poolinfo = &poolinfo_table[1],
> -	.name = "nonblocking",
> +	.name = "nonblocking 0",
>  	.pull = &input_pool,
>  	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
>  	.pool = nonblocking_pool_data,
> @@ -475,6 +486,32 @@ static struct entropy_store nonblocking_pool = {
>  					push_to_pool),
>  };
>  
> +/*
> + * Per NUMA node nonblocking pool. This avoids lock contention
> + * when many processes extract data from /dev/urandom in parallel.
> + * /dev/random stays single instance for now.
> + */
> +static struct entropy_store **nonblocking_node_pool __read_mostly;
> +
> +#define for_each_nb_pool(i, pool) for (i = 0; i < num_possible_nodes(); i++) { \
> +	pool = nonblocking_node_pool[i];
> +#define end_for_each_nb() }

You can avoid the need for end_for_each_nb (end_for_each_nb_pool?) by
writing the condition

  i < num_possible_nodes() && (pool = nonblocking_node_pool[i], 1)

[if you keep it, please be consistent in whether end_for_each_nb() is
followed by ; or not, or add do{}while(0) and make the rule that it is].

But more importantly: Won't this generate a function call (ultimately to
bitmap_weight) for each iteration, at least for large enough
CONFIG_NODES_SHIFT? It's probably not very expensive, but would be nice
to avoid.

> +static inline struct entropy_store *get_nonblocking_pool(void)
> +{
> +	struct entropy_store *pool = &nonblocking_pool;
> +
> +	/*
> +	 * Non node 0 pools may take longer to initialize. Keep using
> +	 * the boot nonblocking pool while this happens.
> +	 */
> +	if (nonblocking_node_pool)
> +		pool = nonblocking_node_pool[numa_node_id()];
> +	if (!pool->initialized)
> +		pool = &nonblocking_pool;
> +	return pool;
> +}

I assume this can't get called concurrently with rand_initialize
(otherwise pool may be NULL even if nonblocking_node_pool is non-NULL). 

>  
> @@ -1393,9 +1446,32 @@ static void init_std_data(struct entropy_store *r)
>   */
>  static int rand_initialize(void)
>  {
> +	int i;
> +	int num_nodes = num_possible_nodes();
> +	char name[40];
> +
> +	nonblocking_node_pool = kzalloc(num_nodes * sizeof(void *),
> +					GFP_KERNEL|__GFP_NOFAIL);
> +

Why kzalloc, when you immediately initialize all elements? New uses of
__GFP_NOFAIL seem to be frowned upon. How hard would it be to just fall
back to only using the single statically allocated pool?

Does rand_initialize get called before or after other initialization
code updates node_possible_map to reflect the actual possible number of
nodes? If before, won't we be wasting a lot of memory (not to mention
that we then might as well allocate all the nonblocking pools statically
based on MAX_NUMNODES).

>  	init_std_data(&input_pool);
>  	init_std_data(&blocking_pool);
> +	nonblocking_node_pool[0] = &nonblocking_pool;
>  	init_std_data(&nonblocking_pool);
> +	for (i = 1; i < num_nodes; i++) {
> +		struct entropy_store *pool = kzalloc(sizeof(struct entropy_store),
> +						     GFP_KERNEL|__GFP_NOFAIL);
> +		nonblocking_node_pool[i] = pool;
> +		pool->poolinfo = &poolinfo_table[1];
> +		pool->pull = &input_pool;
> +		spin_lock_init(&pool->lock);
> +		/* pool data not cleared intentionally */
> +		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
> +				     GFP_KERNEL|__GFP_NOFAIL);
> +		INIT_WORK(&pool->push_work, push_to_pool);
> +		snprintf(name, sizeof name, "nonblocking pool %d", i);
> +		pool->name = kstrdup(name, GFP_KERNEL|__GFP_NOFAIL);

kasprintf(). Also, you renamed the static pool to "nonblocking 0".


Rasmus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
                   ` (3 preceding siblings ...)
  2015-09-23 10:32 ` Rasmus Villemoes
@ 2015-09-23 19:40 ` Austin S Hemmelgarn
  2015-09-23 23:28   ` Andi Kleen
  2015-09-23 21:10 ` Theodore Ts'o
  5 siblings, 1 reply; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-23 19:40 UTC (permalink / raw)
  To: Andi Kleen, tytso; +Cc: linux-kernel, kirill.shutemov, herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 5318 bytes --]

On 2015-09-22 19:16, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> We had a case where a 4 socket system spent >80% of its total CPU time
> contending on the global urandom nonblocking pool spinlock. While the
> application could probably have used an own PRNG, it may have valid
> reasons to use the best possible key for different session keys.
>
> The application still ran acceptable under 2S, but just fell over
> the locking cliff on 4S.
>
> Implementation
> ==============
>
> The non blocking pool is used widely these days, from every execve() (to
> set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
> /dev/urandom users in user space. Clearly having such a popular resource
> under a global lock is bad thing.
>
> This patch changes the random driver to use distributed per NUMA node
> nonblocking pools. The basic structure is not changed: entropy is
> first fed into the input pool and later from there distributed
> round-robin into the blocking and non blocking pools. This patch extends
> this to use an dedicated non blocking pool for each node, and distribute
> evenly from the input pool into these distributed pools, in
> addition to the blocking pool.
>
> Then every urandom/getrandom user fetches data from its node local
> pool. At boot time when users may be still waiting for the non
> blocking pool initialization we use the node 0 non blocking pool,
> to avoid the need for different wake up queues.
>
> For single node systems (like the vast majority of non server systems)
> nothing changes. There is still only a single non blocking pool.
>
> The different per-node pools also start with different start
> states and diverge more and more over time, as they get
> feed different input data. So "replay" attacks are
> difficult after some time.
I really like this idea, as it both makes getting random numbers on busy 
servers faster, and makes replay attacks more difficult.
>
> Without hardware random number seed support the start states
> (until enough real entropy is collected) are not very random, but
> that's not worse than before
>
> Since we still have a global input pool there are no problems
> with load balancing entropy data between nodes. Any node that never
> runs any interrupts would still get the same amount of entropy as
> other nodes.
>
> Entropy is fed preferably to nodes that need it more using
> the existing 75% threshold.
>
> For saving/restoring /dev/urandom, there is currently no mechanism
> to access the non local node pool (short of setting task affinity).
> This implies that currently the standard init/exit random save/restore
> scripts would only save node 0. On restore all pools are updates.
> So the entropy of non 0 gets lost over reboot. That seems acceptable
> to me for now (fixing this would need a new separate save/restore interface)
I agree that this is acceptable, it wouldn't be hard for someone who 
wants to to just modify the script to set it's own task affinity and 
loop through the nodes (although that might get confusing with 
hot-plugged/hot-removed nodes).
>
> Scalability
> ===========
>
> I tested the patch with a simple will-it-scale test banging
> on get_random() in parallel on more and more CPUs. Of course
> that is not a realistic scenario, as real programs should
> do some work between getting random numbers. But it's a worst
> case for the random scalability.
>
> On a 4S Xeon v3 system _without_ the patchkit the benchmark
> maxes out when using all the threads on one node. After
> that it quickly settles to about half the throughput of
> one node with 2-4 nodes.
>
> (all throughput factors, bigger is better)
> Without patchkit:
>
> 1 node:  1x
> 2 nodes: 0.75x
> 3 nodes: 0.55x
> 4 nodes: 0.42x
>
> With the patchkit applied:
>
> 1 node:  1x
> 2 nodes: 2x
> 3 nodes: 3.4x
> 4 nodes: 6x
>
> So it's not quite linear scalability, but 6x maximum throughput
> is already a lot better.
>
> A node can still have a large number of CPUs: on my test system 36
> logical software threads (18C * 2T). In principle it may make
> sense to split it up further. Per logical CPU would be clearly
> overkill. But that would also add more pressure on the input
> pools. For now per node seems like a acceptable compromise.
I'd almost say that making the partitioning level configurable at build 
time might be useful.  I can see possible value to being able to at 
least partition down to physical cores (so, shared between HyperThreads 
on Intel processors, and between Compute Module cores on AMD 
processors), as that could potentially help people running large numbers 
of simulations in parallel.

Personally, I'm the type who would be willing to take the performance 
hit to do it per logical CPU just for the fact that it would make replay 
attacks more difficult, but I'm probably part of a very small minority 
in that case.
>
> /dev/random still uses a single global lock. For now that seems
> acceptable as it normally cannot be used for real high volume
> accesses anyways.
>
> The input pool also still uses a global lock. The existing per CPU
> fast pool and "give up when busy" mechanism seems to scale well enough
> even on larger systems.
>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
                   ` (4 preceding siblings ...)
  2015-09-23 19:40 ` Austin S Hemmelgarn
@ 2015-09-23 21:10 ` Theodore Ts'o
  2015-09-23 21:25   ` Andi Kleen
  5 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2015-09-23 21:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, kirill.shutemov, herbert, Andi Kleen

On Tue, Sep 22, 2015 at 04:16:05PM -0700, Andi Kleen wrote:
> 
> This patch changes the random driver to use distributed per NUMA node
> nonblocking pools. The basic structure is not changed: entropy is
> first fed into the input pool and later from there distributed
> round-robin into the blocking and non blocking pools. This patch extends
> this to use an dedicated non blocking pool for each node, and distribute
> evenly from the input pool into these distributed pools, in
> addition to the blocking pool.
> 
> Then every urandom/getrandom user fetches data from its node local
> pool. At boot time when users may be still waiting for the non
> blocking pool initialization we use the node 0 non blocking pool,
> to avoid the need for different wake up queues.

What I would suggest is that we only create the per-NUMA node pools
until the original node 0 is marked as initialized (i.e., that it was
been initialized with 128 bits of randomness).  At that point
initialize pools 1..n using the originial non-blocking pool by using
get_random_bytes() to fill up the pool.  Once all of the pools are
initialized, only then set the nonblocking_node_pool variable.  In
practice, especially for the large server systems, getting 128 bits of
randomness to initialize the primary non-blocking pool shouldn't take
long.  On my laptop, it takes 4 seconds.

My concern, though, is that things like initialized ssh host keys
happen at boot time, so anything that weakens the random number
generator is a bad thing.  (In practice, this isn't a problem
pre-systemd, but systemd speeds up the boot sequence quickly enough
that this is in fact a potential security problem.)  We already have
the problem that systemd-udevd grabs random numbers before the random
pool is initialized:

[    1.124926] random: systemd-udevd urandom read with 13 bits of entropy available
[    4.137543] random: nonblocking pool is initialized

> The different per-node pools also start with different start
> states and diverge more and more over time, as they get
> feed different input data. So "replay" attacks are
> difficult after some time.

The problem is "after some time" happens after public keys get
generated, this isn't good for anything other than starving academics
publishing papers about weaknesses in the Linux's random number
generator.  :-)

So I'd strongly prefer if we don't weaken the random number generator
boot-time initialization story, and the best way to do this is to wait
for a single non-blocking pool to be completely initialized, and then
use that pool to initialize the rest of the pools.  If that means that
we don't have the full /dev/urandom scalability during the first few
seconds after the system is booted, to my mind that's a fair tradeoff.

Does that sound reasonable?

> For saving/restoring /dev/urandom, there is currently no mechanism
> to access the non local node pool (short of setting task affinity).
> This implies that currently the standard init/exit random save/restore
> scripts would only save node 0. On restore all pools are updates.
> So the entropy of non 0 gets lost over reboot. That seems acceptable
> to me for now (fixing this would need a new separate save/restore interface)

Yes, that's fine; I'm not really worried about that, since
getrandom(2) only provides a guarantee of initialized cryptographic
randomness, and that's what we should most care about.

	    	       	       	      	   	- Ted
						

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-23 21:10 ` Theodore Ts'o
@ 2015-09-23 21:25   ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2015-09-23 21:25 UTC (permalink / raw)
  To: Theodore Ts'o, Andi Kleen, linux-kernel, kirill.shutemov, herbert

> Does that sound reasonable?

Sounds good. I can do that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-23 10:32 ` Rasmus Villemoes
@ 2015-09-23 21:54   ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2015-09-23 21:54 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andi Kleen, tytso, linux-kernel, kirill.shutemov, herbert, Andi Kleen

> > +{
> > +	struct entropy_store *pool = &nonblocking_pool;
> > +
> > +	/*
> > +	 * Non node 0 pools may take longer to initialize. Keep using
> > +	 * the boot nonblocking pool while this happens.
> > +	 */
> > +	if (nonblocking_node_pool)
> > +		pool = nonblocking_node_pool[numa_node_id()];
> > +	if (!pool->initialized)
> > +		pool = &nonblocking_pool;
> > +	return pool;
> > +}
> 
> I assume this can't get called concurrently with rand_initialize
> (otherwise pool may be NULL even if nonblocking_node_pool is non-NULL). 

Yes. I can move the assignment to the global last and add a memory
barrier.

> > +	char name[40];
> > +
> > +	nonblocking_node_pool = kzalloc(num_nodes * sizeof(void *),
> > +					GFP_KERNEL|__GFP_NOFAIL);
> > +
> 
> Why kzalloc, when you immediately initialize all elements? New uses of
> __GFP_NOFAIL seem to be frowned upon. How hard would it be to just fall
> back to only using the single statically allocated pool?

It's already doing that.

> 
> Does rand_initialize get called before or after other initialization
> code updates node_possible_map to reflect the actual possible number of
> nodes? If before, won't we be wasting a lot of memory (not to mention
> that we then might as well allocate all the nonblocking pools statically
> based on MAX_NUMNODES).

I'll check.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-23 19:40 ` Austin S Hemmelgarn
@ 2015-09-23 23:28   ` Andi Kleen
  2015-09-24 11:37     ` Austin S Hemmelgarn
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2015-09-23 23:28 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Andi Kleen, tytso, linux-kernel, kirill.shutemov, herbert, Andi Kleen

> I'd almost say that making the partitioning level configurable at
> build time might be useful.  I can see possible value to being able
> to at least partition down to physical cores (so, shared between
> HyperThreads on Intel processors, and between Compute Module cores
> on AMD processors), as that could potentially help people running
> large numbers of simulations in parallel.

I don't like build time size configurations. It doesn't make sense
for simulations to use urandom. It may make sense to have
some run time tunable, but for now that's too much complexity.
So I'll stay with the simpler per node pool for now.

-Andi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-23 23:28   ` Andi Kleen
@ 2015-09-24 11:37     ` Austin S Hemmelgarn
  2015-09-24 13:12       ` Theodore Ts'o
  0 siblings, 1 reply; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-24 11:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tytso, linux-kernel, kirill.shutemov, herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1399 bytes --]

On 2015-09-23 19:28, Andi Kleen wrote:
>> I'd almost say that making the partitioning level configurable at
>> build time might be useful.  I can see possible value to being able
>> to at least partition down to physical cores (so, shared between
>> HyperThreads on Intel processors, and between Compute Module cores
>> on AMD processors), as that could potentially help people running
>> large numbers of simulations in parallel.
>
> I don't like build time size configurations. It doesn't make sense
> for simulations to use urandom. It may make sense to have
> some run time tunable, but for now that's too much complexity.
> So I'll stay with the simpler per node pool for now.
Using /dev/urandom directly, yes that doesn't make sense because it 
consistent returns non-uniformly random numbers when used to generate 
larger amounts of entropy than the blocking pool can source, but most 
that use their own PRNG's seed them off of /dev/urandom, and there are 
cases I've seen of people doing very large numbers (on the order of 
millions) of short (15 or so minutes run time on decent hardware) 
simulations that do this.

As for a run-time tunable, I agree that that would be better than 
build-time, and I also agree that it should be done later.  The only 
reason I suggested a build time tunable was that I figured it would be a 
lot easier to do than run-time.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 11:37     ` Austin S Hemmelgarn
@ 2015-09-24 13:12       ` Theodore Ts'o
  2015-09-24 16:00         ` Austin S Hemmelgarn
  0 siblings, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2015-09-24 13:12 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Andi Kleen, linux-kernel, kirill.shutemov, herbert, Andi Kleen

On Thu, Sep 24, 2015 at 07:37:39AM -0400, Austin S Hemmelgarn wrote:
> Using /dev/urandom directly, yes that doesn't make sense because it
> consistent returns non-uniformly random numbers when used to generate larger
> amounts of entropy than the blocking pool can source

Why do you think this is the case?   Reproduction, please?

					- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 13:12       ` Theodore Ts'o
@ 2015-09-24 16:00         ` Austin S Hemmelgarn
  2015-09-24 16:52           ` Jeff Epler
  0 siblings, 1 reply; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-24 16:00 UTC (permalink / raw)
  To: Theodore Ts'o, Andi Kleen, linux-kernel, kirill.shutemov,
	herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1877 bytes --]

On 2015-09-24 09:12, Theodore Ts'o wrote:
> On Thu, Sep 24, 2015 at 07:37:39AM -0400, Austin S Hemmelgarn wrote:
>> Using /dev/urandom directly, yes that doesn't make sense because it
>> consistent returns non-uniformly random numbers when used to generate larger
>> amounts of entropy than the blocking pool can source
>
> Why do you think this is the case?   Reproduction, please?
>
> 					- Ted
Aside from the literature scattered across the web and the fact that it 
fails Dieharder tests way more than a high quality RNG should (even a 
good one should fail from time to time, one that never does is 
inherently flawed for other reasons, but I've had cases where I've done 
thousands of dieharder runs, and it failed almost 10% of the time, while 
stuff like mt19937 fails in otherwise identical tests only about 1-2% of 
the time)?  I will admit that it is significantly better than any libc 
implementation of rand() that I've seen, and many other PRNG's (notably 
including being significantly more random than the FIPS 140 DRBG's), but 
it does not do as well (usually) as stuff like OpenBSD's /dev/aranedom 
(which is way more processor intensive as well from what I've seen) or 
some of the high quality RNG's found in the GSL.

And it's also worth noting that this is with regards to systems that are 
consistently getting significantly less entropy into the blocking pool 
than is being sourced from the non-blocking pool by userspace (that is 
greater than a 100 times or so).

In short, I would not trust it as a CSPRNG (although I wouldn't trust 
most things touted as CSPRNG's either), or even for important 
simulations that need _lots_ of random numbers.  I'm not saying that it 
shouldn't be used for stuff like seeding other PRNG's however (and TBH, 
I do trust it more for that than I trust stuff like RDSEED or RDRAND).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 16:00         ` Austin S Hemmelgarn
@ 2015-09-24 16:52           ` Jeff Epler
  2015-09-24 19:11             ` Austin S Hemmelgarn
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Epler @ 2015-09-24 16:52 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Theodore Ts'o, Andi Kleen, linux-kernel, kirill.shutemov,
	herbert, Andi Kleen

On Thu, Sep 24, 2015 at 12:00:44PM -0400, Austin S Hemmelgarn wrote:
> I've had cases where I've done thousands of dieharder runs, and it
> failed almost 10% of the time, while stuff like mt19937 fails in
> otherwise identical tests only about 1-2% of the time

That is a startling result.  Please say what architecture, kernel
version, dieharder version and commandline arguments you are using to
get 10% WEAK or FAILED assessments from dieharder on /dev/urandom.

Since the structure of linux urandom involves taking a cryptographic
hash the basic expectation is that it would fail statistical randomness
tests at similar rates to e.g., dieharder's AES_OFB (-g 205) even in the
absence of any entropy in the kernel pools.

So if 10% failures at correct statistical tests can be replicated it is
important and needs attention.

I did take a few moments to look into this today and got starling
failures (p-value 0.00000000) with e.g., 
    dieharder -g 501 -d 10
(and a few other tests) using dieharder 3.31.1 on both debian
linux-4.1-rt-amd64 and debian kfreebsd-10-amd64, but this seems to be an
upstream bug known at least to debian and redhat, possibly fixed in
current Fedora but apparently not in Debian.
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745742
    https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=803292
if you have an affected version, these failures are seen only with -g
501, not with -g 200 < /dev/urandom.  They are probably also not seen
with 32-bit dieharder.

 diehard_parking_lot|   0|     12000|     100|0.00000000|  FAILED  
    diehard_2dsphere|   2|      8000|     100|0.00000000|  FAILED  
    diehard_3dsphere|   3|      4000|     100|0.00000000|  FAILED  
     diehard_squeeze|   0|    100000|     100|0.00000000|  FAILED  
        diehard_sums|   0|       100|     100|0.00000000|  FAILED  

Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 16:52           ` Jeff Epler
@ 2015-09-24 19:11             ` Austin S Hemmelgarn
  2015-09-24 20:00               ` Jeff Epler
  2015-09-24 20:14               ` Theodore Ts'o
  0 siblings, 2 replies; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-24 19:11 UTC (permalink / raw)
  To: Jeff Epler
  Cc: Theodore Ts'o, Andi Kleen, linux-kernel, kirill.shutemov,
	herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 3945 bytes --]

On 2015-09-24 12:52, Jeff Epler wrote:
> On Thu, Sep 24, 2015 at 12:00:44PM -0400, Austin S Hemmelgarn wrote:
>> I've had cases where I've done thousands of dieharder runs, and it
>> failed almost 10% of the time, while stuff like mt19937 fails in
>> otherwise identical tests only about 1-2% of the time
>
> That is a startling result.  Please say what architecture, kernel
> version, dieharder version and commandline arguments you are using to
> get 10% WEAK or FAILED assessments from dieharder on /dev/urandom.
I do not remember what exact dieharder version or command-line arguments 
(this was almost a decade ago), except that I compiled it from source 
myself, I do remember it was a 32-bit x86 processor (as that was sadly 
all I had to run Linux on at the time), and an early 2.6 series kernel 
(which if I remember correctly was already EOL by the time I was using 
it).  It may haven been impacted by the fact that I did the testing in 
QEMU, but I would not expect that to affect things that much.  It is 
worth noting that I only saw this happen three times, and and each time 
it was in a sample of 2000 runs (which has always been the sample size 
I've used, as that's the point at which I tend to get impatient).

I don't tend to do any of that type of testing anymore (at least, not 
since I started donating spare cycles to various BOINC projects).  I 
will make a point however to run some tests over the weekend on a 
current kernel version (4.2.1), with the current dieharder version I 
have available (3.31.1).
>
> Since the structure of linux urandom involves taking a cryptographic
> hash the basic expectation is that it would fail statistical randomness
> tests at similar rates to e.g., dieharder's AES_OFB (-g 205) even in the
> absence of any entropy in the kernel pools.
>
> So if 10% failures at correct statistical tests can be replicated it is
> important and needs attention.
>
> I did take a few moments to look into this today and got starling
> failures (p-value 0.00000000) with e.g.,
>      dieharder -g 501 -d 10
> (and a few other tests) using dieharder 3.31.1 on both debian
> linux-4.1-rt-amd64 and debian kfreebsd-10-amd64, but this seems to be an
> upstream bug known at least to debian and redhat, possibly fixed in
> current Fedora but apparently not in Debian.
>      https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745742
>      https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=803292
> if you have an affected version, these failures are seen only with -g
> 501, not with -g 200 < /dev/urandom.  They are probably also not seen
> with 32-bit dieharder.
>
>   diehard_parking_lot|   0|     12000|     100|0.00000000|  FAILED
>      diehard_2dsphere|   2|      8000|     100|0.00000000|  FAILED
>      diehard_3dsphere|   3|      4000|     100|0.00000000|  FAILED
>       diehard_squeeze|   0|    100000|     100|0.00000000|  FAILED
>          diehard_sums|   0|       100|     100|0.00000000|  FAILED
The diehard_sums test is known and documented to be a flawed test.  As 
far as the other failures, even a top quality RNG should get them 
sometimes (because a good RNG _should_ spit out long runs of identical 
bits from time to time, which is why the absolute insanity that is FIPS 
cryptography standards should not ever be considered when doing anything 
other than security work (and only considered cautiously even there)). 
Based on what I've seen with the AES_OFB generator, 'perfect' generators 
should be getting WEAK results about 1% of the time, and FAILED results 
about 0.1% of the time (except on diehard_sums).

A generator never getting FAILED or WEAK results over thousands of runs 
should be an indication that either that generator is flawed in some way 
(ie, it's actively trying to produce numbers that pass the tests, means 
it's not really a RNG), or the test itself is flawed in some way.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 19:11             ` Austin S Hemmelgarn
@ 2015-09-24 20:00               ` Jeff Epler
  2015-09-24 20:14               ` Theodore Ts'o
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Epler @ 2015-09-24 20:00 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Theodore Ts'o, Andi Kleen, linux-kernel, kirill.shutemov,
	herbert, Andi Kleen

On Thu, Sep 24, 2015 at 03:11:23PM -0400, Austin S Hemmelgarn wrote:
> I will make a point however to run some tests over the weekend on a
> current kernel version (4.2.1), with the current dieharder version I
> have available (3.31.1).

Please report your findings.  If urandom is worse than AES_OFB in
statistical tests, we need to know about it.

I'm an hour into some test-to-failure runs of diehard_count_1s_str on
various RNGs -- urandom, AES_OFB, mt19937_1999, and rand48.  So far at
psamples >=29000 none have failed, so there's no result to report. (test
8 was chosen by mere human pseudorandomness; hey, it finds in <2s that
RANDU is a flawed generator)

    dieharder -d 8 -g 205 -Y 2 -k 2
    dieharder -d 8 -g 200 -Y 2 -k 2
    dieharder -d 8 -g 14 -Y 2 -k 2
    dieharder -d 8 -g 22 -Y 2 -k 2

If the results are other than "both urandom and aes_ofb were running
when I had to reboot my laptop", I'll report my results as well.

Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 19:11             ` Austin S Hemmelgarn
  2015-09-24 20:00               ` Jeff Epler
@ 2015-09-24 20:14               ` Theodore Ts'o
  2015-09-25 11:41                 ` Austin S Hemmelgarn
  1 sibling, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2015-09-24 20:14 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Jeff Epler, Andi Kleen, linux-kernel, kirill.shutemov, herbert,
	Andi Kleen

On Thu, Sep 24, 2015 at 03:11:23PM -0400, Austin S Hemmelgarn wrote:
> >That is a startling result.  Please say what architecture, kernel
> >version, dieharder version and commandline arguments you are using to
> >get 10% WEAK or FAILED assessments from dieharder on /dev/urandom.
>
> I do not remember what exact dieharder version or command-line arguments
> (this was almost a decade ago), except that I compiled it from source
> myself, I do remember it was a 32-bit x86 processor (as that was sadly all I
> had to run Linux on at the time), and an early 2.6 series kernel (which if I
> remember correctly was already EOL by the time I was using it).

It might have been nice if you had said this from the beginning
instead of making an unqualified statement with the assumption that it
was applicable to kernels likely to be used today in non-obsolete
systems.  Otherwise it risks generating a click-bait article on
Phoronix that would get people really worried for no good reason...

There was a bug a long, long time ago (which where we weren't doing
sufficient locking and if two processes raced reading from
/dev/urandom at the same time, it was possible that the two processes
would get the same value read out from /dev/urandom).  This was fixed
a long time ago, though, and in fact the scalability problem which
Andi is trying to fix was caused by that extra locking that was added.  :-)

It's possible that is what you saw.  I don't know, since there was no
reproduction information to back up your rather startling claim.

If you can reproduce consistent Dieharder failures, please do let us
know with detailed reproduction instructures.

Many thanks,

					- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 20:14               ` Theodore Ts'o
@ 2015-09-25 11:41                 ` Austin S Hemmelgarn
  2015-09-25 19:07                   ` Austin S Hemmelgarn
  0 siblings, 1 reply; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-25 11:41 UTC (permalink / raw)
  To: Theodore Ts'o, Jeff Epler, Andi Kleen, linux-kernel,
	kirill.shutemov, herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

On 2015-09-24 16:14, Theodore Ts'o wrote:
> On Thu, Sep 24, 2015 at 03:11:23PM -0400, Austin S Hemmelgarn wrote:
>>> That is a startling result.  Please say what architecture, kernel
>>> version, dieharder version and commandline arguments you are using to
>>> get 10% WEAK or FAILED assessments from dieharder on /dev/urandom.
>>
>> I do not remember what exact dieharder version or command-line arguments
>> (this was almost a decade ago), except that I compiled it from source
>> myself, I do remember it was a 32-bit x86 processor (as that was sadly all I
>> had to run Linux on at the time), and an early 2.6 series kernel (which if I
>> remember correctly was already EOL by the time I was using it).
>
> It might have been nice if you had said this from the beginning
> instead of making an unqualified statement with the assumption that it
> was applicable to kernels likely to be used today in non-obsolete
> systems.  Otherwise it risks generating a click-bait article on
> Phoronix that would get people really worried for no good reason...
I sincerely apologize about this, I should have been more specific right 
from the beginning (I need to get better about that when talking to 
people, I'm so used to dealing with some of my friends who couldn't 
event tell you the difference between RAM and a hard drive, think a bus 
is only something you use for transportation, and get confused when I 
try to properly explain even relatively simple CS and statistics concepts).
>
> There was a bug a long, long time ago (which where we weren't doing
> sufficient locking and if two processes raced reading from
> /dev/urandom at the same time, it was possible that the two processes
> would get the same value read out from /dev/urandom).  This was fixed
> a long time ago, though, and in fact the scalability problem which
> Andi is trying to fix was caused by that extra locking that was added.  :-)
>
> It's possible that is what you saw.  I don't know, since there was no
> reproduction information to back up your rather startling claim.
I don't think this was what I hit, I'm pretty sure I had serialized the 
dieharder runs.
>
> If you can reproduce consistent Dieharder failures, please do let us
> know with detailed reproduction instructures.
Will do.
>
> Many thanks,
>
> 					- Ted
>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-25 11:41                 ` Austin S Hemmelgarn
@ 2015-09-25 19:07                   ` Austin S Hemmelgarn
  2015-09-25 20:24                     ` Theodore Ts'o
  2015-09-29 11:57                     ` Austin S Hemmelgarn
  0 siblings, 2 replies; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-25 19:07 UTC (permalink / raw)
  To: Theodore Ts'o, Jeff Epler, Andi Kleen, linux-kernel,
	kirill.shutemov, herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 3281 bytes --]

On 2015-09-25 07:41, Austin S Hemmelgarn wrote:
> On 2015-09-24 16:14, Theodore Ts'o wrote:
>> On Thu, Sep 24, 2015 at 03:11:23PM -0400, Austin S Hemmelgarn wrote:
>>>> That is a startling result.  Please say what architecture, kernel
>>>> version, dieharder version and commandline arguments you are using to
>>>> get 10% WEAK or FAILED assessments from dieharder on /dev/urandom.
>>>
>>> I do not remember what exact dieharder version or command-line arguments
>>> (this was almost a decade ago), except that I compiled it from source
>>> myself, I do remember it was a 32-bit x86 processor (as that was
>>> sadly all I
>>> had to run Linux on at the time), and an early 2.6 series kernel
>>> (which if I
>>> remember correctly was already EOL by the time I was using it).
>>
>> It might have been nice if you had said this from the beginning
>> instead of making an unqualified statement with the assumption that it
>> was applicable to kernels likely to be used today in non-obsolete
>> systems.  Otherwise it risks generating a click-bait article on
>> Phoronix that would get people really worried for no good reason...
> I sincerely apologize about this, I should have been more specific right
> from the beginning (I need to get better about that when talking to
> people, I'm so used to dealing with some of my friends who couldn't
> event tell you the difference between RAM and a hard drive, think a bus
> is only something you use for transportation, and get confused when I
> try to properly explain even relatively simple CS and statistics concepts).
>>
>> There was a bug a long, long time ago (which where we weren't doing
>> sufficient locking and if two processes raced reading from
>> /dev/urandom at the same time, it was possible that the two processes
>> would get the same value read out from /dev/urandom).  This was fixed
>> a long time ago, though, and in fact the scalability problem which
>> Andi is trying to fix was caused by that extra locking that was
>> added.  :-)
>>
>> It's possible that is what you saw.  I don't know, since there was no
>> reproduction information to back up your rather startling claim.
> I don't think this was what I hit, I'm pretty sure I had serialized the
> dieharder runs.
>>
>> If you can reproduce consistent Dieharder failures, please do let us
>> know with detailed reproduction instructures.
> Will do.
OK, just started a couple of runs in parallel using different generators 
using the following command line:
dieharder -a -m 32 -k 1 -Y 1 -g XXX
with one each for:
/dev/urandom (502)
AES_OFB (205)
glibc random() (039)
mt19937 (013)
The above command line will run all dieharder tests with 12800 psamples, 
using a higher than default precision, and re-running tests that return 
WEAK until it gets a PASS or FAIL.  Even on the relatively fast (at 
least, fast for a desktop) system I'm running them on, I expect it will 
take quite some time to finish (although regardless of that I'm probably 
not going to be getting back to it until Monday).

Interestingly, based on what dieharder is already saying about 
performance, /dev/urandom is slower than AES_OFB (at least, on this 
particular system, happy to provide hardware specs if someone wants).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-25 19:07                   ` Austin S Hemmelgarn
@ 2015-09-25 20:24                     ` Theodore Ts'o
  2015-09-29 12:06                       ` Austin S Hemmelgarn
  2015-09-29 11:57                     ` Austin S Hemmelgarn
  1 sibling, 1 reply; 27+ messages in thread
From: Theodore Ts'o @ 2015-09-25 20:24 UTC (permalink / raw)
  To: Austin S Hemmelgarn
  Cc: Jeff Epler, Andi Kleen, linux-kernel, kirill.shutemov, herbert,
	Andi Kleen

On Fri, Sep 25, 2015 at 03:07:54PM -0400, Austin S Hemmelgarn wrote:
> 
> Interestingly, based on what dieharder is already saying about performance,
> /dev/urandom is slower than AES_OFB (at least, on this particular system,
> happy to provide hardware specs if someone wants).

Yeah, not surprised by that.  We're currently using a crypto hash
instead of AES, which means we're not doing any kind of hardware
acceleration.

Crazy applications that want to spend 100% of the CPU generating
random numbers instead of you know, doing _useful_ work
notwithstanding, /dev/urandom never had high performance as one of its
design goals.  The assumption was that if you needed that kind of
performance, you would use a user-space cryptographic random number
generator.


					- Ted

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-25 19:07                   ` Austin S Hemmelgarn
  2015-09-25 20:24                     ` Theodore Ts'o
@ 2015-09-29 11:57                     ` Austin S Hemmelgarn
  1 sibling, 0 replies; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-29 11:57 UTC (permalink / raw)
  To: Theodore Ts'o, Jeff Epler, Andi Kleen, linux-kernel,
	kirill.shutemov, herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 4111 bytes --]

On 2015-09-25 15:07, Austin S Hemmelgarn wrote:
> On 2015-09-25 07:41, Austin S Hemmelgarn wrote:
>> On 2015-09-24 16:14, Theodore Ts'o wrote:
>>> On Thu, Sep 24, 2015 at 03:11:23PM -0400, Austin S Hemmelgarn wrote:
>>>>> That is a startling result.  Please say what architecture, kernel
>>>>> version, dieharder version and commandline arguments you are using to
>>>>> get 10% WEAK or FAILED assessments from dieharder on /dev/urandom.
>>>>
>>>> I do not remember what exact dieharder version or command-line
>>>> arguments
>>>> (this was almost a decade ago), except that I compiled it from source
>>>> myself, I do remember it was a 32-bit x86 processor (as that was
>>>> sadly all I
>>>> had to run Linux on at the time), and an early 2.6 series kernel
>>>> (which if I
>>>> remember correctly was already EOL by the time I was using it).
>>>
>>> It might have been nice if you had said this from the beginning
>>> instead of making an unqualified statement with the assumption that it
>>> was applicable to kernels likely to be used today in non-obsolete
>>> systems.  Otherwise it risks generating a click-bait article on
>>> Phoronix that would get people really worried for no good reason...
>> I sincerely apologize about this, I should have been more specific right
>> from the beginning (I need to get better about that when talking to
>> people, I'm so used to dealing with some of my friends who couldn't
>> event tell you the difference between RAM and a hard drive, think a bus
>> is only something you use for transportation, and get confused when I
>> try to properly explain even relatively simple CS and statistics
>> concepts).
>>>
>>> There was a bug a long, long time ago (which where we weren't doing
>>> sufficient locking and if two processes raced reading from
>>> /dev/urandom at the same time, it was possible that the two processes
>>> would get the same value read out from /dev/urandom).  This was fixed
>>> a long time ago, though, and in fact the scalability problem which
>>> Andi is trying to fix was caused by that extra locking that was
>>> added.  :-)
>>>
>>> It's possible that is what you saw.  I don't know, since there was no
>>> reproduction information to back up your rather startling claim.
>> I don't think this was what I hit, I'm pretty sure I had serialized the
>> dieharder runs.
>>>
>>> If you can reproduce consistent Dieharder failures, please do let us
>>> know with detailed reproduction instructures.
>> Will do.
> OK, just started a couple of runs in parallel using different generators
> using the following command line:
> dieharder -a -m 32 -k 1 -Y 1 -g XXX
> with one each for:
> /dev/urandom (502)
> AES_OFB (205)
> glibc random() (039)
> mt19937 (013)
> The above command line will run all dieharder tests with 12800 psamples,
> using a higher than default precision, and re-running tests that return
> WEAK until it gets a PASS or FAIL.  Even on the relatively fast (at
> least, fast for a desktop) system I'm running them on, I expect it will
> take quite some time to finish (although regardless of that I'm probably
> not going to be getting back to it until Monday).
>
> Interestingly, based on what dieharder is already saying about
> performance, /dev/urandom is slower than AES_OFB (at least, on this
> particular system, happy to provide hardware specs if someone wants).
>
Apologies for not replying yesterday like I said I would.

I actually didn't get a chance to run the tests to completion as the 
wifi card in the system I was running the tests on lost it's mind about 
55 hours in and I had to cold reboot the system to reset it.  I would 
give the results here, except that I have a feeling that people probably 
don't want 110kb of data in the e-mail body, and thunderbird is for some 
reason choking on trying to attach files.  In general, the results were 
pretty typical of a good PRNG, performance differences not withstanding. 
  In other words, don't use /dev/urandom except for seeding other 
PRNG's, but because of the speed, not the quality.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-25 20:24                     ` Theodore Ts'o
@ 2015-09-29 12:06                       ` Austin S Hemmelgarn
  0 siblings, 0 replies; 27+ messages in thread
From: Austin S Hemmelgarn @ 2015-09-29 12:06 UTC (permalink / raw)
  To: Theodore Ts'o, Jeff Epler, Andi Kleen, linux-kernel,
	kirill.shutemov, herbert, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

On 2015-09-25 16:24, Theodore Ts'o wrote:
> On Fri, Sep 25, 2015 at 03:07:54PM -0400, Austin S Hemmelgarn wrote:
>>
>> Interestingly, based on what dieharder is already saying about performance,
>> /dev/urandom is slower than AES_OFB (at least, on this particular system,
>> happy to provide hardware specs if someone wants).
>
> Yeah, not surprised by that.  We're currently using a crypto hash
> instead of AES, which means we're not doing any kind of hardware
> acceleration.
>
> Crazy applications that want to spend 100% of the CPU generating
> random numbers instead of you know, doing _useful_ work
> notwithstanding, /dev/urandom never had high performance as one of its
> design goals.  The assumption was that if you needed that kind of
> performance, you would use a user-space cryptographic random number
> generator.
While I do understand that, it's abysmal performance compared to any of 
the others I tested.  Part of the standard testing in dieharder is 
reporting how many random numbers it can source from the generator per 
second (it's some bit-width of integers, I just don't remember which). 
Here's the actual numbers I got:

         AES_OFB|  1.11e+07
   random-glibc2|  6.11e+07
         mt19937|  3.30e+07
    /dev/urandom|  6.53e+05

That much difference in speed is kind of interesting, and reinforces my 
statement that you should just use /dev/urandom for seeding other RNG's, 
just for a different reason than my original statement.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] Make /dev/urandom scalable
@ 2016-03-01  5:17 Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2016-03-01  5:17 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

We had a case where a 4 socket system spent >80% of its total CPU time
contending on the global urandom nonblocking pool spinlock. While the
application could probably have used an own PRNG, it may have valid
reasons to use the best possible key for different session keys.

The application still ran acceptable under 2S, but just fell over
the locking cliff on 4S.

Implementation
==============

The non blocking pool is used widely these days, from every execve() (to
set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
/dev/urandom users in user space. Clearly having such a popular resource
under a global lock is bad thing.

This patch changes the random driver to use distributed per NUMA node
nonblocking pools. The basic structure is not changed: entropy is
first fed into the input pool and later from there distributed
round-robin into the blocking and non blocking pools. This patch extends
this to use an dedicated non blocking pool for each node, and distribute
evenly from the input pool into these distributed pools, in
addition to the blocking pool.

Then every urandom/getrandom user fetches data from its node local
pool. At boot time when users may be still waiting for the non
blocking pool initialization we use the node 0 non blocking pool,
to avoid the need for different wake up queues.

For single node systems (like the vast majority of non server systems)
nothing changes. There is still only a single non blocking pool.

For other systems the original nonblocking pool is used until this
pool has 128 bits worth of entropy. After that it is used
to initialize the other pools. This already gives them
different init states, so they don't run in lock-step
to avoid "replay" attacks.

Since we still have a global input pool there are no problems
with load balancing entropy data between nodes. Any node that never
runs any interrupts would still get the same amount of entropy as
other nodes.

Entropy is fed preferably to nodes that need it more using
the existing 75% threshold.

For saving/restoring /dev/urandom, there is currently no mechanism
to access the non local node pool (short of setting task affinity).
This implies that currently the standard init/exit random save/restore
scripts would only save node 0. On restore all pools are updates.
So the entropy of non 0 gets lost over reboot. That seems acceptable
to me for now (fixing this would need a new separate save/restore interface)

Scalability
===========

I tested the patch with a simple will-it-scale test banging
on get_random() in parallel on more and more CPUs. Of course
that is not a realistic scenario, as real programs should
do some work between getting random numbers. But it's a worst
case for the random scalability.

On a 4S Xeon v3 system _without_ the patchkit the benchmark
maxes out when using all the threads on one node. After
that it quickly settles to about half the throughput of
one node with 2-4 nodes.

(all throughput factors, bigger is better)
Without patchkit:

1 node:  1x
2 nodes: 0.75x
3 nodes: 0.55x
4 nodes: 0.42x

With the patchkit applied:

1 node:  1x
2 nodes: 2x
3 nodes: 2.4x
4 nodes: 3x

So it's not quite linear scalability, but 3x maximum throughput
is already a lot better.

A node can still have a large number of CPUs: on my test system 36
logical software threads (18C * 2T). In principle it may make
sense to split it up further. Per logical CPU would be clearly
overkill. But that would also add more pressure on the input
pools. For now per node seems like a acceptable compromise.

/dev/random still uses a single global lock. For now that seems
acceptable as it normally cannot be used for real high volume
accesses anyways.

The input pool also still uses a global lock. The existing per CPU
fast pool and "give up when busy" mechanism seems to scale well enough
even on larger systems.

v2: Fix name of pool 0. Fix race with interrupts. Make
iteration loops slightly more efficient. Add ifdefs to avoid
any extra code on non-NUMA. Delay other pool use to when
the original pool initialized and initialize the pools from
pool 0. Add comments on memory allocation.

v3: Minor changes from review. Consistent pool names.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 187 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 160 insertions(+), 27 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..e7e02c0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -156,6 +156,17 @@
  * particular randomness source.  They do this by keeping track of the
  * first and second order deltas of the event timings.
  *
+ * Distributed pools
+ * =================
+ *
+ * On larger systems the locking on the single non blocking pool can
+ * become a bottleneck. To avoid this, we use an own non blocking pool
+ * for each NUMA node. The distributed pools are fed round robin from
+ * the input pool. Each user then only reads entropy from their local
+ * pool.
+ *
+ * For the blocking pool there is still only a single instance.
+ *
  * Ensuring unpredictability at system startup
  * ============================================
  *
@@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
 
 static struct entropy_store nonblocking_pool = {
 	.poolinfo = &poolinfo_table[1],
-	.name = "nonblocking",
+	.name = "nonblocking 0",
 	.pull = &input_pool,
 	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
 	.pool = nonblocking_pool_data,
@@ -475,6 +486,41 @@ static struct entropy_store nonblocking_pool = {
 					push_to_pool),
 };
 
+/*
+ * Per NUMA node nonblocking pool. This avoids lock contention
+ * when many processes extract data from /dev/urandom in parallel.
+ * /dev/random stays single instance for now.
+ *
+ * This variable is only set up once all non node 0 pools
+ * are initialized.
+ */
+#ifdef CONFIG_NUMA
+static struct entropy_store **nonblocking_node_pool __read_mostly;
+static struct entropy_store **early_nonblocking_node_pool __read_mostly;
+#else
+#define nonblocking_node_pool ((struct entropy_store **)NULL)
+#endif
+
+/* User must check for zero nonblocking_node_pool */
+#define for_each_nb_pool(i, pool) \
+	{ int num_nodes = num_possible_nodes(); \
+		for (i = 0; i < num_nodes; i++) { \
+			pool = nonblocking_node_pool[i];
+#define end_for_each_nb() } }
+
+static inline struct entropy_store *get_nonblocking_pool(void)
+{
+	struct entropy_store *pool = &nonblocking_pool;
+
+	/*
+	 * Non node 0 pools may take longer to initialize. Keep using
+	 * the boot nonblocking pool while this happens.
+	 */
+	if (nonblocking_node_pool)
+		pool = nonblocking_node_pool[numa_node_id()];
+	return pool;
+}
+
 static __u32 const twist_table[8] = {
 	0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 	0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -608,6 +654,27 @@ static void process_random_ready_list(void)
 	spin_unlock_irqrestore(&random_ready_list_lock, flags);
 }
 
+#define POOL_INIT_BYTES (128 / 8)
+
+void init_node_pools(void)
+{
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+
+	for (i = 0; i < num_nodes; i++) {
+		struct entropy_store *pool = early_nonblocking_node_pool[i];
+		char buf[POOL_INIT_BYTES];
+
+		get_random_bytes(buf, sizeof buf);
+		mix_pool_bytes(pool, buf, sizeof buf);
+	}
+	/* Initialize first before setting variable */
+	mb();
+	nonblocking_node_pool = early_nonblocking_node_pool;
+#endif
+}
+
 /*
  * Credit (or debit) the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -681,6 +748,7 @@ retry:
 			prandom_reseed_late();
 			process_random_ready_list();
 			wake_up_all(&urandom_init_wait);
+			init_node_pools();
 			pr_notice("random: %s pool is initialized\n", r->name);
 		}
 	}
@@ -698,18 +766,23 @@ retry:
 			kill_fasync(&fasync, SIGIO, POLL_IN);
 		}
 		/* If the input pool is getting full, send some
-		 * entropy to the two output pools, flipping back and
-		 * forth between them, until the output pools are 75%
+		 * entropy to the other output pools, cycling
+		 * between them, until the output pools are 75%
 		 * full.
+		 * Feed into all pools round robin.
 		 */
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
 			static struct entropy_store *last = &blocking_pool;
+			static int next_pool = -1;
 			struct entropy_store *other = &blocking_pool;
 
-			if (last == &blocking_pool)
-				other = &nonblocking_pool;
+			/* -1: use blocking pool, 0<=max_node: node nb pool */
+			if (next_pool > -1)
+				other = nonblocking_node_pool[next_pool];
+			if (++next_pool >= num_possible_nodes())
+				next_pool = -1;
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -748,6 +821,17 @@ struct timer_rand_state {
 
 #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };
 
+static void pool_add_buf(struct entropy_store *pool,
+			 const void *buf, unsigned int size)
+{
+	unsigned long flags;
+	unsigned long time = random_get_entropy() ^ jiffies;
+	spin_lock_irqsave(&pool->lock, flags);
+	_mix_pool_bytes(pool, buf, size);
+	_mix_pool_bytes(pool, &time, sizeof(time));
+	spin_unlock_irqrestore(&pool->lock, flags);
+}
+
 /*
  * Add device- or boot-specific data to the input and nonblocking
  * pools to help initialize them to unique values.
@@ -758,19 +842,18 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
-	unsigned long time = random_get_entropy() ^ jiffies;
-	unsigned long flags;
+	struct entropy_store *pool;
+	int i;
 
 	trace_add_device_randomness(size, _RET_IP_);
-	spin_lock_irqsave(&input_pool.lock, flags);
-	_mix_pool_bytes(&input_pool, buf, size);
-	_mix_pool_bytes(&input_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&input_pool.lock, flags);
-
-	spin_lock_irqsave(&nonblocking_pool.lock, flags);
-	_mix_pool_bytes(&nonblocking_pool, buf, size);
-	_mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
+	pool_add_buf(&input_pool, buf, size);
+	if (!nonblocking_node_pool) {
+		pool_add_buf(&nonblocking_pool, buf, size);
+	} else {
+		for_each_nb_pool (i, pool) {
+			pool_add_buf(pool, buf, size);
+		} end_for_each_nb()
+	}
 }
 EXPORT_SYMBOL(add_device_randomness);
 
@@ -1252,15 +1335,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
  */
 void get_random_bytes(void *buf, int nbytes)
 {
+	struct entropy_store *pool = get_nonblocking_pool();
 #if DEBUG_RANDOM_BOOT > 0
-	if (unlikely(nonblocking_pool.initialized == 0))
+	if (unlikely(pool->initialized == 0))
 		printk(KERN_NOTICE "random: %pF get_random_bytes called "
 		       "with %d bits of entropy available\n",
 		       (void *) _RET_IP_,
-		       nonblocking_pool.entropy_total);
+		       pool->entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(pool, buf, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1350,7 +1434,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
@@ -1390,12 +1474,43 @@ static void init_std_data(struct entropy_store *r)
  * initializations complete at compile time. We should also
  * take care not to overwrite the precious per platform data
  * we were given.
+ *
+ * This is early initialization, if memory allocations of a few
+ * bytes fails the kernel is doomed anyways. So use __GFP_NOFAIL.
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+	struct entropy_store **pool_list;
+#endif
+
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
 	init_std_data(&nonblocking_pool);
+#ifdef CONFIG_NUMA
+	pool_list = kmalloc(num_nodes * sizeof(void *),
+					GFP_KERNEL|__GFP_NOFAIL);
+	pool_list[0] = &nonblocking_pool;
+	for (i = 1; i < num_nodes; i++) {
+		struct entropy_store *pool;
+
+		pool = kzalloc(sizeof(struct entropy_store),
+				GFP_KERNEL|__GFP_NOFAIL);
+		pool_list[i] = pool;
+		pool->poolinfo = &poolinfo_table[1];
+		pool->pull = &input_pool;
+		spin_lock_init(&pool->lock);
+		/* pool data not cleared intentionally */
+		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
+				     GFP_KERNEL|__GFP_NOFAIL);
+		INIT_WORK(&pool->push_work, push_to_pool);
+		pool->name = kasprintf(GFP_KERNEL|__GFP_NOFAIL,
+				"nonblocking %d", i);
+	}
+	early_nonblocking_node_pool = pool_list;
+#endif
 	return 0;
 }
 early_initcall(rand_initialize);
@@ -1458,17 +1573,19 @@ static ssize_t
 urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	int ret;
+	struct entropy_store *pool;
 
-	if (unlikely(nonblocking_pool.initialized == 0))
+	pool = get_nonblocking_pool();
+	if (unlikely(pool->initialized == 0))
 		printk_once(KERN_NOTICE "random: %s urandom read "
 			    "with %d bits of entropy available\n",
 			    current->comm, nonblocking_pool.entropy_total);
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
-	ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
+	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
-			   ENTROPY_BITS(&input_pool));
+	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+			   ENTROPY_BITS(pool));
 	return ret;
 }
 
@@ -1513,22 +1630,34 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
 	size_t ret;
+	struct entropy_store *pool;
+	int i;
 
 	ret = write_pool(&blocking_pool, buffer, count);
 	if (ret)
 		return ret;
-	ret = write_pool(&nonblocking_pool, buffer, count);
-	if (ret)
-		return ret;
+	if (nonblocking_node_pool) {
+		for_each_nb_pool (i, pool) {
+			ret = write_pool(pool, buffer, count);
+			if (ret)
+				return ret;
+		} end_for_each_nb()
+	} else {
+		ret = write_pool(&nonblocking_pool, buffer, count);
+		if (ret)
+			return ret;
+	}
 
 	return (ssize_t)count;
 }
 
 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
+	int i;
 	int size, ent_count;
 	int __user *p = (int __user *)arg;
 	int retval;
+	struct entropy_store *pool;
 
 	switch (cmd) {
 	case RNDGETENTCNT:
@@ -1569,6 +1698,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 			return -EPERM;
 		input_pool.entropy_count = 0;
 		nonblocking_pool.entropy_count = 0;
+		if (nonblocking_node_pool)
+			for_each_nb_pool (i, pool) {
+				pool->entropy_count = 0;
+			} end_for_each_nb()
 		blocking_pool.entropy_count = 0;
 		return 0;
 	default:
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/3] Make /dev/urandom scalable
  2016-02-10 23:01 Scalable random patchkit revisited Andi Kleen
@ 2016-02-10 23:01 ` Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2016-02-10 23:01 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

We had a case where a 4 socket system spent >80% of its total CPU time
contending on the global urandom nonblocking pool spinlock. While the
application could probably have used an own PRNG, it may have valid
reasons to use the best possible key for different session keys.

The application still ran acceptable under 2S, but just fell over
the locking cliff on 4S.

Implementation
==============

The non blocking pool is used widely these days, from every execve() (to
set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
/dev/urandom users in user space. Clearly having such a popular resource
under a global lock is bad thing.

This patch changes the random driver to use distributed per NUMA node
nonblocking pools. The basic structure is not changed: entropy is
first fed into the input pool and later from there distributed
round-robin into the blocking and non blocking pools. This patch extends
this to use an dedicated non blocking pool for each node, and distribute
evenly from the input pool into these distributed pools, in
addition to the blocking pool.

Then every urandom/getrandom user fetches data from its node local
pool. At boot time when users may be still waiting for the non
blocking pool initialization we use the node 0 non blocking pool,
to avoid the need for different wake up queues.

For single node systems (like the vast majority of non server systems)
nothing changes. There is still only a single non blocking pool.

For other systems the original nonblocking pool is used until this
pool has 128 bits worth of entropy. After that it is used
to initialize the other pools. This already gives them
different init states, so they don't run in lock-step
to avoid "replay" attacks.

Since we still have a global input pool there are no problems
with load balancing entropy data between nodes. Any node that never
runs any interrupts would still get the same amount of entropy as
other nodes.

Entropy is fed preferably to nodes that need it more using
the existing 75% threshold.

For saving/restoring /dev/urandom, there is currently no mechanism
to access the non local node pool (short of setting task affinity).
This implies that currently the standard init/exit random save/restore
scripts would only save node 0. On restore all pools are updates.
So the entropy of non 0 gets lost over reboot. That seems acceptable
to me for now (fixing this would need a new separate save/restore interface)

Scalability
===========

I tested the patch with a simple will-it-scale test banging
on get_random() in parallel on more and more CPUs. Of course
that is not a realistic scenario, as real programs should
do some work between getting random numbers. But it's a worst
case for the random scalability.

On a 4S Xeon v3 system _without_ the patchkit the benchmark
maxes out when using all the threads on one node. After
that it quickly settles to about half the throughput of
one node with 2-4 nodes.

(all throughput factors, bigger is better)
Without patchkit:

1 node:  1x
2 nodes: 0.75x
3 nodes: 0.55x
4 nodes: 0.42x

With the patchkit applied:

1 node:  1x
2 nodes: 2x
3 nodes: 2.4x
4 nodes: 3x

So it's not quite linear scalability, but 3x maximum throughput
is already a lot better.

A node can still have a large number of CPUs: on my test system 36
logical software threads (18C * 2T). In principle it may make
sense to split it up further. Per logical CPU would be clearly
overkill. But that would also add more pressure on the input
pools. For now per node seems like a acceptable compromise.

/dev/random still uses a single global lock. For now that seems
acceptable as it normally cannot be used for real high volume
accesses anyways.

The input pool also still uses a global lock. The existing per CPU
fast pool and "give up when busy" mechanism seems to scale well enough
even on larger systems.

v2: Fix name of pool 0. Fix race with interrupts. Make
iteration loops slightly more efficient. Add ifdefs to avoid
any extra code on non-NUMA. Delay other pool use to when
the original pool initialized and initialize the pools from
pool 0. Add comments on memory allocation.

v3: Minor changes from review. Consistent pool names.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 187 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 160 insertions(+), 27 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..e7e02c0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -156,6 +156,17 @@
  * particular randomness source.  They do this by keeping track of the
  * first and second order deltas of the event timings.
  *
+ * Distributed pools
+ * =================
+ *
+ * On larger systems the locking on the single non blocking pool can
+ * become a bottleneck. To avoid this, we use an own non blocking pool
+ * for each NUMA node. The distributed pools are fed round robin from
+ * the input pool. Each user then only reads entropy from their local
+ * pool.
+ *
+ * For the blocking pool there is still only a single instance.
+ *
  * Ensuring unpredictability at system startup
  * ============================================
  *
@@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
 
 static struct entropy_store nonblocking_pool = {
 	.poolinfo = &poolinfo_table[1],
-	.name = "nonblocking",
+	.name = "nonblocking 0",
 	.pull = &input_pool,
 	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
 	.pool = nonblocking_pool_data,
@@ -475,6 +486,41 @@ static struct entropy_store nonblocking_pool = {
 					push_to_pool),
 };
 
+/*
+ * Per NUMA node nonblocking pool. This avoids lock contention
+ * when many processes extract data from /dev/urandom in parallel.
+ * /dev/random stays single instance for now.
+ *
+ * This variable is only set up once all non node 0 pools
+ * are initialized.
+ */
+#ifdef CONFIG_NUMA
+static struct entropy_store **nonblocking_node_pool __read_mostly;
+static struct entropy_store **early_nonblocking_node_pool __read_mostly;
+#else
+#define nonblocking_node_pool ((struct entropy_store **)NULL)
+#endif
+
+/* User must check for zero nonblocking_node_pool */
+#define for_each_nb_pool(i, pool) \
+	{ int num_nodes = num_possible_nodes(); \
+		for (i = 0; i < num_nodes; i++) { \
+			pool = nonblocking_node_pool[i];
+#define end_for_each_nb() } }
+
+static inline struct entropy_store *get_nonblocking_pool(void)
+{
+	struct entropy_store *pool = &nonblocking_pool;
+
+	/*
+	 * Non node 0 pools may take longer to initialize. Keep using
+	 * the boot nonblocking pool while this happens.
+	 */
+	if (nonblocking_node_pool)
+		pool = nonblocking_node_pool[numa_node_id()];
+	return pool;
+}
+
 static __u32 const twist_table[8] = {
 	0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 	0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -608,6 +654,27 @@ static void process_random_ready_list(void)
 	spin_unlock_irqrestore(&random_ready_list_lock, flags);
 }
 
+#define POOL_INIT_BYTES (128 / 8)
+
+void init_node_pools(void)
+{
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+
+	for (i = 0; i < num_nodes; i++) {
+		struct entropy_store *pool = early_nonblocking_node_pool[i];
+		char buf[POOL_INIT_BYTES];
+
+		get_random_bytes(buf, sizeof buf);
+		mix_pool_bytes(pool, buf, sizeof buf);
+	}
+	/* Initialize first before setting variable */
+	mb();
+	nonblocking_node_pool = early_nonblocking_node_pool;
+#endif
+}
+
 /*
  * Credit (or debit) the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -681,6 +748,7 @@ retry:
 			prandom_reseed_late();
 			process_random_ready_list();
 			wake_up_all(&urandom_init_wait);
+			init_node_pools();
 			pr_notice("random: %s pool is initialized\n", r->name);
 		}
 	}
@@ -698,18 +766,23 @@ retry:
 			kill_fasync(&fasync, SIGIO, POLL_IN);
 		}
 		/* If the input pool is getting full, send some
-		 * entropy to the two output pools, flipping back and
-		 * forth between them, until the output pools are 75%
+		 * entropy to the other output pools, cycling
+		 * between them, until the output pools are 75%
 		 * full.
+		 * Feed into all pools round robin.
 		 */
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
 			static struct entropy_store *last = &blocking_pool;
+			static int next_pool = -1;
 			struct entropy_store *other = &blocking_pool;
 
-			if (last == &blocking_pool)
-				other = &nonblocking_pool;
+			/* -1: use blocking pool, 0<=max_node: node nb pool */
+			if (next_pool > -1)
+				other = nonblocking_node_pool[next_pool];
+			if (++next_pool >= num_possible_nodes())
+				next_pool = -1;
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -748,6 +821,17 @@ struct timer_rand_state {
 
 #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };
 
+static void pool_add_buf(struct entropy_store *pool,
+			 const void *buf, unsigned int size)
+{
+	unsigned long flags;
+	unsigned long time = random_get_entropy() ^ jiffies;
+	spin_lock_irqsave(&pool->lock, flags);
+	_mix_pool_bytes(pool, buf, size);
+	_mix_pool_bytes(pool, &time, sizeof(time));
+	spin_unlock_irqrestore(&pool->lock, flags);
+}
+
 /*
  * Add device- or boot-specific data to the input and nonblocking
  * pools to help initialize them to unique values.
@@ -758,19 +842,18 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
-	unsigned long time = random_get_entropy() ^ jiffies;
-	unsigned long flags;
+	struct entropy_store *pool;
+	int i;
 
 	trace_add_device_randomness(size, _RET_IP_);
-	spin_lock_irqsave(&input_pool.lock, flags);
-	_mix_pool_bytes(&input_pool, buf, size);
-	_mix_pool_bytes(&input_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&input_pool.lock, flags);
-
-	spin_lock_irqsave(&nonblocking_pool.lock, flags);
-	_mix_pool_bytes(&nonblocking_pool, buf, size);
-	_mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
+	pool_add_buf(&input_pool, buf, size);
+	if (!nonblocking_node_pool) {
+		pool_add_buf(&nonblocking_pool, buf, size);
+	} else {
+		for_each_nb_pool (i, pool) {
+			pool_add_buf(pool, buf, size);
+		} end_for_each_nb()
+	}
 }
 EXPORT_SYMBOL(add_device_randomness);
 
@@ -1252,15 +1335,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
  */
 void get_random_bytes(void *buf, int nbytes)
 {
+	struct entropy_store *pool = get_nonblocking_pool();
 #if DEBUG_RANDOM_BOOT > 0
-	if (unlikely(nonblocking_pool.initialized == 0))
+	if (unlikely(pool->initialized == 0))
 		printk(KERN_NOTICE "random: %pF get_random_bytes called "
 		       "with %d bits of entropy available\n",
 		       (void *) _RET_IP_,
-		       nonblocking_pool.entropy_total);
+		       pool->entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(pool, buf, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1350,7 +1434,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
@@ -1390,12 +1474,43 @@ static void init_std_data(struct entropy_store *r)
  * initializations complete at compile time. We should also
  * take care not to overwrite the precious per platform data
  * we were given.
+ *
+ * This is early initialization, if memory allocations of a few
+ * bytes fails the kernel is doomed anyways. So use __GFP_NOFAIL.
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+	struct entropy_store **pool_list;
+#endif
+
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
 	init_std_data(&nonblocking_pool);
+#ifdef CONFIG_NUMA
+	pool_list = kmalloc(num_nodes * sizeof(void *),
+					GFP_KERNEL|__GFP_NOFAIL);
+	pool_list[0] = &nonblocking_pool;
+	for (i = 1; i < num_nodes; i++) {
+		struct entropy_store *pool;
+
+		pool = kzalloc(sizeof(struct entropy_store),
+				GFP_KERNEL|__GFP_NOFAIL);
+		pool_list[i] = pool;
+		pool->poolinfo = &poolinfo_table[1];
+		pool->pull = &input_pool;
+		spin_lock_init(&pool->lock);
+		/* pool data not cleared intentionally */
+		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
+				     GFP_KERNEL|__GFP_NOFAIL);
+		INIT_WORK(&pool->push_work, push_to_pool);
+		pool->name = kasprintf(GFP_KERNEL|__GFP_NOFAIL,
+				"nonblocking %d", i);
+	}
+	early_nonblocking_node_pool = pool_list;
+#endif
 	return 0;
 }
 early_initcall(rand_initialize);
@@ -1458,17 +1573,19 @@ static ssize_t
 urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	int ret;
+	struct entropy_store *pool;
 
-	if (unlikely(nonblocking_pool.initialized == 0))
+	pool = get_nonblocking_pool();
+	if (unlikely(pool->initialized == 0))
 		printk_once(KERN_NOTICE "random: %s urandom read "
 			    "with %d bits of entropy available\n",
 			    current->comm, nonblocking_pool.entropy_total);
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
-	ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
+	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
-			   ENTROPY_BITS(&input_pool));
+	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+			   ENTROPY_BITS(pool));
 	return ret;
 }
 
@@ -1513,22 +1630,34 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
 	size_t ret;
+	struct entropy_store *pool;
+	int i;
 
 	ret = write_pool(&blocking_pool, buffer, count);
 	if (ret)
 		return ret;
-	ret = write_pool(&nonblocking_pool, buffer, count);
-	if (ret)
-		return ret;
+	if (nonblocking_node_pool) {
+		for_each_nb_pool (i, pool) {
+			ret = write_pool(pool, buffer, count);
+			if (ret)
+				return ret;
+		} end_for_each_nb()
+	} else {
+		ret = write_pool(&nonblocking_pool, buffer, count);
+		if (ret)
+			return ret;
+	}
 
 	return (ssize_t)count;
 }
 
 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
+	int i;
 	int size, ent_count;
 	int __user *p = (int __user *)arg;
 	int retval;
+	struct entropy_store *pool;
 
 	switch (cmd) {
 	case RNDGETENTCNT:
@@ -1569,6 +1698,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 			return -EPERM;
 		input_pool.entropy_count = 0;
 		nonblocking_pool.entropy_count = 0;
+		if (nonblocking_node_pool)
+			for_each_nb_pool (i, pool) {
+				pool->entropy_count = 0;
+			} end_for_each_nb()
 		blocking_pool.entropy_count = 0;
 		return 0;
 	default:
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/3] Make /dev/urandom scalable
@ 2015-10-06 22:05 Andi Kleen
  0 siblings, 0 replies; 27+ messages in thread
From: Andi Kleen @ 2015-10-06 22:05 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

We had a case where a 4 socket system spent >80% of its total CPU time
contending on the global urandom nonblocking pool spinlock. While the
application could probably have used an own PRNG, it may have valid
reasons to use the best possible key for different session keys.

The application still ran acceptable under 2S, but just fell over
the locking cliff on 4S.

Implementation
==============

The non blocking pool is used widely these days, from every execve() (to
set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
/dev/urandom users in user space. Clearly having such a popular resource
under a global lock is bad thing.

This patch changes the random driver to use distributed per NUMA node
nonblocking pools. The basic structure is not changed: entropy is
first fed into the input pool and later from there distributed
round-robin into the blocking and non blocking pools. This patch extends
this to use an dedicated non blocking pool for each node, and distribute
evenly from the input pool into these distributed pools, in
addition to the blocking pool.

Then every urandom/getrandom user fetches data from its node local
pool. At boot time when users may be still waiting for the non
blocking pool initialization we use the node 0 non blocking pool,
to avoid the need for different wake up queues.

For single node systems (like the vast majority of non server systems)
nothing changes. There is still only a single non blocking pool.

For other systems the original nonblocking pool is used until this
pool has 128 bits worth of entropy. After that it is used
to initialize the other pools. This already gives them
different init states, so they don't run in lock-step
to avoid "replay" attacks.

Since we still have a global input pool there are no problems
with load balancing entropy data between nodes. Any node that never
runs any interrupts would still get the same amount of entropy as
other nodes.

Entropy is fed preferably to nodes that need it more using
the existing 75% threshold.

For saving/restoring /dev/urandom, there is currently no mechanism
to access the non local node pool (short of setting task affinity).
This implies that currently the standard init/exit random save/restore
scripts would only save node 0. On restore all pools are updates.
So the entropy of non 0 gets lost over reboot. That seems acceptable
to me for now (fixing this would need a new separate save/restore interface)

Scalability
===========

I tested the patch with a simple will-it-scale test banging
on get_random() in parallel on more and more CPUs. Of course
that is not a realistic scenario, as real programs should
do some work between getting random numbers. But it's a worst
case for the random scalability.

On a 4S Xeon v3 system _without_ the patchkit the benchmark
maxes out when using all the threads on one node. After
that it quickly settles to about half the throughput of
one node with 2-4 nodes.

(all throughput factors, bigger is better)
Without patchkit:

1 node:  1x
2 nodes: 0.75x
3 nodes: 0.55x
4 nodes: 0.42x

With the patchkit applied:

1 node:  1x
2 nodes: 2x
3 nodes: 2.4x
4 nodes: 3x

So it's not quite linear scalability, but 3x maximum throughput
is already a lot better.

A node can still have a large number of CPUs: on my test system 36
logical software threads (18C * 2T). In principle it may make
sense to split it up further. Per logical CPU would be clearly
overkill. But that would also add more pressure on the input
pools. For now per node seems like a acceptable compromise.

/dev/random still uses a single global lock. For now that seems
acceptable as it normally cannot be used for real high volume
accesses anyways.

The input pool also still uses a global lock. The existing per CPU
fast pool and "give up when busy" mechanism seems to scale well enough
even on larger systems.

v2: Fix name of pool 0. Fix race with interrupts. Make
iteration loops slightly more efficient. Add ifdefs to avoid
any extra code on non-NUMA. Delay other pool use to when
the original pool initialized and initialize the pools from
pool 0. Add comments on memory allocation.

v3: Minor changes from review. Consistent pool names.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 187 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 160 insertions(+), 27 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..e7e02c0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -156,6 +156,17 @@
  * particular randomness source.  They do this by keeping track of the
  * first and second order deltas of the event timings.
  *
+ * Distributed pools
+ * =================
+ *
+ * On larger systems the locking on the single non blocking pool can
+ * become a bottleneck. To avoid this, we use an own non blocking pool
+ * for each NUMA node. The distributed pools are fed round robin from
+ * the input pool. Each user then only reads entropy from their local
+ * pool.
+ *
+ * For the blocking pool there is still only a single instance.
+ *
  * Ensuring unpredictability at system startup
  * ============================================
  *
@@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
 
 static struct entropy_store nonblocking_pool = {
 	.poolinfo = &poolinfo_table[1],
-	.name = "nonblocking",
+	.name = "nonblocking 0",
 	.pull = &input_pool,
 	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
 	.pool = nonblocking_pool_data,
@@ -475,6 +486,41 @@ static struct entropy_store nonblocking_pool = {
 					push_to_pool),
 };
 
+/*
+ * Per NUMA node nonblocking pool. This avoids lock contention
+ * when many processes extract data from /dev/urandom in parallel.
+ * /dev/random stays single instance for now.
+ *
+ * This variable is only set up once all non node 0 pools
+ * are initialized.
+ */
+#ifdef CONFIG_NUMA
+static struct entropy_store **nonblocking_node_pool __read_mostly;
+static struct entropy_store **early_nonblocking_node_pool __read_mostly;
+#else
+#define nonblocking_node_pool ((struct entropy_store **)NULL)
+#endif
+
+/* User must check for zero nonblocking_node_pool */
+#define for_each_nb_pool(i, pool) \
+	{ int num_nodes = num_possible_nodes(); \
+		for (i = 0; i < num_nodes; i++) { \
+			pool = nonblocking_node_pool[i];
+#define end_for_each_nb() } }
+
+static inline struct entropy_store *get_nonblocking_pool(void)
+{
+	struct entropy_store *pool = &nonblocking_pool;
+
+	/*
+	 * Non node 0 pools may take longer to initialize. Keep using
+	 * the boot nonblocking pool while this happens.
+	 */
+	if (nonblocking_node_pool)
+		pool = nonblocking_node_pool[numa_node_id()];
+	return pool;
+}
+
 static __u32 const twist_table[8] = {
 	0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 	0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -608,6 +654,27 @@ static void process_random_ready_list(void)
 	spin_unlock_irqrestore(&random_ready_list_lock, flags);
 }
 
+#define POOL_INIT_BYTES (128 / 8)
+
+void init_node_pools(void)
+{
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+
+	for (i = 0; i < num_nodes; i++) {
+		struct entropy_store *pool = early_nonblocking_node_pool[i];
+		char buf[POOL_INIT_BYTES];
+
+		get_random_bytes(buf, sizeof buf);
+		mix_pool_bytes(pool, buf, sizeof buf);
+	}
+	/* Initialize first before setting variable */
+	mb();
+	nonblocking_node_pool = early_nonblocking_node_pool;
+#endif
+}
+
 /*
  * Credit (or debit) the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -681,6 +748,7 @@ retry:
 			prandom_reseed_late();
 			process_random_ready_list();
 			wake_up_all(&urandom_init_wait);
+			init_node_pools();
 			pr_notice("random: %s pool is initialized\n", r->name);
 		}
 	}
@@ -698,18 +766,23 @@ retry:
 			kill_fasync(&fasync, SIGIO, POLL_IN);
 		}
 		/* If the input pool is getting full, send some
-		 * entropy to the two output pools, flipping back and
-		 * forth between them, until the output pools are 75%
+		 * entropy to the other output pools, cycling
+		 * between them, until the output pools are 75%
 		 * full.
+		 * Feed into all pools round robin.
 		 */
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
 			static struct entropy_store *last = &blocking_pool;
+			static int next_pool = -1;
 			struct entropy_store *other = &blocking_pool;
 
-			if (last == &blocking_pool)
-				other = &nonblocking_pool;
+			/* -1: use blocking pool, 0<=max_node: node nb pool */
+			if (next_pool > -1)
+				other = nonblocking_node_pool[next_pool];
+			if (++next_pool >= num_possible_nodes())
+				next_pool = -1;
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -748,6 +821,17 @@ struct timer_rand_state {
 
 #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };
 
+static void pool_add_buf(struct entropy_store *pool,
+			 const void *buf, unsigned int size)
+{
+	unsigned long flags;
+	unsigned long time = random_get_entropy() ^ jiffies;
+	spin_lock_irqsave(&pool->lock, flags);
+	_mix_pool_bytes(pool, buf, size);
+	_mix_pool_bytes(pool, &time, sizeof(time));
+	spin_unlock_irqrestore(&pool->lock, flags);
+}
+
 /*
  * Add device- or boot-specific data to the input and nonblocking
  * pools to help initialize them to unique values.
@@ -758,19 +842,18 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
-	unsigned long time = random_get_entropy() ^ jiffies;
-	unsigned long flags;
+	struct entropy_store *pool;
+	int i;
 
 	trace_add_device_randomness(size, _RET_IP_);
-	spin_lock_irqsave(&input_pool.lock, flags);
-	_mix_pool_bytes(&input_pool, buf, size);
-	_mix_pool_bytes(&input_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&input_pool.lock, flags);
-
-	spin_lock_irqsave(&nonblocking_pool.lock, flags);
-	_mix_pool_bytes(&nonblocking_pool, buf, size);
-	_mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
+	pool_add_buf(&input_pool, buf, size);
+	if (!nonblocking_node_pool) {
+		pool_add_buf(&nonblocking_pool, buf, size);
+	} else {
+		for_each_nb_pool (i, pool) {
+			pool_add_buf(pool, buf, size);
+		} end_for_each_nb()
+	}
 }
 EXPORT_SYMBOL(add_device_randomness);
 
@@ -1252,15 +1335,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
  */
 void get_random_bytes(void *buf, int nbytes)
 {
+	struct entropy_store *pool = get_nonblocking_pool();
 #if DEBUG_RANDOM_BOOT > 0
-	if (unlikely(nonblocking_pool.initialized == 0))
+	if (unlikely(pool->initialized == 0))
 		printk(KERN_NOTICE "random: %pF get_random_bytes called "
 		       "with %d bits of entropy available\n",
 		       (void *) _RET_IP_,
-		       nonblocking_pool.entropy_total);
+		       pool->entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(pool, buf, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1350,7 +1434,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
@@ -1390,12 +1474,43 @@ static void init_std_data(struct entropy_store *r)
  * initializations complete at compile time. We should also
  * take care not to overwrite the precious per platform data
  * we were given.
+ *
+ * This is early initialization, if memory allocations of a few
+ * bytes fails the kernel is doomed anyways. So use __GFP_NOFAIL.
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+	struct entropy_store **pool_list;
+#endif
+
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
 	init_std_data(&nonblocking_pool);
+#ifdef CONFIG_NUMA
+	pool_list = kmalloc(num_nodes * sizeof(void *),
+					GFP_KERNEL|__GFP_NOFAIL);
+	pool_list[0] = &nonblocking_pool;
+	for (i = 1; i < num_nodes; i++) {
+		struct entropy_store *pool;
+
+		pool = kzalloc(sizeof(struct entropy_store),
+				GFP_KERNEL|__GFP_NOFAIL);
+		pool_list[i] = pool;
+		pool->poolinfo = &poolinfo_table[1];
+		pool->pull = &input_pool;
+		spin_lock_init(&pool->lock);
+		/* pool data not cleared intentionally */
+		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
+				     GFP_KERNEL|__GFP_NOFAIL);
+		INIT_WORK(&pool->push_work, push_to_pool);
+		pool->name = kasprintf(GFP_KERNEL|__GFP_NOFAIL,
+				"nonblocking %d", i);
+	}
+	early_nonblocking_node_pool = pool_list;
+#endif
 	return 0;
 }
 early_initcall(rand_initialize);
@@ -1458,17 +1573,19 @@ static ssize_t
 urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	int ret;
+	struct entropy_store *pool;
 
-	if (unlikely(nonblocking_pool.initialized == 0))
+	pool = get_nonblocking_pool();
+	if (unlikely(pool->initialized == 0))
 		printk_once(KERN_NOTICE "random: %s urandom read "
 			    "with %d bits of entropy available\n",
 			    current->comm, nonblocking_pool.entropy_total);
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
-	ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
+	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
-			   ENTROPY_BITS(&input_pool));
+	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+			   ENTROPY_BITS(pool));
 	return ret;
 }
 
@@ -1513,22 +1630,34 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
 	size_t ret;
+	struct entropy_store *pool;
+	int i;
 
 	ret = write_pool(&blocking_pool, buffer, count);
 	if (ret)
 		return ret;
-	ret = write_pool(&nonblocking_pool, buffer, count);
-	if (ret)
-		return ret;
+	if (nonblocking_node_pool) {
+		for_each_nb_pool (i, pool) {
+			ret = write_pool(pool, buffer, count);
+			if (ret)
+				return ret;
+		} end_for_each_nb()
+	} else {
+		ret = write_pool(&nonblocking_pool, buffer, count);
+		if (ret)
+			return ret;
+	}
 
 	return (ssize_t)count;
 }
 
 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
+	int i;
 	int size, ent_count;
 	int __user *p = (int __user *)arg;
 	int retval;
+	struct entropy_store *pool;
 
 	switch (cmd) {
 	case RNDGETENTCNT:
@@ -1569,6 +1698,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 			return -EPERM;
 		input_pool.entropy_count = 0;
 		nonblocking_pool.entropy_count = 0;
+		if (nonblocking_node_pool)
+			for_each_nb_pool (i, pool) {
+				pool->entropy_count = 0;
+			} end_for_each_nb()
 		blocking_pool.entropy_count = 0;
 		return 0;
 	default:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
@ 2015-09-30 14:40   ` Rasmus Villemoes
  0 siblings, 0 replies; 27+ messages in thread
From: Rasmus Villemoes @ 2015-09-30 14:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tytso, linux-kernel, Andi Kleen

On Thu, Sep 24 2015, Andi Kleen <andi@firstfloor.org> wrote:

>
> v2: Fix name of pool 0. Fix race with interrupts. Make
> iteration loops slightly more efficient. Add ifdefs to avoid
> any extra code on non-NUMA. Delay other pool use to when
> the original pool initialized and initialize the pools from
> pool 0. Add comments on memory allocation.

More bikeshedding. Please ignore unless you do a v3 anyway.

>  static struct entropy_store nonblocking_pool = {
>  	.poolinfo = &poolinfo_table[1],
> -	.name = "nonblocking",
> +	.name = "nonblocking pool 0",

Hm, yeah, but then you should probably also update the blocking pool's
name (or use the format "nonblocking %d").

> +#define POOL_INIT_BYTES (128 / 8)
> +
> +void init_node_pools(void)
> +{
> +#ifdef CONFIG_NUMA
> +	int i;
> +	for (i = 0; i < num_possible_nodes(); i++) {

int num_nodes = num_possible_nodes();
for (i = 0; i < num_nodes; i++) {

>  static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
>  {
> +	int i;
>  	int size, ent_count;
>  	int __user *p = (int __user *)arg;
>  	int retval;
> +	struct entropy_store *pool;
>  
>  	switch (cmd) {
>  	case RNDGETENTCNT:
> @@ -1569,6 +1696,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
>  			return -EPERM;
>  		input_pool.entropy_count = 0;
>  		nonblocking_pool.entropy_count = 0;
> +		if (nonblocking_node_pool)
> +			for_each_nb_pool (i, pool) {
> +				pool->entropy_count = 0;
> +			} end_for_each_nb();

extra ;. harmless in this case, but could cause slightly hard-to-understand build
failures if that if grows an else.

Rasmus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 17:19 Updated scalable urandom patchkit Andi Kleen
@ 2015-09-24 17:19 ` Andi Kleen
  2015-09-30 14:40   ` Rasmus Villemoes
  0 siblings, 1 reply; 27+ messages in thread
From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

We had a case where a 4 socket system spent >80% of its total CPU time
contending on the global urandom nonblocking pool spinlock. While the
application could probably have used an own PRNG, it may have valid
reasons to use the best possible key for different session keys.

The application still ran acceptable under 2S, but just fell over
the locking cliff on 4S.

Implementation
==============

The non blocking pool is used widely these days, from every execve() (to
set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
/dev/urandom users in user space. Clearly having such a popular resource
under a global lock is bad thing.

This patch changes the random driver to use distributed per NUMA node
nonblocking pools. The basic structure is not changed: entropy is
first fed into the input pool and later from there distributed
round-robin into the blocking and non blocking pools. This patch extends
this to use an dedicated non blocking pool for each node, and distribute
evenly from the input pool into these distributed pools, in
addition to the blocking pool.

Then every urandom/getrandom user fetches data from its node local
pool. At boot time when users may be still waiting for the non
blocking pool initialization we use the node 0 non blocking pool,
to avoid the need for different wake up queues.

For single node systems (like the vast majority of non server systems)
nothing changes. There is still only a single non blocking pool.

For other systems the original nonblocking pool is used until this
pool has 128 bits worth of entropy. After that it is used
to initialize the other pools. This already gives them
different init states, so they don't run in lock-step
to avoid "replay" attacks.

Since we still have a global input pool there are no problems
with load balancing entropy data between nodes. Any node that never
runs any interrupts would still get the same amount of entropy as
other nodes.

Entropy is fed preferably to nodes that need it more using
the existing 75% threshold.

For saving/restoring /dev/urandom, there is currently no mechanism
to access the non local node pool (short of setting task affinity).
This implies that currently the standard init/exit random save/restore
scripts would only save node 0. On restore all pools are updates.
So the entropy of non 0 gets lost over reboot. That seems acceptable
to me for now (fixing this would need a new separate save/restore interface)

Scalability
===========

I tested the patch with a simple will-it-scale test banging
on get_random() in parallel on more and more CPUs. Of course
that is not a realistic scenario, as real programs should
do some work between getting random numbers. But it's a worst
case for the random scalability.

On a 4S Xeon v3 system _without_ the patchkit the benchmark
maxes out when using all the threads on one node. After
that it quickly settles to about half the throughput of
one node with 2-4 nodes.

(all throughput factors, bigger is better)
Without patchkit:

1 node:  1x
2 nodes: 0.75x
3 nodes: 0.55x
4 nodes: 0.42x

With the patchkit applied:

1 node:  1x
2 nodes: 2x
3 nodes: 2.4x
4 nodes: 3x

So it's not quite linear scalability, but 3x maximum throughput
is already a lot better.

A node can still have a large number of CPUs: on my test system 36
logical software threads (18C * 2T). In principle it may make
sense to split it up further. Per logical CPU would be clearly
overkill. But that would also add more pressure on the input
pools. For now per node seems like a acceptable compromise.

/dev/random still uses a single global lock. For now that seems
acceptable as it normally cannot be used for real high volume
accesses anyways.

The input pool also still uses a global lock. The existing per CPU
fast pool and "give up when busy" mechanism seems to scale well enough
even on larger systems.

v2: Fix name of pool 0. Fix race with interrupts. Make
iteration loops slightly more efficient. Add ifdefs to avoid
any extra code on non-NUMA. Delay other pool use to when
the original pool initialized and initialize the pools from
pool 0. Add comments on memory allocation.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 185 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 158 insertions(+), 27 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..333a70c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -156,6 +156,17 @@
  * particular randomness source.  They do this by keeping track of the
  * first and second order deltas of the event timings.
  *
+ * Distributed pools
+ * =================
+ *
+ * On larger systems the locking on the single non blocking pool can
+ * become a bottleneck. To avoid this, we use an own non blocking pool
+ * for each NUMA node. The distributed pools are fed round robin from
+ * the input pool. Each user then only reads entropy from their local
+ * pool.
+ *
+ * For the blocking pool there is still only a single instance.
+ *
  * Ensuring unpredictability at system startup
  * ============================================
  *
@@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
 
 static struct entropy_store nonblocking_pool = {
 	.poolinfo = &poolinfo_table[1],
-	.name = "nonblocking",
+	.name = "nonblocking pool 0",
 	.pull = &input_pool,
 	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
 	.pool = nonblocking_pool_data,
@@ -475,6 +486,41 @@ static struct entropy_store nonblocking_pool = {
 					push_to_pool),
 };
 
+/*
+ * Per NUMA node nonblocking pool. This avoids lock contention
+ * when many processes extract data from /dev/urandom in parallel.
+ * /dev/random stays single instance for now.
+ *
+ * This variable is only set up once all non node 0 pools
+ * are initialized.
+ */
+#ifdef CONFIG_NUMA
+static struct entropy_store **nonblocking_node_pool __read_mostly;
+static struct entropy_store **early_nonblocking_node_pool __read_mostly;
+#else
+#define nonblocking_node_pool ((struct entropy_store **)NULL)
+#endif
+
+/* User must check for zero nonblocking_node_pool */
+#define for_each_nb_pool(i, pool) \
+	{ int num_nodes = num_possible_nodes(); \
+		for (i = 0; i < num_nodes; i++) { \
+			pool = nonblocking_node_pool[i];
+#define end_for_each_nb() } }
+
+static inline struct entropy_store *get_nonblocking_pool(void)
+{
+	struct entropy_store *pool = &nonblocking_pool;
+
+	/*
+	 * Non node 0 pools may take longer to initialize. Keep using
+	 * the boot nonblocking pool while this happens.
+	 */
+	if (nonblocking_node_pool)
+		pool = nonblocking_node_pool[numa_node_id()];
+	return pool;
+}
+
 static __u32 const twist_table[8] = {
 	0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 	0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -608,6 +654,25 @@ static void process_random_ready_list(void)
 	spin_unlock_irqrestore(&random_ready_list_lock, flags);
 }
 
+#define POOL_INIT_BYTES (128 / 8)
+
+void init_node_pools(void)
+{
+#ifdef CONFIG_NUMA
+	int i;
+	for (i = 0; i < num_possible_nodes(); i++) {
+		struct entropy_store *pool = early_nonblocking_node_pool[i];
+		char buf[POOL_INIT_BYTES];
+
+		get_random_bytes(buf, sizeof buf);
+		mix_pool_bytes(pool, buf, sizeof buf);
+	}
+	/* Initialize first before setting variable */
+	mb();
+	nonblocking_node_pool = early_nonblocking_node_pool;
+#endif
+}
+
 /*
  * Credit (or debit) the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -681,6 +746,7 @@ retry:
 			prandom_reseed_late();
 			process_random_ready_list();
 			wake_up_all(&urandom_init_wait);
+			init_node_pools();
 			pr_notice("random: %s pool is initialized\n", r->name);
 		}
 	}
@@ -698,18 +764,23 @@ retry:
 			kill_fasync(&fasync, SIGIO, POLL_IN);
 		}
 		/* If the input pool is getting full, send some
-		 * entropy to the two output pools, flipping back and
-		 * forth between them, until the output pools are 75%
+		 * entropy to the other output pools, cycling
+		 * between them, until the output pools are 75%
 		 * full.
+		 * Feed into all pools round robin.
 		 */
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
 			static struct entropy_store *last = &blocking_pool;
+			static int next_pool = -1;
 			struct entropy_store *other = &blocking_pool;
 
-			if (last == &blocking_pool)
-				other = &nonblocking_pool;
+			/* -1: use blocking pool, 0<=max_node: node nb pool */
+			if (next_pool > -1)
+				other = nonblocking_node_pool[next_pool];
+			if (++next_pool >= num_possible_nodes())
+				next_pool = -1;
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -748,6 +819,17 @@ struct timer_rand_state {
 
 #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };
 
+static void pool_add_buf(struct entropy_store *pool,
+			 const void *buf, unsigned int size)
+{
+	unsigned long flags;
+	unsigned long time = random_get_entropy() ^ jiffies;
+	spin_lock_irqsave(&pool->lock, flags);
+	_mix_pool_bytes(pool, buf, size);
+	_mix_pool_bytes(pool, &time, sizeof(time));
+	spin_unlock_irqrestore(&pool->lock, flags);
+}
+
 /*
  * Add device- or boot-specific data to the input and nonblocking
  * pools to help initialize them to unique values.
@@ -758,19 +840,18 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
-	unsigned long time = random_get_entropy() ^ jiffies;
-	unsigned long flags;
+	struct entropy_store *pool;
+	int i;
 
 	trace_add_device_randomness(size, _RET_IP_);
-	spin_lock_irqsave(&input_pool.lock, flags);
-	_mix_pool_bytes(&input_pool, buf, size);
-	_mix_pool_bytes(&input_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&input_pool.lock, flags);
-
-	spin_lock_irqsave(&nonblocking_pool.lock, flags);
-	_mix_pool_bytes(&nonblocking_pool, buf, size);
-	_mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
+	pool_add_buf(&input_pool, buf, size);
+	if (!nonblocking_node_pool) {
+		pool_add_buf(&nonblocking_pool, buf, size);
+	} else {
+		for_each_nb_pool (i, pool) {
+			pool_add_buf(pool, buf, size);
+		} end_for_each_nb()
+	}
 }
 EXPORT_SYMBOL(add_device_randomness);
 
@@ -1252,15 +1333,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
  */
 void get_random_bytes(void *buf, int nbytes)
 {
+	struct entropy_store *pool = get_nonblocking_pool();
 #if DEBUG_RANDOM_BOOT > 0
-	if (unlikely(nonblocking_pool.initialized == 0))
+	if (unlikely(pool->initialized == 0))
 		printk(KERN_NOTICE "random: %pF get_random_bytes called "
 		       "with %d bits of entropy available\n",
 		       (void *) _RET_IP_,
-		       nonblocking_pool.entropy_total);
+		       pool->entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(pool, buf, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1350,7 +1432,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
@@ -1390,12 +1472,43 @@ static void init_std_data(struct entropy_store *r)
  * initializations complete at compile time. We should also
  * take care not to overwrite the precious per platform data
  * we were given.
+ *
+ * This is early initialization, if memory allocations of a few
+ * bytes fails the kernel is doomed anyways. So use __GFP_NOFAIL.
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+	struct entropy_store **pool_list;
+#endif
+
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
 	init_std_data(&nonblocking_pool);
+#ifdef CONFIG_NUMA
+	pool_list = kmalloc(num_nodes * sizeof(void *),
+					GFP_KERNEL|__GFP_NOFAIL);
+	pool_list[0] = &nonblocking_pool;
+	for (i = 1; i < num_nodes; i++) {
+		struct entropy_store *pool;
+
+		pool = kzalloc(sizeof(struct entropy_store),
+				GFP_KERNEL|__GFP_NOFAIL);
+		pool_list[i] = pool;
+		pool->poolinfo = &poolinfo_table[1];
+		pool->pull = &input_pool;
+		spin_lock_init(&pool->lock);
+		/* pool data not cleared intentionally */
+		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
+				     GFP_KERNEL|__GFP_NOFAIL);
+		INIT_WORK(&pool->push_work, push_to_pool);
+		pool->name = kasprintf(GFP_KERNEL|__GFP_NOFAIL,
+				"nonblocking pool %d", i);
+	}
+	early_nonblocking_node_pool = pool_list;
+#endif
 	return 0;
 }
 early_initcall(rand_initialize);
@@ -1458,17 +1571,19 @@ static ssize_t
 urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	int ret;
+	struct entropy_store *pool;
 
-	if (unlikely(nonblocking_pool.initialized == 0))
+	pool = get_nonblocking_pool();
+	if (unlikely(pool->initialized == 0))
 		printk_once(KERN_NOTICE "random: %s urandom read "
 			    "with %d bits of entropy available\n",
 			    current->comm, nonblocking_pool.entropy_total);
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
-	ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
+	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
-			   ENTROPY_BITS(&input_pool));
+	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+			   ENTROPY_BITS(pool));
 	return ret;
 }
 
@@ -1513,22 +1628,34 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
 	size_t ret;
+	struct entropy_store *pool;
+	int i;
 
 	ret = write_pool(&blocking_pool, buffer, count);
 	if (ret)
 		return ret;
-	ret = write_pool(&nonblocking_pool, buffer, count);
-	if (ret)
-		return ret;
+	if (nonblocking_node_pool) {
+		for_each_nb_pool (i, pool) {
+			ret = write_pool(pool, buffer, count);
+			if (ret)
+				return ret;
+		} end_for_each_nb()
+	} else {
+		ret = write_pool(&nonblocking_pool, buffer, count);
+		if (ret)
+			return ret;
+	}
 
 	return (ssize_t)count;
 }
 
 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
+	int i;
 	int size, ent_count;
 	int __user *p = (int __user *)arg;
 	int retval;
+	struct entropy_store *pool;
 
 	switch (cmd) {
 	case RNDGETENTCNT:
@@ -1569,6 +1696,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 			return -EPERM;
 		input_pool.entropy_count = 0;
 		nonblocking_pool.entropy_count = 0;
+		if (nonblocking_node_pool)
+			for_each_nb_pool (i, pool) {
+				pool->entropy_count = 0;
+			} end_for_each_nb();
 		blocking_pool.entropy_count = 0;
 		return 0;
 	default:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-03-01  5:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-22 23:16 [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
2015-09-22 23:16 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
2015-09-22 23:16 ` [PATCH 3/3] random: Add pool name to urandom_read trace point Andi Kleen
2015-09-22 23:25 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
2015-09-23 10:32 ` Rasmus Villemoes
2015-09-23 21:54   ` Andi Kleen
2015-09-23 19:40 ` Austin S Hemmelgarn
2015-09-23 23:28   ` Andi Kleen
2015-09-24 11:37     ` Austin S Hemmelgarn
2015-09-24 13:12       ` Theodore Ts'o
2015-09-24 16:00         ` Austin S Hemmelgarn
2015-09-24 16:52           ` Jeff Epler
2015-09-24 19:11             ` Austin S Hemmelgarn
2015-09-24 20:00               ` Jeff Epler
2015-09-24 20:14               ` Theodore Ts'o
2015-09-25 11:41                 ` Austin S Hemmelgarn
2015-09-25 19:07                   ` Austin S Hemmelgarn
2015-09-25 20:24                     ` Theodore Ts'o
2015-09-29 12:06                       ` Austin S Hemmelgarn
2015-09-29 11:57                     ` Austin S Hemmelgarn
2015-09-23 21:10 ` Theodore Ts'o
2015-09-23 21:25   ` Andi Kleen
2015-09-24 17:19 Updated scalable urandom patchkit Andi Kleen
2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
2015-09-30 14:40   ` Rasmus Villemoes
2015-10-06 22:05 Andi Kleen
2016-02-10 23:01 Scalable random patchkit revisited Andi Kleen
2016-02-10 23:01 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
2016-03-01  5:17 Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.