From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935057AbbIVXQ1 (ORCPT ); Tue, 22 Sep 2015 19:16:27 -0400 Received: from mga03.intel.com ([134.134.136.65]:29698 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935008AbbIVXQV (ORCPT ); Tue, 22 Sep 2015 19:16:21 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,575,1437462000"; d="scan'208";a="566560124" From: Andi Kleen To: tytso@mit.edu Cc: linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com, herbert@gondor.apana.org.au, Andi Kleen Subject: [PATCH 1/3] Make /dev/urandom scalable Date: Tue, 22 Sep 2015 16:16:05 -0700 Message-Id: <1442963767-14945-1-git-send-email-andi@firstfloor.org> X-Mailer: git-send-email 2.4.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andi Kleen We had a case where a 4 socket system spent >80% of its total CPU time contending on the global urandom nonblocking pool spinlock. While the application could probably have used an own PRNG, it may have valid reasons to use the best possible key for different session keys. The application still ran acceptable under 2S, but just fell over the locking cliff on 4S. Implementation ============== The non blocking pool is used widely these days, from every execve() (to set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent /dev/urandom users in user space. Clearly having such a popular resource under a global lock is bad thing. This patch changes the random driver to use distributed per NUMA node nonblocking pools. The basic structure is not changed: entropy is first fed into the input pool and later from there distributed round-robin into the blocking and non blocking pools. This patch extends this to use an dedicated non blocking pool for each node, and distribute evenly from the input pool into these distributed pools, in addition to the blocking pool. Then every urandom/getrandom user fetches data from its node local pool. At boot time when users may be still waiting for the non blocking pool initialization we use the node 0 non blocking pool, to avoid the need for different wake up queues. For single node systems (like the vast majority of non server systems) nothing changes. There is still only a single non blocking pool. The different per-node pools also start with different start states and diverge more and more over time, as they get feed different input data. So "replay" attacks are difficult after some time. Without hardware random number seed support the start states (until enough real entropy is collected) are not very random, but that's not worse than before Since we still have a global input pool there are no problems with load balancing entropy data between nodes. Any node that never runs any interrupts would still get the same amount of entropy as other nodes. Entropy is fed preferably to nodes that need it more using the existing 75% threshold. For saving/restoring /dev/urandom, there is currently no mechanism to access the non local node pool (short of setting task affinity). This implies that currently the standard init/exit random save/restore scripts would only save node 0. On restore all pools are updates. So the entropy of non 0 gets lost over reboot. That seems acceptable to me for now (fixing this would need a new separate save/restore interface) Scalability =========== I tested the patch with a simple will-it-scale test banging on get_random() in parallel on more and more CPUs. Of course that is not a realistic scenario, as real programs should do some work between getting random numbers. But it's a worst case for the random scalability. On a 4S Xeon v3 system _without_ the patchkit the benchmark maxes out when using all the threads on one node. After that it quickly settles to about half the throughput of one node with 2-4 nodes. (all throughput factors, bigger is better) Without patchkit: 1 node: 1x 2 nodes: 0.75x 3 nodes: 0.55x 4 nodes: 0.42x With the patchkit applied: 1 node: 1x 2 nodes: 2x 3 nodes: 3.4x 4 nodes: 6x So it's not quite linear scalability, but 6x maximum throughput is already a lot better. A node can still have a large number of CPUs: on my test system 36 logical software threads (18C * 2T). In principle it may make sense to split it up further. Per logical CPU would be clearly overkill. But that would also add more pressure on the input pools. For now per node seems like a acceptable compromise. /dev/random still uses a single global lock. For now that seems acceptable as it normally cannot be used for real high volume accesses anyways. The input pool also still uses a global lock. The existing per CPU fast pool and "give up when busy" mechanism seems to scale well enough even on larger systems. Signed-off-by: Andi Kleen --- drivers/char/random.c | 142 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 114 insertions(+), 28 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d0da5d8..d0302be 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -156,6 +156,17 @@ * particular randomness source. They do this by keeping track of the * first and second order deltas of the event timings. * + * Distributed pools + * ================= + * + * On larger systems the locking on the single non blocking pool can + * become a bottleneck. To avoid this, we use an own non blocking pool + * for each NUMA node. The distributed pools are fed round robin from + * the input pool. Each user then only reads entropy from their local + * pool. + * + * For the blocking pool there is still only a single instance. + * * Ensuring unpredictability at system startup * ============================================ * @@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = { static struct entropy_store nonblocking_pool = { .poolinfo = &poolinfo_table[1], - .name = "nonblocking", + .name = "nonblocking 0", .pull = &input_pool, .lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock), .pool = nonblocking_pool_data, @@ -475,6 +486,32 @@ static struct entropy_store nonblocking_pool = { push_to_pool), }; +/* + * Per NUMA node nonblocking pool. This avoids lock contention + * when many processes extract data from /dev/urandom in parallel. + * /dev/random stays single instance for now. + */ +static struct entropy_store **nonblocking_node_pool __read_mostly; + +#define for_each_nb_pool(i, pool) for (i = 0; i < num_possible_nodes(); i++) { \ + pool = nonblocking_node_pool[i]; +#define end_for_each_nb() } + +static inline struct entropy_store *get_nonblocking_pool(void) +{ + struct entropy_store *pool = &nonblocking_pool; + + /* + * Non node 0 pools may take longer to initialize. Keep using + * the boot nonblocking pool while this happens. + */ + if (nonblocking_node_pool) + pool = nonblocking_node_pool[numa_node_id()]; + if (!pool->initialized) + pool = &nonblocking_pool; + return pool; +} + static __u32 const twist_table[8] = { 0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158, 0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 }; @@ -698,18 +735,23 @@ retry: kill_fasync(&fasync, SIGIO, POLL_IN); } /* If the input pool is getting full, send some - * entropy to the two output pools, flipping back and - * forth between them, until the output pools are 75% + * entropy to the other output pools, cycling + * between them, until the output pools are 75% * full. + * Feed into all pools round robin. */ if (entropy_bits > random_write_wakeup_bits && r->initialized && r->entropy_total >= 2*random_read_wakeup_bits) { static struct entropy_store *last = &blocking_pool; + static int next_pool = -1; struct entropy_store *other = &blocking_pool; - if (last == &blocking_pool) - other = &nonblocking_pool; + /* -1: use blocking pool, 0<=max_node: node nb pool */ + if (next_pool > -1) + other = nonblocking_node_pool[next_pool]; + if (++next_pool >= num_possible_nodes()) + next_pool = -1; if (other->entropy_count <= 3 * other->poolinfo->poolfracbits / 4) last = other; @@ -748,6 +790,17 @@ struct timer_rand_state { #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, }; +static void pool_add_buf(struct entropy_store *pool, + const void *buf, unsigned int size) +{ + unsigned long flags; + unsigned long time = random_get_entropy() ^ jiffies; + spin_lock_irqsave(&pool->lock, flags); + _mix_pool_bytes(pool, buf, size); + _mix_pool_bytes(pool, &time, sizeof(time)); + spin_unlock_irqrestore(&pool->lock, flags); +} + /* * Add device- or boot-specific data to the input and nonblocking * pools to help initialize them to unique values. @@ -758,19 +811,18 @@ struct timer_rand_state { */ void add_device_randomness(const void *buf, unsigned int size) { - unsigned long time = random_get_entropy() ^ jiffies; - unsigned long flags; + struct entropy_store *pool; + int i; trace_add_device_randomness(size, _RET_IP_); - spin_lock_irqsave(&input_pool.lock, flags); - _mix_pool_bytes(&input_pool, buf, size); - _mix_pool_bytes(&input_pool, &time, sizeof(time)); - spin_unlock_irqrestore(&input_pool.lock, flags); - - spin_lock_irqsave(&nonblocking_pool.lock, flags); - _mix_pool_bytes(&nonblocking_pool, buf, size); - _mix_pool_bytes(&nonblocking_pool, &time, sizeof(time)); - spin_unlock_irqrestore(&nonblocking_pool.lock, flags); + pool_add_buf(&input_pool, buf, size); + if (!nonblocking_node_pool) { + pool_add_buf(&nonblocking_pool, buf, size); + } else { + for_each_nb_pool (i, pool) { + pool_add_buf(pool, buf, size); + } end_for_each_nb() + } } EXPORT_SYMBOL(add_device_randomness); @@ -1252,15 +1304,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf, */ void get_random_bytes(void *buf, int nbytes) { + struct entropy_store *pool = get_nonblocking_pool(); #if DEBUG_RANDOM_BOOT > 0 - if (unlikely(nonblocking_pool.initialized == 0)) + if (unlikely(pool->initialized == 0)) printk(KERN_NOTICE "random: %pF get_random_bytes called " "with %d bits of entropy available\n", (void *) _RET_IP_, - nonblocking_pool.entropy_total); + pool->entropy_total); #endif trace_get_random_bytes(nbytes, _RET_IP_); - extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0); + extract_entropy(pool, buf, nbytes, 0, 0); } EXPORT_SYMBOL(get_random_bytes); @@ -1350,7 +1403,7 @@ void get_random_bytes_arch(void *buf, int nbytes) } if (nbytes) - extract_entropy(&nonblocking_pool, p, nbytes, 0, 0); + extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0); } EXPORT_SYMBOL(get_random_bytes_arch); @@ -1393,9 +1446,32 @@ static void init_std_data(struct entropy_store *r) */ static int rand_initialize(void) { + int i; + int num_nodes = num_possible_nodes(); + char name[40]; + + nonblocking_node_pool = kzalloc(num_nodes * sizeof(void *), + GFP_KERNEL|__GFP_NOFAIL); + init_std_data(&input_pool); init_std_data(&blocking_pool); + nonblocking_node_pool[0] = &nonblocking_pool; init_std_data(&nonblocking_pool); + for (i = 1; i < num_nodes; i++) { + struct entropy_store *pool = kzalloc(sizeof(struct entropy_store), + GFP_KERNEL|__GFP_NOFAIL); + nonblocking_node_pool[i] = pool; + pool->poolinfo = &poolinfo_table[1]; + pool->pull = &input_pool; + spin_lock_init(&pool->lock); + /* pool data not cleared intentionally */ + pool->pool = kmalloc(sizeof(nonblocking_pool_data), + GFP_KERNEL|__GFP_NOFAIL); + INIT_WORK(&pool->push_work, push_to_pool); + snprintf(name, sizeof name, "nonblocking pool %d", i); + pool->name = kstrdup(name, GFP_KERNEL|__GFP_NOFAIL); + init_std_data(pool); + } return 0; } early_initcall(rand_initialize); @@ -1458,17 +1534,19 @@ static ssize_t urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { int ret; + struct entropy_store *pool; - if (unlikely(nonblocking_pool.initialized == 0)) + pool = get_nonblocking_pool(); + if (unlikely(pool->initialized == 0)) printk_once(KERN_NOTICE "random: %s urandom read " "with %d bits of entropy available\n", current->comm, nonblocking_pool.entropy_total); nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3)); - ret = extract_entropy_user(&nonblocking_pool, buf, nbytes); + ret = extract_entropy_user(pool, buf, nbytes); - trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool), - ENTROPY_BITS(&input_pool)); + trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool), + ENTROPY_BITS(pool)); return ret; } @@ -1513,22 +1591,28 @@ static ssize_t random_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { size_t ret; + struct entropy_store *pool; + int i; ret = write_pool(&blocking_pool, buffer, count); if (ret) return ret; - ret = write_pool(&nonblocking_pool, buffer, count); - if (ret) - return ret; + for_each_nb_pool (i, pool) { + ret = write_pool(pool, buffer, count); + if (ret) + return ret; + } end_for_each_nb() return (ssize_t)count; } static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg) { + int i; int size, ent_count; int __user *p = (int __user *)arg; int retval; + struct entropy_store *pool; switch (cmd) { case RNDGETENTCNT: @@ -1568,7 +1652,9 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg) if (!capable(CAP_SYS_ADMIN)) return -EPERM; input_pool.entropy_count = 0; - nonblocking_pool.entropy_count = 0; + for_each_nb_pool (i, pool) { + pool->entropy_count = 0; + } end_for_each_nb(); blocking_pool.entropy_count = 0; return 0; default: -- 2.4.3