* Re: Updated scalable urandom patchkit @ 2015-10-10 18:45 George Spelvin 2015-10-11 2:31 ` Theodore Ts'o 0 siblings, 1 reply; 30+ messages in thread From: George Spelvin @ 2015-10-10 18:45 UTC (permalink / raw) To: andi; +Cc: ahferroin7, jepler, linux, linux-kernel, linux, tytso I'm very very sorry to be so late to the party; I didn't see this thread until the week-delayed LWN article on the subject drew my attention to it. There's a fundamental design issue with this patch set that seems unnecessary to me: multiple nonblocking pools. I know, I know, that seems like the whole point of it, but hear me out. Entropy is a holographic property of the pool, not located in any particular bit position. And the most basic operation, of reading from the pool, can easily be done by multiple readers at the same time from the same bit pattern. They just need to be salted with distinguishing nonces (CPU IDs will do nicely) to ensure they all get different results. So a reader doesn't just get hash(pool[]), but hash(unique_nonce, pool[]). This in turn lets the spin_lock_irqsave() in extract_buf be moved to after the sha_transform() calls. I think the whole thing can be done, more elegantly, using a single pool and judiciously relaxed locking rules. You don't even need reader-writer locks; it's actually beneficial rather than harmful if there's a race between a pool reader and writer. In general, fewer larger pools is better for entropy storage. The only reason for the current three-pool design is to achieve strict isolation of the /dev/random pool. The two operations that require locking are: 1. Entropy accounting. However, that only has to be strict for /dev/random. For /dev/urandom, it can be late in various ways. One possibility is to have separate entropy accounding per NUMA node; only the pool proper has to be shared. 2. Add-back of the output to the pool. This involves writing to pool, but for non-invertibility, it only has to be do ne byone of a number of concurrent readers. I think significant cleverness can be applied here. There are a lot of things that can be done about this second point: 2a. The obvious one is to batch the add-back. For backtracking protection, we only need to do one add-back per read, not one per 10 bytes of output. Combined with the above change that locks only around the add-back and not the pool hashing, this greatly reduces both the lock window and the number of times it's taken. 2b. Another idea would be to optimize for 16 bytes. I like the basic concept of having a larger hash internal state than the output, but would it be possible to output 16 of the 20 bytes of SHA-1 output rather than the current 10? This is obviously a Ted question (the current folding is his idea), but even 32 bits is enough to protect against internal collisions in the hash state. 2c. Another possibility would be to add small separate "salt piles" which are initialized to the CPU id, and stirred by reads. They only need to be large enough to not experience birthsay collisions even assuming a stupid number of reads between stirs of the main pool. (So 128 bits would be more than ample.) A read would consist of: hash[5] = { initial values } sha1(hash, my_salt) sha1(hash, main_pool) if (spin_trylock(main_pool)) { add_back(hash, main_pool); spin_unlock(main_pool); } else { add_back(hash, my_salt); } 2d. A more complete solution involves noting that, if there are multiple concurrent readers, only one has to add back its output to prevent backtracking, because all of the concurrent reads are equivalent in that sense. (Even though, because they're salted with separate nonces like the CPU id, they're not identical.) I'm thinking in terms of a "generation counter" for the pool. (Perhaps implemented as a seqlock.) The readers of the generation-n pool must ensure that *someone* is responsible for bumping the generation to n+1 to prevent backtracking of the generation-n state, but only one. A key point is that the add-back that bumps the generation to n+2 may actaully be a hash of the generation-n pool state, the generation-n+1 state, or some transient intermedicate state due to races with the generation-n+1 add-back. It is not necessary for readers to wait for the pool to be stable, only that it's a non-invertible transformation based on some earlier generation. Now, to actually implement that, a "spin lock while checking extra condition" would be ideal, but I do not want to add Yet Another Locking Primitive to the kernel. One possibility would be to accept the possibility of a race condition breaking anti-backtracking. Another reader will come along soom enough and fix it. But if that's not okay, consider the following: We add a second lock, the "responsibility lock", the holder of which is responsible for doing anti-backtracking. After hashing the pool, each reader does the following: 1. (Optional) trylock() the pool lock. This is a low-load optimization. If it succeeds, go directly to step 5. 2. Trylock() the responsibility lock. If this fails, we're done; return. 3. Wait for the pool lock. 4. Drop the responsibility lock. This must be done before updating the pool proper. 5. Add our anti-backtracking bytes to the pool. 6. Release the pool lock. (I originally thought I'd need to ping-pong between two responsibility locks based on the generation counter % 2, but now I don't think so.) Under high load, you have one processor adding back, a second waiting to add back, and everyone else just sails right through without waiting at all. Details to be filled in: - Each reader needs some sort of nonce to distinguish multiple reads if there's no main pool add-back between them. RDTSC, per-cpu counter, or whatever. - Entropy accounting. As mentioned, some sloppiness is okay, so I think it's solvable, but I haven't looked at it yet. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-10 18:45 Updated scalable urandom patchkit George Spelvin @ 2015-10-11 2:31 ` Theodore Ts'o 2015-10-11 2:53 ` Theodore Ts'o 2015-10-11 4:35 ` George Spelvin 0 siblings, 2 replies; 30+ messages in thread From: Theodore Ts'o @ 2015-10-11 2:31 UTC (permalink / raw) To: George Spelvin; +Cc: andi, ahferroin7, jepler, linux-kernel, linux On Sat, Oct 10, 2015 at 02:45:46PM -0400, George Spelvin wrote: > In general, fewer larger pools is better for entropy storage. The only > reason for the current three-pool design is to achieve strict isolation > of the /dev/random pool. You're absolutely right, and I would like to see if we can get away with keeping with a single pool design. Your idea of using the CPU id as a nonce is a good one, and I think we can do something similar that should be good enough for mixing the hash back into the pool. We can simply use a hash of the CPU id to change the offset where we do the mixing, and this should be good enough to avoid collisions when we do the add-back. HOWEVER. One downside of this approach is that, especially on a NUMA system, the costs of the cache coherency across a single pool which is constantly being modified and shared by all of the NUMA nodes could be a killer, even if we go completely lockless. To that end, Andi, can you try benchmarking the scalability of the patch below? I really hope it will be good enough, since besides using less memory, there are security advantages in not spreading the entropy across N pools. If it isn't, we might be able to play a trick where we sample the r->add_ptr value before we start hashing the pool, and then check to see if it's changed afterwards. If it has, we could skip doing the hash back, which could reduce the cache coherency traffic, since as you point out: > 2d. A more complete solution involves noting that, if there are multiple > concurrent readers, only one has to add back its output to prevent > backtracking, because all of the concurrent reads are equivalent > in that sense. (Even though, because they're salted with separate > nonces like the CPU id, they're not identical.) However, even if we put in that optimization, the primary question is how good is Intel's cache coherency protocols on their NUMA systems? I'm pretty sure this would be a disaster on, say, Sequent's old NUMA machines, but I'm quite confident all of those servers are dead and buried by now. :-) - Ted commit 3cb51896deab45bddc1b8f571b1103eae8f50e0e Author: Theodore Ts'o <tytso@mit.edu> Date: Sat Oct 10 22:03:53 2015 -0400 random: make the reading from the non-blocking pool more scalable Andi Kleen reported a case where a 4 socket system spent >80% of its total CPU time contending on the global urandom nonblocking pool spinlock. While the application could probably have used an own PRNG, it may have valid reasons to use the best possible key for different session keys. Instead of creating separate multiple per-NUMA node non-blocking pools, use a trick suggested by George Spelvin: Entropy is a holographic property of the pool, not located in any particular bit position. And the most basic operation, of reading from the pool, can easily be done by multiple readers at the same time from the same bit pattern. They just need to be salted with distinguishing nonces (CPU IDs will do nicely) to ensure they all get different results.... We use this trick, and in addition use a hash of the cpu id to change where we mix the hash back into the pool to avoid collisions. Since we are already using a lockless technique (cmpxchg) to update the entropy accounting, we don't need to change this around. This technique won't be quite as scalable since on a NUMA node we will still be forcing cache lines to bounce around, but from the perspective of entropy storage we're much better using a single pool rather than spreading it across multiple pools. Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff --git a/drivers/char/random.c b/drivers/char/random.c index d0da5d8..be6b315 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -260,6 +260,7 @@ #include <linux/irq.h> #include <linux/syscalls.h> #include <linux/completion.h> +#include <linux/hash.h> #include <asm/processor.h> #include <asm/uaccess.h> @@ -494,7 +495,9 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, { unsigned long i, tap1, tap2, tap3, tap4, tap5; int input_rotate; + int is_nonblock = (r == &nonblocking_pool); int wordmask = r->poolinfo->poolwords - 1; + int poolbits = r->poolinfo->poolbitshift - 5; const char *bytes = in; __u32 w; @@ -506,6 +509,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, input_rotate = r->input_rotate; i = r->add_ptr; + if (is_nonblock) + i += hash_32(get_cpu(), poolbits); /* mix one byte at a time to simplify size handling and churn faster */ while (nbytes--) { @@ -534,6 +539,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, r->input_rotate = input_rotate; r->add_ptr = i; + if (is_nonblock) + put_cpu(); } static void __mix_pool_bytes(struct entropy_store *r, const void *in, @@ -1090,6 +1097,7 @@ retry: static void extract_buf(struct entropy_store *r, __u8 *out) { int i; + int is_nonblock = (r == &nonblocking_pool); union { __u32 w[5]; unsigned long l[LONGS(20)]; @@ -1108,9 +1116,12 @@ static void extract_buf(struct entropy_store *r, __u8 *out) break; hash.l[i] = v; } + if (is_nonblock) + hash.w[0] ^= hash_32(get_cpu(), 32); + else + spin_lock_irqsave(&r->lock, flags); /* Generate a hash across the pool, 16 words (512 bits) at a time */ - spin_lock_irqsave(&r->lock, flags); for (i = 0; i < r->poolinfo->poolwords; i += 16) sha_transform(hash.w, (__u8 *)(r->pool + i), workspace); @@ -1124,7 +1135,10 @@ static void extract_buf(struct entropy_store *r, __u8 *out) * hash. */ __mix_pool_bytes(r, hash.w, sizeof(hash.w)); - spin_unlock_irqrestore(&r->lock, flags); + if (is_nonblock) + put_cpu(); + else + spin_unlock_irqrestore(&r->lock, flags); memzero_explicit(workspace, sizeof(workspace)); ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-11 2:31 ` Theodore Ts'o @ 2015-10-11 2:53 ` Theodore Ts'o 2015-10-11 4:35 ` George Spelvin 1 sibling, 0 replies; 30+ messages in thread From: Theodore Ts'o @ 2015-10-11 2:53 UTC (permalink / raw) To: George Spelvin, andi, ahferroin7, jepler, linux-kernel, linux On Sat, Oct 10, 2015 at 10:31:46PM -0400, Theodore Ts'o wrote: > To that end, Andi, can you try benchmarking the scalability of the > patch below? I really hope it will be good enough, since besides > using less memory, there are security advantages in not spreading the > entropy across N pools. Andi, I forgot to menion --- if you could benchmark using both on your "run get_entropy() in a tight loop" and the actual real-life application --- since if is CPU cache behavior is going to be a large factor in the results, a real application is going to be more representative that a micro-benchmark, that would be much appreciated. Thanks! - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-11 2:31 ` Theodore Ts'o 2015-10-11 2:53 ` Theodore Ts'o @ 2015-10-11 4:35 ` George Spelvin 2015-10-11 22:25 ` Theodore Ts'o 1 sibling, 1 reply; 30+ messages in thread From: George Spelvin @ 2015-10-11 4:35 UTC (permalink / raw) To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux Damn, I bow before the master. That is a much neater solution than mine; I had assumed a lock was required for writing. While it's good enough for benchmarking, there are a few leftover problems I mention so they don't get missed. One is the final write back of add_ptr on the last line of _mix_pool_bytes. It actually writes i, which includes the per-cpu nonce, and will have it jumping all over without the steady progression that the mixing polynomial assumes. (There's a similar, lesser problem with input_rotate.) The second, less obvious, problem is that by calling _mix_pool_bytes completely lockless, there's the risk that it will race with and overwrite the addition of new seed material to the pool. The add-back is not critical, and races between two writers don't really do any harm. But seed entropy is valuable. And unfortunately, transferring 256 bits (32 bytes) to the output pool will try to write every word, so *any* concurrent add-back is risky; there's no "safe" part of the pool that can be accessed lockless. (The first crude hack that comes to mind is to double the size of the output pool, without increasing its nominal capacity, and do seeding and add-back to different halves. Hopefully there's something more elegant.) Two minor suggestions about is_nonblock: 1) Rather than using (r == &nonblocking_pool), how about !r->limit? 2) Would you mind making is_nonblock bool? I know back before the sacred text of the prophets Kernighan and Ritchie was corrupted by modern heresies, we used "int" for everything, and we liked it. 5 miles in the snow, uphill both ways, yadda yadda. But I like to document range restrictions as much as possible, and "bool" makes it clearer to both the compiler and the reader of the code. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-11 4:35 ` George Spelvin @ 2015-10-11 22:25 ` Theodore Ts'o 2015-10-12 0:16 ` George Spelvin 0 siblings, 1 reply; 30+ messages in thread From: Theodore Ts'o @ 2015-10-11 22:25 UTC (permalink / raw) To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux On Sun, Oct 11, 2015 at 12:35:46AM -0400, George Spelvin wrote: > > One is the final write back of add_ptr on the last line of > _mix_pool_bytes. It actually writes i, which includes the per-cpu nonce, > and will have it jumping all over without the steady progression that > the mixing polynomial assumes. Yep, good catch; I need to subtract that off. > (There's a similar, lesser problem with input_rotate.) I didn't bother messing with input_rotate at all, and I don't think that will be a problem. > The second, less obvious, problem is that by calling _mix_pool_bytes > completely lockless, there's the risk that it will race with and overwrite > the addition of new seed material to the pool. > > The add-back is not critical, and races between two writers don't really > do any harm. But seed entropy is valuable. That's true, but I've been going back and forth about how much it's worth to fix this. After all, the the non-blocking pool is guaranteed to have cryptographic randomness, and so it's a question of how much effort we want to avoid the possibility of losing some seed entropy. One approach might be to use take a reader lock when mixing back entropy, but take an exclusive lock in the case when seeding the non-random pool. Another approach is to have an alternate mechanism for the non-blocking pool which uses the LOCK prefix when XOR'ing into the pool. This would mean that we would have to drop the twisted LFSR and replace it with something simpler but so long as the mixing algothm eventually involves all of the bits in the pool, that will probably be OK. Of course, using the LOCK prefix is CPU architecture specific, and in particular, there is no equivalent on the ARM architecture. The final thing we could do is to just throw in a smp_mb() before the write into the pool, and just call it day, and accept that while we might lose a small amount of entropy due to a race, it will only a byte's worth each time we lose a race (instead of the whole 32 byte cache line). None of the alternatives are all that satisfying, and they will all be a bit more expensive than the first patch. The question is how to weigh these problems against just simply using a separate pool for each NUMA node, and how much scalability are we really need for "realistic" workloads? - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-11 22:25 ` Theodore Ts'o @ 2015-10-12 0:16 ` George Spelvin 2015-10-12 4:05 ` Theodore Ts'o 0 siblings, 1 reply; 30+ messages in thread From: George Spelvin @ 2015-10-12 0:16 UTC (permalink / raw) To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux TedTs'o wrote: > Yep, good catch; I need to subtract that off. I'm not thrilled with incrementing the pointer from i to len, but mixing at positions i+k to i+k+len. The whole LFSR scheme relies on a regular pass structure. How about this instead: drop the hashed offset, and instead let each writer do an atomic_add_return on the index, then iterate over the "reserved" slots. Concurrent additions will at least do non-overlapping writes until the numer equals the pool size. (Admittedly, if the pool is 32 words and we're adding back 20 bytes per reader, overlap is hard to avoid. Could we do seeding a byte at a time with input rotate, but add-back 32 bits at a time for simplicity and speed?) The other one would be to have separate indexes for add-back and seed addition, with the latter protected by a lock. >> (There's a similar, lesser problem with input_rotate.) > I didn't bother messing with input_rotate at all, and I don't think > that will be a problem. It was more that it's going to do strange non-deterministic things. Personally, I hate the input_rotate. It's not that it's harmful, just that it doesn't do much good compared to the cost; for the number of cycles and context space required there are more efficient mixing operations. But if you want, you can easily compute it on demand. If you avoid reducing r->add_words mod poolwords, then input_rotate = 7*(r->add_words + r->add_words/poolwords) >> The add-back is not critical, and races between two writers don't really >> do any harm. But seed entropy is valuable. > That's true, but I've been going back and forth about how much it's > worth to fix this. After all, the the non-blocking pool is guaranteed > to have cryptographic randomness, and so it's a question of how much > effort we want to avoid the possibility of losing some seed entropy. I worry that with only 32 words of pool, on a large (1K CPUs) machine that's execv()ing and geerating AT_RANDOM vectors a lot, the traffic could be pretty heavy, and the odds of stomping a byte of seed material cound get substantial. Or a small machine with a couple of concurrent /dev/urandom abusers. Remember, it's globally readable, so it has to be resistance to malicious abuse. > One approach might be to use take a reader lock when mixing back > entropy, but take an exclusive lock in the case when seeding the > non-random pool. Even a reader lock is at least one atomic operation. > Another approach is to have an alternate mechanism for the > non-blocking pool which uses the LOCK prefix when XOR'ing into the > pool. This would mean that we would have to drop the twisted LFSR and > replace it with something simpler but so long as the mixing algothm > eventually involves all of the bits in the pool, that will probably be > OK. > > Of course, using the LOCK prefix is CPU architecture specific, and in > particular, there is no equivalent on the ARM architecture. You can add rather than XOR, and we have atomic add primitives. (You could even, if you wanted to preserve the period proof of the mixing function, compute the value which, when added to the original word, would make the XOR change, and then atomically add that.) > The final thing we could do is to just throw in a smp_mb() before the > write into the pool, and just call it day, and accept that while we > might lose a small amount of entropy due to a race, it will only a > byte's worth each time we lose a race (instead of the whole 32 byte > cache line). If you're willing to accept "probably" solutions, just add the seed material twice. If one byte is lost, the other copy will probably survive. Or, if you want to be cleverer, any form of error-correcting code (duplication is just a simple for of it) will add enough redundant information that a few erasures won't lose entropy. But the atomic add seems safer. I don't have a good feel for the fraction of lost entropy a deliberate attack on a large machine could cause and /dev/urandom is not an area where I'm comfortable hoping. > None of the alternatives are all that satisfying, and they will all be > a bit more expensive than the first patch. The question is how to > weigh these problems against just simply using a separate pool for > each NUMA node, and how much scalability are we really need for > "realistic" workloads? There are several possible solutions that don't need separate pools (including separate add-back pools, with a shared seeded pool that is never touched by add-back), so I don't think it's necessary to give up yet. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-12 0:16 ` George Spelvin @ 2015-10-12 4:05 ` Theodore Ts'o 2015-10-12 7:49 ` George Spelvin 0 siblings, 1 reply; 30+ messages in thread From: Theodore Ts'o @ 2015-10-12 4:05 UTC (permalink / raw) To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux On Sun, Oct 11, 2015 at 08:16:01PM -0400, George Spelvin wrote: > > I'm not thrilled with incrementing the pointer from i to len, but mixing > at positions i+k to i+k+len. The whole LFSR scheme relies on a regular > pass structure. That part I'm not worried about. We still have a regular pass structure --- since for each CPU, we are still iterating over the pool in a regular fashion. > How about this instead: drop the hashed offset, and instead let each > writer do an atomic_add_return on the index, then iterate over the > "reserved" slots. Concurrent additions will at least do non-overlapping > writes until the numer equals the pool size. One atomic operation per byte that we're mixing in? That's quite expensive. > Personally, I hate the input_rotate. It's not that it's harmful, just > that it doesn't do much good compared to the cost; for the number of cycles > and context space required there are more efficient mixing operations. The input_rotate is useful for the input pool, for things like keyboard code so we can make sure they enter the pool at more points than just the low bits of each word. But for the output pools, it really doesn't make any sense. And we are getting to the point where we may end up having different mixing algorithms for the nonblocking pool, and in that case I have absolutely no trouble dropping the input_rotate part of the mixing algorithm for the non-blocking pool. > Or a small machine with a couple of concurrent /dev/urandom abusers. > Remember, it's globally readable, so it has to be resistance to malicious > abuse. One of the other ways we could solve this is by hanging a struct off the task structure, and if we detect that we have a /dev/urandom abuser, we give that process its own urandom pool, so any pissing that it does will be in its own pool. (So to speak.) Most processes don't actually use that much randomness, and I'm not that worried about in-kernel users of the nonblocking pool. Even with the most exec-heavy workload, setting up a new exec image is heavyweight enough that you're really not going to be contending on the lock. I also have trouble with someone spending $$$$ on a system with 1K cpu cores and wasting all of their CPU power with running shell scripts that fork and exec a lot. :-) The reality is that most processes don't use /dev/urandom or getrandom(2) at all, and those that do, many of them only use it sparingly. So maybe the right answer is to do something simple which takes care of the abusers. > You can add rather than XOR, and we have atomic add primitives. Atomic-add primitives aren't portable either. The representation isn't guaranteed to be 32-bits, and some platforms an atomic int is only 24-bits wide (the top 8 bits being used for locking purposes). > There are several possible solutions that don't need separate pools > (including separate add-back pools, with a shared seeded pool that > is never touched by add-back), so I don't think it's necessary to > give up yet. Hmm, maybe. I'm a bit worried about the amount of complexity that this entails, and the reality is that the urandom pool or pools don't provide anything other than cryptogaphic randomness. At this point, I wonder if it might not be simpler to restrict the current nonblocking pool to kernel users, and for userspace users, the first time a process reads from /dev/urandom or calls getrandom(2), we create for them a ChaCha20 CRNG, which hangs off of the task structure. This would require about 72 bytes of state per process, but normally very few processes are reading from /dev/urandom or calling getrandom(2) from userspace. The CRNG would be initialized from the non-blocking pool, and is reseeded after, say, 2**24 cranks or five minutes. It's essentially an OpenBSD-style arc4random in the kernel. (Arguably the right answer is to put arc4random in libc, where it can automatically handle forks/clones/pthread automatically, but it seems pretty clear *that* train has left a long time ago.) I have a feeling this may be less code and complexity, and it nicely handles the case where we have a /dev/urandom abuser who feels that they want to call /dev/urandom in a tight loop, even on a 4 socket Xeon system. :-) - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-12 4:05 ` Theodore Ts'o @ 2015-10-12 7:49 ` George Spelvin 2015-10-12 13:54 ` Theodore Ts'o 0 siblings, 1 reply; 30+ messages in thread From: George Spelvin @ 2015-10-12 7:49 UTC (permalink / raw) To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux >> I'm not thrilled with incrementing the pointer from i to len, but mixing >> at positions i+k to i+k+len. The whole LFSR scheme relies on a regular >> pass structure. > That part I'm not worried about. We still have a regular pass > structure --- since for each CPU, we are still iterating over the pool > in a regular fashion. Not if I understand what you are doing. Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12. (I'll also use a count-up index even though I know it's actually count-down in the code.) CPU0 writes 5 words at 0..4, leaving add_ptr = 5. Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10 The CPU 0 writes its next word at 10. Words 5..9 never got mixed at all. >> How about this instead: drop the hashed offset, and instead let each >> writer do an atomic_add_return on the index, then iterate over the >> "reserved" slots. Concurrent additions will at least do non-overlapping >> writes until the number equals the pool size. > One atomic operation per byte that we're mixing in? That's quite > expensive. Well, for this part, it's one atomic operation per _mix_pool_bytes: i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes; However, I *also* have to use atomic_add to write r->pool[i], which is indeed expensive. PoC diff attached, but I'm not sure if you'll like it. > The input_rotate is useful for the input pool, for things like > keyboard code so we can make sure they enter the pool at more points > than just the low bits of each word. But for the output pools, it > really doesn't make any sense. And we are getting to the point where > we may end up having different mixing algorithms for the nonblocking > pool, and in that case I have absolutely no trouble dropping the > input_rotate part of the mixing algorithm for the non-blocking pool. You're forgetting: keyboard codes don't go into the input pool. They go into the per-CPU fast pool, which does some quite thorough mixing and delivers 128 bits with the entropy effectively distributed across it. These days, the only thing that goes into the large pools are the output from other pools and /dev/random writes. But also, that's what the twist_table achieves. By the time the pool wraps around, the previous value at a given array posistion has been read and twisted many times, and is now smeared across all 32 bits. If you want larger shifts *and* greater efficiency by removing a table lookup from the critical path, I'll happily replace it with an Xorshift structure (will take me a little while; I've generated primitive polynomials in the past but need to rewrite the code). >> Or a small machine with a couple of concurrent /dev/urandom abusers. >> Remember, it's globally readable, so it has to be resistance to malicious >> abuse. > One of the other ways we could solve this is by hanging a struct off > the task structure, and if we detect that we have a /dev/urandom > abuser, we give that process its own urandom pool, so any pissing that > it does will be in its own pool. (So to speak.) Let me make sure we're talking about the same thing here. Segregating abusers so they don't cause *performance* problems for other threads is a reasonable optimization. But the quoted conversation was talking about a *security* problem that could be created by /dev/urandom abusers if some of your lock-relaxing ideas were followed. I was observing that the failure case you were hoping was rare could be made common by a malicious user. And for that, I don't think a cacheing wrapper is the way to solve the problem. For security, either the underlying pool is robust or fragile. If it's robust, we don't need this extra layer; abusers will get shitty performance but won't hurt the quality of the output. If we need the layer, then we have to ask if we understand the vulnerability to abuse and have successfully segregated all damaging forms of abuse. That's more analysis and design effort than simply eliminating the problem in the first place. > Most processes don't actually use that much randomness, and I'm not > that worried about in-kernel users of the nonblocking pool. Even with > the most exec-heavy workload, setting up a new exec image is > heavyweight enough that you're really not going to be contending on > the lock. I also have trouble with someone spending $$$$ on a system > with 1K cpu cores and wasting all of their CPU power with running > shell scripts that fork and exec a lot. :-) But I just re-read Andi's original trouble report, and although he mentions AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's causing the problem *not* execve. My memory may have gotten confused. I can imagine that one /dev/urandom abuser can minopolize the locck and cause problems for a shell script. But yes, for this performance problem, a segregate-the-abusers solution is quite attractive. > Atomic-add primitives aren't portable either. The representation > isn't guaranteed to be 32-bits, and some platforms an atomic int is > only 24-bits wide (the top 8 bits being used for locking purposes). No longer true; see Documentation/atomic_ops.txt. But yes, atomic ops are *ridiculously* expensive on SMP 32-bit SPARC. Do we care? Is anyone using multiprocessor V7 SPARC in anger these days? >> There are several possible solutions that don't need separate pools >> (including separate add-back pools, with a shared seeded pool that >> is never touched by add-back), so I don't think it's necessary to >> give up yet. > Hmm, maybe. I'm a bit worried about the amount of complexity that > this entails, and the reality is that the urandom pool or pools don't > provide anything other than cryptogaphic randomness. Well, *one* add-back pool would also do, and be a quite simple addition to the code. > the urandom pool or pools don't > provide anything other than cryptogaphic randomness. Well, there's cryptographic and there's cryptographic. /dev/urandom is deliberately *ridiculosuly* overengineered; It's not dependent on the security of a single primitive. I'd rather not downgrade it to depending on the security of AES, or ChaCha, or Keccak, or any other performance-optimized primitive. For my own code, sure. But /dev/random is sold as "you *really* don't have to worry about it". > At this point, I wonder if it might not be simpler to restrict the > current nonblocking pool to kernel users, and for userspace users, the > first time a process reads from /dev/urandom or calls getrandom(2), we > create for them a ChaCha20 CRNG, which hangs off of the task > structure. This would require about 72 bytes of state per process, > but normally very few processes are reading from /dev/urandom or > calling getrandom(2) from userspace. Interesting idea, although the loss of backtracking protection for long-running servers is a significant semantic change. Remember that making antibacktracking work is the entire problem we're struggling to solve. If discarding it is on the table, I can solve the scalability problems extremely easily. Even without giving it up completely, just put the add-back inside if(trylock()) and the problem goes away. I'm also not sure where you get 72 bytes from. You need 32 bytes of key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the ChaCha input block. Then optionally add 64 bytes of output buffer if you don't want to discard generated bytes after an unaligned read. It adds up to 44/48, or 108/112 bytes. Where does 72 come from? But why bother with even that much? If you're willing to discard antibacktracking, just have *one* key for the kernel, reseeded more often, and a per-thread nonce and stream position. > (Arguably the right answer is to put arc4random in libc, where it can > automatically handle forks/clones/pthread automatically, but it seems > pretty clear *that* train has left a long time ago.) I'm not quite which viatical incident you're alluding to, but yes that would be the preferred solution. > I have a feeling this may be less code and complexity, and it nicely > handles the case where we have a /dev/urandom abuser who feels that > they want to call /dev/urandom in a tight loop, even on a 4 socket > Xeon system. :-) I'm not sanguine about its simplicity; it's a whole other layer. Might be a good idea anyway. I have a similar patch set in work, but it's called /dev/frandom because I wasn't willing to suggest such a semantic change. But without interfering with legitimate users of /dev/urandom at all, I'd be quite willing to say that as soon as you read more than 32 bytes (current request plus recent history) from /dev/urandom, you get a private ChaCha20 structure to read from. (This is just enforcing the usage rules from the random(4) man page. Since I wrote that text, I'm hardly going to object!) Here's that atomic_t patch I mentioned above. diff --git a/drivers/char/random.c b/drivers/char/random.c index e62b30ba..e65357c4 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -260,6 +260,7 @@ #include <linux/irq.h> #include <linux/syscalls.h> #include <linux/completion.h> +#include <linux/hash.h> #include <asm/processor.h> #include <asm/uaccess.h> @@ -423,7 +424,7 @@ struct entropy_store; struct entropy_store { /* read-only data: */ const struct poolinfo *poolinfo; - __u32 *pool; + atomic_t *pool; const char *name; struct entropy_store *pull; struct work_struct push_work; @@ -431,20 +432,20 @@ struct entropy_store { /* read-write data: */ unsigned long last_pulled; spinlock_t lock; - unsigned short add_ptr; - unsigned short input_rotate; + atomic_t add_ptr; int entropy_count; int entropy_total; unsigned int initialized:1; unsigned int limit:1; unsigned int last_data_init:1; + unsigned char input_rotate; __u8 last_data[EXTRACT_SIZE]; }; static void push_to_pool(struct work_struct *work); -static __u32 input_pool_data[INPUT_POOL_WORDS]; -static __u32 blocking_pool_data[OUTPUT_POOL_WORDS]; -static __u32 nonblocking_pool_data[OUTPUT_POOL_WORDS]; +static atomic_t input_pool_data[INPUT_POOL_WORDS]; +static atomic_t blocking_pool_data[OUTPUT_POOL_WORDS]; +static atomic_t nonblocking_pool_data[OUTPUT_POOL_WORDS]; static struct entropy_store input_pool = { .poolinfo = &poolinfo_table[0], @@ -488,6 +489,10 @@ static __u32 const twist_table[8] = { * degree, and then twisted. We twist by three bits at a time because * it's cheap to do so and helps slightly in the expected case where * the entropy is concentrated in the low-order bits. + * + * This function is designed to be safe to call without locking the + * entropy_store. Race conditions might do strange things, but will + * leave the pool valid and not lose entropy. */ static void _mix_pool_bytes(struct entropy_store *r, const void *in, int nbytes) @@ -496,7 +501,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, int input_rotate; int wordmask = r->poolinfo->poolwords - 1; const char *bytes = in; - __u32 w; + + BUILD_BUG_ON(sizeof(int) != sizeof(__u32)); tap1 = r->poolinfo->tap1; tap2 = r->poolinfo->tap2; @@ -505,23 +511,31 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, tap5 = r->poolinfo->tap5; input_rotate = r->input_rotate; - i = r->add_ptr; + i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes; /* mix one byte at a time to simplify size handling and churn faster */ while (nbytes--) { - w = rol32(*bytes++, input_rotate); - i = (i - 1) & wordmask; + __u32 v, w = rol32(*bytes++, input_rotate); /* XOR in the various taps */ - w ^= r->pool[i]; - w ^= r->pool[(i + tap1) & wordmask]; - w ^= r->pool[(i + tap2) & wordmask]; - w ^= r->pool[(i + tap3) & wordmask]; - w ^= r->pool[(i + tap4) & wordmask]; - w ^= r->pool[(i + tap5) & wordmask]; + i = (i - 1) & wordmask; + w ^= atomic_read(r->pool + ((i + tap1) & wordmask)); + w ^= atomic_read(r->pool + ((i + tap2) & wordmask)); + w ^= atomic_read(r->pool + ((i + tap3) & wordmask)); + w ^= atomic_read(r->pool + ((i + tap4) & wordmask)); + w ^= atomic_read(r->pool + ((i + tap5) & wordmask)); + w ^= v = atomic_read(r->pool + i); + /* Twist the result to mix bits within words */ + w = (w >> 3) ^ twist_table[w & 7]; - /* Mix the result back in with a twist */ - r->pool[i] = (w >> 3) ^ twist_table[w & 7]; + /* + * Now, to put the result back, we don't have an + * atomic_xor operation, but we do have an atomic_add. + * Atomically add the value which would cause the desired + * change. If there's a race, something strange will happen, + * but we won't lose entropy. + */ + atomic_add(w - v, r->pool + i); /* * Normally, we add 7 bits of rotation to the pool. @@ -532,8 +546,13 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, input_rotate = (input_rotate + (i ? 7 : 14)) & 31; } - r->input_rotate = input_rotate; - r->add_ptr = i; + /* + * Access to input_rotate is not thread-safe, because the + * actual value doesn't matter to the security guarantees, + * and the fact that it changes is enough for its purpose. + * So if there's a race condition, it doesn't matter who wins. + */ + r->input_rotate = (unsigned char)input_rotate; } static void __mix_pool_bytes(struct entropy_store *r, const void *in, @@ -1109,17 +1128,26 @@ retry: * This function does the actual extraction for extract_entropy and * extract_entropy_user. * + * For the blocking pools, we lock the pool until we feed back the + * extracted entropy, so the next extract_buf will return different + * results. + * + * For the non-blocking pool, concurrent access to the pool is allowed, + * and the CPU id is used as a salt to ensure that concurrent readers + * will get different results. + * * Note: we assume that .poolwords is a multiple of 16 words. */ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) { int i; + bool is_blocking = r->limit; union { __u32 w[5]; unsigned long l[LONGS(20)]; } hash; __u32 workspace[SHA_WORKSPACE_WORDS]; - unsigned long flags; + unsigned long flags = flags; /* "initialization" shuts up GCC */ /* * If we have an architectural hardware random number @@ -1133,8 +1161,11 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) hash.l[i] = v; } + if (is_blocking) + spin_lock_irqsave(&r->lock, flags); + else + hash.w[0] ^= hash_32(get_cpu(), 32); /* Generate a hash across the pool, 16 words (512 bits) at a time */ - spin_lock_irqsave(&r->lock, flags); for (i = 0; i < r->poolinfo->poolwords; i += 16) sha_transform(hash.w, (__u8 *)(r->pool + i), workspace); @@ -1148,7 +1179,10 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) * hash. */ __mix_pool_bytes(r, hash.w, sizeof(hash.w)); - spin_unlock_irqrestore(&r->lock, flags); + if (is_blocking) + spin_unlock_irqrestore(&r->lock, flags); + else + put_cpu(); memzero_explicit(workspace, sizeof(workspace)); ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-12 7:49 ` George Spelvin @ 2015-10-12 13:54 ` Theodore Ts'o 2015-10-12 20:30 ` George Spelvin 0 siblings, 1 reply; 30+ messages in thread From: Theodore Ts'o @ 2015-10-12 13:54 UTC (permalink / raw) To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote: > > Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12. > (I'll also use a count-up index even though I know it's actually > count-down in the code.) > > CPU0 writes 5 words at 0..4, leaving add_ptr = 5. > Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10 > The CPU 0 writes its next word at 10. > > Words 5..9 never got mixed at all. Hmm, good point. It *shouldn't* matter if the hash is secure, but the potentially uneven mixing would be a concern. > However, I *also* have to use atomic_add to write r->pool[i], > which is indeed expensive. PoC diff attached, but I'm not sure if > you'll like it. Yeah, that was the part that worried me. Whether we use an atomic add or an atomic xor, it's going to be very expensive. Using an atomic add means that it will work with arm, but there's an explicit loop with arm, so on an arm system, it effectively turns into a more efficient spinlock, except you're spinning on each byte, so instead of bouncing cache lines with the locking granularity of the extract, we would be bouncing cache lines for every single byte of the mixback operation. > You're forgetting: keyboard codes don't go into the input pool. > They go into the per-CPU fast pool, which does some quite thorough > mixing and delivers 128 bits with the entropy effectively distributed > across it. They used to go through the input pool, and when we added the per-CPU fast pool, we didn't revisit the question of whether the input_rotate was still needed. So yes, at this point it might be worthwhile to remove the input_rotate part of the mixing scheme. > Segregating abusers so they don't cause *performance* problems > for other threads is a reasonable optimization. > > But the quoted conversation was talking about a *security* > problem that could be created by /dev/urandom abusers if some > of your lock-relaxing ideas were followed. Segregating abusers solves both problems. If we do this then we don't need to drop the locks from the nonblocking pool, which solves the security problem. And it also means that each process has its own CRNG, which is way faster than /dev/urandom even if you don't have the locking problem, and it provides per-process isolation, which means that process A can't carry out a backtracking attack on process B. > > Most processes don't actually use that much randomness, and I'm not > > that worried about in-kernel users of the nonblocking pool. Even with > > the most exec-heavy workload, setting up a new exec image is > > heavyweight enough that you're really not going to be contending on > > the lock. I also have trouble with someone spending $$$$ on a system > > with 1K cpu cores and wasting all of their CPU power with running > > shell scripts that fork and exec a lot. :-) > > But I just re-read Andi's original trouble report, and although he mentions > AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's > causing the problem *not* execve. My memory may have gotten confused. > > I can imagine that one /dev/urandom abuser can minopolize the locck and > cause problems for a shell script. Well, that's my point. If we restrict the nonblocking pool to kernel users, then a /dev/urandom abuser won't be able to cause problems for the shell script. And even the most inefficient shell script which is firing off hundreds of processes per second is unlikely to be able to swamp the spin loop, *and* I would hope that someone who spent $$$ on a super-expensive 1K cpu system would also have the wit to actually recode the shell script in a more efficient compiled language, instead of wasting all of that expensive hardware. > Remember that making antibacktracking work is the entire problem we're > struggling to solve. If discarding it is on the table, I can solve > the scalability problems extremely easily. Even without giving it up > completely, just put the add-back inside if(trylock()) and the problem > goes away. The trylock() scheme is a bit dangerous, since we would need to make sure that *all* times when other callers take the lock, the contents of the pool would have changed. Or else, a subseqent call to extract entropy from that CPU will return the same value. And even if that were true today (or we could make it be true) if that ever changed, it would break the code. So it makes it pretty fragile. > I'm also not sure where you get 72 bytes from. You need 32 bytes of > key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the > ChaCha input block. The ChaCha 20 state block is 64 bytes, and then I was assuming two 4 byte control variables, for the number of cranks and the time since last reseed. We could actually store those control variables into the IV portion of the ChaCha state structure, which would make it be 64 bytes --- or we could generate a new state structure each time and not bother storing the 16 bytes of constants in the state structure, which trades a tiny amount of time and stack space for memory. These are all implementation details, though.... > But why bother with even that much? If you're willing to discard > antibacktracking, just have *one* key for the kernel, reseeded more > often, and a per-thread nonce and stream position. Well, we're not completely discarding backtracking protection. We are discarding backtracking protection between successive reads from a single process, and even there we would be reseeding every five minutes (and this could be tuned), so there is *some* anti-backtracking protection. I'm going to make the claim that if a particular process is reading from /dev/urandom in a tight loop, you probably don't *need* backtracking. You're probably doing something silly such as using /dev/urandom to overwrite a disk, or some such. On the flip side, the time when you might care about anti-backtracing protection is say, when you're generating a new session key for a new connection. So perhaps one approach is to use some kind of ratelimiter algorithm so that if you're using /dev/urandom "carefully" (say, no more than ten times a second), we'll use the non-blocking pool. But once a process exceeds that limit, it will switch over the the CRNG, and then the only performance that abuser process will hurt is its own (because it would be even faster if they were running something like arc4random in userspace). > But without interfering with legitimate users of /dev/urandom at all, > I'd be quite willing to say that as soon as you read more than 32 bytes > (current request plus recent history) from /dev/urandom, you get a > private ChaCha20 structure to read from. Yeah, that's what I was thinking, although I was willing to be a bit more generous about when we switch them over to using the ChaCha pool. I think doing this automatically is going to be much better than creating a new /dev/frandom, since it's going to be hard to get applications to switch over. We're still several years away from swtiching people over to the new getrandom(2) system call, after all. - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-12 13:54 ` Theodore Ts'o @ 2015-10-12 20:30 ` George Spelvin 2015-10-12 20:34 ` George Spelvin 2015-10-13 2:46 ` Theodore Ts'o 0 siblings, 2 replies; 30+ messages in thread From: George Spelvin @ 2015-10-12 20:30 UTC (permalink / raw) To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux Theodore Ts'o wrote: > On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote: >> Words 5..9 never got mixed at all. > Hmm, good point. It *shouldn't* matter if the hash is secure, but the > potentially uneven mixing would be a concern. I don't quite follow the "shouldn't matter" part. The problem is that insufficient input mixing can cause collisions, which are loss of entropy. The output hash(the secure one) can't fix that. The whole reason for the 5-tap LFSR is to ensure that each input word affects a significant number of *other* words before getting stored into by fresh input. I just realized a worse problem: suppose CPU 1 has an offset of -1. It will proceed to *undo* the mixing just done by CPU 0. I don't just mean that it'll store its input over top, but it will also undo the LFSR XOR. Ack! (Maybe I'll retract that "bow before the master" remark. :-)) > They used to go through the input pool, and when we added the per-CPU > fast pool, we didn't revisit the question of whether the input_rotate > was still needed. So yes, at this point it might be worthwhile to > remove the input_rotate part of the mixing scheme. The fast pools actually just extended an issue which the three-pool design created: when spilling from one pool to the other, the whole "expensive output, cheap input" optimization makes no sense. For raw data and interrupt overhead, yes obviously. But when it's 8 sha_transform() calls to generate 10 bytes, the design of the subsequent input hash needs some rethinking. It may make sense to have a separate path for "known high-quality" input that puts more effort into diffusion. Idea 1: Given that we do have an entropy estimate with every write to a pool, what if we did an input hash that did work proportional to the *entropy* of the input. With some lower bound for zero-entropy writes like plain /dev/random writes and mixback.) Idea 2: For input mixing not in interrupt context, which includes all input to the output pools, it would be smaller and simpler code to just XOR into the pool, and defer mixing until one whole-pool mixing pass after the XOR position reched the end of the pool. Concurrent writers could do a single atomic_add_return to find an add position, and if that extended past the end of the buffer, they'd obtain the lock, stir the pool, decrement the add position (which would permit other concurrent writers) and then do their XORs. There's still the risk of races between delayed non-locking inputters and stirrers, but maybe if non-locking inputters were limited to non-critical mixbacks, that would be manageable. >> However, I *also* have to use atomic_add to write r->pool[i], >> which is indeed expensive. PoC diff attached, but I'm not sure if >> you'll like it. > Yeah, that was the part that worried me. Whether we use an atomic add > or an atomic xor, it's going to be very expensive. Using an atomic add > means that it will work with arm, but there's an explicit loop with > arm, so on an arm system, it effectively turns into a more efficient > spinlock, except you're spinning on each byte, so instead of bouncing > cache lines with the locking granularity of the extract, we would be > bouncing cache lines for every single byte of the mixback operation. Yes. The current ticket spinlocks are little more than a single atomic_add_return, and the dangers of putting multiple locks in the same cache line is well known. > Segregating abusers solves both problems. If we do this then we don't > need to drop the locks from the nonblocking pool, which solves the > security problem. Er, sort of. I still think my points were valid, but they're about a particular optimization suggestion you had. By avoiding the need for the optimization, the entire issue is mooted. I still say a separate firewall is a bad way to *solve* security problems, but it can avoid preformance pressure to *create* them in the first place. >> I can imagine that one /dev/urandom abuser can minopolize the locck and >> cause problems for a shell script. > Well, that's my point. If we restrict the nonblocking pool to kernel > users, then a /dev/urandom abuser won't be able to cause problems for > the shell script. And I quite agree with that. Sorry, sometimes a discussion goes in multiple directions and it's hard to keep track of the point of any particular sentence. I get fixated on one point and intrepret comments about something else as if they pertained to the issue I'm thinking about. > The trylock() scheme is a bit dangerous, since we would need to make > sure that *all* times when other callers take the lock, the contents > of the pool would have changed. Or else, a subseqent call to extract > entropy from that CPU will return the same value. And even if that > were true today (or we could make it be true) if that ever changed, it > would break the code. So it makes it pretty fragile. Sorry, yes, I'm prefectly aware of this, I just glossed over it. Each extraction has to have a unique nonce; that's non-negotiable, but not hard to achieve. We'd need to add a per-cpu counter to go along with the CPU id to make the extraction nonce. Doing this would be valuable *anyway*, because it would let us reduce the mixback to once per read() rather than once per 10 bytes. (Also, what would you say to reducing arch_get_random_long() to 16 bytes, and reserving hash.w[5] for the nonce? That would ensure that even a malicious arch_get_random_long() couldn't force a collision.) > The ChaCha 20 state block is 64 bytes, and then I was assuming two 4 > byte control variables, for the number of cranks and the time since > last reseed. We could actually store those control variables into the > IV portion of the ChaCha state structure, which would make it be 64 > bytes --- or we could generate a new state structure each time and not > bother storing the 16 bytes of constants in the state structure, which > trades a tiny amount of time and stack space for memory. These are > all implementation details, though.... You have to copy the state *anyway* because you don't want it overwritten by the ChaCha output, so there's really no point storing the constants. (Also, ChaCha has a simpler input block structure than Salsa20; the constants are all adjacent.) And ChaCha includes a stream position counter itself, so you don't need a separate one. (Note: one problem with ChaCha specifically is that is needs 16x32 bits of registers, and Arm32 doesn't quite have enough. We may want to provide an arch CPRNG hook so people can plug in other algorithms with good platform support, like x86 AES instructions.) >> But why bother with even that much? If you're willing to discard >> antibacktracking, just have *one* key for the kernel, reseeded more >> often, and a per-thread nonce and stream position. > Well, we're not completely discarding backtracking protection. We are > discarding backtracking protection between successive reads from a single > process, and even there we would be reseeding every five minutes (and > this could be tuned), so there is *some* anti-backtracking protection. Yeah, I realized this a few hours after sending that e-mail. Anti-backtracking between processes is more important than within. > I'm going to make the claim that if a particular process is reading > from /dev/urandom in a tight loop, you probably don't *need* > backtracking. You're probably doing something silly such as using > /dev/urandom to overwrite a disk, or some such. I fully agree. > On the flip side, the time when you might care about anti-backtracing > protection is say, when you're generating a new session key for a new > connection. So perhaps one approach is to use some kind of > ratelimiter algorithm so that if you're using /dev/urandom "carefully" > (say, no more than ten times a second), we'll use the non-blocking > pool. But once a process exceeds that limit, it will switch over the > the CRNG, and then the only performance that abuser process will hurt > is its own (because it would be even faster if they were running > something like arc4random in userspace). I consider the size of the read at least as much as the rate. As mentioned in the random(4) man page, I don't think more than 256 bits of computational security even exists. Human Bitcoin mining is (per http://bitcoincharts.com/) doing 440051 Thash/s, which is 80 bits per month. We can extrapolate computational feasibility beyond what's currently been done. I think 128-bit claims are defensible. But once we get beyond the square of anything ever attempted (170 bits or so), security claims really don't know what they're talking about. 256 bits is the *cube*. That's just crazy talk. This in turn means that someone asking for more than 256 bits of seed material is doing something stupid. (In the more diplomatic language of the man page, "should be taken as a sign that its cryptography is _not_ skillfully implemented." I didn't want to totally condmen quick one-time hacks.) >> But without interfering with legitimate users of /dev/urandom at all, >> I'd be quite willing to say that as soon as you read more than 32 bytes >> (current request plus recent history) from /dev/urandom, you get a >> private ChaCha20 structure to read from. > Yeah, that's what I was thinking, although I was willing to be a bit > more generous about when we switch them over to using the ChaCha pool. > I think doing this automatically is going to be much better than > creating a new /dev/frandom, since it's going to be hard to get > applications to switch over. We're still several years away from > swtiching people over to the new getrandom(2) system call, after all. I worry this might attract the same sort of shitstorm as the SHA-3 proposal to tweak Keccak's preimage resistance (which, for the benefit of the peanut gallery, was solid cryptographic engineering that belatedly fixed a stupidity in the original SHA-3 contest rules), but if you're up for it, I'm not going to object. I'm not too hung up on the thresholds, as long as we catch bulk readers on the first read, rather than wait until *after* they've drained /dev/urandom. I'd prefer a single 64-byte read request would move to mitigation mode, but I could be talked into 128. (Actually, it would probably make more sense to have the threshold be a multiple of EXTRACT_SIZE.) History is probably best kept in a leaky bucket of some sort. That has a current level and last update jiffies, and each read, we subtract off the allowed maximum rate from the level. (Or would an exponential average be better? That would make the maximum read rate more strongly dependent on the evenness of the spacing.) Then add that (scaled somehow?) to the current read size and compare with the threshold. The same variables can be used (with different parameters) to decide if we want to get out of mitigation mode. The one thing to watch out for is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once the buffer cache fills. We don't want to forgive after too small a fixed interval. Finally, we have the issue of where to attach this rate-limiting structure and crypto context. My idea was to use the struct file. But now that we have getrandom(2), it's harder. mm, task_struct, signal_struct, what? (Post-finally, do we want this feature to be configurable under CONFIG_EMBEDDED? I know keeping the /dev/random code size small is a speficic design goal, and abuse mitigation is optional.) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-12 20:30 ` George Spelvin @ 2015-10-12 20:34 ` George Spelvin 2015-10-13 2:46 ` Theodore Ts'o 1 sibling, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-12 20:34 UTC (permalink / raw) To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux (BTW, my previous e-mail went out early due to a mis-click. It was almost done, but please excuse any unfinished sentences.) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-12 20:30 ` George Spelvin 2015-10-12 20:34 ` George Spelvin @ 2015-10-13 2:46 ` Theodore Ts'o 2015-10-13 3:50 ` Raymond Jennings ` (2 more replies) 1 sibling, 3 replies; 30+ messages in thread From: Theodore Ts'o @ 2015-10-13 2:46 UTC (permalink / raw) To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote: > > Segregating abusers solves both problems. If we do this then we don't > > need to drop the locks from the nonblocking pool, which solves the > > security problem. > > Er, sort of. I still think my points were valid, but they're > about a particular optimization suggestion you had. By avoiding > the need for the optimization, the entire issue is mooted. Sure, I'm not in love with anyone's particular optimization, whether it's mine, yours, or Andi's. I'm just trying to solve the scalability problem while also trying to keep the code maintainable and easy to understand (and over the years we've actually made things worse, to the extent that having a single mixing for the input and output pools is starting to be more of problem than a feature, since we're coding in a bunch of exceptions when it's the output pool, etc.). So if we can solve a problem by routing around it, that's fine in my book. > You have to copy the state *anyway* because you don't want it overwritten > by the ChaCha output, so there's really no point storing the constants. > (Also, ChaCha has a simpler input block structure than Salsa20; the > constants are all adjacent.) We're really getting into low-level implementations here, and I think it's best to worry about these sorts of things when we have a patch to review..... > (Note: one problem with ChaCha specifically is that is needs 16x32 bits > of registers, and Arm32 doesn't quite have enough. We may want to provide > an arch CPRNG hook so people can plug in other algorithms with good > platform support, like x86 AES instructions.) So while a ChaCha20-based CRNG should be faster than a SHA-1 based CRNG, and I consider this a good thing, for me speed is **not** more important than keeping the underlying code maintainable and simple. This is one of the reasons why I looked at, and then discarded, to use x86 accelerated AES as the basis for a CRNG. Setting up AES so that it can be used easily with or without hardware acceleration looks very complicated to do in a cross-architectural way, and I don't want to drag in all of the crypto layer for /dev/random. > The same variables can be used (with different parameters) to decide if > we want to get out of mitigation mode. The one thing to watch out for > is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once > the buffer cache fills. We don't want to forgive after too small a > fixed interval. At least initially, once we go into mitigation mode for a particular process, it's probably safer to simply not exit it. > Finally, we have the issue of where to attach this rate-limiting structure > and crypto context. My idea was to use the struct file. But now that > we have getrandom(2), it's harder. mm, task_struct, signal_struct, what? I'm personally more inclined to keep it with the task struct, so that different threads will use different crypto contexts, just from simplicity point of view since we won't need to worry about locking. Since many processes don't use /dev/urandom or getrandom(2) at all, the first time they do, we'd allocate a structure and hang it off the task_struct. When the process exits, we would explicitly memzero it and then release the memory. > (Post-finally, do we want this feature to be configurable under > CONFIG_EMBEDDED? I know keeping the /dev/random code size small is > a speficic design goal, and abuse mitigation is optional.) Once we code it up we can see how many bytes this takes, we can have this discussion. I'll note that ChaCha20 is much more compact than SHA1: text data bss dec hex filename 4230 0 0 4230 1086 /build/ext4-64/lib/sha1.o 1152 304 0 1456 5b0 /build/ext4-64/crypto/chacha20_generic.o ... and I've thought about this as being the first step towards potentially replacing SHA1 with something ChaCha20 based, in light of the SHAppening attack. Unfortunately, BLAKE2s is similar to ChaCha only from design perspective, not an implementation perspective. Still, I suspect the just looking at the crypto primitives, even if we need to include two independent copies of the ChaCha20 core crypto and the Blake2s core crypto, it still should be about half the size of the SHA-1 crypto primitive. And from the non-plumbing side of things, Andi's patchset increases the size of /dev/random by a bit over 6%, or 974 bytes from a starting base of 15719 bytes. It ought to be possible to implement a ChaCha20 based CRNG (ignoring the crypto primitives) in less than 974 bytes of x86_64 assembly. :-) - Ted ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 2:46 ` Theodore Ts'o @ 2015-10-13 3:50 ` Raymond Jennings 2015-10-13 7:50 ` George Spelvin 2015-10-13 6:24 ` George Spelvin 2015-10-13 16:20 ` Andi Kleen 2 siblings, 1 reply; 30+ messages in thread From: Raymond Jennings @ 2015-10-13 3:50 UTC (permalink / raw) To: Theodore Ts'o, George Spelvin, ahferroin7, andi, jepler, linux-kernel, linux On Mon, Oct 12, 2015 at 7:46 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote: >> > Segregating abusers solves both problems. If we do this then we >> don't >> > need to drop the locks from the nonblocking pool, which solves the >> > security problem. >> >> Er, sort of. I still think my points were valid, but they're >> about a particular optimization suggestion you had. By avoiding >> the need for the optimization, the entire issue is mooted. > > Sure, I'm not in love with anyone's particular optimization, whether > it's mine, yours, or Andi's. I'm just trying to solve the scalability > problem while also trying to keep the code maintainable and easy to > understand (and over the years we've actually made things worse, to > the extent that having a single mixing for the input and output pools > is starting to be more of problem than a feature, since we're coding > in a bunch of exceptions when it's the output pool, etc.). > > So if we can solve a problem by routing around it, that's fine in my > book. > >> You have to copy the state *anyway* because you don't want it >> overwritten >> by the ChaCha output, so there's really no point storing the >> constants. >> (Also, ChaCha has a simpler input block structure than Salsa20; the >> constants are all adjacent.) > > We're really getting into low-level implementations here, and I think > it's best to worry about these sorts of things when we have a patch to > review..... > >> (Note: one problem with ChaCha specifically is that is needs 16x32 >> bits >> of registers, and Arm32 doesn't quite have enough. We may want to >> provide >> an arch CPRNG hook so people can plug in other algorithms with good >> platform support, like x86 AES instructions.) > > So while a ChaCha20-based CRNG should be faster than a SHA-1 based > CRNG, and I consider this a good thing, for me speed is **not** more > important than keeping the underlying code maintainable and simple. > This is one of the reasons why I looked at, and then discarded, to use > x86 accelerated AES as the basis for a CRNG. Setting up AES so that > it can be used easily with or without hardware acceleration looks very > complicated to do in a cross-architectural way, and I don't want to > drag in all of the crypto layer for /dev/random. > >> The same variables can be used (with different parameters) to >> decide if >> we want to get out of mitigation mode. The one thing to watch out >> for >> is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once >> the buffer cache fills. We don't want to forgive after too small a >> fixed interval. > > At least initially, once we go into mitigation mode for a particular > process, it's probably safer to simply not exit it. > >> Finally, we have the issue of where to attach this rate-limiting >> structure >> and crypto context. My idea was to use the struct file. But now >> that >> we have getrandom(2), it's harder. mm, task_struct, signal_struct, >> what? > > I'm personally more inclined to keep it with the task struct, so that > different threads will use different crypto contexts, just from > simplicity point of view since we won't need to worry about locking. > > Since many processes don't use /dev/urandom or getrandom(2) at all, > the first time they do, we'd allocate a structure and hang it off the > task_struct. When the process exits, we would explicitly memzero it > and then release the memory. > >> (Post-finally, do we want this feature to be configurable under >> CONFIG_EMBEDDED? I know keeping the /dev/random code size small is >> a speficic design goal, and abuse mitigation is optional.) > > Once we code it up we can see how many bytes this takes, we can have > this discussion. I'll note that ChaCha20 is much more compact than > SHA1: > > text data bss dec hex filename > 4230 0 0 4230 1086 /build/ext4-64/lib/sha1.o > 1152 304 0 1456 > 5b0 /build/ext4-64/crypto/chacha20_generic.o > > ... and I've thought about this as being the first step towards > potentially replacing SHA1 with something ChaCha20 based, in light of > the SHAppening attack. Unfortunately, BLAKE2s is similar to ChaCha > only from design perspective, not an implementation perspective. > Still, I suspect the just looking at the crypto primitives, even if we > need to include two independent copies of the ChaCha20 core crypto and > the Blake2s core crypto, it still should be about half the size of the > SHA-1 crypto primitive. > > And from the non-plumbing side of things, Andi's patchset increases > the size of /dev/random by a bit over 6%, or 974 bytes from a starting > base of 15719 bytes. It ought to be possible to implement a ChaCha20 > based CRNG (ignoring the crypto primitives) in less than 974 bytes of > x86_64 assembly. :-) > > - Ted > > -- > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ This might be stupid, but could something asynchronous work? Perhaps have the entropy generators dump their entropy into a central pool via a cycbuf, and have a background kthread manage the per-cpu or per-process entropy pools? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 3:50 ` Raymond Jennings @ 2015-10-13 7:50 ` George Spelvin 0 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-13 7:50 UTC (permalink / raw) To: ahferroin7, andi, jepler, linux-kernel, linux, linux, shentino, tytso > This might be stupid, but could something asynchronous work? Perhaps > have the entropy generators dump their entropy into a central pool via > a cycbuf, and have a background kthread manage the per-cpu or > per-process entropy pools? No for two reasons: (Minor): One of the functions of the mixback is to ensure that the next reader hashes a *different* pool state. If the mixback is delayed, the next reader might hash the *same* pool state and return the same numbers. (There are easy workarounds for this.) (Major): What do you do when the circular buffer is full? If it's not safe to skip the mixback, then we have to block and get into the same lock-contention problem. But... this suggestion of having a separate thread do the mixback gives me an idea. In fact, I think it's a good idea. Ted (or anyone else listening), what do you think of the following? I think it would solve Andi's problem and be a smaller code change than the abuse mitigation mode. (Which is still a good idea, but is off the critical path.) - Associated with a pool is an atomic "mixback needed" flag. - Also an optional mixback buffer. (Optional because the mixback could just recompute it.) - Dropping the lock requires the following operations: - Test the mixback needed flag. If set, - Copy out and wipe the buffer, - smp_wmb() - Clear the flag - smp_wmb() - Do the mixback, and - Re-check again before dropping the lock. (This check before dropping the lock is technically an optional optimization.) - Drop the lock. - smp_mb() (Since it's write-to-read, we can't use _rmb or _wmb.) - Test the mixback pending flag again. - If it's set, trylock(). If that succeeds, go do the mixback as above. - If it fails, return. Each reader uses already-discussed nonce techniques to safely do concurrent reads from the same pool. Then, at the end: - (Optional) trylock() and, if it succeeds, do mixback directly. - Copy our mixback data to the buffer (race conditions be damned) - smp_wmb() - set the mixback needed flag - smp_mb() (Since it's write-to-read; or use smp_store_mb()) - trylock() - If that fails. return - If that succeeds (and the flag is still set) do the mixback This is based on the fact that if there are multiple concurrent reads, we only need one mixback (thus, only one buffer/flag), but the "last one out the door" has to do it. Also, we don't care if we mis-count and end up doing it twice. Each reader sets the flag and *then* does a trylock. If the trylock fails, it's guaranteed that the lock-holder will see the flag and take care of the mixback for us. The writers drop the lock and *then* test the flag. The result is that readers *never* do a blocking acquire of the pool lock. Which should solve all the contention problems. Andi's stupid application will still be stupid, but won't fall off a locking cliff. (We could also use w[5] as the "mixback needed" flag and just force it to 1 on the off chance it's zero with negligible loss of entropy and zero security loss.) The one thing I worry about is livelock keeping one thread in the mixback code indefinitely, which can be mitigated by dropping the lock and waiting before re-testing and re-acquiring if we loop too often. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 2:46 ` Theodore Ts'o 2015-10-13 3:50 ` Raymond Jennings @ 2015-10-13 6:24 ` George Spelvin 2015-10-13 16:20 ` Andi Kleen 2 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-13 6:24 UTC (permalink / raw) To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux > We're really getting into low-level implementations here, and I think > it's best to worry about these sorts of things when we have a patch to > review..... > it's probably safer to simply not exit it. > I'm personally more inclined to keep it with the task struct, so that > different threads will use different crypto contexts, just from > simplicity point of view since we won't need to worry about locking. > Once we code it up we can see how many bytes this takes, we can have > this discussion. I'm fine with all of these; thank you. > This is one of the reasons why I looked at, and then discarded, to use > x86 accelerated AES as the basis for a CRNG. Setting up AES so that > it can be used easily with or without hardware acceleration looks very > complicated to do in a cross-architectural way, and I haven't looked as deeply, but it didn't look too hard. Is it possible to briefly explain the problem? I assumed you'd have an arch-specific capabilities probe function that would set up an operations structure. That would provide the various buffer sizes required, and setup (kernel_fpu_begin() and key scheduling) CPRNG core, and teardown (kernel_fpu_end()) functions. It there some huge gotcha I'm overlooking? > I don't want to drag in all of the crypto layer for /dev/random. Oh, gods, no; the crypto layer drives me nuts. Truthfully, the main hair of the crypto layer is all the modular cipher modes on top of block ciphers, and the scatterlist stuff to handle arbitrarily fragmented input and output buffers for the benefit of the network layer, but the code is horrible reading. Every time I look, I find something I want to fix (the CTS mode implementation uses 6 blocks worth of stack buffer; I have a patch to reduce that to 3) but then I get lost is the morass of structures and wrappers trying to make the code fit in with the rest. There's a struct crypto_tfm, crypto_alg, crypto_instance, cipher_desc, crypto_type, crypto_template, crypto_spawn... I've been trying to read it and I still have no idea what half of them are for. And I *still* haven't figured out how to get the self-test code to tell me that test X was performed and passed. I ended up writing my own test, which seems wrong. > ... and I've thought about this as being the first step towards > potentially replacing SHA1 with something ChaCha20 based, in light of > the SHAppening attack. Unfortunately, BLAKE2s is similar to ChaCha > only from design perspective, not an implementation perspective. > Still, I suspect the just looking at the crypto primitives, even if we > need to include two independent copies of the ChaCha20 core crypto and > the Blake2s core crypto, it still should be about half the size of the > SHA-1 crypto primitive. Well, the SHAppening doesn't really change much except a slight schedule tweak, but yeah, it's one of those things that would be nice to get around to. I'm not sure what you expect to do with ChaCha, though; it's really an expansion function, not compression, and not easily adapted to be one. BLAKE2 is a bit ugly. I'm generally not liking MD5/SHA-like designs that dribble the message and some round constants in; I'm much preferring the large-state "add a bunch of input all at once" designs like Keccak, SipHash and DJB's SHA-3 entry, CubeHash. Have you seen it? It's quite similar to Keccak, just using a 32x32 = 1024 bit state rather than Keccak's 25*64=1600. The reason it got dumped is because, like Keccak, to get n bits of preimage resistance, it requires 2n bits of "capacity" bits unused each round. When you ask for 512 bits of preimage resistance, you can only import a few bits of message each block. Keccak has the same problem, but it has a big enough block that it can handle it. In Dan's submission, you'll see his usual "this is a stupid request which I'm only paying lip service to", and his "SHA-3-512-formal" proposal was dog-slow. To quote: The "SHA-3-512-formal" proposal is aimed at users who are (1) concerned with attacks using 2^384 operations, (2) unconcerned with quantum attacks that cost far less, and (3) unaware that attackers able to carry out 2^256 operations would wreak havoc on the entire SHA-3 landscape, forcing SHA-3 to be replaced no matter which function is selected as SHA-3. The "SHA-3-512-normal" proposal is aimed at sensible users. For all real-world cryptographic applications, the "formal" versions here can be ignored, and the tweak amounts to a proposal of CubeHash16/32 as SHA-3." If NIST had proposed changing the preimage resistance rules *before* the final decision, things would have gone a lot differently. > And from the non-plumbing side of things, Andi's patchset increases > the size of /dev/random by a bit over 6%, or 974 bytes from a starting > base of 15719 bytes. It ought to be possible to implement a ChaCha20 > based CRNG (ignoring the crypto primitives) in less than 974 bytes of > x86_64 assembly. :-) Yes, not hard. Are you inspired, or would you like me to put together a patch? And should I do moderately evil space-saving things like store the pointer to the crypto state and the leaky bucket in the same task slot, distinguished by the pointer lsbit? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 2:46 ` Theodore Ts'o 2015-10-13 3:50 ` Raymond Jennings 2015-10-13 6:24 ` George Spelvin @ 2015-10-13 16:20 ` Andi Kleen 2015-10-13 21:10 ` George Spelvin 2 siblings, 1 reply; 30+ messages in thread From: Andi Kleen @ 2015-10-13 16:20 UTC (permalink / raw) To: Theodore Ts'o, George Spelvin, ahferroin7, andi, jepler, linux-kernel, linux I tested the two proposed patches from earlier this thread on a 4S system. This was just with my (worst case) micro. Unfortunately both patches scale much worse than the duplicated pools, and can be even worse than the baseline (not sure why). The base line peaks at slightly above 200K ops/s with less than 20 CPUs. And the it gets slower until settling around 100K ops/s with more CPUs. Ted's patch peaks at 350K with four CPUs, but then quickly degrades to 50K ops/s at 20+ CPUs. At 144 CPUs it is slightly faster again at ~80K. Spelvin's patch peaks at only 140K at 2 CPUs (so it's slower than base line), stays around 120K upto 20, then degrades quickly to 50K and then slowly improves again to ~80K. The duplicated pool patch is ~200K upto 20 CPus, 400K upto 40, 600K at slightly below 60 CPUs, and then very slowly degrades to 520K at 144. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 16:20 ` Andi Kleen @ 2015-10-13 21:10 ` George Spelvin 2015-10-14 2:15 ` Andi Kleen 2015-10-21 8:27 ` Updated scalable urandom patchkit George Spelvin 0 siblings, 2 replies; 30+ messages in thread From: George Spelvin @ 2015-10-13 21:10 UTC (permalink / raw) To: ahferroin7, andi, jepler, linux-kernel, linux, linux, tytso > Ted's patch peaks at 350K with four CPUs, but then quickly degrades to > 50K ops/s at 20+ CPUs. At 144 CPUs it is slightly faster again at ~80K. Good to know, thanks! With its race conditions, it's basically a "best case" for that particular design, which tells me that more significant changes are required. Off hand, do you know how large a read each operation is? I want to reduce mixback from once per 10 bytes to once per read, and the size ratio will give me some idea of how large an improvement to expect. > Spelvin's patch peaks at only 140K at 2 CPUs (so it's slower than base line), > stays around 120K upto 20, then degrades quickly to 50K and then slowly > improves again to ~80K. Sorry to make you go to the trouble; I knew from discussions with Ted that it wasn't going to work. It was mostly just in the form of a patch for the sake of a more concrete discussion. I'll have a patch that I hope will do some good for testing in a couple of hours. > The duplicated pool patch is ~200K upto 20 CPus, 400K upto 40, 600K at > slightly below 60 CPUs, and then very slowly degrades to 520K at 144. Shitty performance is practically a design goal of /dev/urandom. You are NOT supposed to hit it more than once per minute per thread. But since we have a real-world problem with it, Ted's "abusse mitigation mode" idea (where the kernel does what the app should do: seed a private CPRNG and use that) will provide good security at extremely high access rates. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 21:10 ` George Spelvin @ 2015-10-14 2:15 ` Andi Kleen 2015-10-16 5:28 ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin 2015-10-21 8:27 ` Updated scalable urandom patchkit George Spelvin 1 sibling, 1 reply; 30+ messages in thread From: Andi Kleen @ 2015-10-14 2:15 UTC (permalink / raw) To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux, tytso > Off hand, do you know how large a read each operation is? I want to > reduce mixback from once per 10 bytes to once per read, and the size > ratio will give me some idea of how large an improvement to expect. My test reads 64 bytes using the syscall. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH 0/4] Alternate sclable urandom patchset 2015-10-14 2:15 ` Andi Kleen @ 2015-10-16 5:28 ` George Spelvin 2015-10-16 5:29 ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin ` (3 more replies) 0 siblings, 4 replies; 30+ messages in thread From: George Spelvin @ 2015-10-16 5:28 UTC (permalink / raw) To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux Sent as separate patches to make things easier to review; the action is all in #4. The first two are minor cleanups I've had in my tree for a while and felt like working on top of rather than backing out. The third I'm not happy with, but will serve as a proof-of-concept stub that I'll rewrite if the rest meets with approval. Since writing it, I've remembered that we already have a per-CPU "nonce pool" in the form of the get_random_int_hash array, and I should take advantage of that. Another thing I don't like is the mixback is only 160 bits, 80 of which are private. I'd rather have twice that, if the read is large enough. But that requires more code refactoring. But it does illustrate the basic idea of using a nonce to avoid the need to do synchronous mixback. We can do many calls to extract_buf, from the same or different processors, without doing mixback. The fourth is where the fun is. The names of things is certainly up for debate; creating a subclass of struct entropy_store called "entropy_store_plus" is not my proudest moment. I also considered just making the extra fields global variables. If someone wants to suggest an arrangement that's good for cache line sharing, that would be appreciated. But I think the nonblocking_mixback() function came out rather nice. This uses two locks, rather that the one in the idea I most recently posted, because add_interrupt_randomness() needs to take the lock, and I didn't want to force it to do an unbounded amount of mixback work. Using two locks avoids the need for interrupt handlers to do any mixback, and for readers to loop and thus possibly livelock. George Spelvin (4): random: Reduce stack usage in _xfer_secondary_pool random: Remove two unused arguments from extract_entropy() random: Only do mixback once per read random: Make non-blocking mixback non-blocking drivers/char/random.c | 218 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 184 insertions(+), 34 deletions(-) -- 2.6.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool 2015-10-16 5:28 ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin @ 2015-10-16 5:29 ` George Spelvin 2015-10-16 5:30 ` [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() George Spelvin ` (2 subsequent siblings) 3 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-16 5:29 UTC (permalink / raw) To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux Rather than asking extract_entropy to fill a large buffer, transfer bytes in EXTRACT_SIZE chunks using multiple calls to extract_buf. (Which is what extract_entropy does internally.) Signed-off-by: George Spelvin <linux@horizon.com> --- drivers/char/random.c | 48 ++++++++++++++++++++++++++++++++++++------------ 1 file changed, 36 insertions(+), 12 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d0da5d85..c8ad49ba 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -964,8 +964,9 @@ EXPORT_SYMBOL_GPL(add_disk_randomness); * *********************************************************************/ -static ssize_t extract_entropy(struct entropy_store *r, void *buf, - size_t nbytes, int min, int rsvd); +static size_t account(struct entropy_store *r, size_t nbytes, int min, + int reserved); +static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]); /* * This utility inline function is responsible for transferring entropy @@ -994,23 +995,46 @@ static void xfer_secondary_pool(struct entropy_store *r, size_t nbytes) static void _xfer_secondary_pool(struct entropy_store *r, size_t nbytes) { - __u32 tmp[OUTPUT_POOL_WORDS]; - - /* For /dev/random's pool, always leave two wakeups' worth */ - int rsvd_bytes = r->limit ? 0 : random_read_wakeup_bits / 4; + u8 tmp[EXTRACT_SIZE]; int bytes = nbytes; /* pull at least as much as a wakeup */ bytes = max_t(int, bytes, random_read_wakeup_bits / 8); /* but never more than the buffer size */ - bytes = min_t(int, bytes, sizeof(tmp)); + bytes = min_t(int, bytes, OUTPUT_POOL_WORDS*sizeof(u32)); + /* + * FIXME: Move this to after account(), so it shows the true amount + * transferred? + */ trace_xfer_secondary_pool(r->name, bytes * 8, nbytes * 8, ENTROPY_BITS(r), ENTROPY_BITS(r->pull)); - bytes = extract_entropy(r->pull, tmp, bytes, - random_read_wakeup_bits / 8, rsvd_bytes); - mix_pool_bytes(r, tmp, bytes); - credit_entropy_bits(r, bytes*8); + + /* + * This is the only place we call account() with non-zero + * "min" and "reserved" values. The minimum is used to + * enforce catastrophic reseeding: if we can't get at least + * random_read_wakeup_bits of entropy, don't bother reseeding + * at all, but wait until a useful amount is available. + * + * The "reserved" is used to prevent reads from /dev/urandom + * from emptying the unput pool; leave two wakeups' worth + * for /dev/random. + */ + bytes = account(r->pull, bytes, random_read_wakeup_bits / 8, + r->limit ? 0 : random_read_wakeup_bits / 4); + + /* Now to the actual transfer, in EXTRACT_SIZE units */ + while (bytes) { + int i = min_t(int, bytes, EXTRACT_SIZE); + + extract_buf(r->pull, tmp); + mix_pool_bytes(r, tmp, i); + credit_entropy_bits(r, i*8); + bytes -= i; + } + + memzero_explicit(tmp, sizeof(tmp)); } /* @@ -1087,7 +1111,7 @@ retry: * * Note: we assume that .poolwords is a multiple of 16 words. */ -static void extract_buf(struct entropy_store *r, __u8 *out) +static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) { int i; union { -- 2.6.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() 2015-10-16 5:28 ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin 2015-10-16 5:29 ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin @ 2015-10-16 5:30 ` George Spelvin 2015-10-16 5:33 ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin 2015-10-16 5:34 ` [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking George Spelvin 3 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-16 5:30 UTC (permalink / raw) To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux Now that _xfer_secondary_pool doesn't need the magic functionality, the "min" and "reserved" arguments are always zero, so stop passing them. Signed-off-by: George Spelvin <linux@horizon.com> --- drivers/char/random.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index c8ad49ba..e62b30ba 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1175,7 +1175,7 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) * pool after each pull to avoid starving other readers. */ static ssize_t extract_entropy(struct entropy_store *r, void *buf, - size_t nbytes, int min, int reserved) + size_t nbytes) { ssize_t ret = 0, i; __u8 tmp[EXTRACT_SIZE]; @@ -1199,7 +1199,7 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, trace_extract_entropy(r->name, nbytes, ENTROPY_BITS(r), _RET_IP_); xfer_secondary_pool(r, nbytes); - nbytes = account(r, nbytes, min, reserved); + nbytes = account(r, nbytes, 0, 0); while (nbytes) { extract_buf(r, tmp); @@ -1284,7 +1284,7 @@ void get_random_bytes(void *buf, int nbytes) nonblocking_pool.entropy_total); #endif trace_get_random_bytes(nbytes, _RET_IP_); - extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0); + extract_entropy(&nonblocking_pool, buf, nbytes); } EXPORT_SYMBOL(get_random_bytes); @@ -1374,7 +1374,7 @@ void get_random_bytes_arch(void *buf, int nbytes) } if (nbytes) - extract_entropy(&nonblocking_pool, p, nbytes, 0, 0); + extract_entropy(&nonblocking_pool, p, nbytes); } EXPORT_SYMBOL(get_random_bytes_arch); -- 2.6.1 h ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [RFC PATCH 3/4] random: Only do mixback once per read 2015-10-16 5:28 ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin 2015-10-16 5:29 ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin 2015-10-16 5:30 ` [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() George Spelvin @ 2015-10-16 5:33 ` George Spelvin 2015-10-16 6:12 ` kbuild test robot 2015-10-16 6:23 ` kbuild test robot 2015-10-16 5:34 ` [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking George Spelvin 3 siblings, 2 replies; 30+ messages in thread From: George Spelvin @ 2015-10-16 5:33 UTC (permalink / raw) To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux Anti-backtracking is only required on read request boundaries, not on each few bytes of output. This reduces contention on the pool lock. Without mixback for each block of output, some other mechanism must guarantee the hash result will change each time the pool is hashed. This is provided by the CPU ID and a per-CPU counter acting as a nonce. Although a per-read nonce would be simpler for the current operation, doing it per-CPU helps later. This is a draft patch; this change has security implications which I'm not entirely happy with in its current state, but it serves as a stub for testing performance improvements. (The output is, at least, not so bad as to cause problems in limited deployment for testing.) Also, allowing concurrent output from a single pool breaks the FIPS mode anti-repetition test in a fundamental way. I'm not sure how to fix that. Signed-off-by: George Spelvin <linux@horizon.com> --- drivers/char/random.c | 70 ++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 20 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index e62b30ba..cf34e83d 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -966,7 +966,8 @@ EXPORT_SYMBOL_GPL(add_disk_randomness); static size_t account(struct entropy_store *r, size_t nbytes, int min, int reserved); -static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]); +static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE], + bool mixback); /* * This utility inline function is responsible for transferring entropy @@ -1028,10 +1029,10 @@ static void _xfer_secondary_pool(struct entropy_store *r, size_t nbytes) while (bytes) { int i = min_t(int, bytes, EXTRACT_SIZE); - extract_buf(r->pull, tmp); + bytes -= i; + extract_buf(r->pull, tmp, bytes == 0); mix_pool_bytes(r, tmp, i); credit_entropy_bits(r, i*8); - bytes -= i; } memzero_explicit(tmp, sizeof(tmp)); @@ -1110,31 +1111,54 @@ retry: * extract_entropy_user. * * Note: we assume that .poolwords is a multiple of 16 words. + * + * The "mixback" flag enables anti-backtracking. This need only be + * set on the last extract_buf in a contiguous series of requests. + * Ensuring distinct outputs is done by including a unique nonce + * (consisting of the CPU ID and a per-CPU counter that will not wrap + * before a mixback occurs) + * + * (FIXME: Is one ahsn's worth of mixback sufficient anti-backtracking + * protection? Should we feed back more?) */ -static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) +static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE], + bool mixback) { int i; union { __u32 w[5]; - unsigned long l[LONGS(20)]; + unsigned long l[LONGS(16)]; } hash; + __u32 nonce; __u32 workspace[SHA_WORKSPACE_WORDS]; - unsigned long flags; + static DEFINE_PER_CPU(__u32, random_nonce); /* * If we have an architectural hardware random number * generator, use it for SHA's initial vector */ sha_init(hash.w); - for (i = 0; i < LONGS(20); i++) { + for (i = 0; i < LONGS(16); i++) { unsigned long v; if (!arch_get_random_long(&v)) break; hash.l[i] = v; } + /* Add the current CPU ID and nonce */ + hash.w[3] += get_cpu(); + nonce = __this_cpu_inc_return(random_nonce); + put_cpu(); + + /* + * Theoretically, it's possible on a 64-bit system for someone to + * request EXTRACT_SIZE << 32 bytes in one read. So force mixback + * to be true each time the nonce wraps. + */ + hash.w[4] += nonce; + mixback |= !nonce; + /* Generate a hash across the pool, 16 words (512 bits) at a time */ - spin_lock_irqsave(&r->lock, flags); for (i = 0; i < r->poolinfo->poolwords; i += 16) sha_transform(hash.w, (__u8 *)(r->pool + i), workspace); @@ -1146,9 +1170,11 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) * mixing at least a SHA1 worth of hash data back, we make * brute-forcing the feedback as hard as brute-forcing the * hash. + * + * FIXME: update security analysis in light of reduced mixback. */ - __mix_pool_bytes(r, hash.w, sizeof(hash.w)); - spin_unlock_irqrestore(&r->lock, flags); + if (mixback) + mix_pool_bytes(r, hash.w, sizeof(hash.w)); memzero_explicit(workspace, sizeof(workspace)); @@ -1177,7 +1203,7 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) static ssize_t extract_entropy(struct entropy_store *r, void *buf, size_t nbytes) { - ssize_t ret = 0, i; + ssize_t ret = 0; __u8 tmp[EXTRACT_SIZE]; unsigned long flags; @@ -1190,7 +1216,7 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, trace_extract_entropy(r->name, EXTRACT_SIZE, ENTROPY_BITS(r), _RET_IP_); xfer_secondary_pool(r, EXTRACT_SIZE); - extract_buf(r, tmp); + extract_buf(r, tmp, nbytes == 0); spin_lock_irqsave(&r->lock, flags); memcpy(r->last_data, tmp, EXTRACT_SIZE); } @@ -1202,7 +1228,10 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, nbytes = account(r, nbytes, 0, 0); while (nbytes) { - extract_buf(r, tmp); + size_t i = min_t(size_t, nbytes, EXTRACT_SIZE); + + nbytes -= i; + extract_buf(r, tmp, nbytes == 0); if (fips_enabled) { spin_lock_irqsave(&r->lock, flags); @@ -1211,9 +1240,8 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, memcpy(r->last_data, tmp, EXTRACT_SIZE); spin_unlock_irqrestore(&r->lock, flags); } - i = min_t(int, nbytes, EXTRACT_SIZE); + memcpy(buf, tmp, i); - nbytes -= i; buf += i; ret += i; } @@ -1231,15 +1259,17 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf, size_t nbytes) { - ssize_t ret = 0, i; + ssize_t ret = 0; __u8 tmp[EXTRACT_SIZE]; - int large_request = (nbytes > 256); + bool large_request = (nbytes > 256); trace_extract_entropy_user(r->name, nbytes, ENTROPY_BITS(r), _RET_IP_); xfer_secondary_pool(r, nbytes); nbytes = account(r, nbytes, 0, 0); while (nbytes) { + size_t i; + if (large_request && need_resched()) { if (signal_pending(current)) { if (ret == 0) @@ -1249,14 +1279,14 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf, schedule(); } - extract_buf(r, tmp); - i = min_t(int, nbytes, EXTRACT_SIZE); + i = min_t(size_t, nbytes, EXTRACT_SIZE); + nbytes -= i; + extract_buf(r, tmp, i == 0); if (copy_to_user(buf, tmp, i)) { ret = -EFAULT; break; } - nbytes -= i; buf += i; ret += i; } -- 2.6.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 3/4] random: Only do mixback once per read 2015-10-16 5:33 ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin @ 2015-10-16 6:12 ` kbuild test robot 2015-10-16 8:11 ` George Spelvin 2015-10-16 6:23 ` kbuild test robot 1 sibling, 1 reply; 30+ messages in thread From: kbuild test robot @ 2015-10-16 6:12 UTC (permalink / raw) To: George Spelvin Cc: kbuild-all, andi, tytso, ahferroin7, jepler, linux-kernel, linux, linux [-- Attachment #1: Type: text/plain, Size: 6019 bytes --] Hi George, [auto build test ERROR on v4.3-rc5 -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/George-Spelvin/Alternate-sclable-urandom-patchset/20151016-133627 config: i386-randconfig-i1-201541 (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): In file included from include/asm-generic/percpu.h:6:0, from arch/x86/include/asm/percpu.h:551, from arch/x86/include/asm/preempt.h:5, from include/linux/preempt.h:64, from include/linux/spinlock.h:50, from include/linux/seqlock.h:35, from include/linux/time.h:5, from include/uapi/linux/timex.h:56, from include/linux/timex.h:56, from include/linux/sched.h:19, from include/linux/utsname.h:5, from drivers/char/random.c:238: drivers/char/random.c: In function 'extract_buf': include/linux/percpu-defs.h:91:33: error: section attribute cannot be specified for local variables extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:92:26: error: section attribute cannot be specified for local variables __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:92:26: error: declaration of '__pcpu_unique_random_nonce' with no linkage follows extern declaration __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:91:33: note: previous declaration of '__pcpu_unique_random_nonce' was here extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ >> drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ >> drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION' __typeof__(type) name ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ >> drivers/char/random.c:1134:31: error: weak declaration of 'random_nonce' must be public static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION' __typeof__(type) name ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ >> drivers/char/random.c:1134:31: error: declaration of 'random_nonce' with no linkage follows extern declaration static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION' __typeof__(type) name ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ drivers/char/random.c:1134:31: note: previous declaration of 'random_nonce' was here static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ vim +1134 drivers/char/random.c 1128 union { 1129 __u32 w[5]; 1130 unsigned long l[LONGS(16)]; 1131 } hash; 1132 __u32 nonce; 1133 __u32 workspace[SHA_WORKSPACE_WORDS]; > 1134 static DEFINE_PER_CPU(__u32, random_nonce); 1135 1136 /* 1137 * If we have an architectural hardware random number --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/octet-stream, Size: 18058 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 3/4] random: Only do mixback once per read 2015-10-16 6:12 ` kbuild test robot @ 2015-10-16 8:11 ` George Spelvin 0 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-16 8:11 UTC (permalink / raw) To: linux, lkp Cc: ahferroin7, andi, jepler, kbuild-all, linux-kernel, linux, tytso If anyone has the same compile pronlem, move the "random_nonce" definition up 10 lines to before the definition of extract_buf(). It compiles (and boots; I'm running it right now) for me as posted; I'm not sure what the difference is. Compiler version? I'm running gcc 5.2.1. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH 3/4] random: Only do mixback once per read 2015-10-16 5:33 ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin 2015-10-16 6:12 ` kbuild test robot @ 2015-10-16 6:23 ` kbuild test robot 1 sibling, 0 replies; 30+ messages in thread From: kbuild test robot @ 2015-10-16 6:23 UTC (permalink / raw) To: George Spelvin Cc: kbuild-all, andi, tytso, ahferroin7, jepler, linux-kernel, linux, linux [-- Attachment #1: Type: text/plain, Size: 6886 bytes --] Hi George, [auto build test ERROR on v4.3-rc5 -- if it's inappropriate base, please suggest rules for selecting the more suitable base] url: https://github.com/0day-ci/linux/commits/George-Spelvin/Alternate-sclable-urandom-patchset/20151016-133627 config: x86_64-randconfig-x010-10130227 (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): In file included from include/asm-generic/percpu.h:6:0, from arch/x86/include/asm/percpu.h:551, from arch/x86/include/asm/preempt.h:5, from include/linux/preempt.h:64, from include/linux/spinlock.h:50, from include/linux/seqlock.h:35, from include/linux/time.h:5, from include/uapi/linux/timex.h:56, from include/linux/timex.h:56, from include/linux/sched.h:19, from include/linux/utsname.h:5, from drivers/char/random.c:238: drivers/char/random.c: In function 'extract_buf': >> include/linux/percpu-defs.h:91:33: error: section attribute cannot be specified for local variables extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:92:26: error: section attribute cannot be specified for local variables __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ >> include/linux/percpu-defs.h:92:26: error: declaration of '__pcpu_unique_random_nonce' with no linkage follows extern declaration __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:91:33: note: previous declaration of '__pcpu_unique_random_nonce' was here extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ ^ include/linux/percpu-defs.h:116:2: note: in expansion of macro 'DEFINE_PER_CPU_SECTION' DEFINE_PER_CPU_SECTION(type, name, "") ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ drivers/char/random.c:1134:31: error: section attribute cannot be specified for local variables static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION' __typeof__(type) name ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ drivers/char/random.c:1134:31: error: weak declaration of 'random_nonce' must be public static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION' __typeof__(type) name ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ drivers/char/random.c:1134:31: error: declaration of 'random_nonce' with no linkage follows extern declaration static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:95:19: note: in definition of macro 'DEFINE_PER_CPU_SECTION' __typeof__(type) name ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ drivers/char/random.c:1134:31: note: previous declaration of 'random_nonce' was here static DEFINE_PER_CPU(__u32, random_nonce); ^ include/linux/percpu-defs.h:93:44: note: in definition of macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ drivers/char/random.c:1134:9: note: in expansion of macro 'DEFINE_PER_CPU' static DEFINE_PER_CPU(__u32, random_nonce); ^ vim +91 include/linux/percpu-defs.h 7c756e6e Tejun Heo 2009-06-24 85 #define DECLARE_PER_CPU_SECTION(type, name, sec) \ 7c756e6e Tejun Heo 2009-06-24 86 extern __PCPU_DUMMY_ATTRS char __pcpu_scope_##name; \ dd17c8f7 Rusty Russell 2009-10-29 87 extern __PCPU_ATTRS(sec) __typeof__(type) name 7c756e6e Tejun Heo 2009-06-24 88 7c756e6e Tejun Heo 2009-06-24 89 #define DEFINE_PER_CPU_SECTION(type, name, sec) \ 7c756e6e Tejun Heo 2009-06-24 90 __PCPU_DUMMY_ATTRS char __pcpu_scope_##name; \ 0f5e4816 Tejun Heo 2009-10-29 @91 extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ 7c756e6e Tejun Heo 2009-06-24 @92 __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ b1a0fbfd Tejun Heo 2013-12-04 93 extern __PCPU_ATTRS(sec) __typeof__(type) name; \ c43768cb Tejun Heo 2009-07-04 94 __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES __weak \ dd17c8f7 Rusty Russell 2009-10-29 95 __typeof__(type) name :::::: The code at line 91 was first introduced by commit :::::: 0f5e4816dbf38ce9488e611ca2296925c1e90d5e percpu: remove some sparse warnings :::::: TO: Tejun Heo <tj@kernel.org> :::::: CC: Tejun Heo <tj@kernel.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/octet-stream, Size: 30837 bytes --] ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking 2015-10-16 5:28 ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin ` (2 preceding siblings ...) 2015-10-16 5:33 ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin @ 2015-10-16 5:34 ` George Spelvin 3 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-16 5:34 UTC (permalink / raw) To: andi, tytso; +Cc: ahferroin7, jepler, linux-kernel, linux, linux Andi Kleen has reported that some loads cause hideous lock congestion on the /dev/urandom pool lock. This patch uses some trickery to allow extract_buf to avoid taking the pool lock to do mixback in case of contention. The fundamental idea is that, if there are multiple concurrent readers, only one mixback is necessary for antibacktracking. What a reader has to be sure of is that *some* mixback happens after the reader is done hashing the pool. So if one CPU is holding the mixback_lock, the others just spam their mixback data into a global buffer (races be damned) and set a flag. The lock holder promises to check that flag and do the mixback before returning. This promise applies the entire time the mixback_lock is held, so the flag must be checked after dropping it. In theory this could be applied to the /dev/random pool as well, but that's not subject to the same kind of abuse as urandom, so leave it alone for the sake of conservatism. The first two spin_trylock() operations in nonblocking_mixback() are not strictly necessary; if either one succeeds, that shortens the code path, but things would work if they failed unconditionally. Another legal change would be to move the release of the mixback_lock after the first __mix_pool_bytes. But since there's a shortcut optimization if the pool lock is available, that would require keeping a flag to indicate if the mixback_lock is held and needs to be released. Signed-off-by: George Spelvin <linux@horizon.com> --- drivers/char/random.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 100 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index cf34e83d..54df2982 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -419,7 +419,6 @@ static LIST_HEAD(random_ready_list); * **********************************************************************/ -struct entropy_store; struct entropy_store { /* read-only data: */ const struct poolinfo *poolinfo; @@ -465,7 +464,17 @@ static struct entropy_store blocking_pool = { push_to_pool), }; -static struct entropy_store nonblocking_pool = { +/* This is a variant with some extra fields for nonblocking mixback */ +struct entropy_store_plus { + struct entropy_store store; + bool mixback_flag; + __u32 mixback[SHA_DIGEST_WORDS]; + spinlock_t mixback_lock; +}; + +#define nonblocking_pool nonblocking_plus.store +static struct entropy_store_plus nonblocking_plus = { + .store = { .poolinfo = &poolinfo_table[1], .name = "nonblocking", .pull = &input_pool, @@ -473,6 +482,9 @@ static struct entropy_store nonblocking_pool = { .pool = nonblocking_pool_data, .push_work = __WORK_INITIALIZER(nonblocking_pool.push_work, push_to_pool), + }, + .mixback_flag = false, + .mixback_lock = __SPIN_LOCK_UNLOCKED(nonblocking_plus.mixback_lock), }; static __u32 const twist_table[8] = { @@ -554,6 +566,83 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in, spin_unlock_irqrestore(&r->lock, flags); } +/* + * Because the /dev/urandom pool is by far the busiest, large machines + * with applications that hit the pool hard can fall off the locking + * cliff on the pool lock. (They *should* use a private CPRNG reseeded + * infrequently, but there are lazy programmers in the world.) + * + * This function avoids contention on the pool's lock, by taking + * advantage of the fact that mixback from a given pool state is logically + * idempotent: if there are multiple threads wanting to mix back, only + * one of them has to succeed. A non-deterministic mixture of the + * competing mixback data is also okay, so we use a common temporary + * buffer written without locking. + * + * It is also not necessary that a thread confirm that mixback has + * happened before returning to user space; a backtracking window a few + * microseconds long is not a concern, as long as it is strongly bounded. + * In this case, because the mixback lock is taken without disabling + * interrupts, the window is up to the interrupt latency long, but that's + * considered acceptable. + * + * What is important is that, after a mixback has started, any later + * pool readers will schedule an additional mixback. That's what all + * the dancing around with mixback_flag and mixback_lock is about. + */ +static void nonblocking_mixback(struct entropy_store_plus *r, + const void *in, int nbytes) +{ + unsigned long flags; + + if (!spin_trylock_irqsave(&r->store.lock, flags)) { + /* Someone's currently writing to the pool. Anyone waiting? */ + if (!spin_trylock(&r->mixback_lock)) { + /* Leave it for the lock holder to take care of. */ + nbytes = min_t(int, nbytes, sizeof(r->mixback)); + memcpy(r->mixback, in, nbytes); + smp_store_release(&r->mixback_flag, true); + smp_wmb(); + /* + * If the lock is still held, we are guaranteed that + * the lock holder will promptly do the mixback for us. + */ + if (!spin_trylock(&r->mixback_lock)) + return; + } + /* + * Wait in line to do mixback. This has to be a separate + * lock because we don't want to oblige interrupt entropy + * sources which also take the pool lock to do mixback. + */ + spin_lock_irqsave(&r->store.lock, flags); + spin_unlock(&r->mixback_lock); + smp_mb(); /* Between unlock and read of r->mixback flag */ + } + + __mix_pool_bytes(&r->store, in, nbytes); + + /* + * Do any buffered mixback. Holding the mixback_lock obliges + * us to check after dropping it, but we only have to check once. + * (We aren't required to check at all if we never took the + * mixback_lock, but it does no harm to.) + */ + if (READ_ONCE_CTRL(r->mixback_flag)) { + __u32 mixback[ARRAY_SIZE(r->mixback)]; + + memcpy(mixback, r->mixback, sizeof mixback); + memset(r->mixback, 0, sizeof r->mixback); + smp_store_release(&r->mixback_flag, false); + + smp_wmb(); /* Ensure release is seen before mixing */ + __mix_pool_bytes(&r->store, mixback, sizeof mixback); + memzero_explicit(mixback, sizeof mixback); + } + spin_unlock_irqrestore(&r->store.lock, flags); +} + + struct fast_pool { __u32 pool[4]; unsigned long last; @@ -1173,8 +1262,15 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE], * * FIXME: update security analysis in light of reduced mixback. */ - if (mixback) - mix_pool_bytes(r, hash.w, sizeof(hash.w)); + if (mixback) { + if (r->limit) { + mix_pool_bytes(r, hash.w, sizeof(hash.w)); + } else { + struct entropy_store_plus *p = + container_of(r, struct entropy_store_plus, store); + nonblocking_mixback(p, hash.w, sizeof(hash.w)); + } + } memzero_explicit(workspace, sizeof(workspace)); -- 2.6.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-13 21:10 ` George Spelvin 2015-10-14 2:15 ` Andi Kleen @ 2015-10-21 8:27 ` George Spelvin 2015-10-21 11:47 ` Andi Kleen 1 sibling, 1 reply; 30+ messages in thread From: George Spelvin @ 2015-10-21 8:27 UTC (permalink / raw) To: ahferroin7, andi, jepler, linux-kernel, linux, linux, tytso After having worked out the basics of a non-blocking /dev/urandom read, (patches poated; I'm hoping someone will comment on them!), I need to work out the crypto. Patch 3/4 of that series showed how to have concurrent pool readers not separated by mixback operations, but it was a stub, not really well thought through. Since then, I've been trying to think of how to do it properly. That entails figuring out exactly what the design goals of the mixback are. The current mixback operation works as follows: * Generate 160 bits of hash output * Produce 80 bits of that as user output * Mix all 160 bits back into the pool. This achieves two things: * Salting. It changes the pool state so the next hash will produce different output. * Anti-backtracking protection. By adding the output back to the input, it makes it difficult for an attacker who captures the later pool state to figure out the Now, the salting goal can be achieved just as well with a simple unique nonce. In the example, it's an incrementing counter and CPU ID. But the anti-backtracking is more complex. First of all, we note that the current design is only computationally secure, not information theoretically. To achieve the latter, we'd have to actually destroy entropy in the pool (by zeroing it or doing something like Spritz's crush() operation), which doesn't seem worth it. The basic idea of having as much mixback as output seems good. If you're going to do it at all, you might as well do it right, and you don't want a 256-bit key to have an 160-bit attack from a captured pool state. But there is a practical maximum. I can't think of a reason why you'd need more than the output pool size (128 bytes = 1024 bits), but maybe 256 bits is good enough. Thoughts? (256 bits also corresponds to one full cycle of the input mix over the pool, FWIW.) One thing that's questionable is the fact that the entire 20 bytes is mixed back. Because the input hash is not very rapid, it's possible for the mixback to land on a low-entropy part of the pool and not be well hidden. If that happens, the previous random output (or partial information about it) would be recoverable, even if the full previous pool state is not. I'm not sure how big a problem this is, but only mixing back 80 bits (independent of the output bits) might be better. A change I'm thinking about, which would simplify implementation considerably, would be to use all 160 bits of hash output for the user output, and then do additional hashing, to do the mixback. This greatly simplifies the contended-pool case by only requiring readers to pass the *amount* of mixback to the lock holder, which can be done with an atomic_add. There's no need to pass actual data, because the lock-holder can just generate it themselves. Specifically, in the low-contention case, the reader would use any leftover hash bytes for mixback if it could get the lock, but it wouldn't be a big performance hit to throw them away if someone else was holding the lock. In the case of extremely high contention, the limit on maximum total mixback would let the lock holder do less mixback than the total that would be done by the concurrent readers. But that requires understanding the reasons for only outputting half of the hash result, to know if it's safe to make the change. Three possibilities come to mind: 1. To produce enough mixback. This is preserved by the proposed change. 2. To avoid entropy loss due to collisions in the non-invertible final addition in a "narrow-pipe" hash like SHA. This is a generic issue with hashes which do not maintain an internal state wider than the input. This includes MD5, SHA1, SHA2, and BLAKE/BLAKE2. (But not Keccak, SipHash, JH, Gr{\o}stl, Skein, etc.) Assume the first half of the pool has lots of entropy, so the 160-bit output of the first sha_transform() call is perfectly uniformly distributed. (Each of the 2^160 possible output occurs with probably 2^-160, so 160 bits of entropy by any easure.) If the second half of the pool has no entropy (a known, fixed value), then sha_transform approximates a 160-to-160 bit random function. Such a function has output collisions with probabilities described by the Poisson distribution with mean (lambda) 1. (This number is the ratio of input and output state sizes.) In particular, 1/e = e^-1 = 36.788% of the outputs do not appear for any input. (If there are k time more inputs than output, all equally likely, it's e^-k. So just a few bits reduces that drastically.) If you do the Shannon entropy math, that means that the output has log2(e) = 0.827245389... fewer bits of entropy than the input. (If there are k states in the second half of the pool, the entropy loss is slightly less than 1/k of this value. So 163 bits in produces 160 - 0.827/8 = 159.9 bits out.) If you're going by min-entropy, it's worse. Over 2^160 inputs, the Poisson distribution predicts 3.5 41-way collisions, and 0.0866 42-way collisions. So the loss is log2(41) = 5.358 bits. (If there are 8 states in the second half, then we expect a 77-way collision, for a min-entropy of 160 - log2(77/8) = 160 - 3.2668 = 156.73 bits.) (For more on this, the work of Andrea R\"ock, e.g. "Collision Attacks based on the Entropy Loss caused by Random Functions", is good reading.) Although this affects all narrow-pipe hashes, you only need to chop a few bits off the state size to be nearly perfect in Shannon terms, and 8 bits gets you less than 1 bit of min-entropy lost. 2 bytes makes the largest collision that you expect to find at least 1 of among 2^176 inputs is a 69270-way, resulting in a loss of log2(69270/65536) = 0.08 bit of min-entropy. So while I can see the point to truncating the hash, cutting it in half seems unnecessary. 3. General "I don't trust SHA-1" principles without a more specific reason. If this is all there is, would substituting a different hash (I'm looking into BLAKE2 at Ted's suggestion) help? Note that the entire SHAppening issue is not particularly relevant to /dev/urandom attacks. Attacking the /dev/urandom output mix is more like a pre-image attack, but considerably harder, since you're not trying to find not just *a* pre image, but *the* pre-image. Using the expected pre-image resistance of the hash is insanely conservaitve. Perhaps not a bad thing, though. But could we use 128 bits rather than 80? Finally, I'm trying to figure out how to use the get_radom_int_hash array for salt material. We already have it, so it would be nice to use. But the current way it's used (basically, uses the random_int_secret as a key and runs in OFBmode) leaves the last random number visible. A pool reader must change its salt if it can't get the mixback lock. There may be a few bytes of leftover entropy available for the purpose if the hash size doesn't divide the read size. I could always do another hash iteration to reseed get_random_int_hash if I can't get the mixback lock, but is there a cheaper way? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-21 8:27 ` Updated scalable urandom patchkit George Spelvin @ 2015-10-21 11:47 ` Andi Kleen 2015-10-21 18:10 ` George Spelvin 0 siblings, 1 reply; 30+ messages in thread From: Andi Kleen @ 2015-10-21 11:47 UTC (permalink / raw) To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux, tytso On Wed, Oct 21, 2015 at 04:27:43AM -0400, George Spelvin wrote: > After having worked out the basics of a non-blocking /dev/urandom read, > (patches poated; I'm hoping someone will comment on them!), I need to > work out the crypto. Can we just use my multi pool patch for now? It works and is actually scalable and does not require any new "cryptographic research" or other risks. The main concern I heard was code size. I addressed this in the latest version by making the new code depend on CONFIG_NUMA, which is generally not set on small systems. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Updated scalable urandom patchkit 2015-10-21 11:47 ` Andi Kleen @ 2015-10-21 18:10 ` George Spelvin 0 siblings, 0 replies; 30+ messages in thread From: George Spelvin @ 2015-10-21 18:10 UTC (permalink / raw) To: andi, linux; +Cc: ahferroin7, jepler, linux-kernel, linux, tytso > Can we just use my multi pool patch for now? It works and is actually scalable > and does not require any new "cryptographic research" or other risks. To clarify, it's actually quite easy to come up with something that works, cryptographically. All the discussion is: 1) Erring on the side of public discussion for a security-sensitive area, and 2) Trying to find "The Right Thing" among the many workable solutions. Earlier, Ted seemed to be interested in more significant changes like replacing sha_transform and "abuse mitigation mode", so I've been thinking expansively. The immediate plans aren't necessarily quite as complex. Mostly, it's an embarrassment of ways to do it, and trying to find a reason to choose one over the other. I wrote one very quickly, said "I can do better", thought about coding that up, thought of an even better solution, and decided to post the performance-affecting part while I played donkey-between-two-piles-of-hay. > or other risks. Ted gets to decide, but my objection is that the already-sparse entropy supply is considerably diluted. That *is* a risk. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Updated scalable urandom patchkit @ 2015-09-24 17:19 Andi Kleen 0 siblings, 0 replies; 30+ messages in thread From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw) To: tytso; +Cc: linux-kernel, linux Updated patchkit to make /dev/urandom scale on larger systesm. I addressed all the review comments. See ChangeLogs in the individual patches. -Andi ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2015-10-21 18:10 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-10 18:45 Updated scalable urandom patchkit George Spelvin 2015-10-11 2:31 ` Theodore Ts'o 2015-10-11 2:53 ` Theodore Ts'o 2015-10-11 4:35 ` George Spelvin 2015-10-11 22:25 ` Theodore Ts'o 2015-10-12 0:16 ` George Spelvin 2015-10-12 4:05 ` Theodore Ts'o 2015-10-12 7:49 ` George Spelvin 2015-10-12 13:54 ` Theodore Ts'o 2015-10-12 20:30 ` George Spelvin 2015-10-12 20:34 ` George Spelvin 2015-10-13 2:46 ` Theodore Ts'o 2015-10-13 3:50 ` Raymond Jennings 2015-10-13 7:50 ` George Spelvin 2015-10-13 6:24 ` George Spelvin 2015-10-13 16:20 ` Andi Kleen 2015-10-13 21:10 ` George Spelvin 2015-10-14 2:15 ` Andi Kleen 2015-10-16 5:28 ` [RFC PATCH 0/4] Alternate sclable urandom patchset George Spelvin 2015-10-16 5:29 ` [RFC PATCH 1/4] random: Reduce stack usage in _xfer_secondary_pool George Spelvin 2015-10-16 5:30 ` [RFC PATCH 2/4] random: Remove two unused arguments from extract_entropy() George Spelvin 2015-10-16 5:33 ` [RFC PATCH 3/4] random: Only do mixback once per read George Spelvin 2015-10-16 6:12 ` kbuild test robot 2015-10-16 8:11 ` George Spelvin 2015-10-16 6:23 ` kbuild test robot 2015-10-16 5:34 ` [RFC PATCH 4/4] random: Make non-blocking mixback non-blocking George Spelvin 2015-10-21 8:27 ` Updated scalable urandom patchkit George Spelvin 2015-10-21 11:47 ` Andi Kleen 2015-10-21 18:10 ` George Spelvin -- strict thread matches above, loose matches on Subject: below -- 2015-09-24 17:19 Andi Kleen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.