* [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? @ 2014-06-09 0:05 George Spelvin 2014-06-09 1:35 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-09 0:05 UTC (permalink / raw) To: hpa, price, tytso; +Cc: linux, linux-kernel, mingo I have a few assorted cleanups in-work, but I was thinking about the following and it's led to some thinking about the (non-)usefulness of the out[] buffer from _mix_pool_bytes. The problem I started trying to solve is readers (entropy extractors) monopolizing the pool lock and stalling writers (entropy mixers), which are supposed to be fast and low overhead. So I thought I could rely on the "out" buffer returned from _mix_pool_bytes to make concurrent extractors produce different output. That led to the patch below, which drops the lock during the main pool hash and only locks around the mix-back. (Using the mix_pool_bytes wrapper rather than explicit locking and __mix_pool_bytes.) But I'm not sure it's quite right. Suppose we have a pool that's mostly all-zero, but has a small dense high-entropy section, and no arch_get_random_long(). Narrowing the lock lets concurrent readers get to the __mix_pool_bytes() mix-back (last hunk of the diff) with identical SHA states. With the original code, which fills out[] with data from just *after* the add_ptr (i.e. the oldest portion of the pool), it's possible for multiple readers to obtain all-zero out[] buffers and return identical hashes. (Or, in a less extreme case, very low-entropy out[] buffers and strongly correlated outputs. I'll keep using "all-zero" to make the examples simpler, and assume people can see the extrapolation.) Well, okay, I think, easily fixed (included in the patch below): change it to return the most recently modified section just *before* the add_ptr, which includes the previus extractor's SHA hash. That way, I'm guaranteed to get different data on multiple concurrent calls and everything is okay. But is it? Suppose I have three concurrent callers. (I need three because the hash folding covers up the problem with two.) The pool contains 30 bytes of entropy, so it should be possible to return uncorrelated outputs to each of them, but as mentioned that's in a small dense region of the pool and the area around add_ptr is all zeros. Each hashes the pool to obtain identical 20-byte SHA-1 hashes. Then they serialize arond calls to _mix_pool_bytes. The first gets zeros in out[], the second gets the first's SHA-1 hash and so generates a different hash, and the third gets the second's SHA-1 hash. So they all produce different outputs, but the outputs (totalling 30 bytes) passed through a 20-byte choke point and so have at most 20 bytes of entropy between them. This violates the /dev/random security goals. Here are the solutions I've been thinking about: 1) Continue to lock the entire hashing operation. As long as we do this, the out[] parameter to _mix_pool_bytes is unnecessary and should be deleted. Since the pool can't change between the bulk hash and the hash of out[], there's no point to re-hashing some data that was just hashed. (You are sampling the add_ptr, which differs between callers, but that's negligible.) One way to address my concern would be to split r->lock into r->mix_lock and r->extract_lock. Extractors would obtain the latter for the entire hash, and only obtain the mix_lock for the brief mix-back operation. The downside, of course, is a second lock on the extract path, but given that the efficiency of the extract path is not a design goal, perhaps that's fine. 2) Shrink the lock and add a nonce (like a CPUID) to the *initial* SHA state. This is fun because of its greater cryptographic cleverness, but that means it should be discussed carefully first. Due to the non-linearity of SHA hashing, this would result in the concurrent hashes sampling different parts of the pool's entropy, and the 20-byte entropy bottleneck would disappear. But in this case, the out[] buffer also does not contribute to the security guarantee and could also disappear. Thoughts, anyone? diff --git a/drivers/char/random.c b/drivers/char/random.c index 102c50d..d847367 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -499,6 +499,15 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, input_rotate = ACCESS_ONCE(r->input_rotate); i = ACCESS_ONCE(r->add_ptr); + /* + * Out gets a copy of the portion of the pool most recently + * modified by previous writers. Since add_ptr counts down, + * this is at positive offsets from the initial value. + */ + if (out) + for (j = 0; j < 16; j++) + ((__u32 *)out)[j] = r->pool[(i + j) & wordmask]; + /* mix one byte at a time to simplify size handling and churn faster */ while (nbytes--) { w = rol32(*bytes++, input_rotate); @@ -527,10 +536,6 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in, ACCESS_ONCE(r->input_rotate) = input_rotate; ACCESS_ONCE(r->add_ptr) = i; smp_wmb(); - - if (out) - for (j = 0; j < 16; j++) - ((__u32 *)out)[j] = r->pool[(i - j) & wordmask]; } static void __mix_pool_bytes(struct entropy_store *r, const void *in, @@ -1043,7 +1048,6 @@ static void extract_buf(struct entropy_store *r, __u8 *out) } /* Generate a hash across the pool, 16 words (512 bits) at a time */ - spin_lock_irqsave(&r->lock, flags); for (i = 0; i < r->poolinfo->poolwords; i += 16) sha_transform(hash.w, (__u8 *)(r->pool + i), workspace); @@ -1056,8 +1060,7 @@ static void extract_buf(struct entropy_store *r, __u8 *out) * brute-forcing the feedback as hard as brute-forcing the * hash. */ - __mix_pool_bytes(r, hash.w, sizeof(hash.w), extract); - spin_unlock_irqrestore(&r->lock, flags); + mix_pool_bytes(r, hash.w, sizeof(hash.w), extract); /* * To avoid duplicates, we atomically extract a portion of the ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 0:05 [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? George Spelvin @ 2014-06-09 1:35 ` Theodore Ts'o 2014-06-09 2:10 ` George Spelvin 2014-06-09 13:17 ` drivers/char/random.c: More futzing about George Spelvin 0 siblings, 2 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-09 1:35 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, price, linux-kernel, mingo On Sun, Jun 08, 2014 at 08:05:24PM -0400, George Spelvin wrote: > > The problem I started trying to solve is readers (entropy extractors) > monopolizing the pool lock and stalling writers (entropy mixers), which > are supposed to be fast and low overhead. Which writer are you worried about, specifically? A userspace write to /dev/random from rgnd? And have you measured this to be something significant, or is this a theoretical concern. If you've measured it, what's the conditions where this is stalling an entropy mixer a significant amount of time? - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 1:35 ` Theodore Ts'o @ 2014-06-09 2:10 ` George Spelvin 2014-06-09 2:18 ` George Spelvin 2014-06-09 13:17 ` drivers/char/random.c: More futzing about George Spelvin 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-09 2:10 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Which writer are you worried about, specifically? A userspace write > to /dev/random from rgnd? Actually, the part I'm actually worried about is latency in add_interrupt_randomness. But I may have not understood how the locking works. There's a per-CPU fast pool which isn't locked at all, which feeds the input_pool, and the add to the input pool would be slow if it were locked, but it seems hard for user space to force a lot of locking activity on the input_pool. A bulk read will drain the secondary pool, but random_min_urandom_seed will prevent reseeding, so lock contention on input_pool will be almost nil. Maybe I need to turn this into a documentation patch explaining all of this. > And have you measured this to be something significant, or is this a > theoretical concern. If you've measured it, what's the conditions > where this is stalling an entropy mixer a significant amount of time? It's a theoretical extrapolation from timing of user-space writes. Some time comparisons (on a multicore machine so the two threads should run on separate processors, and with scaling_governor = performance): $ dd if=/dev/zero of=/dev/random count=65536 65536+0 records in 65536+0 records out 33554432 bytes (34 MB) copied, 0.356898 s, 94.0 MB/s $ dd if=/dev/zero of=/dev/random count=65536 65536+0 records in 65536+0 records out 33554432 bytes (34 MB) copied, 0.357693 s, 93.8 MB/s $ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 ; sleep 4 65536+0 records in 65536+0 records out 33554432 bytes (34 MB) copied, 0.505941 s, 66.3 MB/s 16384+0 records in 16384+0 records out 8388608 bytes (8.4 MB) copied, 0.715132 s, 11.7 MB/s $ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 ; sleep 4 65536+0 records in 65536+0 records out 33554432 bytes (34 MB) copied, 0.509075 s, 65.9 MB/s 16384+0 records in 16384+0 records out 8388608 bytes (8.4 MB) copied, 0.734479 s, 11.4 MB/s $ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 65536+0 records in 65536+0 records out 33554432 bytes (34 MB) copied, 0.66224 s, 50.7 MB/s 16384+0 records in 16384+0 records out 8388608 bytes (8.4 MB) copied, 0.894693 s, 9.4 MB/s 16384+0 records in 16384+0 records out 8388608 bytes (8.4 MB) copied, 0.895628 s, 9.4 MB/s $ dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/urandom of=/dev/null count=16384 & dd if=/dev/zero of=/dev/random count=65536 ; sleep 4 65536+0 records in 65536+0 records out 33554432 bytes (34 MB) copied, 0.657055 s, 51.1 MB/s 16384+0 records in 16384+0 records out 8388608 bytes (8.4 MB) copied, 0.908407 s, 9.2 MB/s 16384+0 records in 16384+0 records out 8388608 bytes (8.4 MB) copied, 0.908989 s, 9.2 MB/s Summarizing that, time to feed in 32 MiB of zeros (from user-space): 0 concurrent reads: 0.356898 0.357693 1 concurrent read: 0.505941 0.509075 (+42%) 2 concurrent reads: 0.662240 0.657055 (+84%) ... so it seems like there are some noticeable contention effects. And then I started thinking, and realized that the out[] parameter wasn't doing anything useful at all, and even if we don't change anything else at all, maybe it should be deleted and de-clutter the code? It dates back to the days when entropy adding tried to be lockless, but that was a long time ago. And once you have locking, it no longer serves any function. It's a bit of meandering stream of consciousness, sorry. The latency issue is where I started, but the general uselessness of out[] is what I felt really needed discussing before I proposed a patch to dike it out. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 2:10 ` George Spelvin @ 2014-06-09 2:18 ` George Spelvin 2014-06-09 4:03 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-09 2:18 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Summarizing that, time to feed in 32 MiB of zeros (from user-space): > > 0 concurrent reads: 0.356898 0.357693 > 1 concurrent read: 0.505941 0.509075 (+42%) > 2 concurrent reads: 0.662240 0.657055 (+84%) Er, wait a minute... I'm not sure which kernel (patched or unpatched) I did that measurement on. That may be a completely useless number. I need to compile some more kernels and reboot a few more times. Sorry about that. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 2:18 ` George Spelvin @ 2014-06-09 4:03 ` George Spelvin 2014-06-09 9:23 ` George Spelvin 2014-06-09 13:34 ` Theodore Ts'o 0 siblings, 2 replies; 57+ messages in thread From: George Spelvin @ 2014-06-09 4:03 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price Sigh, adventures in "unable to mount root filesystem" currently underway. Tested on a different computer without patches (half size, since it's an older machine): Writing 16 MiB: 16777216 bytes (17 MB) copied, 0.289169 s, 58.0 MB/s 16777216 bytes (17 MB) copied, 0.289378 s, 58.0 MB/s Writing while reading: 16777216 bytes (17 MB) copied, 0.538839 s, 31.1 MB/s 4194304 bytes (4.2 MB) copied, 0.544769 s, 7.7 MB/s 16777216 bytes (17 MB) copied, 0.537425 s, 31.2 MB/s 4194304 bytes (4.2 MB) copied, 0.544259 s, 7.7 MB/s 16777216 bytes (17 MB) copied, 0.740495 s, 22.7 MB/s 4194304 bytes (4.2 MB) copied, 0.879353 s, 4.8 MB/s 4194304 bytes (4.2 MB) copied, 0.879629 s, 4.8 MB/s 16777216 bytes (17 MB) copied, 0.7262 s, 23.1 MB/s 4194304 bytes (4.2 MB) copied, 0.877035 s, 4.8 MB/s 4194304 bytes (4.2 MB) copied, 0.880627 s, 4.8 MB/s 16777216 bytes (17 MB) copied, 0.996933 s, 16.8 MB/s 4194304 bytes (4.2 MB) copied, 1.24551 s, 3.4 MB/s 4194304 bytes (4.2 MB) copied, 1.26138 s, 3.3 MB/s 4194304 bytes (4.2 MB) copied, 1.2664 s, 3.3 MB/s 16777216 bytes (17 MB) copied, 0.969144 s, 17.3 MB/s 4194304 bytes (4.2 MB) copied, 1.25311 s, 3.3 MB/s 4194304 bytes (4.2 MB) copied, 1.26076 s, 3.3 MB/s 4194304 bytes (4.2 MB) copied, 1.25887 s, 3.3 MB/s Summarized: 0 readers: 0.289169 0.289378 1 reader: 0.538839 0.537425 (+86%) 2 readers: 0.740495 0.726200 (+153%) 3 readers: 0.996933 0.969144 (+240%) That seems... noticeable. Causing iterrupt latency problems is defintiely a theoretical extrapolation, however. For comparison, on this system, dd from /dev/zero runs at 1 GB/s per thread for up to 4 threads with no interference. *Really* confusingly, dd from /dev/zero to tmpfs runs at 450 MB/s (per thread) for 2 to 4 threads, but 325 MB/s for 1 thread. NFclue. (This is writing to separate files; writing the the same file is slower.) dd from tmpfs to tmpfs runs at about 380 MB/s, again independent of the number of threads up to the number of CPUs. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 4:03 ` George Spelvin @ 2014-06-09 9:23 ` George Spelvin 2014-06-09 13:34 ` Theodore Ts'o 1 sibling, 0 replies; 57+ messages in thread From: George Spelvin @ 2014-06-09 9:23 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Sigh, adventures in "unable to mount root filesystem" currently underway. PEBKAC resolved. (I had CONFIG_ATA=m for some testing, but forgot to change it back to y when compiling a kernel I expected to boot.) 32 MiB write: Without patch With patch 0 readers: 0.495523 0.494998 0.515995 0.516026 1 reader: 0.842245 0.845308 0.704283 0.704318 2 readers: 1.18658 1.18762 0.904635 0.904844 3 readers: 1.49615 1.49616 1.14311 1.16836 I haven't rebooted back to the without-patch kernel to see if the 20 ms loss is consistent, but obviously that needs fixing. (There's more than just that patch I posted in there.) Anyway, it's noticeable, but maybe not worth fixing. The interesting question is whether we should get rid of out[]. That's obscurity that contributes no security, and thus a Bad Thing by definition. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 4:03 ` George Spelvin 2014-06-09 9:23 ` George Spelvin @ 2014-06-09 13:34 ` Theodore Ts'o 2014-06-09 15:04 ` George Spelvin 1 sibling, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-09 13:34 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Mon, Jun 09, 2014 at 12:03:55AM -0400, George Spelvin wrote: > > That seems... noticeable. Causing iterrupt latency problems is defintiely > a theoretical extrapolation, however. Actually, if you look very closely, or take a look at the commit description for 902c098a3663de, you'll see that we're actually updating the entropy pool from add_interrupt_randomness() w/o taking any locks. So that's actually not a problem. What could be a problem is if we get really unlucky (an interrupt update happening at the same time as an update from rngd), it's possible we could end up with an entropy addition getting lost. It's not such a big deal if a contribution from add_interrupt_randomness() gets lost, since we only giving one bit worth of entropy credit. But it's possible than a much larger input from rngd could potentially end up getting overwritten from the fast_pool contribution. This isn't a disaster per se, but it probably means that it's worth taking a closer look at how we do the entropy pool mixing input to see if we can actually a make a fully lockless add_entropy work correctly. One approach might be to use atomics but of course then we have to balance the increased overhead of using atomic_t types versus the locking overhead. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 13:34 ` Theodore Ts'o @ 2014-06-09 15:04 ` George Spelvin 2014-06-09 15:50 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-09 15:04 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Actually, if you look very closely, or take a look at the commit > description for 902c098a3663de, you'll see that we're actually > updating the entropy pool from add_interrupt_randomness() w/o taking > any locks. So that's actually not a problem. Arrgh... I could swear I saw a path that locked, but now that I look again I can't see it. I think you're right and I was hallucinating. Nice work on credit_entropy_bits, BTW. > What could be a problem is if we get really unlucky (an interrupt > update happening at the same time as an update from rngd), it's > possible we could end up with an entropy addition getting lost. It's > not such a big deal if a contribution from add_interrupt_randomness() > gets lost, since we only giving one bit worth of entropy credit. But > it's possible than a much larger input from rngd could potentially end > up getting overwritten from the fast_pool contribution. What I can't see in the code is how that case is detected and the entropy credit suppressed. I just want to be sure I understand your use of the word "overwritten". There's the entropy itself, and the associated credit. I'm taking about the fact that _mix_pool_bytes can cause the actual entropy bits to get overwritten in the pool if two writers race each other. As you say, it's always possible to fall back to zero credit, but you have to *detect* the situation, and I'm not sure how that happens. While it's easy enough to have a special case for one interrupt handler, there's one per processor that can be trying to write. The obvious way is to do a trylock of some sort from the interrupt handler and hold off spilling the fast_pool if the attempt fails. So there's a lock of a sort, but no spinning on it. > This isn't a disaster per se, but it probably means that it's worth > taking a closer look at how we do the entropy pool mixing input to see > if we can actually a make a fully lockless add_entropy work correctly. > One approach might be to use atomics but of course then we have to > balance the increased overhead of using atomic_t types versus the > locking overhead. Doing it really right, including clamping at the endpoints, is extremely tricky in the face of concurrent access. Consider the case of a full pool, and a concurrent write and read of a full pool's worth of entropy. If the write wins the race, it overfills the pool, so may not give any credit, and the read them empties it. If the read wins the race, it empties the pool, and then the write refills it. But if there's no enforced serialization, the final condition is that the amount of entropy in the pool is completely unknown; it could be anything. The implementation that seems most obvious to me, if we can have the readers and the writers cooperating at least, uses monotonically increasing head and tail counters, with (head - tail) being the current entropy budget. To handle concurrent update, the head counter is split in two, into head_max and head_min. The "max" is incremented before the pool itself is accessed, and the "min" is incremented after. When adding, the writer looks at the tail and chooses an amount to credit that will not overflow the pool. Then increments head_max, mixes in to the pool, and finally lets head_min catch up. When extracting, the reader first blocks (if desired) on head_min. Then the extractor does the extract and advances the tail by the amount extracted, but no more than head_max after the extract. Oh! I just realized that concurrent mix_pool_bytes can be handled with a similar scheme. (To be clear, I've jumped from talking about entropy accounting back to pool mixing.) *If* we can bound the total number of bytes to be concurrently mixed in to the pool to the size of the pool, then you can assign ranges of bytes to write using a osrt of ticket lock. Each adder bumps the add_ptr atomically by the amount of data they want to add, masks the pre-incrmeent counter value, and uses that to determine the area of the pool they will be updating. What gets nasty is if you have a large SMP system and more writers than can be handled thus. Because there's no guarantee on the order that writers will *finish*, there's no simple way to keep track of adds completed and thereby know if an additional writer has room to work. I can't just atomically increment the completed index when I finish adding, because that could be misleading. If processor A reads the add_ptr as 10 and increments it to 20, then processor B reads the add_ptr as 20 and increments it to 30, everything is fine so far. But if processor B finishes first and increments the adds_completetd index to 20, it's lying: the range from 10 to 20 is still busy and attempting to access it will lead to trouble. Ah! It turns out that there is a simple technique after all. I have three add counters, and due to lack of imagination thinking of good descriptive names, I'll call them add1, add2 and add3. add1-add2 is the amount of outstanding writes in progress. The location of the writes is not known, excpe that it's somewhere between add3 and add1. The invariant is add3 <= add2 <= add1. add1 is the ticket lock. A would-be writer checks that it is not trying to increment add1 to more than add3+POOLSIZE, and if so does an atomic fetch-and-add on add1. This tells it the range of the pool it's allowed to write to. All good, and it does the appropriate update. When the write complete, atomically bump add2. Then check if add2 == add1. If not, return. If they are equal, set add3 to the shared value, collapsing the "writes in progress somewhere in this range" window. This would be bad TCP, preiodically collapsing the window, but it should be fine for an entropy pool. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 15:04 ` George Spelvin @ 2014-06-09 15:50 ` Theodore Ts'o 2014-06-09 16:11 ` George Spelvin 2014-06-10 0:20 ` drivers/char/random.c: more ruminations George Spelvin 0 siblings, 2 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-09 15:50 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Mon, Jun 09, 2014 at 11:04:26AM -0400, George Spelvin wrote: > > What could be a problem is if we get really unlucky (an interrupt > > update happening at the same time as an update from rngd), it's > > possible we could end up with an entropy addition getting lost. It's > > not such a big deal if a contribution from add_interrupt_randomness() > > gets lost, since we only giving one bit worth of entropy credit. But > > it's possible than a much larger input from rngd could potentially end > > up getting overwritten from the fast_pool contribution. > > What I can't see in the code is how that case is detected and the entropy > credit suppressed. Sorry, I didn't make myself clear. In the common case where we don't have rngd running, and so we only have add_interrupt_randomness() racing with itself, what's at stake is only one bit of entropy credit. In the absolute worse case, where two CPU's run add_interrupt_randomness() at the the exact same time (after servicing different irq's), what can happen is that both cpu's will get the same add_ptr, and so both CPU's will end up updating the same portion of the entropy pool, and so one CPU's worth of add_interrupt_randomness() will get lost by being overwritten by the other CPU's updates. This means that we will have given two bits worth of entropy credit, but only one add_interrupt_randomness()'s worth of updates. Given that the single bit of entropy bit credits was being very conservative to begin with, it's not such a big deal. However, and this is the case I failed to consider when I made this change almost two years ago, is what happens if you have one updater coming from add_interrupt_randomness(), and another one coming from rngd calling RNGADDENTROPY ioctl. Now if add_interrupt_randomness() wins the race and all of the entropy pool updates are coming from it, then the entropy credit coming from RNDADDENTROPY will be credited to the pool without the contribution from rngd being actually added to the pool, since those contributions might have gotten overwritten by add_interrupt_randomness(). So I was't saying that we could detect the situation; it's just that in cases where we aren't adding any credit anyway, or as in the case of add_interrupt_randomness() racing with each other, it's 50-50 with CPU will win the race, and so an adversary which can't predict which CPU's contribution to the entropy pool will actually "win" has to explore both possibilities, and that mitigates the fact that entropy count might be bumped without the corresponding entropy having been added. But there is the RNGADDENTROPY icotl case where this reasoning doesn't hold, and that's the case which I'm now worried about. > Consider the case of a full pool, and a concurrent write and read > of a full pool's worth of entropy. > > If the write wins the race, it overfills the pool, so may not give > any credit, and the read them empties it. > > If the read wins the race, it empties the pool, and then the write > refills it. Yes, but again, if we're only adding a single bit's worth of entropy credit, then at worst we'll only be off by one. And if the race is a 50-50 proposition (and I know life is not that simple; for example, it doesn't deal with N-way races), then we might not even be off by one. :-) One other thing to consider is that we don't really need to care about how efficient RNGADDENTROPY needs to be. So if necessary, we can put a mutex around that so that we know that there's only a single RNGADDENTROPY being processed at a time. So if we need to play cmpxchg games to make sure the RNGADDENTROPY contribution doesn't get lost, and retry the input mixing multiple times if necessary, that's quite acceptable, so long as we can minimize the overhead on the add_interrupt_randomness() side of the path (and where again, if there is a 50-50 change that we lose the contribution from add_interrupt_randomness(), that's probably acceptable so long as we don't lose the RNGADDENTROPY contribution). Come to think of it, something that we > Ah! It turns out that there is a simple technique after all. > I have three add counters, and due to lack of imagination thinking > of good descriptive names, I'll call them add1, add2 and add3. > > add1-add2 is the amount of outstanding writes in progress. The location > of the writes is not known, excpe that it's somewhere between add3 and add1. > > The invariant is add3 <= add2 <= add1. > > add1 is the ticket lock. A would-be writer checks that it is > not trying to increment add1 to more than add3+POOLSIZE, and > if so does an atomic fetch-and-add on add1. > > This tells it the range of the pool it's allowed to write to. All good, > and it does the appropriate update. > > When the write complete, atomically bump add2. Then check if add2 == > add1. If not, return. If they are equal, set add3 to the shared value, > collapsing the "writes in progress somewhere in this range" window. I need to think about this more, but that's clever! And if we end up having to throw away a contribution from add_interrupt_entropy(), that's fine. Speaking of which, there's an even simpler solution that I've failed to consider. We could simply use a trylock in add_interrupt_randomness(), and if we can't get the lock, decrement fast_pool->count so we try to transfer the entropy from the fast_pool to the entropy pool on the next interrupt. Doh! That solves all of the problems, and it's much simpler and easier to reason about. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? 2014-06-09 15:50 ` Theodore Ts'o @ 2014-06-09 16:11 ` George Spelvin 2014-06-10 0:20 ` drivers/char/random.c: more ruminations George Spelvin 1 sibling, 0 replies; 57+ messages in thread From: George Spelvin @ 2014-06-09 16:11 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Yes, but again, if we're only adding a single bit's worth of entropy > credit, then at worst we'll only be off by one. And if the race is a > 50-50 proposition (and I know life is not that simple; for example, it > doesn't deal with N-way races), then we might not even be off by one. :-) Indeed. > One other thing to consider is that we don't really need to care about > how efficient RNGADDENTROPY needs to be. So if necessary, we can put > a mutex around that so that we know that there's only a single > RNGADDENTROPY being processed at a time. So if we need to play > cmpxchg games to make sure the RNGADDENTROPY contribution doesn't get > lost, and retry the input mixing multiple times if necessary, that's > quite acceptable, so long as we can minimize the overhead on the > add_interrupt_randomness() side of the path (and where again, if there > is a 50-50 change that we lose the contribution from > add_interrupt_randomness(), that's probably acceptable so long as we > don't lose the RNGADDENTROPY contribution). Yes, that was something I was thinking: if you can just *detect* the collision, you can always retry the mixing in. But getting the entropy accounting right is tricky. I completely fail to see how the current RNDADDENTROPY is race-free. There's no attempt at locking between the write_pool() and credit_entropy_bits_safe(). So a reader could sneak in and drain the pool, only to have us credit it back up later. Crediting either before or after the write_pool is unsafe; you have to either wrap mutual exclusion around the whole thing, or do a sort of 2-phase commit. > Speaking of which, there's an even simpler solution that I've failed > to consider. We could simply use a trylock in > add_interrupt_randomness(), and if we can't get the lock, decrement > fast_pool->count so we try to transfer the entropy from the fast_pool > to the entropy pool on the next interrupt. > > Doh! That solves all of the problems, and it's much simpler and > easier to reason about. That's exacty what I was talking about when I wrote (two e-mails ago): >> While it's easy enough to have a special case for one interrupt handler, >> there's one per processor that can be trying to write. The obvious way >> is to do a trylock of some sort from the interrupt handler and hold off >> spilling the fast_pool if the attempt fails. So there's a lock >> of a sort, but no spinning on it. We just used different terminology. I used the more abstract "hold off spilling the fast_pool" and you described a specific implementation technique with "decrement fast_pool->count". The whole "multiple concurrent writers" thing is probably overkill, but it's fun to figure out how to do some of these things anyway. ^ permalink raw reply [flat|nested] 57+ messages in thread
* drivers/char/random.c: more ruminations 2014-06-09 15:50 ` Theodore Ts'o 2014-06-09 16:11 ` George Spelvin @ 2014-06-10 0:20 ` George Spelvin 2014-06-10 1:20 ` Theodore Ts'o 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-10 0:20 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price My original reason for writing turned into a never mind... I came up with a clever and intricate algorithm for preventing a bulk writer from DoSing add_interrupt_randomness via the proposed additional locking on the input_pool. Then I re-checked the code and saw that only the (trusted) RNDADDENTROPY call is allowed to write to the input_pool. A plain write(2) is copied to the two output pools. And writing to the input_pool can never try to take a lock on an output pool; spilling is done in a workqueue. I'm getting to a dangerous place: I think I'm starting to understand this. I should probably turn that into a doc patch. I have an idea for a patch to change _xfer_secondary_pool to use extract_buf rather than extract_entropy; is all that FIPS stuff needed for purely internal transfers? Also, shouldn't the r->last_pulled holdoff in xfer_secondary_pool be really limited to actual transfers? I.e. reorder the conditions as: static void xfer_secondary_pool(struct entropy_store *r, size_t nbytes) { if (!r->pull || r->entropy_count >= (nbytes << (ENTROPY_SHIFT + 3)) || r->entropy_count >= r->poolinfo->poolfracbits) return; if (r->limit == 0 && random_min_urandom_seed) { unsigned long now = jiffies; if (time_before(now, r->last_pulled + random_min_urandom_seed * HZ)) return; r->last_pulled = now; } _xfer_secondary_pool(r, nbytes); } ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-10 0:20 ` drivers/char/random.c: more ruminations George Spelvin @ 2014-06-10 1:20 ` Theodore Ts'o 2014-06-10 3:10 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-10 1:20 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Mon, Jun 09, 2014 at 08:20:57PM -0400, George Spelvin wrote: > > I have an idea for a patch to change _xfer_secondary_pool > to use extract_buf rather than extract_entropy; is all that > FIPS stuff needed for purely internal transfers? That's not the part of extract_entropy() which is critical. What's critical is the control over only transfering entropy when there is at least a certain minimum amount of entropy. This provides the Yarrow-like avalanche property which is required to provide recovery after the internal state of the entropy pools have been exposed. > Also, shouldn't the r->last_pulled holdoff in xfer_secondary_pool be > really limited to actual transfers? I.e. reorder the conditions as... Yes, that makes sense. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-10 1:20 ` Theodore Ts'o @ 2014-06-10 3:10 ` George Spelvin 2014-06-10 15:25 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-10 3:10 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price >> I have an idea for a patch to change _xfer_secondary_pool >> to use extract_buf rather than extract_entropy; is all that >> FIPS stuff needed for purely internal transfers? > That's not the part of extract_entropy() which is critical. What's > critical is the control over only transfering entropy when there is at > least a certain minimum amount of entropy. This provides the > Yarrow-like avalanche property which is required to provide recovery > after the internal state of the entropy pools have been exposed. I do understand the importance of catastrophic reseeding, but I don't see that implemented in extract_entropy(). Extract_entropy() itself contains a call to xfer_secondary_pool, but that is a no-op in the case I'm considering (when it's called from _xfer_secondary_pool) because in that case, r is the the input_pool, which doesn't have an r->pull pool. The number of bytes transferred is adjusted by account(), but it's only adjusted downward. (If it were adjusted up, that would be a buffer overrun.) Normal reads seem to ask for a reseed equal in size to read itself, which is "ctastrophic" enough. The only *minimum* reseed transfer size I can see is enforced in _xfer_secondary_pool and push_to_pool, which use random_read_wakeup_bits (default 64) as a minimum reseed. (Prehaps increase this?) standards, IMHO.) Now, having said all that, I'll try to avoid embarrassing myself in public again and ask... what am I missing? On other matters... If I play around with getting the entropy locking right, do you have a strong preference for the single-lock or two-lock design? I can't decide which to start with. The latter is an extensive core code change, but I can easily convince myself there are no deadlock issues. That's the one I described with separate adder and extractor locks and code paths, two "amount of entropy ever added" variables, ane one "amont of entropy ever removed". The single-lock design is hopefully less code, but (to me, at least) less obviously deadlock-free. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-10 3:10 ` George Spelvin @ 2014-06-10 15:25 ` Theodore Ts'o 2014-06-10 20:40 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-10 15:25 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Mon, Jun 09, 2014 at 11:10:17PM -0400, George Spelvin wrote: > > Extract_entropy() itself contains a call to xfer_secondary_pool, but > that is a no-op in the case I'm considering (when it's called from > _xfer_secondary_pool) because in that case, r is the the input_pool, > which doesn't have an r->pull pool. > > The number of bytes transferred is adjusted by account(), but > it's only adjusted downward. (If it were adjusted up, that would be > a buffer overrun.) It's the adjustment downward which is what gives us the catastrophic reseeding --- specifically, it's adjusting the number of bytes down to zero until we can transfer enough entropy to make it intractable for the adversary to test the 2**N possible pool states that might correspond to the observed /dev/random output. In general, this is how to mitigate a state compromise scenario; it's to delay reseeding until there we have confidence that there is enough entropy built up that the bad guys can't maintain their model of the entropy pool state by observing some or all of the RNG output. Adjusting the bytes down to zero is basically a way of delaying reseeding. As far as the FIPS wankery is considered, *all* of FIPS is wankery, and most serious practitioners believe that the FIPS requirements have net been actively negative for overall security. ("Do not touch a line of this file, including the comments, else it will cost us hundreds of thousands of dollars in recertification fees" -- Apple, in their SSL library code where the "goto fail" bug was found.) The only reason why the FIPS code is there is because apparently it's necessary for those foolish US government customers who require FIPS certification of OpenSSL, and apparently since /dev/random is an input to OpenSSL. (It was also added while Matt Mackall was the maintainer; I'm not entirely certain I would have accepted it, because I think FIPS is actively harmful; OTOH, I was working for IBM at the time, and IBM does make money from selling to silly customers who require FIPS, so maybe I would have.) I suspect that the FIPS check is not necessary for intra-pool transfers, but quite frankly, I can't be bothered to care about improving the efficiency of systems put into FIPS mode. And there is a possibility that changing it might break the FIPS certification. Not that I care either way, but I'm just not motiviated to verify that any change to improve FIPS efficiency might not break security for the use cases that I actually care about. :-/ > If I play around with getting the entropy locking right, do you > have a strong preference for the single-lock or two-lock design? > I can't decide which to start with. > > The latter is an extensive core code change, but I can easily convince > myself there are no deadlock issues. That's the one I described > with separate adder and extractor locks and code paths, two "amount > of entropy ever added" variables, ane one "amont of entropy ever removed". > > The single-lock design is hopefully less code, but (to me, at least) less > obviously deadlock-free. Since as near as I can tell, there isn't really real performance issue here, what's most important to me is that (a) the changes are easily auditable, and (b) the resulting code is easily auditable and Obviously Correct. The null hypothesis that any change would have to compete against is adding a trylock to add_interrupt_randomness(), since the change is small, and obviously not going to make things worse. If we can do better than that in terms of the "the commit is easy to read/audit" and "the resulting code is easy to read/audit", then great. But otherwise, it seems that the "add a trylock" is the simplest way to make things better. Does that make sense? - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-10 15:25 ` Theodore Ts'o @ 2014-06-10 20:40 ` George Spelvin 2014-06-10 21:20 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-10 20:40 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > It's the adjustment downward which is what gives us the catastrophic > reseeding --- specifically, it's adjusting the number of bytes down to > zero until we can transfer enough entropy to make it intractable for > the adversary to test the 2**N possible pool states that might > correspond to the observed /dev/random output. Er, yes, I understand that... as I said, I understand the purpose and importance of catastrophic reseeding. Look, maybe it's easier to see in the draft patch appended. I added a comment pointing out the catastrophic reseeding. (Your opinions on the issues labelled FIXME in the comments are definitely solicited.) An obvious follow-up patch is to delete the last two arguments from extract_entropy(), since they're always zero in all remaining calls. > I suspect that the FIPS check is not necessary for intra-pool > transfers, but quite frankly, I can't be bothered to care about > improving the efficiency of systems put into FIPS mode. And there is > a possibility that changing it might break the FIPS certification. > Not that I care either way, but I'm just not motiviated to verify that > any change to improve FIPS efficiency might not break security for the > use cases that I actually care about. :-/ I don't care about the efficiency either; I just wanted to avoid the stack usage of a "u32 tmp[OUTPUT_POOL_WORDS]" when the actual extraction is done in EXTRACT_SIZE chunks anyhway. (This stuff is called from fs code in some cases, although I haven't checked if it's on the page-out path that's always the stack smasher.) My question was "do I have to preserve that crap when it would be easier to dike it out?" Look at the example patch below. (If I did have to preserve it, then I'd move the check into extract_buf.) > The null hypothesis that any change would have to compete against is > adding a trylock to add_interrupt_randomness(), since the change is > small, and obviously not going to make things worse. Er, you seem to underestimate the changes. It also involves moving the existing locks outward to encompass entropy accounting in many places in the code (after which the cmpxchg in credit_entropy is no longer needed and may be deleted). > Does that make sense? The goals make perfect sense, I'm just saying that "add a trylock" is not a literal implementation guide; there are a lot of consequent changes that are being glossed over. I really think the result will be much clearer, but I'll know for sure when I get there. Draft patch to reduce stack usage in _xfer_secondary_pool: diff --git a/drivers/char/random.c b/drivers/char/random.c index 102c50d3..5bbe5167 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -910,8 +910,9 @@ void add_disk_randomness(struct gendisk *disk) * *********************************************************************/ -static ssize_t extract_entropy(struct entropy_store *r, void *buf, - size_t nbytes, int min, int rsvd); +static size_t account(struct entropy_store *r, size_t nbytes, int min, + int reserved); +static void extract_buf(struct entropy_store *r, u8 out[EXTRACT_SIZE]); /* * This utility inline function is responsible for transferring entropy @@ -937,23 +938,51 @@ static void xfer_secondary_pool(struct entropy_store *r, size_t nbytes) static void _xfer_secondary_pool(struct entropy_store *r, size_t nbytes) { - __u32 tmp[OUTPUT_POOL_WORDS]; - - /* For /dev/random's pool, always leave two wakeups' worth */ - int rsvd_bytes = r->limit ? 0 : random_read_wakeup_bits / 4; + u8 tmp[EXTRACT_SIZE]; int bytes = nbytes; /* pull at least as much as a wakeup */ bytes = max_t(int, bytes, random_read_wakeup_bits / 8); /* but never more than the buffer size */ - bytes = min_t(int, bytes, sizeof(tmp)); + bytes = min_t(int, bytes, OUTPUT_POOL_WORDS*sizeof(u32)); + /* + * FIXME: Move this to after account(), so it shows the true amount + * transferred? + */ trace_xfer_secondary_pool(r->name, bytes * 8, nbytes * 8, ENTROPY_BITS(r), ENTROPY_BITS(r->pull)); - bytes = extract_entropy(r->pull, tmp, bytes, - random_read_wakeup_bits / 8, rsvd_bytes); - mix_pool_bytes(r, tmp, bytes, NULL); - credit_entropy_bits(r, bytes*8); + + /* + * This is the only place we call account() with non-zero + * "min" and "reserved" values. The minimum is used to + * enforce catastrophic reseeding: if we can't get at least + * random_read_wakeup_bits of entropy, don't bother reseeding + * at all, but wait until a useful amount is available. + * + * The "reserved" is used to prevent reads from /dev/urandom + * from emptying the input pool; leave two wakeups' worth + * for /dev/random. + */ + bytes = account(r->pull, bytes, random_read_wakeup_bits / 8, + r->limit ? 0 : random_read_wakeup_bits / 4); + + /* Now to the actual transfer, in EXTRACT_SIZE units */ + while (bytes) { + int i = min_t(int, bytes, EXTRACT_SIZE); + + extract_buf(r->pull, tmp); + mix_pool_bytes(r, tmp, i, NULL); + credit_entropy_bits(r, i*8); + bytes -= i; + } + + /* + * Wipe data just returned from memory. + * FIXME: gcc understands memset and will probably optimize + * this out. Use OpenBSD-style explicit_bzero()? + */ + memset(tmp, 0, sizeof(tmp)); } /* @@ -1019,7 +1048,7 @@ retry: * * Note: we assume that .poolwords is a multiple of 16 words. */ -static void extract_buf(struct entropy_store *r, __u8 *out) +static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE]) { int i; union { ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-10 20:40 ` George Spelvin @ 2014-06-10 21:20 ` Theodore Ts'o 2014-06-11 0:10 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-10 21:20 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Tue, Jun 10, 2014 at 04:40:28PM -0400, George Spelvin wrote: > > Look, maybe it's easier to see in the draft patch appended. > I added a comment pointing out the catastrophic reseeding. Yes, I see what you're trying to do. What I'd suggest is that you do this in a series of small patches. Each patch should solve one problem at a time, so it's easy to audit each one. For example, adding a trylock to add_interrupt_randomness() should do that, and ***only*** that. This fixes the case where add_interrupt_randomness() races with RNDADDENTROPY's call to write_pool(). Yes, there are other potential problems, especially relating to whether we need to atomically update the entropy credit and the input pool. And actually, it doesn't matter, since we never try to extract more than the input pool's has limit set to 1, and so we never extract more entropy than the entropy counter. So if we add the entropy, it's impossible that it will get used before we update the entropy counter. This is why changes should be separate commits, so we can review each one of them separately. Similarly, your comment about memset vs. explicit_bzero is a valid one, but that's a problem that should be fixed in a separate commit. And yes, that's a real problem --- I've confirmed this using gcc -S, and it's not just a problem in drivers/char/random.c, but also in a number of use cases under crypto/*.c. But that's a problem that should be fixed separately, and to be honest, I'd actually consider this problem than some of the other issues we've talked about in this thread. This is also why I objected to your change to use the stale stack contents for the input array. That does something different from the rest of the changes in the patch. I'm sorry if keep harping on this, but this is critically important. It also means that we can accept the small, obviously correct changes, while we discuss the larger, more invasive changes. It also makes it easier to backport the smaller and more important changes to stable kernels. > I don't care about the efficiency either; I just wanted to avoid the > stack usage of a "u32 tmp[OUTPUT_POOL_WORDS]" when the actual extraction > is done in EXTRACT_SIZE chunks anyhway. Huh? The FIPS wankery requires: __u8 tmp[10]; not u32 tmp[OUTPUT_POOL_WORDS]; and when you replace the former with the latter, but still try to move The xfer_seconary_pool code does use OUTPUT_POOL_WORDS*sizeof(32), but only declare tmp as a 10-byte char array, it looks like you're end up overflowing the stack (have you actually tested your draft patch?) In any case, this is another reason why I really request people to send a series of git commits, where each one is small, easy to read, and Obviously Correct. When you try to make multiple changes in a single commit, it makes things harder to review, and that opens up "goto fail" sort of errors. (Well, hopefully not because I spend a lot of time very carefully scrutinizing patches, and if I don't have time, I won't apply the patch until I do have time. So put another way, if you want to increase the likelihood that I'll process your patches quicktly, it's to your advantage to keep each separate commit small and obviously correct(tm). Yes, this means that sometimes when you start making changes, and run into other cleanups, you may need to use commands like "git stash", create a separate commit that does just the cleanup", and then "git stash pop" to resume work --- or use quilt or guilt if that's more convenient for you, but small patches really, REALLY, helps the review process.) > > The null hypothesis that any change would have to compete against is > > adding a trylock to add_interrupt_randomness(), since the change is > > small, and obviously not going to make things worse. > > Er, you seem to underestimate the changes. It also involves moving the > existing locks outward to encompass entropy accounting in many places in > the code (after which the cmpxchg in credit_entropy is no longer needed > and may be deleted). Sure, but you can put the deletion of the cmpxchg and the out[] array in separate commits, where the tree remains building and secure at each step. I generally have the biggest problems with academics who want modify the random number generator, where they dump a several thousand line diff on my porch, and then wonder why I don't just blindly apply it.... - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-10 21:20 ` Theodore Ts'o @ 2014-06-11 0:10 ` George Spelvin 2014-06-11 2:08 ` Theodore Ts'o 2014-06-11 2:21 ` Theodore Ts'o 0 siblings, 2 replies; 57+ messages in thread From: George Spelvin @ 2014-06-11 0:10 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price >> I don't care about the efficiency either; I just wanted to avoid the >> stack usage of a "u32 tmp[OUTPUT_POOL_WORDS]" when the actual extraction >> is done in EXTRACT_SIZE chunks anyhway. > Huh? The FIPS wankery requires: > > __u8 tmp[10]; > > not > > u32 tmp[OUTPUT_POOL_WORDS]; Sorry, we're still not communicating right. You seem to think I'm trying to optimize the FIPS code, and nothing could be further from the truth. Please consider the merits of my patch as if the FIPS stuff weas #ifdefed out. I don't care about it, I'm not trying to optimize it, it was just *in my way* when I was trying to do something else, and I only had to ask about it because it's a legal/regulatory requirement rather than a technical one I couldn't understand it by logic. What I wanted to do was eliminate that huge tmp buffer from _xfer_secondary_pool. There's no good reason why it needs to be there. and several reasons for getting rid of it. A big one for me is auditability, a.k.a. the current code confused me when I was trying to understand it. Based on static code analysis, extract_entropy and xfer_secondary_pool appear to be recursive; the fact that they're not in practice takes more insight. extract_entropy_user(&nonblocking_pool) -> xfer_secondary_pool(&nonblocking_pool) -> _xfer_secondary_pool(&nonblocking_pool) -> extract_entropy(&input_pool) -> xfer_secondary_pool(&input_pool) recursion terminates because input_pool.pull == NULL > and when you replace the former with the latter, but still try to move > > The xfer_seconary_pool code does use OUTPUT_POOL_WORDS*sizeof(32), but > only declare tmp as a 10-byte char array, it looks like you're end up > overflowing the stack (have you actually tested your draft patch?) Please look again; it doesn't overflow. I hadn't tested the patch when I mailed it to you (I prepared it in order to reply to your e-mail, and it's annoying to reboot the machine I'm composing an e-mail on), but I have since. It works. > This is also why I objected to your change to use the stale stack > contents for the input array. That does something different from the > rest of the changes in the patch. I'm sorry if keep harping on this, > but this is critically important. It also means that we can accept > the small, obviously correct changes, while we discuss the larger, > more invasive changes. It also makes it easier to backport the > smaller and more important changes to stable kernels. [...] > In any case, this is another reason why I really request people to > send a series of git commits, where each one is small, easy to read, > and Obviously Correct. [...] > small patches really, REALLY, helps the review process.) I'm very aware of this. As I've said before on the git mailing list, the reason that git bisect is such a killer feature is due to an even more important, but also more subtle, killer feature of git: it makes it practical to use lots of small commits, so it can point you at a very small change. If you had CVS-style weeks-in-preparation-and-testing "code drop" commits, it wouldn't be nearly as useful. Indeed, if you look at some of the 10-patch series I've posted recently (I'm on a "clear my tree of private patches" binge), I think I'm doing fairly well. If anything I worry about erring on the side of too minor. Does this meet with your approval: http://www.spinics.net/lists/linux-media/msg76435.html But even I get annoyed when I have a 1-line comment typo fix and wonder if it really deserves its own commit or if I can just include it with the other changes I'm making to that file. Anwyay, please stop wearing out your fingers and consider the point Made. I will do my utmost to divide things as finely as possible. The one problem is that too small a change can be Obviously Correct (to the extent that the overall code complexity allows *anything* to be obvious), but not Obviously Useful. Still, combining patches is a whole lot less effort than splitting them, so I'll err far on that side. > Sure, but you can put the deletion of the cmpxchg and the out[] array > in separate commits, where the tree remains building and secure at > each step. Definitely! I'm going to produce as long as patch series as I possibly can. But there's always an overall goal to the entire series, and that's what I'm trying to discuss beforehand. Looking back, I understand now how my words could lead to confusion. To me, the race in add_interrupt_randomness is fairly inconsequential and fixing it is a minor side effect of my overall goal of fixing *all* the races. The *fundamental* race, as I see it, is the one between modifying pools and crediting entropy. As I noted, you can't safely do the credit either before *or* after modifying the pool; you will always end up with the wrong answer in some situation. This is why the RNDADDENTROPY ioctl is required and the RNDADDTOENTCNT ioctl is old legacy cruft that should never be used. (Note to self: add a warning if it's ever used, or even disable it entirely.) When I said "it's a lot bigger job than that" I was thinking about *my* agenda of the entire project, when I had just asked "which path do you think I should take?". I wasn't trying to imply it would be one patch. I have half a dozen patches to random.c already waiting. For example, one is a bulk conversion of __u8 and __u32 to u8 and u32. The underscore versions are only intended for public header files where namespace pollution is a problem. But I have to think about ordering and how to present them; a patch like that is annoying to reorder with others. (Honestly, my personal preference is to the longer names from <stdint.h> just because they *are* standard and so very widely used. But that's an issue for a maintainer's esthetic judgement.) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 0:10 ` George Spelvin @ 2014-06-11 2:08 ` Theodore Ts'o 2014-06-11 3:58 ` George Spelvin 2014-06-11 4:34 ` George Spelvin 2014-06-11 2:21 ` Theodore Ts'o 1 sibling, 2 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-11 2:08 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Tue, Jun 10, 2014 at 08:10:03PM -0400, George Spelvin wrote: > What I wanted to do was eliminate that huge tmp buffer from > _xfer_secondary_pool. There's no good reason why it needs to be there. > and several reasons for getting rid of it. So have you actually instrumented the kernel to demonstrate that in fact we have super deep stack call paths where the 128 bytes worth of stack actually matters? Premature optimization being the root of all evil (not to mention wasting a lot of time of kernel developers) and all that.... > I hadn't tested the patch when I mailed it to you (I prepared it in > order to reply to your e-mail, and it's annoying to reboot the machine > I'm composing an e-mail on), but I have since. It works. As an aside, I'd strongly suggest that you use kvm to do your kernel testing. It means you can do a lot more testing which is always a good thing.... > The *fundamental* race, as I see it, is the one between modifying pools > and crediting entropy. > > As I noted, you can't safely do the credit either before *or* after modifying > the pool; you will always end up with the wrong answer in some situation. Actually, it's **fine**. That's because RNDADDENTROPY adds the entropy to the input pool, which is has the limit flag set. So we will never pull more entropy than the pool is credited as having. This means that race can't happen. It ***is*** safe. 1) Assume the entropy count starts at 10 bytes. 2) Random writer mixes in 20 bytes of entropy into the entropy pool. 3) Random extractor tries to extract 32 bytes of entropy. Since the entropy count is still is 10, it will only get 10 bytes. (And if we started with the entropy count started at zero, we wouldn't extract any entropy at all.) 4) Random writer credit the entropy counter with the 20 bytes mixed in step #2. See? no problems! - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 2:08 ` Theodore Ts'o @ 2014-06-11 3:58 ` George Spelvin 2014-06-11 13:11 ` Theodore Ts'o 2014-06-11 4:34 ` George Spelvin 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-11 3:58 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Actually, it's **fine**. That's because RNDADDENTROPY adds the > entropy to the input pool, which is has the limit flag set. So we > will never pull more entropy than the pool is credited as having. > This means that race can't happen. It ***is*** safe. > > 1) Assume the entropy count starts at 10 bytes. > > 2) Random writer mixes in 20 bytes of entropy into the entropy pool. > > 3) Random extractor tries to extract 32 bytes of entropy. Since the > entropy count is still is 10, it will only get 10 bytes. (And if we > started with the entropy count started at zero, we wouldn't extract > any entropy at all.) > > 4) Random writer credit the entropy counter with the 20 bytes mixed in > step #2. > > See? no problems! You can forbid underflows, but the code doesn't forbid overflows. 1. Assume the entropy count starts at 512 bytes (input pool full) 2. Random writer mixes in 20 bytes of entropy into the input pool. 2a. Input pool entropy is, however, capped at 512 bytes. 3. Random extractor extracts 32 bytes of entropy from the pool. Succeeds because 32 < 512. Pool is left with 480 bytes of entropy. 3a. Random extractor decrements pool entropy estimate to 480 bytes. This is accurate. 4. Random writer credits pool with 20 bytes of entropy. 5. Input pool entropy is now 480 bytes, estimate is 500 bytes. Problem, no? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 3:58 ` George Spelvin @ 2014-06-11 13:11 ` Theodore Ts'o 2014-06-12 0:42 ` George Spelvin 2014-06-12 1:03 ` H. Peter Anvin 0 siblings, 2 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-11 13:11 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Tue, Jun 10, 2014 at 11:58:06PM -0400, George Spelvin wrote: > You can forbid underflows, but the code doesn't forbid overflows. > > 1. Assume the entropy count starts at 512 bytes (input pool full) > 2. Random writer mixes in 20 bytes of entropy into the input pool. > 2a. Input pool entropy is, however, capped at 512 bytes. > 3. Random extractor extracts 32 bytes of entropy from the pool. > Succeeds because 32 < 512. Pool is left with 480 bytes of > entropy. > 3a. Random extractor decrements pool entropy estimate to 480 bytes. > This is accurate. > 4. Random writer credits pool with 20 bytes of entropy. > 5. Input pool entropy is now 480 bytes, estimate is 500 bytes. Good point, that's a potential problem, although messing up the accounting betewen 480 and 500 bytes is not nearly as bad as messing up 0 and 20. It's not something where if the changes required massive changes, that I'd necessarily feel the need to backport them to stable. It's a certificational weakness, but it's a not disaster. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 13:11 ` Theodore Ts'o @ 2014-06-12 0:42 ` George Spelvin 2014-06-12 1:03 ` H. Peter Anvin 1 sibling, 0 replies; 57+ messages in thread From: George Spelvin @ 2014-06-12 0:42 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > It's not something where if the changes required massive changes, that > I'd necessarily feel the need to backport them to stable. It's a > certificational weakness, but it's a not disaster. Agreed! It's been there for years, and I'm not too worried. It takes a pretty tight race to cause the problem in the first place. As you note, it only happens with a full pool (already a very secure situation), and the magnitude is limited by the size of entropy additions, which are normally small. I'm just never happy with bugs in security-critical code. "I don't think that bug is exploitable" is almost as ominous a phrase as "Y'all watch this!" ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 13:11 ` Theodore Ts'o 2014-06-12 0:42 ` George Spelvin @ 2014-06-12 1:03 ` H. Peter Anvin 1 sibling, 0 replies; 57+ messages in thread From: H. Peter Anvin @ 2014-06-12 1:03 UTC (permalink / raw) To: Theodore Ts'o, George Spelvin, linux-kernel, mingo, price On 06/11/2014 06:11 AM, Theodore Ts'o wrote: > On Tue, Jun 10, 2014 at 11:58:06PM -0400, George Spelvin wrote: >> You can forbid underflows, but the code doesn't forbid overflows. >> >> 1. Assume the entropy count starts at 512 bytes (input pool full) >> 2. Random writer mixes in 20 bytes of entropy into the input pool. >> 2a. Input pool entropy is, however, capped at 512 bytes. >> 3. Random extractor extracts 32 bytes of entropy from the pool. >> Succeeds because 32 < 512. Pool is left with 480 bytes of >> entropy. >> 3a. Random extractor decrements pool entropy estimate to 480 bytes. >> This is accurate. >> 4. Random writer credits pool with 20 bytes of entropy. >> 5. Input pool entropy is now 480 bytes, estimate is 500 bytes. > > Good point, that's a potential problem, although messing up the > accounting betewen 480 and 500 bytes is not nearly as bad as messing > up 0 and 20. > > It's not something where if the changes required massive changes, that > I'd necessarily feel the need to backport them to stable. It's a > certificational weakness, but it's a not disaster. > Actually, with the new accounting code it will be even less serious, because mixing into a nearly full pool is discounted heavily -- because it is not like filling a queue; the mixing function will probabilistically overwrite existing pool entropy. So it is still a race condition, and still wrong, but it is a lot less wrong. -hpa ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 2:08 ` Theodore Ts'o 2014-06-11 3:58 ` George Spelvin @ 2014-06-11 4:34 ` George Spelvin 2014-06-11 13:09 ` Theodore Ts'o 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-11 4:34 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > So have you actually instrumented the kernel to demonstrate that in > fact we have super deep stack call paths where the 128 bytes worth of > stack actually matters? I haven't got a specific call chain where 128 bytes pushes it over a limit. But kernel stack usage is a perennial problem. Wasn't there some discussion about that just recenty? 6538b8ea8: "x86_64: expand kernel stack to 16K" I agree a 128 byte stack frame is not one of the worst offenders, but it's enough to try to clean up if possible. You can search LKML for a bunch of discussion of 176 bytes in __alloc_pages_slowpath(). And in this case, it's so *easy*. extract_buf() works 10 bytes at a time anyway, and _mix_pool_bytes is byte at a time. >> I hadn't tested the patch when I mailed it to you (I prepared it in >> order to reply to your e-mail, and it's annoying to reboot the machine >> I'm composing an e-mail on), but I have since. It works. > As an aside, I'd strongly suggest that you use kvm to do your kernel > testing. It means you can do a lot more testing which is always a > good thing.... H'mmm. I need to learn what KVM *is*. Apparently there's a second meaning other than "keyboard, video & mouse". :-) Normally, I just test using modules. Especially when working on a driver for a hardware device, virtualization makes life difficult. But /dev/random is (for good reasons) not modularizable. (I can see how it'd be useful for filesystem development, however.) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 4:34 ` George Spelvin @ 2014-06-11 13:09 ` Theodore Ts'o 0 siblings, 0 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-11 13:09 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Wed, Jun 11, 2014 at 12:34:16AM -0400, George Spelvin wrote: > > I haven't got a specific call chain where 128 bytes pushes it > over a limit. But kernel stack usage is a perennial problem. > Wasn't there some discussion about that just recenty? > 6538b8ea8: "x86_64: expand kernel stack to 16K" Yes, but that was a call path involving file systems writepage/writepages, block devices, and the writeback code paths. None of which need random numbers... > Normally, I just test using modules. Especially when working on a > driver for a hardware device, virtualization makes life difficult. > But /dev/random is (for good reasons) not modularizable. Using virtualization is much faster because you don't have to reboot your system. And if you want to test what happens to the random driver at boot time, again, booting a guest kernel under KVM is much faster, especially if you screw up and the system locks up.... Sure, if you are testing an actual hardware device, it's harder to use virtualization. But if you are doing any kind of core kernel work (and /dev/random counts as core kernel), virtualization is really convenient. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: more ruminations 2014-06-11 0:10 ` George Spelvin 2014-06-11 2:08 ` Theodore Ts'o @ 2014-06-11 2:21 ` Theodore Ts'o 1 sibling, 0 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-11 2:21 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Tue, Jun 10, 2014 at 08:10:03PM -0400, George Spelvin wrote: > > But even I get annoyed when I have a 1-line comment typo fix and wonder > if it really deserves its own commit or if I can just include it with > the other changes I'm making to that file. Unless you're actually modifying that section of code, I usually recommend that people just not bother. The fact that someone included a one-line comment fix caused a merge conflict with the ext4 pull in this merge window that Linus had to fix up. Not that it's a big deal, but unless it's something that's really going to confuse the reader, I treat it as white space fixes; something that's only worth fixing if that particular function is being modified for a "real" change. > I have half a dozen patches to random.c already waiting. For example, > one is a bulk conversion of __u8 and __u32 to u8 and u32. The underscore > versions are only intended for public header files where namespace > pollution is a problem. And sorry, I consider this sort of thing is to be just code wankery. We use __u32 in a number of places in the kernel, and it doesn't really affect code readability. A change like this almost guarantees that stable patches won't apply manually, and will have to be ported manually. It's just not worth it. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* drivers/char/random.c: More futzing about 2014-06-09 1:35 ` Theodore Ts'o 2014-06-09 2:10 ` George Spelvin @ 2014-06-09 13:17 ` George Spelvin 2014-06-11 16:38 ` Theodore Ts'o 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-09 13:17 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price Just as an example of some more ambitious changes I'm playing with... I really think the polynomial + twist has outlived its usefulness. In particular, table lookups in infrequently accessed code are called D-cache misses and are undesirable. And the input_rotate is an ugly kludge to compensate for inadequate mixing. Here's an example of a smaller, faster, and better fast_mix() function. The mix is invertible (thus preserving entropy), but causes each input bit or pair of bits to avalanche to at least 43 bits after 2 rounds and 120 bit0 after 3. For comparison, with the current linear fast_mix function, bits above the 12th in the data words only affect 4 other bits after one repetition. With 3 iterations, it runs in 2/3 the time of the current fast_mix and is half the size: 84 bytes of object code as opposed to 168. void fast_mix2(struct fast_pool *f, u32 const input[4]) { u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1]; u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3]; int i; for (i = 0; i < 3; i++) { /* * Inspired by ChaCha's QuarterRound, but * modified for much greater parallelism. * Surprisingly, rotating a and c seems to work * better than b and d. And it runs faster. */ a += b; c += d; d ^= a; b ^= c; a = rol32(a, 15); c = rol32(c, 21); a += b; c += d; d ^= a; b ^= c; a = rol32(a, 3); c = rol32(c, 7); } f->pool[0] = a; f->pool[1] = b; f->pool[2] = c; f->pool[3] = d; f->count++; } Other good rotate sets: score = 43/120/121: 23 6 18 11 score = 42/120/123: 17 15 4 24 score = 42/120/122: 4 7 19 17 score = 43/122/122: 24 6 16 12 score = 42/121/122: 25 22 11 8 score = 42/122/121: 16 20 11 23 score = 43/120/122: 8 11 17 19 score = 46/121/123: 15 21 3 7 score = 43/120/122: 7 11 15 21 score = 42/120/122: 7 10 17 13 score = 42/120/123: 11 3 18 23 score = 44/121/122: 20 10 26 24 score = 42/123/122: 10 5 18 21 score = 44/120/122: 26 21 7 9 score = 42/121/122: 21 11 7 8 score = 42/120/122: 11 11 27 19 score = 42/121/122: 6 18 4 27 score = 42/121/122: 13 23 3 4 If you want smaller code, or more flexibility in the number of rounds, try (24 5 24 5) or (27 8 27 8). The avalanche is only slightly worse, but unrolling twice shaves a smidgen off the run time, so the extra 2 rotate constants come for free. If you want something linear, there are better ways to get better mixing. Here's one based on a period-2^128-1 xorshift generator: void fast_mix3(struct fast_pool *f, u32 const input[4]) { u32 a = f->pool[0] ^ input[0]; u32 b = f->pool[1] ^ input[1]; u32 c = f->pool[2] ^ input[2]; u32 d = f->pool[3] ^ input[3]; /* * See Marsagalia, "Xorshift RNGs". Possible shift amounts * are [5, 14, 1], [15, 4, 21], [23, 24, 3], [5, 12, 29]. */ a ^= a<<15; a ^= a>>4 ^ d ^ d>>21; f->pool[0] = a; b ^= b<<15; b ^= b>>4 ^ a ^ a>>21; f->pool[1] = b; c ^= c<<15; c ^= c>>4 ^ b ^ b>>21; f->pool[2] = c; d ^= d<<15; d ^= d>>4 ^ c ^ c>>21; f->pool[3] = d; f->count++; } ... However this is slower than 2 iterations of fast_mix2, and doesn't avalanche as much. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-09 13:17 ` drivers/char/random.c: More futzing about George Spelvin @ 2014-06-11 16:38 ` Theodore Ts'o 2014-06-11 16:48 ` H. Peter Anvin ` (2 more replies) 0 siblings, 3 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-11 16:38 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote: > Here's an example of a smaller, faster, and better fast_mix() function. > The mix is invertible (thus preserving entropy), but causes each input > bit or pair of bits to avalanche to at least 43 bits after 2 rounds and > 120 bit0 after 3. I've been looking at your fast_mix2(), and it does look interesting. > For comparison, with the current linear fast_mix function, bits above > the 12th in the data words only affect 4 other bits after one repetition. > > With 3 iterations, it runs in 2/3 the time of the current fast_mix > and is half the size: 84 bytes of object code as opposed to 168. ... but how did you measure the "2/3 the time"? I've done some measurements, using both "time calling fast_mix() and fast_mix2() N times and divide by N (where N needs to be quite large). Using that metric, fast_mix2() takes seven times as long to run. If I only run the two mixing functions once, and use RDTSC to measure the time, fast_mix2() takes only three times as long. (That's because the memory cache effects are much less, which favors fast_mix2). But either way, fast_mix2() is slower than the current fast_mix(), and using the measurements that are as advantageous (and most realistic) that I could come up with, it's still three times slower. My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-11 16:38 ` Theodore Ts'o @ 2014-06-11 16:48 ` H. Peter Anvin 2014-06-11 19:25 ` Theodore Ts'o 2014-06-12 0:32 ` George Spelvin 2014-06-12 3:43 ` drivers/char/random.c: More futzing about George Spelvin 2 siblings, 1 reply; 57+ messages in thread From: H. Peter Anvin @ 2014-06-11 16:48 UTC (permalink / raw) To: Theodore Ts'o, George Spelvin, linux-kernel, mingo, price On 06/11/2014 09:38 AM, Theodore Ts'o wrote: > On Mon, Jun 09, 2014 at 09:17:38AM -0400, George Spelvin wrote: >> Here's an example of a smaller, faster, and better fast_mix() function. >> The mix is invertible (thus preserving entropy), but causes each input >> bit or pair of bits to avalanche to at least 43 bits after 2 rounds and >> 120 bit0 after 3. > > I've been looking at your fast_mix2(), and it does look interesting. > >> For comparison, with the current linear fast_mix function, bits above >> the 12th in the data words only affect 4 other bits after one repetition. >> >> With 3 iterations, it runs in 2/3 the time of the current fast_mix >> and is half the size: 84 bytes of object code as opposed to 168. > > ... but how did you measure the "2/3 the time"? I've done some > measurements, using both "time calling fast_mix() and fast_mix2() N > times and divide by N (where N needs to be quite large). Using that > metric, fast_mix2() takes seven times as long to run. > > If I only run the two mixing functions once, and use RDTSC to measure > the time, fast_mix2() takes only three times as long. (That's because > the memory cache effects are much less, which favors fast_mix2). > > But either way, fast_mix2() is slower than the current fast_mix(), and > using the measurements that are as advantageous (and most realistic) > that I could come up with, it's still three times slower. > > My measurements were done using Intel 2.8 GHz quad-core i7-4900MQ CPU. > While talking about performance, I did a quick prototype of random using Skein instead of SHA-1, and it was measurably faster, in part because Skein produces more output per hash. -hpa ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-11 16:48 ` H. Peter Anvin @ 2014-06-11 19:25 ` Theodore Ts'o 2014-06-11 20:41 ` H. Peter Anvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-11 19:25 UTC (permalink / raw) To: H. Peter Anvin; +Cc: George Spelvin, linux-kernel, mingo, price On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote: > While talking about performance, I did a quick prototype of random using > Skein instead of SHA-1, and it was measurably faster, in part because > Skein produces more output per hash. Which Skein parameters did you use, and how much stack space was required for it? Skein-512 is described as needing 200 bytes of state, IIRC (which I assume most of which comes from Threefish key schedule). - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-11 19:25 ` Theodore Ts'o @ 2014-06-11 20:41 ` H. Peter Anvin 2014-06-12 0:44 ` H. Peter Anvin 0 siblings, 1 reply; 57+ messages in thread From: H. Peter Anvin @ 2014-06-11 20:41 UTC (permalink / raw) To: Theodore Ts'o, George Spelvin, linux-kernel, mingo, price On 06/11/2014 12:25 PM, Theodore Ts'o wrote: > On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote: >> While talking about performance, I did a quick prototype of random using >> Skein instead of SHA-1, and it was measurably faster, in part because >> Skein produces more output per hash. > > Which Skein parameters did you use, and how much stack space was > required for it? Skein-512 is described as needing 200 bytes of > state, IIRC (which I assume most of which comes from Threefish key > schedule). > I believe I used Skein-256, but I'd have to dig to find it again. -hpa ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-11 20:41 ` H. Peter Anvin @ 2014-06-12 0:44 ` H. Peter Anvin 2014-06-12 1:51 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: H. Peter Anvin @ 2014-06-12 0:44 UTC (permalink / raw) To: H. Peter Anvin, Theodore Ts'o, George Spelvin, linux-kernel, mingo, price On 06/11/2014 01:41 PM, H. Peter Anvin wrote: > On 06/11/2014 12:25 PM, Theodore Ts'o wrote: >> On Wed, Jun 11, 2014 at 09:48:31AM -0700, H. Peter Anvin wrote: >>> While talking about performance, I did a quick prototype of random using >>> Skein instead of SHA-1, and it was measurably faster, in part because >>> Skein produces more output per hash. >> >> Which Skein parameters did you use, and how much stack space was >> required for it? Skein-512 is described as needing 200 bytes of >> state, IIRC (which I assume most of which comes from Threefish key >> schedule). >> > > I believe I used Skein-256, but I'd have to dig to find it again. > > -hpa > Sadly I can't find the tree, but I'm 94% sure it was Skein-256 (specifically the SHA3-256 candidate parameter set.) -hpa ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-12 0:44 ` H. Peter Anvin @ 2014-06-12 1:51 ` George Spelvin 0 siblings, 0 replies; 57+ messages in thread From: George Spelvin @ 2014-06-12 1:51 UTC (permalink / raw) To: hpa, hpa, linux-kernel, linux, mingo, price, tytso > Sadly I can't find the tree, but I'm 94% sure it was Skein-256 > (specifically the SHA3-256 candidate parameter set.) It would be nice to have two hash functions, optimized separately for 32- and 64-bit processors. As the Skein report says, the algorithm can be adapted to 32 bits easily enough. I also did some work a while ago to adapt the Skein parameter search code to develop a Skein-192 (6x32 bits) that would fit into registers on x86-32. (It got stalled when I e-mailed Niels Ferguson about it and never heard back; it fell off the to-do list while I was waiting.) The intended target was IPv6 address hashing for sequence number randomization, but it could be used for pool hashing, too. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-11 16:38 ` Theodore Ts'o 2014-06-11 16:48 ` H. Peter Anvin @ 2014-06-12 0:32 ` George Spelvin 2014-06-12 3:22 ` Theodore Ts'o 2014-06-12 3:43 ` drivers/char/random.c: More futzing about George Spelvin 2 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-12 0:32 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > ... but how did you measure the "2/3 the time"? I've done some > measurements, using both "time calling fast_mix() and fast_mix2() N > times and divide by N (where N needs to be quite large). Using that > metric, fast_mix2() takes seven times as long to run. Wow, *massively* different results. Using a large iteration count, it's defintiely faster. Here's with 1,000,000 iterations: # ./perftest ./random pool 1 = 6b469698 596ceb8f 70536d2d e262b8ed pool 2 = 72e2554d fd20e020 a51baf43 19472f63 0: 40587696 19952756 (-20634940) 1: 40569444 19986700 (-20582744) 2: 40529396 19983108 (-20546288) 3: 40492468 19959528 (-20532940) 4: 40461808 19977316 (-20484492) 5: 40421980 19977436 (-20444544) 6: 40495440 19959904 (-20535536) 7: 40436676 19975400 (-20461276) 8: 40454912 19971360 (-20483552) 9: 40463260 19957324 (-20505936) Here's with 1 iteration (which you're very right, I failed to test, and instruction counts matter more when code is not hot in cache) # ./perftest ./random pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 76 84 (+8) 1: 36 36 (+0) 2: 32 36 (+4) 3: 32 40 (+8) 4: 28 40 (+12) 5: 32 36 (+4) 6: 32 36 (+4) 7: 28 40 (+12) 8: 32 40 (+8) 9: 28 40 (+12) Comparable, but slightly slower. Clearly, I need to do better. And you can see the first-iteration effects clearly. Still, noting *remotely* like 7x! That seems crazy, as the overall operation counts are not that dissimilar. Here's the "perftest" shell wrapper: #!/bin/sh old="`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor`" for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $i done "$@" for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "$old" > $i done And "random.c" is appended. Apologies for the comemnts and #ifdefs I used while experimenting; I wanted to give you *exactly* what I'm running. Do you see a bug in it anywhere? I know P4s have slow rotates, so the larger number of them compared to shifts in the original will definitely cause a hit. But neither of us are using that. Compile line and platform: cc -W -Wall -m64 -O -march=native random.c -o random model name : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz #include <stdint.h> typedef uint32_t u32; static u32 const twist_table[8] = { 0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158, 0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 }; struct fast_pool { u32 pool[4]; unsigned long last; unsigned short count; unsigned char rotate; unsigned char last_timer_intr; }; static inline u32 rol32(u32 word, unsigned int shift) { return (word << shift) | (word >> (32 - shift)); } /* * This is a fast mixing routine used by the interrupt randomness * collector. It's hardcoded for an 128 bit pool and assumes that any * locks that might be needed are taken by the caller. */ void fast_mix(struct fast_pool *f, u32 const input[4]) { u32 w; unsigned input_rotate = f->rotate; w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3]; f->pool[0] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 14) & 31; w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0]; f->pool[1] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1]; f->pool[2] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2]; f->pool[3] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; f->rotate = input_rotate; f->count++; } void fast_mix2(struct fast_pool *f, u32 const input[4]) { /* Copy pool to registers */ u32 a = f->pool[0] ^ input[0]; u32 b = f->pool[1] ^ input[1]; u32 c = f->pool[2] ^ input[2]; u32 d = f->pool[3] ^ input[3]; #if 1 int i; for (i = 0; i < 2; i++) { /* * Inspired by ChaCha QuarterRound, but * modified for much greater parallelism. */ a += b; c += d; d ^= a; b ^= c; // a = rol32(a, 24); c = rol32(c, 5); a = rol32(a, 27); c = rol32(c, 7); // d = rol32(d, 27); b = rol32(b, 7); a += b; c += d; d ^= a; b ^= c; a = rol32(a, 17); c = rol32(c, 11); // d = rol32(d, 17); b = rol32(b, 11); #if 0 a += b; c += d; d ^= a; b ^= c; a = rol32(a, 24); c = rol32(c, 5); #endif } f->pool[0] = a; f->pool[1] = b; f->pool[2] = c; f->pool[3] = d; #elif 1 a ^= a<<15; a ^= a>>4 ^ d ^ d>>21; f->pool[0] = a; b ^= b<<15; b ^= b>>4 ^ a ^ a>>21; f->pool[1] = b; c ^= c<<15; c ^= c>>4 ^ b ^ b>>21; f->pool[2] = c; d ^= d<<15; d ^= d>>4 ^ c ^ c>>21; f->pool[3] = d; #else f->pool[0] = a ^= a<<20 ^ b ^ b>>11 ^ c ^ c<<27 ^ d ^ d >> 6; f->pool[1] = b ^= b<<20 ^ c ^ c>>11 ^ d ^ d<<27 ^ a ^ a >> 6; f->pool[2] = c ^= c<<20 ^ d ^ d>>11 ^ a ^ a<<27 ^ b ^ b >> 6; f->pool[3] = d ^= d<<20 ^ a ^ a>>11 ^ b ^ b<<27 ^ c ^ c >> 6; #endif f->count++; } static uint64_t rdtsc(void) { uint32_t low, high; __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high)); return ((uint64_t)high << 32) | low; } #include <stdio.h> #include <stdlib.h> #define ITER 1 #define N 10 int main(void) { struct fast_pool fp1 = { .pool = {0}, .count = 0, .rotate = 0 }; struct fast_pool fp2 = { .pool = {0}, .count = 0 }; uint32_t times[N][2]; int i, j; for (i = 0; i < N; i++) { uint64_t t1, t2, t3; uint32_t data[4]; for (j = 0; j < 4; j++) data[j] = random(); t1 = rdtsc(); for (j = 0; j < ITER; j++) fast_mix(&fp1, data); t2 = rdtsc(); for (j = 0; j < ITER; j++) fast_mix2(&fp2, data); t3 = rdtsc(); times[i][0] = t2 - t1; times[i][1] = t3 - t2; } printf("pool 1 = %08x %08x %08x %08x\n", fp1.pool[0], fp1.pool[1], fp1.pool[2], fp1.pool[3]); printf("pool 2 = %08x %08x %08x %08x\n", fp2.pool[0], fp2.pool[1], fp2.pool[2], fp2.pool[3]); for (i = 0; i < N; i++) printf("%2d: %10u %10u (%+d)\n", i, times[i][0], times[i][1], (int)(times[i][1] - times[i][0])); return 0; } ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-12 0:32 ` George Spelvin @ 2014-06-12 3:22 ` Theodore Ts'o 2014-06-12 4:13 ` random: Benchamrking fast_mix2 George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-12 3:22 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Wed, Jun 11, 2014 at 08:32:49PM -0400, George Spelvin wrote: > Comparable, but slightly slower. Clearly, I need to do better. > And you can see the first-iteration effects clearly. Still, > noting *remotely* like 7x! I redid my numbers, and I can no longer reproduce the 7x slowdown. I do see that if you compile w/o -O2, fast_mix2 is twice as slow. But it's not 7x slower. When compiling w/o -O2: fast_mix fast_mix2 task-clock 221.3 ms 460.7 ms When compiling with -O2 -Os: fast_mix fast_mix2 task-clock 115.4 ms 71.5 ms And here's the numbers I got with a single iteration using rdtsc: fast_mix: 164 fast_mix2: 237 fast_mix: 168 fast_mix2: 230 fast_mix: 166 fast_mix2: 228 fast_mix: 164 fast_mix2: 230 fast_mix: 166 fast_mix2: 230 fast_mix: 168 fast_mix2: 232 fast_mix: 166 fast_mix2: 228 fast_mix: 164 fast_mix2: 228 fast_mix: 166 fast_mix2: 234 fast_mix: 166 fast_mix2: 230 - Ted #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> typedef unsigned int __u32; struct fast_pool { __u32 pool[4]; unsigned long last; unsigned short count; unsigned char rotate; unsigned char last_timer_intr; }; /** * rol32 - rotate a 32-bit value left * @word: value to rotate * @shift: bits to roll */ static inline __u32 rol32(__u32 word, unsigned int shift) { return (word << shift) | (word >> (32 - shift)); } static __u32 const twist_table[8] = { 0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158, 0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 }; /* * This is a fast mixing routine used by the interrupt randomness * collector. It's hardcoded for an 128 bit pool and assumes that any * locks that might be needed are taken by the caller. */ extern void fast_mix(struct fast_pool *f, __u32 input[4]) { __u32 w; unsigned input_rotate = f->rotate; w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3]; f->pool[0] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 14) & 31; w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0]; f->pool[1] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1]; f->pool[2] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2]; f->pool[3] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; f->rotate = input_rotate; f->count++; } extern fast_mix2(struct fast_pool *f, __u32 const input[4]) { __u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1]; __u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3]; int i; for (i = 0; i < 3; i++) { /* * Inspired by ChaCha's QuarterRound, but * modified for much greater parallelism. * Surprisingly, rotating a and c seems to work * better than b and d. And it runs faster. */ a += b; c += d; d ^= a; b ^= c; a = rol32(a, 15); c = rol32(c, 21); a += b; c += d; d ^= a; b ^= c; a = rol32(a, 3); c = rol32(c, 7); } f->pool[0] = a; f->pool[1] = b; f->pool[2] = c; f->pool[3] = d; f->count++; } static __inline__ volatile unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } int main(int argc, char **argv) { struct fast_pool f; int i; __u32 input[4]; unsigned volatile long long start_time, end_time; memset(&f, 0, sizeof(f)); memset(&input, 0, sizeof(input)); f.pool[0] = 1; #if !defined(BENCH_FASTMIX) && !defined(BENCH_FASTMIX2) for (i=0; i < 10; i++) { usleep(50000); start_time = rdtsc(); fast_mix(&f, input); end_time = rdtsc(); printf("fast_mix: %llu\t", end_time - start_time); usleep(50000); start_time = rdtsc(); fast_mix2(&f, input); end_time = rdtsc(); printf("fast_mix2: %llu\n", end_time - start_time); } #endif #ifdef BENCH_FASTMIX for (i=0; i < 10240000; i++) { fast_mix(&f, input); } #endif #ifdef BENCH_FASTMIX2 for (i=0; i < 10240000; i++) { fast_mix2(&f, input); } #endif } ^ permalink raw reply [flat|nested] 57+ messages in thread
* random: Benchamrking fast_mix2 2014-06-12 3:22 ` Theodore Ts'o @ 2014-06-12 4:13 ` George Spelvin 2014-06-12 11:18 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-12 4:13 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > I redid my numbers, and I can no longer reproduce the 7x slowdown. I > do see that if you compile w/o -O2, fast_mix2 is twice as slow. But > it's not 7x slower. For my single-round, I needed to drop to 2 loops rather than 3 to match the speed. That's in the source I posted, but I didn't point it out. (It wasn't an attempt to be deceptive, that's just how I happened to have left the file when I was experimenting with various options. I figured if we were looking for 7x, 1.5x wasn't all that important.) That explains some of the residual difference between our figures. When developing, I was using a many-iteration benchmark, and I suspect it fitted in the Ivy Bridge uop cache, which let it saturate the execution resources. Sorry for the premature alarm; I'll go back to work and find something better. I still get comparable speed for 2 loops and -O2: $ cc -W -Wall -m32 -O2 -march=native random.c -o random32 # ./perftest ../spooky/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 148 124 (-24) 1: 48 36 (-12) 2: 40 36 (-4) 3: 44 40 (-4) 4: 44 40 (-4) 5: 36 36 (+0) 6: 52 36 (-16) 7: 44 32 (-12) 8: 44 36 (-8) 9: 48 36 (-12) $ cc -W -Wall -m64 -O2 -march=native random.c -o random64 # ./perftest ../spooky/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 132 104 (-28) 1: 40 40 (+0) 2: 36 44 (+8) 3: 32 40 (+8) 4: 40 36 (-4) 5: 32 40 (+8) 6: 36 44 (+8) 7: 40 40 (+0) 8: 36 44 (+8) 9: 40 36 (-4) $ cc -W -Wall -m32 -O3 -march=native random.c -o random32 # ./perftest ./random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 88 48 (-40) 1: 36 40 (+4) 2: 36 44 (+8) 3: 32 40 (+8) 4: 36 40 (+4) 5: 96 40 (-56) 6: 40 40 (+0) 7: 36 40 (+4) 8: 28 48 (+20) 9: 28 40 (+12) $ cc -W -Wall -m64 -O3 -march=native random.c -o random64 # ./perftest ./random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 72 80 (+8) 1: 36 52 (+16) 2: 32 36 (+4) 3: 32 36 (+4) 4: 28 40 (+12) 5: 32 40 (+8) 6: 32 40 (+8) 7: 32 36 (+4) 8: 28 44 (+16) 9: 36 36 (+0) $ cc -W -Wall -m32 -Os -march=native random.c -o random32 # ./perftest ./random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 108 132 (+24) 1: 44 44 (+0) 2: 76 40 (-36) 3: 44 48 (+4) 4: 36 40 (+4) 5: 32 44 (+12) 6: 40 56 (+16) 7: 44 36 (-8) 8: 44 40 (-4) 9: 32 40 (+8) $ $ cc -W -Wall -m64 -Os -march=native random.c -o random64 # ./perftest ./random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 96 108 (+12) 1: 44 52 (+8) 2: 40 40 (+0) 3: 40 36 (-4) 4: 40 32 (-8) 5: 36 36 (+0) 6: 44 32 (-12) 7: 36 36 (+0) 8: 40 36 (-4) 9: 40 36 (-4) Yours looks much more careful about the timing. A few GCC warnings I ended up fixing: 1) "volatile" on rdtsc is meaningless and ignore (with a warning) 2) fast_mix2() needs a void return type; it defaults to int. 3) int main() needs a "return 0" Here's what I got running *your* program, unmodified except for the above (meaning 3 inner loop iterations). Compiled with GCC 4.9.0 (Devian 4.9.0-6), -O2. i7-4940K# ./perftest ./ted32 fast_mix: 430 fast_mix2: 431 fast_mix: 442 fast_mix2: 464 fast_mix: 442 fast_mix2: 465 fast_mix: 442 fast_mix2: 431 fast_mix: 442 fast_mix2: 465 fast_mix: 431 fast_mix2: 430 fast_mix: 442 fast_mix2: 431 fast_mix: 431 fast_mix2: 465 fast_mix: 431 fast_mix2: 465 fast_mix: 431 fast_mix2: 431 i7-4940K# ./perftest ./ted64 fast_mix: 454 fast_mix2: 465 fast_mix: 453 fast_mix2: 465 fast_mix: 442 fast_mix2: 464 fast_mix: 453 fast_mix2: 464 fast_mix: 454 fast_mix2: 465 fast_mix: 453 fast_mix2: 465 fast_mix: 442 fast_mix2: 464 fast_mix: 453 fast_mix2: 464 fast_mix: 453 fast_mix2: 464 fast_mix: 453 fast_mix2: 465 In other words, pretty damn near the same speed (with 3 loops). So we still have some discrepancy to track down. A few other machines. i5-3330$ /tmp/ted32 fast_mix: 226 fast_mix2: 277 fast_mix: 561 fast_mix2: 429 fast_mix: 156 fast_mix2: 406 fast_mix: 504 fast_mix2: 534 fast_mix: 579 fast_mix2: 270 fast_mix: 240 fast_mix2: 270 fast_mix: 494 fast_mix2: 270 fast_mix: 240 fast_mix2: 138 fast_mix: 750 fast_mix2: 277 fast_mix: 124 fast_mix2: 270 i5-3330$ /tmp/ted64 fast_mix: 224 fast_mix2: 277 fast_mix: 226 fast_mix2: 312 fast_mix: 646 fast_mix2: 276 fast_mix: 233 fast_mix2: 456 fast_mix: 591 fast_mix2: 570 fast_mix: 413 fast_mix2: 563 fast_mix: 584 fast_mix2: 270 fast_mix: 231 fast_mix2: 261 fast_mix: 233 fast_mix2: 459 fast_mix: 528 fast_mix2: 277 Pentium4$ /tmp/ted32 fast_mix: 912 fast_mix2: 396 fast_mix: 792 fast_mix2: 160 fast_mix: 524 fast_mix2: 160 fast_mix: 1460 fast_mix2: 440 fast_mix: 496 fast_mix2: 160 fast_mix: 672 fast_mix2: 160 fast_mix: 700 fast_mix2: 160 fast_mix: 336 fast_mix2: 540 fast_mix: 896 fast_mix2: 160 fast_mix: 1052 fast_mix2: 156 Phemom9850$ /tmp/ted32 fast_mix: 463 fast_mix2: 158 fast_mix: 276 fast_mix2: 174 fast_mix: 194 fast_mix2: 135 fast_mix: 620 fast_mix2: 424 fast_mix: 584 fast_mix2: 424 fast_mix: 610 fast_mix2: 418 fast_mix: 651 fast_mix2: 1107 fast_mix: 634 fast_mix2: 439 fast_mix: 632 fast_mix2: 456 fast_mix: 534 fast_mix2: 205 Phemom9850$ /tmp/ted64 fast_mix: 783 fast_mix2: 185 fast_mix: 903 fast_mix2: 144 fast_mix: 955 fast_mix2: 178 fast_mix: 515 fast_mix2: 437 fast_mix: 642 fast_mix2: 580 fast_mix: 610 fast_mix2: 525 fast_mix: 523 fast_mix2: 119 fast_mix: 180 fast_mix2: 315 fast_mix: 596 fast_mix2: 570 fast_mix: 598 fast_mix2: 775 AthlonXP$ /tmp/ted32 fast_mix: 119 fast_mix2: 113 fast_mix: 139 fast_mix2: 109 fast_mix: 155 fast_mix2: 123 fast_mix: 134 fast_mix2: 140 fast_mix: 126 fast_mix2: 154 fast_mix: 134 fast_mix2: 113 fast_mix: 176 fast_mix2: 140 fast_mix: 145 fast_mix2: 113 fast_mix: 134 fast_mix2: 144 fast_mix: 155 fast_mix2: 112 So I'm still a bit confused. Would any bystanders like to chip in? Ted, shall I send you some binaries? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-12 4:13 ` random: Benchamrking fast_mix2 George Spelvin @ 2014-06-12 11:18 ` George Spelvin 2014-06-12 20:17 ` Theodore Ts'o 2014-06-12 20:46 ` Theodore Ts'o 0 siblings, 2 replies; 57+ messages in thread From: George Spelvin @ 2014-06-12 11:18 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price Ted, just as an example of a possible tweak, the following change seems to sightly improve speed without reducing diffusion. (It now looks a bit more like the Skein mix() function.) I'll look for more even more efficient kernels. It appears that the inital XOR is costing a lot; 8 memory fetches take a while. Spacing that out a bit would help, but it having input partway through the round ends up requiring a major surgery to my avalanche-measuring code. These rotation constants aren't final, but tweaking them shouldn't affect the speed. IMHO *one* pass of this is as good as the current fast_mix, and two are considerably better. Three achieve almost complete avalanche, but aren't really necessary. Since the pass is made of two identical halves, half-passes are also possible. Either using only 2 rotate constants, or by unrolling the last half-pass. // (29/66353) score = 49/121/123: 6 27 16 14 a += b; c += d; + b = rol32(a, 6); d = rol32(c, 27); d ^= a; b ^= c; - a = rol32(a, 15); c = rol32(c, 21); a += b; c += d; + b = rol32(a, 16); d = rol32(c, 14); d ^= a; b ^= c; - a = rol32(a, 3); c = rol32(c, 7); Of course, using wider words works fantastically. These constants give 76 bits if avalanche after 2 rounds, essentially full after 3: extern void fast_mix2(struct fast_pool *f, __u32 const input[4]) { uint64_t a = ((uint64_t *)f->pool)[0] ^ ((uint64_t const *)input)[0]; uint64_t b = ((uint64_t *)f->pool)[1] ^ ((uint64_t const *)input)[1]; int i; for (i = 0; i < 3; i++) { a += b; b = rol64(b, 52); b ^= a; a = rol64(a, 10); a += b; b = rol64(b, 47); b ^= a; a = rol64(a, 17); } ((uint64_t *)f->pool)[0] = a; ((uint64_t *)f->pool)[1] = b; f->count++; } And it measures as faster than fast_mix even with 3 rounds: # ./perftest ./ted64 fast_mix: 499 fast_mix2: 430 fast_mix: 431 fast_mix2: 430 fast_mix: 510 fast_mix2: 419 fast_mix: 430 fast_mix2: 419 fast_mix: 430 fast_mix2: 419 fast_mix: 431 fast_mix2: 419 fast_mix: 431 fast_mix2: 430 fast_mix: 510 fast_mix2: 419 fast_mix: 510 fast_mix2: 431 fast_mix: 510 fast_mix2: 430 And with 2 it's even faster. (But so is fast_mix; there appears to be significant timing instability here.) # ./perftest ./ted64 fast_mix: 430 fast_mix2: 306 fast_mix: 431 fast_mix2: 306 fast_mix: 442 fast_mix2: 306 fast_mix: 442 fast_mix2: 306 fast_mix: 442 fast_mix2: 306 fast_mix: 442 fast_mix2: 363 fast_mix: 442 fast_mix2: 306 fast_mix: 442 fast_mix2: 329 fast_mix: 408 fast_mix2: 306 fast_mix: 442 fast_mix2: 329 ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-12 11:18 ` George Spelvin @ 2014-06-12 20:17 ` Theodore Ts'o 2014-06-12 20:46 ` Theodore Ts'o 1 sibling, 0 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-12 20:17 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price One of the reasons for the timing instability is because of the usleep() calls. This allows some other process to schedule, and thus disrupts the I-cache and uop caches. With the usleep() calls: i7-4900MQ# schedtool -R -p 1 -e /tmp/fast_mix2_49 fast_mix: 213 fast_mix2: 280 fast_mix: 336 fast_mix2: 356 fast_mix: 174 fast_mix2: 392 fast_mix: 202 fast_mix2: 403 fast_mix: 152 fast_mix2: 280 fast_mix: 212 fast_mix2: 403 fast_mix: 213 fast_mix2: 403 fast_mix: 213 fast_mix2: 392 fast_mix: 202 fast_mix2: 403 fast_mix: 191 fast_mix2: 392 ... and without the usleep calls: i7-4900MQ# schedtool -R -p 1 -e /tmp/fast_mix2_49 fast_mix: 146 fast_mix2: 347 fast_mix: 157 fast_mix2: 90 fast_mix: 78 fast_mix2: 90 fast_mix: 78 fast_mix2: 89 fast_mix: 78 fast_mix2: 90 fast_mix: 78 fast_mix2: 90 fast_mix: 90 fast_mix2: 90 fast_mix: 79 fast_mix2: 90 fast_mix: 90 fast_mix2: 89 fast_mix: 79 fast_mix2: 90 I had originally added the usleep calls() in my test infrastructure to more accurately disable the uop cache effects, since we are going to be called from an interrupt handler, not in a loop. But anyway, this is one of the reasons for the differences that you were seeing with your benchmarking framework and mine, and why micro-benchmarking can be so hard to get right. :-) (BTW, this is your original mixer; I haven't tried playing with your modified Skein-like core round yet.) - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-12 11:18 ` George Spelvin 2014-06-12 20:17 ` Theodore Ts'o @ 2014-06-12 20:46 ` Theodore Ts'o 2014-06-13 0:23 ` George Spelvin 1 sibling, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-12 20:46 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price So I just tried your modified 32-bit mixing function where you the rotation to the middle step instead of the last step. With the usleep(), it doesn't make any difference: # schedtool -R -p 1 -e /tmp/fast_mix2_48 fast_mix: 212 fast_mix2: 400 fast_mix3: 400 fast_mix: 208 fast_mix2: 408 fast_mix3: 388 fast_mix: 208 fast_mix2: 396 fast_mix3: 404 fast_mix: 224 fast_mix2: 408 fast_mix3: 392 fast_mix: 200 fast_mix2: 404 fast_mix3: 404 fast_mix: 208 fast_mix2: 412 fast_mix3: 396 fast_mix: 208 fast_mix2: 392 fast_mix3: 392 fast_mix: 212 fast_mix2: 408 fast_mix3: 388 fast_mix: 200 fast_mix2: 716 fast_mix3: 773 fast_mix: 426 fast_mix2: 717 fast_mix3: 728 without the usleep() I get: 692# schedtool -R -p 1 -e /tmp/fast_mix2_48 fast_mix: 104 fast_mix2: 224 fast_mix3: 176 fast_mix: 56 fast_mix2: 112 fast_mix3: 56 fast_mix: 56 fast_mix2: 64 fast_mix3: 64 fast_mix: 64 fast_mix2: 64 fast_mix3: 48 fast_mix: 56 fast_mix2: 64 fast_mix3: 56 fast_mix: 56 fast_mix2: 64 fast_mix3: 64 fast_mix: 56 fast_mix2: 64 fast_mix3: 64 fast_mix: 56 fast_mix2: 72 fast_mix3: 56 fast_mix: 56 fast_mix2: 64 fast_mix3: 56 fast_mix: 64 fast_mix2: 64 fast_mix3: 56 I'm beginning to suspect that some of the differences between your measurements and mine might be that in addition to having a smaller cache (8M instead of 12M), I suspect there are some other caches, perhaps the uop cache, which are also smaller on the mobile processor, and that is explaining why you are seeing some different results. > > Of course, using wider words works fantastically. > These constants give 76 bits if avalanche after 2 rounds, > essentially full after 3.... And here is my testing using your 64-bit variant: # schedtool -R -p 1 -e /tmp/fast_mix2_49 fast_mix: 294 fast_mix2: 476 fast_mix4: 442 fast_mix: 286 fast_mix2: 1058 fast_mix4: 448 fast_mix: 958 fast_mix2: 460 fast_mix4: 1002 fast_mix: 940 fast_mix2: 1176 fast_mix4: 826 fast_mix: 476 fast_mix2: 840 fast_mix4: 826 fast_mix: 462 fast_mix2: 840 fast_mix4: 826 fast_mix: 462 fast_mix2: 826 fast_mix4: 826 fast_mix: 462 fast_mix2: 826 fast_mix4: 826 fast_mix: 462 fast_mix2: 826 fast_mix4: 826 fast_mix: 462 fast_mix2: 840 fast_mix4: 826 ... and without usleep() 690# schedtool -R -p 1 -e /tmp/fast_mix2_48 fast_mix: 52 fast_mix2: 116 fast_mix4: 96 fast_mix: 32 fast_mix2: 32 fast_mix4: 24 fast_mix: 28 fast_mix2: 36 fast_mix4: 24 fast_mix: 32 fast_mix2: 32 fast_mix4: 24 fast_mix: 32 fast_mix2: 36 fast_mix4: 24 fast_mix: 36 fast_mix2: 32 fast_mix4: 24 fast_mix: 32 fast_mix2: 36 fast_mix4: 28 fast_mix: 28 fast_mix2: 28 fast_mix4: 24 fast_mix: 32 fast_mix2: 36 fast_mix4: 28 fast_mix: 32 fast_mix2: 32 fast_mix4: 24 The bottom line is that what we are primarily measuring here is all different cache effects. And these are going to be quite different on different microarchitectures. That being said, I wouldn't be at all surprised if there are some CPU's where the extract memory dereference to the twist_table[] would definitely hurt, since Intel's amazing cache architecture(tm) is no doubt covering a lot of sins. I wouldn't be at all surprised if some of these new mixing functions would fare much better if we tried benchmarking them on an 32-bit ARM processor, for example.... - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-12 20:46 ` Theodore Ts'o @ 2014-06-13 0:23 ` George Spelvin 2014-06-13 15:52 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-13 0:23 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > So I just tried your modified 32-bit mixing function where you the > rotation to the middle step instead of the last step. With the > usleep(), it doesn't make any difference: > > # schedtool -R -p 1 -e /tmp/fast_mix2_48 > fast_mix: 212 fast_mix2: 400 fast_mix3: 400 > fast_mix: 208 fast_mix2: 408 fast_mix3: 388 > fast_mix: 208 fast_mix2: 396 fast_mix3: 404 > fast_mix: 224 fast_mix2: 408 fast_mix3: 392 > fast_mix: 200 fast_mix2: 404 fast_mix3: 404 > fast_mix: 208 fast_mix2: 412 fast_mix3: 396 > fast_mix: 208 fast_mix2: 392 fast_mix3: 392 > fast_mix: 212 fast_mix2: 408 fast_mix3: 388 > fast_mix: 200 fast_mix2: 716 fast_mix3: 773 > fast_mix: 426 fast_mix2: 717 fast_mix3: 728 > And here is my testing using your 64-bit variant: > > # schedtool -R -p 1 -e /tmp/fast_mix2_49 > fast_mix: 294 fast_mix2: 476 fast_mix4: 442 > fast_mix: 286 fast_mix2: 1058 fast_mix4: 448 > fast_mix: 958 fast_mix2: 460 fast_mix4: 1002 > fast_mix: 940 fast_mix2: 1176 fast_mix4: 826 > fast_mix: 476 fast_mix2: 840 fast_mix4: 826 > fast_mix: 462 fast_mix2: 840 fast_mix4: 826 > fast_mix: 462 fast_mix2: 826 fast_mix4: 826 > fast_mix: 462 fast_mix2: 826 fast_mix4: 826 > fast_mix: 462 fast_mix2: 826 fast_mix4: 826 > fast_mix: 462 fast_mix2: 840 fast_mix4: 826 > The bottom line is that what we are primarily measuring here is all > different cache effects. And these are going to be quite different on > different microarchitectures. So adding fast_mix4 doubled the time taken by fast_mix. Yeah, that's trustworthy timing! :-) Still, you do seem to observe a pretty consistent factor of about 2x difference, which confuses me because I can't reproduce it. But it's hard to reach definite conclusions with this much measurement noise. Another cache we might be hitting is the branch predictor. Could you try unrolling fast_mix2 and fast_mix4 and see what difference that makes? (I'd send you a patch but you could probably do it by hand faster than appying one.) It only makes a slight difference on my high-end Intel box, but almost doubles the speed on the Phenom: Rolled (64-bit core, 2 rounds): fast_mix: 293 fast_mix2: 205 fast_mix: 257 fast_mix2: 162 fast_mix: 170 fast_mix2: 137 fast_mix: 283 fast_mix2: 218 fast_mix: 270 fast_mix2: 185 fast_mix: 288 fast_mix2: 199 fast_mix: 423 fast_mix2: 131 fast_mix: 286 fast_mix2: 218 fast_mix: 681 fast_mix2: 165 fast_mix: 268 fast_mix2: 190 Unrolled (64-bit core, 2 rounds): fast_mix: 394 fast_mix2: 108 fast_mix: 145 fast_mix2: 80 fast_mix: 270 fast_mix2: 112 fast_mix: 145 fast_mix2: 81 fast_mix: 145 fast_mix2: 79 fast_mix: 662 fast_mix2: 107 fast_mix: 145 fast_mix2: 78 fast_mix: 140 fast_mix2: 127 fast_mix: 164 fast_mix2: 182 fast_mix: 205 fast_mix2: 79 Since the original fast_mix is unrolled, a penalty there wouldn't hit it. > That being said, I wouldn't be at all surprised if there are some > CPU's where the extract memory dereference to the twist_table[] would > definitely hurt, since Intel's amazing cache architecture(tm) is no > doubt covering a lot of sins. I wouldn't be at all surprised if some > of these new mixing functions would fare much better if we tried > benchmarking them on an 32-bit ARM processor, for example.... Yes, Intel's D-caches are quite impressive. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-13 0:23 ` George Spelvin @ 2014-06-13 15:52 ` Theodore Ts'o 2014-06-14 2:10 ` George Spelvin 2014-06-14 4:55 ` [RFC] random: is the IRQF_TIMER test working as intended? George Spelvin 0 siblings, 2 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-13 15:52 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Thu, Jun 12, 2014 at 08:23:04PM -0400, George Spelvin wrote: > Another cache we might be hitting is the branch predictor. Could you try > unrolling fast_mix2 and fast_mix4 and see what difference that makes? > (I'd send you a patch but you could probably do it by hand faster than > appying one.) Unrolling doesn't make much difference; which isn't surprising given that almost all of the differences go away when I commented out the udelay(). Basically, at this point what we're primarily measuring is how good various CPU's caches work, especially across context switches where other code gets to run in between. If that's the case, all else being equal, removing the extra memory reference for twist_table[] does make sense, and something else I've considered doing is to remove the input[] array entirely, and have add_interrupt_randomness[] xor values directly into the pool, and then let thast fast_mix function stir the pool. It's harder to benchmark this, but at this point, I think we know enough to be confident that this will be a win on at least some platforms, and so long as it's not a massvie lose from what we had before, I'll be fine with it. I also think that it's going to be worthwhile to do the RDTSC measurement in vitro, and calculate average and max latencies, since it's clear that there are real limitations to userspace benchmarking. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-13 15:52 ` Theodore Ts'o @ 2014-06-14 2:10 ` George Spelvin 2014-06-14 3:06 ` Theodore Ts'o 2014-06-14 4:55 ` [RFC] random: is the IRQF_TIMER test working as intended? George Spelvin 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-14 2:10 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Unrolling doesn't make much difference; which isn't surprising given > that almost all of the differences go away when I commented out the > udelay(). Basically, at this point what we're primarily measuring is > how good various CPU's caches work, especially across context switches > where other code gets to run in between. Huh. As I was referring to when I talked about the branch predictor, I was hoping the removing *conditional* branches would help. > If that's the case, all else being equal, removing the extra memory > reference for twist_table[] does make sense, and something else I've > considered doing is to remove the input[] array entirely, and have > add_interrupt_randomness[] xor values directly into the pool, and then > let that fast_mix function stir the pool. Are you trying for an XOR to memory, or is the idea to remain in registers for the entire operation? I'm not sure an XOR to memory is that much better; it's 2 pool loads and 1 pool store either way. Currently, the store is first (to input[]) and then both it and the fast_pool are fetched in fast_mix. With an XOR to memory, it's load-store-load, but is that really better? In case it's useful, below is a small patch I made to add_interrupt_randomness to take advantage of 64-bit processors and make it a bit clearer what it's doing. Not submitted officially because: 1) I haven't examined the consequences on 32-bit processors carefully yet. 2) It's more of a "code cleanup", meaning personal style preference, and you've expressed some pretty strong unhappiness with such churn. It's also useful preparation for changing to a native 64-bit fast_mix. In general, I dislike "pre-compressing" the input; if the input hash isn't fast enough, fix that for all users rather than adding something ad-hoc. If you want a last_mix function with different input and state widths, I can try to come up with one. (Given the iterated nature of the current fast_mix2, you can also just add additional seed material betwene the rounds.) > I also think that it's going to be worthwhile to do the RDTSC > measurement in vitro, and calculate average and max latencies, since > it's clear that there are real limitations to userspace benchmarking. I'm not sure you're not making a clever joke about the use of silicon dioxide in real chips, but don't you mean "in vivo"? (Also, if we're reading the TSC twice, and we know the delta is noisy as heck, seed with it!) commit d3c0a185991a45e420925d040f19e764808b354e Author: George Spelvin <linux@horizon.com> Date: Sat Jun 7 21:16:45 2014 -0400 random: Simplify add_interrupt_randomness using 64-bit math The same values (except word-swapped on big-endian machines) are passed to fast_mix, but the code is simpler, smaller, and uses 64-bit operations if available. Signed-off-by: George Spelvin <linux@horizon.com> diff --git a/drivers/char/random.c b/drivers/char/random.c index 868760e1..acc9bb1a 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -563,7 +563,7 @@ struct fast_pool { * collector. It's hardcoded for an 128 bit pool and assumes that any * locks that might be needed are taken by the caller. */ -static void fast_mix(struct fast_pool *f, __u32 input[4]) +static void fast_mix(struct fast_pool *f, __u32 const input[4]) { __u32 w; unsigned input_rotate = f->rotate; @@ -839,22 +839,16 @@ void add_interrupt_randomness(int irq, int irq_flags) struct entropy_store *r; struct fast_pool *fast_pool = &__get_cpu_var(irq_randomness); struct pt_regs *regs = get_irq_regs(); - unsigned long now = jiffies; - cycles_t cycles = random_get_entropy(); - __u32 input[4], c_high, j_high; - __u64 ip; unsigned long seed; int credit; + unsigned long now = jiffies; + cycles_t cycles = random_get_entropy(); + __u64 const input[2] = { + cycles ^ irq ^ rol64(now, 32), + regs ? instruction_pointer(regs) : _RET_IP_ + }; - c_high = (sizeof(cycles) > 4) ? cycles >> 32 : 0; - j_high = (sizeof(now) > 4) ? now >> 32 : 0; - input[0] = cycles ^ j_high ^ irq; - input[1] = now ^ c_high; - ip = regs ? instruction_pointer(regs) : _RET_IP_; - input[2] = ip; - input[3] = ip >> 32; - - fast_mix(fast_pool, input); + fast_mix(fast_pool, (__u32 const *)input); if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ)) return; ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 2:10 ` George Spelvin @ 2014-06-14 3:06 ` Theodore Ts'o 2014-06-14 5:25 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-14 3:06 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Fri, Jun 13, 2014 at 10:10:14PM -0400, George Spelvin wrote: > > Unrolling doesn't make much difference; which isn't surprising given > > that almost all of the differences go away when I commented out the > > udelay(). Basically, at this point what we're primarily measuring is > > how good various CPU's caches work, especially across context switches > > where other code gets to run in between. > > Huh. As I was referring to when I talked about the branch > predictor, I was hoping the removing *conditional* branches would > help. At least for Intel, between its branch predictor and speculative execution engine, it doesn't make a difference. > Are you trying for an XOR to memory, or is the idea to remain in > registers for the entire operation? > > I'm not sure an XOR to memory is that much better; it's 2 pool loads > and 1 pool store either way. Currently, the store is first (to > input[]) and then both it and the fast_pool are fetched in fast_mix. > > With an XOR to memory, it's load-store-load, but is that really better? The second load can be optimized away. If the compiler isn't smart enough, the store means that the data is almost certainly still in the D-cache. But with a smart compiler (and gcc should be smart enough), if fast_mix is a static function, gcc will inline fast_mix, and then it should be able to optimize out the load. In fact, it might be smart enough to optimize out the first store, since it should be able to realize that first store to the pool[] array will get overwritten by the final store to the pool[] array. So hopefully, it will remain in registers for the entire operation, and the compilers will hopefully be smart enough to make the right hting happy without the code having to be really ugly. > In case it's useful, below is a small patch I made to > add_interrupt_randomness to take advantage of 64-bit processors and make > it a bit clearer what it's doing. Not submitted officially because: > 1) I haven't examined the consequences on 32-bit processors carefully yet. When I did a quick comparison of your 64-bit fast_mix2 variant, it's much slower than either the 32-bit fast_mix2, or the original fast_mix alrogithm. So given that 32-bit processors tend to be slower, I'm pretty sure if we want to add a 64-bit optimization, we'll have to conditionalize it on BITS_PER_LONG == 64 and include both the original code and the 64-bit optimized code. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 3:06 ` Theodore Ts'o @ 2014-06-14 5:25 ` George Spelvin 2014-06-14 6:24 ` Theodore Ts'o 2014-06-14 6:27 ` Theodore Ts'o 0 siblings, 2 replies; 57+ messages in thread From: George Spelvin @ 2014-06-14 5:25 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > At least for Intel, between its branch predictor and speculative > execution engine, it doesn't make a difference. *Sigh*. We need live measurement. My testing (in your test harness!) showed a noticeable (~10%) speedup. > When I did a quick comparison of your 64-bit fast_mix2 variant, it's > much slower than either the 32-bit fast_mix2, or the original fast_mix > alrogithm. That is f***ing *bizarre*. For me, it's *significantly* faster. You *are* compiling -m64, right? Because I agree with you it'd be stupid to try to use it on 32-bit machines. Forcing max-speed CPU: # ./perftest ./ted64 fast_mix: 419 fast_mix2: 419 fast_mix4: 318 fast_mix: 386 fast_mix2: 419 fast_mix4: 112 fast_mix: 419 fast_mix2: 510 fast_mix4: 328 fast_mix: 420 fast_mix2: 510 fast_mix4: 306 fast_mix: 420 fast_mix2: 510 fast_mix4: 317 fast_mix: 419 fast_mix2: 510 fast_mix4: 318 fast_mix: 362 fast_mix2: 510 fast_mix4: 317 fast_mix: 420 fast_mix2: 510 fast_mix4: 306 fast_mix: 419 fast_mix2: 499 fast_mix4: 318 fast_mix: 420 fast_mix2: 510 fast_mix4: 328 And not: $ ./ted64 fast_mix: 328 fast_mix2: 430 fast_mix4: 272 fast_mix: 442 fast_mix2: 442 fast_mix4: 272 fast_mix: 442 fast_mix2: 430 fast_mix4: 272 fast_mix: 329 fast_mix2: 442 fast_mix4: 272 fast_mix: 329 fast_mix2: 430 fast_mix4: 272 fast_mix: 328 fast_mix2: 442 fast_mix4: 272 fast_mix: 329 fast_mix2: 431 fast_mix4: 272 fast_mix: 328 fast_mix2: 442 fast_mix4: 272 fast_mix: 328 fast_mix2: 431 fast_mix4: 272 fast_mix: 329 fast_mix2: 442 fast_mix4: 272 And on a Phenom: $ /tmp/ted64 fast_mix: 250 fast_mix2: 174 fast_mix4: 109 fast_mix: 258 fast_mix2: 170 fast_mix4: 114 fast_mix: 371 fast_mix2: 285 fast_mix4: 109 fast_mix: 516 fast_mix2: 156 fast_mix4: 90 fast_mix: 140 fast_mix2: 184 fast_mix4: 170 fast_mix: 406 fast_mix2: 146 fast_mix4: 88 fast_mix: 185 fast_mix2: 114 fast_mix4: 94 fast_mix: 161 fast_mix2: 116 fast_mix4: 98 fast_mix: 152 fast_mix2: 104 fast_mix4: 94 fast_mix: 352 fast_mix2: 140 fast_mix4: 79 > So given that 32-bit processors tend to be slower, I'm pretty sure > if we want to add a 64-bit optimization, we'll have to conditionalize > it on BITS_PER_LONG == 64 and include both the original code and the > 64-bit optimized code. Sorry I neglected to say so earlier; that has *always* been my intention. The 32-bit version is primary; the 64-bit version is a conditional optimization. If I can make it faster *and* have more avalanche (and less register pressure, too), it seems worth the hassle of having two versions. ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 5:25 ` George Spelvin @ 2014-06-14 6:24 ` Theodore Ts'o 2014-06-14 8:03 ` George Spelvin 2014-06-14 6:27 ` Theodore Ts'o 1 sibling, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-14 6:24 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price OK, so I normally do my testing using 32-bit kernels under KVM. On a i386 kernel, running under kvm (again using a Lenovo T540 with a i7-4900MQ CPU), with the original fast_mix, I get a timestamp count of 166 cycles using a weighted smoothed average. Using your fast_mix2 I get around 450 cycles. I'll try a 64-bit kernel build later this weekend. Want to give this patch a try on your collection of machines? - Ted P.S. ADD_INTERRUPT_BENCH works only on x86 architectures because random_get_entropy() mapes to get_cycles() which is RDTSC. It won't work on many other architectures, which don't have a timestamp counter. diff --git a/drivers/char/random.c b/drivers/char/random.c index 4bb6e37..90a1c9f 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -267,6 +267,8 @@ #define CREATE_TRACE_POINTS #include <trace/events/random.h> +#define ADD_INTERRUPT_BENCH + /* * Configuration information */ @@ -553,6 +555,7 @@ struct fast_pool { unsigned char last_timer_intr; }; +#ifdef ORIG_FAST_MIX /* * This is a fast mixing routine used by the interrupt randomness * collector. It's hardcoded for an 128 bit pool and assumes that any @@ -579,6 +582,34 @@ static void fast_mix(struct fast_pool *f, __u32 input[4]) f->rotate = input_rotate; f->count++; } +#else +extern void fast_mix(struct fast_pool *f, __u32 const input[4]) +{ + __u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1]; + __u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3]; + int i; + + + for (i = 0; i < 3; i++) { + /* + * Inspired by ChaCha's QuarterRound, but + * modified for much greater parallelism. + * Surprisingly, rotating a and c seems to work + * better than b and d. And it runs faster. + */ + a += b; c += d; + d ^= a; b ^= c; + a = rol32(a, 15); c = rol32(c, 21); + + a += b; c += d; + d ^= a; b ^= c; + a = rol32(a, 3); c = rol32(c, 7); + } + f->pool[0] = a; f->pool[1] = b; + f->pool[2] = c; f->pool[3] = d; + f->count++; +} +#endif /* * Credit (or debit) the entropy store with n bits of entropy. @@ -829,6 +860,27 @@ EXPORT_SYMBOL_GPL(add_input_randomness); static DEFINE_PER_CPU(struct fast_pool, irq_randomness); +#ifdef ADD_INTERRUPT_BENCH +static unsigned long avg_cycles; + +#define FSHIFT 11 /* nr of bits of precision */ +#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */ +#define EXP 2040 + +static void add_interrupt_bench(cycles_t start) +{ + cycles_t duration = random_get_entropy() - start; + + /* use a weighted moving average */ + avg_cycles *= EXP; + avg_cycles += duration * (FIXED_1 - EXP); + avg_cycles += 1UL << (FSHIFT - 1); + avg_cycles >>= FSHIFT; +} +#else +#define add_interrupt_bench(x) +#endif + void add_interrupt_randomness(int irq, int irq_flags) { struct entropy_store *r; @@ -850,6 +902,7 @@ void add_interrupt_randomness(int irq, int irq_flags) input[3] = ip >> 32; fast_mix(fast_pool, input); + add_interrupt_bench(cycles); if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ)) return; @@ -1644,6 +1697,15 @@ struct ctl_table random_table[] = { .mode = 0444, .proc_handler = proc_do_uuid, }, +#ifdef ADD_INTERRUPT_BENCH + { + .procname = "add_interrupt_avg_cycles", + .data = &avg_cycles, + .maxlen = sizeof(avg_cycles), + .mode = 0444, + .proc_handler = proc_doulongvec_minmax, + }, +#endif { } }; #endif /* CONFIG_SYSCTL */ ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 6:24 ` Theodore Ts'o @ 2014-06-14 8:03 ` George Spelvin 2014-06-14 11:14 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-14 8:03 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > Want to give this patch a try on your collection of machines? Not until I fix it to +#if ADD_INTERRUPT_BENCH +static unsigned long avg_cycles; + +#define AVG_SHIFT 8 /* Exponential average factor k=1/256 */ +#define FIXED_1_2 (1 << (AVG_SHIFT-1)) + +static void add_interrupt_bench(cycles_t start) +{ + cycles_t duration = random_get_entropy() - start; + + /* Use a weighted moving average */ + avg_cycles += ((avg_cycles + FIXED_1_2) >> AVG_SHIFT) - duration; +} +#else +#define add_interrupt_bench(x) +#endif + ... because the way you did it was just silly. See net/ipv4/tcp_input.c:tcp_rtt_estimator(). That also shows you how to add a mean deviation estimator, too. I'll go do that now... ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 8:03 ` George Spelvin @ 2014-06-14 11:14 ` George Spelvin 2014-06-14 15:13 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-14 11:14 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price And I have of course embarrassed myself publicly by getting the sign wrong. That's what I get for posting *before* booting the result. You may now point and bray like a donkey. :-) Anyway. the following actually works: #if ADD_INTERRUPT_BENCH static unsigned long avg_cycles, avg_deviation; #define AVG_SHIFT 8 /* Exponential average factor k=1/256 */ #define FIXED_1_2 (1 << (AVG_SHIFT-1)) static void add_interrupt_bench(cycles_t start) { long delta = random_get_entropy() - start; /* Use a weighted moving average */ delta = delta - ((avg_cycles + FIXED_1_2) >> AVG_SHIFT); avg_cycles += delta; /* And average deviation */ delta = abs(delta) - ((avg_deviation + FIXED_1_2) >> AVG_SHIFT); avg_deviation += delta; } #else #define add_interrupt_bench(x) #endif And here are some measurements (uncorrected for *256 scaling) on my primary (Ivy Bridge E) test machine. I've included 10 samples of each value, takesn at 10s intervals. avg_cycles is first, followed by avg_deviation. The three conditions are idle (1.2 GHz), idle with performance governor enabled (3.9 GHz), and during a "make -j7" in the kernel tree (also all processors at maximum). Rather against my intuition, a busy system greatly *reduces* the time spent. Just to see what interrupt rate did, on the last kernel I also tested it while being ping flooded. They're sorted in increasing order of speed. Unrolling definitely makes a difference, but it's not faster than the old code until I drop to 2 iterations in the inner loop (which would be called 4 rounds by most people). The 64-bit mix is noticeably faster yet. Idle performance make -j7 ORIG_FAST_MIX=0 74761 22228 78799 20305 46527 24966 71984 23619 78466 20599 50044 25202 71949 23760 77262 21363 48295 25460 72966 23859 76188 21921 47393 25130 73974 23543 76040 22135 42979 24341 74601 23407 75294 22602 50502 26715 75359 23169 71267 24990 45649 25338 75450 22855 71065 25022 48792 25731 76338 22711 71569 25016 48564 26040 76546 22567 71143 24972 48414 27882 ORIG_FAST_MIX=0, unrolled: 54830 20312 60343 21814 29577 16699 55510 20787 60655 22504 40344 24749 56994 21080 60691 22497 41095 27184 57674 21566 60261 22713 39578 26717 57560 22221 60690 22709 41361 26138 58220 22593 59978 22924 36334 24249 58646 22422 58391 23466 37125 25089 59485 21927 58000 23968 24091 11892 60444 21959 58633 24486 28816 15585 60637 22133 58576 24593 25125 13174 ORIG_FAST_MIX=1 50554 13117 54732 13010 24043 12804 51294 13623 53269 14066 35671 25957 51063 13682 52531 14214 34391 22749 49833 13612 51833 14272 24097 13802 49458 13624 49288 15046 31378 18110 50383 13936 48720 15540 25088 17320 51167 14210 49058 15637 26478 13247 51356 14157 48206 15787 30542 19717 51077 14155 48587 15742 27461 15865 52199 14361 48710 15933 27608 14826 ORIG_FAST_MIX=0, unrolled, 2 (double) rounds: 43011 10685 44846 10523 21469 10994 42568 10261 43968 10834 19694 8501 42012 10304 43657 10619 19174 8557 42166 10063 43064 10598 20221 9398 41496 10353 42125 10843 19034 6685 42176 10826 41547 10984 19462 8002 41671 10947 40756 11242 21654 12140 41691 10643 40309 11312 20526 9188 41091 10817 40135 11318 20159 9267 41151 10553 39877 11484 19653 8393 64-bit hash, 2 (double) rounds (which is excellent avalanche): 36117 11269 39285 11171 16953 5664 35107 14735 35391 11035 36564 11600 18143 7216 35322 14176 34728 11280 35278 12085 16815 6245 35479 14453 35552 11606 35627 11863 16876 5841 34717 14505 35553 11633 35145 11892 17825 6166 35241 14555 35468 11406 35773 11857 16834 5094 34814 14719 35301 11390 35357 11771 16750 4987 35248 14566 34841 10821 35785 11531 19170 8296 35627 14103 34818 10942 35045 11592 17004 6814 34948 14399 35113 11158 35469 11343 19344 7969 33859 14035 Idle performance make -j7 ping -f (from outside) (Again, all numbers must be divided by 256 to get cycles. You can probably divide by 1000 amd multiply by 5 in your head, which is a pretty good approximation.)) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 11:14 ` George Spelvin @ 2014-06-14 15:13 ` George Spelvin 2014-06-14 16:33 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-14 15:13 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price Unfortunately, my test suite is not particularly lightweight. Here are some figures (idle at full speed, and doing a parallel kernel compile) for a 2.5 GHz Phenom. Quite different numbers compared to the Ivy Bridge. When the processor is bust, fast_mix takes longer, instead. And even the rolled up fast_mix2 is (slightly) faster than the original fast_mix. The 64-bit code is basically the same speed as the 2x unrolled 32-bit code. Presented in the same strange order as the Ivy Bridge, even though it's not fastest-to-slowest this time. Rolled up, 3x: 30748 5432 49066 24126 30583 6526 50271 23788 28621 4541 52966 25610 28017 3425 54181 25733 30425 5213 54441 25878 28556 3895 48969 26499 27853 3057 50813 25431 27471 2396 54065 27685 30328 5053 55931 27500 28702 3821 50811 24877 Unrolled, 3x: 26785 4278 40386 12695 25617 3302 41349 16593 25154 2357 41166 15904 24816 1886 40571 17449 27537 4927 41716 15723 28128 6657 42085 18098 25640 3410 41453 16486 25087 2199 39964 12193 27750 5384 38960 13672 26584 4699 38280 13697 Original: 35291 10362 82640 37663 32043 7848 81887 44853 32968 8994 80527 44847 31087 7207 84894 47535 29904 5602 81426 45401 34312 9824 54889 29060 36471 12948 65112 34232 31622 8539 63531 33871 31570 7745 57946 29070 30074 6116 62829 34393 Unrolled, 2x: 24491 2411 40335 16992 27644 6721 38981 16068 25473 4347 35260 10750 24784 3372 34211 10360 24326 2689 34178 11200 26404 4896 34498 10483 24905 3517 33793 10409 24107 2150 35321 11330 23968 1903 33818 11301 26254 4671 32728 9547 64-bit: 26877 7949 52258 25946 24862 5940 55765 27634 24528 5032 56568 28142 24844 5223 55007 27671 27115 8018 55754 27873 25504 6504 54286 27471 24772 5673 52717 26609 24427 4889 51362 27109 27219 8276 51431 22619 25640 6718 52835 26329 ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 15:13 ` George Spelvin @ 2014-06-14 16:33 ` Theodore Ts'o 2014-06-15 0:23 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-14 16:33 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price OK, using your averaging scheme, on a 32-bit KVM kernel, running at idle, here are my quick results: Original 48028 51419 46021 50065 44750 49231 Fastmix 2, 3 interations 95956 58313 97295 57599 97242 56942 Fastmix 2, 2 iterations 68998 41496 68940 41471 68619 41576 Fastmix 2, 2 iterations, unrolled 48725 39281 48313 38424 47349 37847 So with two iterations are at least no worse than before (and in fact the deviation is less, which makes sense since we don't have the twist_array memory accesses), and I can easily believe there will be architectures/microarchitectures where it will be better. I'll need to do a bit more looking to convince myself that 2 iterations is better from a mixing perspective, but this looks like it might be a promising replacement for the 32-bit mixer. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 16:33 ` Theodore Ts'o @ 2014-06-15 0:23 ` George Spelvin 2014-06-15 1:17 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-15 0:23 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > I'll need to do a bit more looking to convince myself that 2 > iterations is better from a mixing perspective, but this looks like it > might be a promising replacement for the 32-bit mixer. This was tested with a hacked-up copy of Bob Jenkins' testing code. It tests a few random starting values, and finds the minimum number of outputs bits affected by an input delta for: - Each input bit or pair of bits - The function run forward or in reverse - XOR, subtraction and addtion delta metrics. (5x2 = 10 in total) The example I posted: // (29/66353) score = 49/121/123: 6 27 16 14 a += b; c += d; b = rol32(a, 6); d = rol32(c, 27); d ^= a; b ^= c; a += b; c += d; b = rol32(a, 16); d = rol32(c, 14); d ^= a; b ^= c; has, after 2 rounds, a minimum avalanche of 49 bits, taken over all of the variables just mentioned. The only thing maximized over is the different starting values. That seems adequate for something that's only being asked to preserve 1 bit of entropy. (BTW, the way the deltas are defined, the maximum possibe score is 124.) ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-15 0:23 ` George Spelvin @ 2014-06-15 1:17 ` Theodore Ts'o 2014-06-15 6:58 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-15 1:17 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Sat, Jun 14, 2014 at 08:23:33PM -0400, George Spelvin wrote: > The example I posted: > > // (29/66353) score = 49/121/123: 6 27 16 14 > > a += b; c += d; > b = rol32(a, 6); d = rol32(c, 27); > d ^= a; b ^= c; > > a += b; c += d; > b = rol32(a, 16); d = rol32(c, 14); > d ^= a; b ^= c; > > has, after 2 rounds, a minimum avalanche of 49 bits, taken over all of > the variables just mentioned. The only thing maximized over is the > different starting values. I'm seeing a minimum delta of 40 bits, actually. Which makes it slightly better than your original fast_mix2 (which had a minimum delta of 39) when using 1024 random samples using random(3) to generate a starting pool and setting a single bit in each possible bit position in the input array. So it's slightly better, and as I mentioned, on my CPU, I'm really not seeing that much difference between fast_mix2() and fast_mix3(). But I'm willing to go with this as being quite sufficient as a mixing function. - Ted (Compile the following with -DANALYZE to see the analysis I did.) #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> typedef unsigned int __u32; typedef unsigned long long __u64; struct fast_pool { __u32 pool[4]; unsigned long last; unsigned short count; unsigned char rotate; unsigned char last_timer_intr; }; /** * rol32 - rotate a 32-bit value left * @word: value to rotate * @shift: bits to roll */ static inline __u32 rol32(__u32 word, unsigned int shift) { return (word << shift) | (word >> (32 - shift)); } static inline __u64 rol64(__u64 word, unsigned int shift) { return (word << shift) | (word >> (64 - shift)); } static __u32 const twist_table[8] = { 0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158, 0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 }; extern void fast_mix(struct fast_pool *f, __u32 input[4]) { __u32 w; unsigned input_rotate = f->rotate; w = rol32(input[0], input_rotate) ^ f->pool[0] ^ f->pool[3]; f->pool[0] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 14) & 31; w = rol32(input[1], input_rotate) ^ f->pool[1] ^ f->pool[0]; f->pool[1] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[2], input_rotate) ^ f->pool[2] ^ f->pool[1]; f->pool[2] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; w = rol32(input[3], input_rotate) ^ f->pool[3] ^ f->pool[2]; f->pool[3] = (w >> 3) ^ twist_table[w & 7]; input_rotate = (input_rotate + 7) & 31; f->rotate = input_rotate; f->count++; } extern void fast_mix2(struct fast_pool *f, __u32 const input[4]) { __u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1]; __u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3]; int i; for (i = 0; i < 2; i++) { /* * Inspired by ChaCha's QuarterRound, but * modified for much greater parallelism. */ a += b; c += d; d ^= a; b ^= c; a = rol32(a, 15); c = rol32(c, 21); a += b; c += d; d ^= a; b ^= c; a = rol32(a, 3); c = rol32(c, 7); } f->pool[0] = a; f->pool[1] = b; f->pool[2] = c; f->pool[3] = d; f->count++; } extern void fast_mix3(struct fast_pool *f, __u32 const input[4]) { __u32 a = f->pool[0] ^ input[0], b = f->pool[1] ^ input[1]; __u32 c = f->pool[2] ^ input[2], d = f->pool[3] ^ input[3]; int i; for (i = 0; i < 2; i++) { a += b; c += d; a = rol32(a, 6); c = rol32(c, 27); d ^= a; b ^= c; a += b; c += d; a = rol32(a, 16); c = rol32(c, 14); d ^= a; b ^= c; } f->pool[0] = a; f->pool[1] = b; f->pool[2] = c; f->pool[3] = d; f->count++; } extern void fast_mix4(struct fast_pool *f, __u32 const input[4]) { __u64 a = ((__u64 *)f->pool)[0] ^ ((__u64 const *)input)[0]; __u64 b = ((__u64 *)f->pool)[1] ^ ((__u64 const *)input)[1]; int i; for (i = 0; i < 2; i++) { a += b; b = rol64(b, 52); b ^= a; a = rol64(a, 10); a += b; b = rol64(b, 47); b ^= a; a = rol64(a, 17); } ((__u64 *)f->pool)[0] = a; ((__u64 *)f->pool)[1] = b; f->count++; } static void rotate(__u32 a[4]) { int i; int carry = 0; __u32 tmp; for (i=0; i < 4; i++) { tmp = a[i]; a[i] = (tmp << 1) + carry; carry = (tmp & 0x80000000) ? 1 : 0; } if (carry) a[0]++; } int global_min = 9999; void analyze(void) { struct fast_pool f; int i, pc; int sum = 0, max = 0, min=9999; __u32 input[4]; __u32 start[4]; start[0] = random(); start[1] = random(); start[2] = random(); start[3] = random(); memset(&f, 0, sizeof(f)); memset(&input, 0, sizeof(input)); input[0] = 1; for (i=0; i < 32; i++) { memcpy(f.pool, start, sizeof(start)); fast_mix3(&f, input); pc = (__builtin_popcount(f.pool[0] ^ start[0]) + __builtin_popcount(f.pool[1] ^ start[1]) + __builtin_popcount(f.pool[2] ^ start[2]) + __builtin_popcount(f.pool[3] ^ start[3])); sum += pc; if (pc > max) max = pc; if (pc < min) min = pc; if (pc < global_min) global_min = pc; rotate(input); // printf("%d ", pc); } // printf("\n"); // printf("average popcount: %d, max: %d min %d\n", sum / 128, max, min); } static __inline__ unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } int main(int argc, char **argv) { struct fast_pool f; int i; __u32 input[4]; unsigned volatile long long start_time, end_time; #ifdef ANALYZE for (i=0; i < 1024; i++) analyze(); printf("Global minimum: %d\n", global_min); return 0; #endif #if !defined(BENCH_FASTMIX) && !defined(BENCH_FASTMIX2) for (i=0; i < 20; i++) { usleep(50000); start_time = rdtsc(); fast_mix2(&f, input); end_time = rdtsc(); printf("fast_mix2: %llu\t", end_time - start_time); #if 0 usleep(50000); start_time = rdtsc(); fast_mix2(&f, input); end_time = rdtsc(); printf("fast_mix2: %llu\t", end_time - start_time); usleep(50000); start_time = rdtsc(); fast_mix3(&f, input); end_time = rdtsc(); printf("fast_mix3: %llu\t", end_time - start_time); #endif fputc('\n', stdout); } #endif #ifdef BENCH_FASTMIX for (i=0; i < 10240000; i++) { fast_mix(&f, input); } #endif #ifdef BENCH_FASTMIX2 for (i=0; i < 10240000; i++) { fast_mix2(&f, input); } #endif return 0; } ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-15 1:17 ` Theodore Ts'o @ 2014-06-15 6:58 ` George Spelvin 2014-06-15 13:01 ` Theodore Ts'o 0 siblings, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-15 6:58 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price Actually, could you hold off merging that for a bit? Or if you really want to, add a comment that the rotate constants are preliminary and will be further optimized. I posted that as WIP, just as evidence that something with better avalanche could be as fast or faster, and I'd like to go back and do a careful re-check of the analysis software, then do an exhaustive search for good rotate constants (there are only 1M possibilities, after all). I believe the basic structure is solid, and you appear to be persuaded too, but this is an area littered with embarrassing failures by cryptographic amateurs and I'd like to be *very* careful before attaching my S-o-b to it. Also, credit for inspiration and the analysis techniques: Bob Jenkins. I actually started with his analysis software for SpookyHash (http://www.burtleburtle.net/bob/hash/spooky.html) and then converted it back to "block function" form. Appended is the software I'm using. It began life as http://www.burtleburtle.net/bob/c/screen.cpp but was translated into plain C and heavily hacked up. It's not pretty, but youuld you either look at it for mistakes or tell me to pretty it up so you can look at it? The one thing I notice is that your analyze() function is measuring something totally different than what I'm measuring. Considering fast_mix as a 128->128 bit function (i.e. ignoring the input XOR), you appear to be considering popcount(input[] ^ output[]), and then *minimizing* over a variety of inputs. I'm doing basically the following: min_score = 128 for (each 1- or 2-bit delta[]) { avalanche[] = 0 for (5 random arrays input[]) { output1[] = fast_mix3(input[]) output2[] = fast_mix3(input[] ^ delta[]) avalanche[] |= output1[] ^ output2[] } score = popcount(avalanche[]) min_score = min(score, min_score) } For each input delta, consider several random inputs and find all of the output bits that that input delta can cause to toggle. The addition carries make the operation non-linear and so the output delta depends on the input state. I'm interested in all the bits that can *possibly* be toggled. Bob's code considers output deltas other than xor (minimizing over all of theose), and both forward and reverse mixing functions. One of the deltas considered is ~(output1[] ^ output2[]), the set of bits sometimes *not* toggled by an input delta. If an input delta *always* changes most of the bits, that's not as interesting either. (Simple parity and replication can do that.) More discussion of his analysis ideas here: http://www.burtleburtle.net/bob/hash/avalanche.html http://www.burtleburtle.net/bob/hash/evahash.html The following rotate constants achieve 57 bits of avalanche after 2 iterations: // (91/238891) score(4/5/6) = 57/87/116: 26 14 19 9 // (113/269149) score(4/5/6) = 57/84/115: 28 29 20 8 // (118/277648) score(4/5/6) = 57/89/121: 4 27 9 24 // (179/462783) score(4/5/6) = 57/93/117: 11 27 14 22 This one does 58 bits after 2 iteration, but later scores (for 2.5 and 3 iterations) are a bit low: // (191/488173) score(4/5/6) = 58/78/113: 9 23 8 15 This one has high scores for 2, 2.5 and 3 iterations: // (22/48296) score(4/5/6) = 56/98/122: 8 10 22 15 I might go with that one pending further work (like a proper exhaustive search; the current code is random). I'm also experimenting to see if swapping + for - helps, e.g.: a ^= b; c ^= d; b = Rotl(b, K1); d = Rotl(d, K2); d += a; b -= c; a ^= b; c ^= d; b = Rotl(b, K3); d = Rotl(d, K4); d += a; b -= c; That has rotate sets like // (13/484210) score(4/5/6) = 57/90/118: 6 25 16 21 // (14/490774) score(4/5/6) = 56/99/122: 25 9 19 7 // (23/804540) score(4/5/6) = 56/98/122: 17 20 10 4 // (35/1014495) score(4/5/6) = 57/94/119: 7 27 4 20 // (46/1254928) score(4/5/6) = 56/100/120: 20 18 5 10 // (47/1260723) score(4/5/6) = 56/97/120: 13 22 8 5 // (62/1627276) score(4/5/6) = 57/94/120: 6 5 10 20 // (74/1943366) score(4/5/6) = 58/92/121: 12 11 18 26 // (77/2097408) score(4/5/6) = 57/96/119: 19 18 5 11 (The following code was recently hacked up to operate in "half-iterations" where not every rotate constant is used an equal number of times. You'll still see residue of the old semantics in places.) #include <stdio.h> #include <stddef.h> #include <stdint.h> #include <stdbool.h> #include <assert.h> #if 1 #define BITS 32 typedef uint32_t word_t; #define Count(x) __builtin_popcount(x) #else #define BITS 64 typedef uint64_t word_t; #define Count(x) __builtin_popcountll(x) #endif static word_t Rotl(word_t x, int k) { return (x << k) | (x >> (BITS-k)); } static word_t Rotr(word_t x, int k) { return (x << (BITS-k)) | (x >> k); } static uint64_t Rotl64(uint64_t x, int k) { return (x << k) | (x >> (64 - k)); } // // try to find an adequate long-message mixing function for SpookyHash // struct Random { uint64_t a, b, c, d; }; static uint64_t RandValue(struct Random *r) { uint64_t e = r->a - Rotl64(r->b, 23); r->a = r->b ^ Rotl64(r->c, 16); r->b = r->c + Rotl64(r->d, 11); r->c = r->d + e; return r->d = e + r->a; } static void RandInit(struct Random *r, uint64_t seed) { int i; r->a = 0xdeadbeef; r->b = r->c = r->d = seed; for (i = 0; i < 20; ++i) (void)RandValue(r); } #define VARS 4 #define ROTS 4 struct Sieve { int sh[ROTS]; struct Random r; }; void SieveInit(struct Sieve *s, int seed) { RandInit(&s->r, seed); } void SievePrint(struct Sieve const *s, FILE *f) { int i; for (i = 0; i < ROTS; i++) fprintf(f, " %2d", s->sh[i]); putc('\n', f); } // generate a new function at random static void SieveGenerate(struct Sieve *s) { uint64_t v = RandValue(&s->r); int i; /* Fill in the shift amounts */ for (i = 0; i < ROTS; i++) { s->sh[i] = v % BITS; v /= BITS; } } #if 1 /* Quick option */ #define DO_ROT(a, b, s) (a = Rotl(a, s)) #else #define DO_ROT(a, b, s) (b = Rotl(b, s)) #endif #if VARS == 4 // evaluate the function static void Fun(int const sh[VARS], word_t state[VARS], int iter) { int i; word_t a = state[0]; word_t b = state[1]; word_t c = state[2]; word_t d = state[3]; for (i = 0; i < 2*iter; i += 2) { #if 1 a ^= b; c ^= d; b = Rotl(b, sh[i%ROTS]); d = Rotl(d, sh[(i+1)%ROTS]); d += a; b += c; #elif 1 a += b; c += d; d = Rotl(d, sh[i%ROTS]); d ^= a; b ^= c; c = Rotl(c, sh[(i+1)%ROTS]); #elif 1 a += b; b ^= c; c -= d; d = Rotl(d, sh[i%ROTS]); c += d; d ^= a; a -= b; b = Rotl(b, sh[(i+1)%ROTS]); #else a += b; c += d; d = Rotl(d, sh[i%ROTS]); d ^= a; b ^= c; c = Rotl(c, sh[(2*i+1)%ROTS1]); #endif } state[0] = a; state[1] = b; state[2] = c; state[3] = d; } static void RFun(int const sh[VARS], word_t state[VARS], int iter) { int i; word_t a = state[0]; word_t b = state[1]; word_t c = state[2]; word_t d = state[3]; while (iter) { i = 2 * --iter; #if 1 d -= a; b -= c; b = Rotr(b, sh[i%ROTS]); d = Rotr(d, sh[(i+1)%ROTS]); a ^= b; c ^= d; #elif 1 c -= d; d ^= a; a += b; b = Rotr(b, sh[(i+1) % ROTS]); a -= b; b ^= c; c += d; d = Rotr(d, sh[i % ROTS]); #else d ^= a; b ^= c; a = Rotr(a, sh[(i+1)%ROTS]); a -= b; c -= d; b = Rotr(b, sh[i%ROTS]); #endif } state[0] = a; state[1] = b; state[2] = c; state[3] = d; } #elif VARS == 2 // evaluate the function static void Fun(int const sh[VARS], word_t state[VARS], int iter) { int i; word_t a = state[0]; word_t b = state[1]; for (i = 0; i < iter; i++) { a += b; b = Rotl(b, sh[0]); b ^= a; a = Rotl(a, sh[1]); #if ROTS > 2 a += b; b = Rotl(b, sh[2]); b ^= a; a = Rotl(a, sh[3]); #endif } state[0] = a; state[1] = b; } static void RFun(int const sh[VARS], word_t state[VARS], int iter) { int i; word_t a = state[0]; word_t b = state[1]; for (i = 0; i < iter; i++) { a = Rotr(a, sh[3]); b ^= a; b = Rotr(b, sh[2]); a -= b; #if ROTS > 2 a = Rotr(a, sh[1]); b ^= a; b = Rotr(b, sh[0]); a -= b; #endif } state[0] = a; state[1] = b; } #endif #define TRIALS 8 #define MEASURES 10 static int OneTest(struct Sieve *s, bool forward, int limit, int iter) { int minVal = VARS*BITS; int i, j, k, l; for (i = 0; i < VARS*BITS; i++) { for (j = i; j < VARS*BITS; j++) { word_t total[MEASURES][VARS] = { { 0 } }; for (k = 0; k < TRIALS; k++) { word_t a[VARS], b[VARS]; for (l = 0; l < VARS; l++) b[l] = a[l] = RandValue(&s->r); /* Alter the second one */ b[i/BITS] ^= (word_t)1 << (i % BITS); if (i != j) b[j/BITS] ^= (word_t)1 << (j % BITS); /* Run the function */ if (forward) { Fun(s->sh, a, iter); Fun(s->sh, b, iter); } else { RFun(s->sh, a, iter); RFun(s->sh, b, iter); } /* Now compute differences */ for (l = 0; l < VARS; l++) { word_t t; total[0][l] |= a[l]; total[5][l] |= ~a[l]; total[1][l] |= b[l]; total[6][l] |= ~b[l]; t = a[l] ^ b[l]; total[2][l] |= t; total[7][l] |= ~t; t = a[l] - b[l]; t ^= t >> 1; /* Gray-code it */ total[3][l] |= t; total[8][l] |= ~t; t = a[l] + b[l]; t ^= t >> 1; /* Gray-code it */ total[4][l] |= t; total[9][l] |= ~t; } } /* Find minimum weight by all measures */ for (k = 0; k < MEASURES; k++) { int counter = 0; for (l = 0; l < VARS; l++) counter += Count(total[k][l]); if (counter < minVal) { minVal = counter; if (counter < limit) return counter; } } } /* j = second bit */ } /* i = first bit */ return minVal; } static int TwoTest(struct Sieve *s, int limit, int iter) { int score1, score2; word_t a[VARS], b[VARS]; int i; /* Quick sanity test */ for (i = 0; i <= iter; i++) { int j; for (j = 0; j < VARS; j++) b[j] = a[j] = RandValue(&s->r); Fun(s->sh, a, i); RFun(s->sh, a, i); for (j = 0; j < VARS; j++) assert(a[j] == b[j]); RFun(s->sh, a, i); Fun(s->sh, a, i); for (j = 0; j < VARS; j++) assert(a[j] == b[j]); } score1 = OneTest(s, true, limit, iter); if (score1 < limit) return score1; score2 = OneTest(s, false, limit, iter); return score1 < score2 ? score1 : score2; } static void driver(int seed, FILE *f, unsigned n, int iter, int limit, int limit2) { unsigned i, found = 0; struct Sieve s; int best = 0; SieveInit(&s, seed); for (i = 0; found < n; i++) { int score0, score, score2; if (i % 1000 == 0) { printf("%u\r", i); fflush(stdout); } SieveGenerate(&s); score0 = TwoTest(&s, limit, iter-1); if (score0 < limit) continue; score = TwoTest(&s, limit2, iter); if (score < limit2) continue; if (score > best) best = score; score2 = TwoTest(&s, 0, iter+1); fprintf(f, "// (%u/%u) score(%d/%d/%d) = %d/%d/%d:", found, i, iter-1, iter, iter+1, score0, score, score2); SievePrint(&s, f); found++; } printf("Best score: %d\n", best); } int main(void) { // driver(21, stdout, 200, 6, 108); // driver(21, stdout, 200, 5, 70); // driver(21, stdout, 200, 2, 42); driver(21, stdout, 200, 5, 45, 70); // driver(21, stdout, 200, 4, 72, 90); return 0; } ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-15 6:58 ` George Spelvin @ 2014-06-15 13:01 ` Theodore Ts'o 0 siblings, 0 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-15 13:01 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Sun, Jun 15, 2014 at 02:58:03AM -0400, George Spelvin wrote: > Actually, could you hold off merging that for a bit? Or if you really > want to, add a comment that the rotate constants are preliminary and > will be further optimized. > > I posted that as WIP, just as evidence that something with better > avalanche could be as fast or faster, and I'd like to go back and do a > careful re-check of the analysis software, then do an exhaustive search > for good rotate constants (there are only 1M possibilities, after all). > > I believe the basic structure is solid, and you appear to be persuaded > too, but this is an area littered with embarrassing failures by > cryptographic amateurs and I'd like to be *very* careful before attaching > my S-o-b to it. Keep in mind that we're using this *just* as a mixing function. It's not like we have to worry about things like a chosen plaintext attack --- the attacker can't look at the output of /dev/random, force time to go backwards, and then force the CPU counter to be one bit different, run time forwards, and then compare the new output of /dev/random and try to do some kind of differential cryptanalysis, for example. Or if they could, the ability to run time backwards could be used in much more entertaining ways. (See also the recent movie, "Edge of Tomorrow" :-) And given my quick tests, I'm convinced that it's better than what we had before, which is why I decided to commit the code. We can further deltas from what we have, so long as each step forward is an improvement. That's not to say that more analysis wouldn't be a bad thing, but I have some other things I need to attend to; if you come up with better rotational values, let me know, and when I do have more time, I can come back to trying to do a bit more work on this particular front. > The one thing I notice is that your analyze() function is measuring > something totally different than what I'm measuring. > > Considering fast_mix as a 128->128 bit function (i.e. ignoring > the input XOR), you appear to be considering > popcount(input[] ^ output[]), and then *minimizing* over a > variety of inputs. Yes, part of this comes from my bias from the previous mixing function. Since we were using a modified LFSR, it had the property that if you repeatedly mixed with the input array containing all zeroes, we would iterate over every possible state in the pool, with equal probability. So the XOR of the input array simply jumped us to a different point in the 2**N cycle of possible states, such that even if there was a bias where (for example) only the bottom 3 bits in the input array were significant, those three bits of uncertainty could potentially affect any bit in the entropy pool --- as opposed to only twiddling the bottom three bits in the pool in the absence of having the mixing function. Anyway, for the purposes of the fast_mix, it's quite possible that simply using the input_rotate and discarding everything else would have been sufficient. There was a bit of overdesign even at the beginning. Cheers, - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: random: Benchamrking fast_mix2 2014-06-14 5:25 ` George Spelvin 2014-06-14 6:24 ` Theodore Ts'o @ 2014-06-14 6:27 ` Theodore Ts'o 1 sibling, 0 replies; 57+ messages in thread From: Theodore Ts'o @ 2014-06-14 6:27 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Sat, Jun 14, 2014 at 01:25:27AM -0400, George Spelvin wrote: > > When I did a quick comparison of your 64-bit fast_mix2 variant, it's > > much slower than either the 32-bit fast_mix2, or the original fast_mix > > alrogithm. > > That is f***ing *bizarre*. For me, it's *significantly* faster. > You *are* compiling -m64, right? Because I agree with you it'd > be stupid to try to use it on 32-bit machines. Sorry for not being clear. My results are similar to yours with -m64. It was much slower when compiled with -m32. (Which may have been obvious, but one of the things I've learned very early on with benchmarking is you measure, and don't assume, whenever possible....) - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* [RFC] random: is the IRQF_TIMER test working as intended? 2014-06-13 15:52 ` Theodore Ts'o 2014-06-14 2:10 ` George Spelvin @ 2014-06-14 4:55 ` George Spelvin 2014-06-14 6:43 ` Theodore Ts'o 1 sibling, 1 reply; 57+ messages in thread From: George Spelvin @ 2014-06-14 4:55 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price I'm trying to understand the entropy credit computation in add_interrupt_randomness. A few things confuse me, and I'm wondering if it's intended to be that way. 1) Since the number of samples between spills to the input pool is variable (with > 64 samples now possible due to the trylock), wouldn't it make more sense to accumulate an entropy estimate? 2) Why only deny entropy credit for back-to-back timer interrupts? If both both t2 - x and x - t1 are worth credit, why not for t2 - t1? It seems a lot better (not to mention simpler) to not credit any timer interrupt, so x - t1 will get credit but not t2 - x. 3) Why only consider the status of the interrupts when spills occur? This is the most confusing. The whole __IRQF_TIMER and last_timer_intr logic simply skips over the intermediate samples, so it actually detects timer interrupts 64 interrupt (or 1 second) apart. Shouldn't that sort of thing actually be looking at *consecutive* calls to add_interrupt_randomness? 4) If the above logic denies credit, why deny credit for arch_get_random_seed_long as well? For discussion, here's an example of a change that fixes all of the above, in patch form. (The credit_entropy_frac function is omitted but hopefully obvious.) The amount of entropy credit particularly needs thought. I'm currently using 1/8 of a bit per sample to keep the patch as simple as possible. This is 8x the current credit if interrupts are frequent, but less if they occur at less than 8 Hz. That actually seems on the conservative side of reasonable to me (1/8 of a bit is odds of 1 in 58.3817), particularly if there's a cycle timer. diff --git a/drivers/char/random.c b/drivers/char/random.c index 03c385f5..c877cb65 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -548,9 +548,8 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in, struct fast_pool { __u32 pool[4]; unsigned long last; - unsigned short count; + unsigned short entropy; /* Entropy, in fractional bits */ unsigned char rotate; - unsigned char last_timer_intr; }; /* @@ -577,7 +576,6 @@ static void fast_mix(struct fast_pool *f, __u32 input[4]) input_rotate = (input_rotate + 7) & 31; f->rotate = input_rotate; - f->count++; } /* @@ -851,15 +849,33 @@ void add_interrupt_randomness(int irq, int irq_flags) fast_mix(fast_pool, input); - if ((fast_pool->count & 63) && !time_after(now, fast_pool->last + HZ)) + /* + * If we don't have a vaid cycle counter, don't give credit for + * timer interrupts. Otherwise, credit 1/8 bit per interrupt. + * (Should there be a difference if there's a cycle counter?) + */ + if (cycles || (irq_flags & IRQF_TIMER == 0)) + credit = 1; /* 1/8 bit */ + else + credit = 0; + + credit += fast_pool->entropy; + + if (credit < 8 << ENTROPY_SHIFT && + !time_after(now, fast_pool->last + HZ)) { + fast_pool->entropy = credit; return; + } + + credit = min_t(int, credit, 32 << ENTROPY_SHIFT); r = nonblocking_pool.initialized ? &input_pool : &nonblocking_pool; if (!spin_trylock(&r->lock)) { - fast_pool->count--; + fast_pool->entropy = credit; return; } fast_pool->last = now; + fast_pool->entropy = 0; __mix_pool_bytes(r, &fast_pool->pool, sizeof(fast_pool->pool)); /* @@ -867,28 +883,13 @@ void add_interrupt_randomness(int irq, int irq_flags) * add it to the pool. For the sake of paranoia count it as * 50% entropic. */ - credit = 1; if (arch_get_random_seed_long(&seed)) { __mix_pool_bytes(r, &seed, sizeof(seed)); - credit += sizeof(seed) * 4; + credit += sizeof(seed) * 4 << entropy_shift; } spin_unlock(&r->lock); - /* - * If we don't have a valid cycle counter, and we see - * back-to-back timer interrupts, then skip giving credit for - * any entropy, otherwise credit 1 bit. - */ - if (cycles == 0) { - if (irq_flags & __IRQF_TIMER) { - if (fast_pool->last_timer_intr) - credit = 0; - fast_pool->last_timer_intr = 1; - } else - fast_pool->last_timer_intr = 0; - } - - credit_entropy_bits(r, credit); + credit_entropy_frac(r, credit); } #ifdef CONFIG_BLOCK ^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [RFC] random: is the IRQF_TIMER test working as intended? 2014-06-14 4:55 ` [RFC] random: is the IRQF_TIMER test working as intended? George Spelvin @ 2014-06-14 6:43 ` Theodore Ts'o 2014-06-14 7:23 ` George Spelvin 0 siblings, 1 reply; 57+ messages in thread From: Theodore Ts'o @ 2014-06-14 6:43 UTC (permalink / raw) To: George Spelvin; +Cc: hpa, linux-kernel, mingo, price On Sat, Jun 14, 2014 at 12:55:20AM -0400, George Spelvin wrote: > I'm trying to understand the entropy credit computation in > add_interrupt_randomness. A few things confuse me, and I'm > wondering if it's intended to be that way. In general, yes. It's intended this way. I'm trying to be extremely conservative with my entropy measurements, and part of it is because there is generally a huge amount of interrupts available, at least on desktop systems, and I'd much rather be very conservative than not. The only downside to being slow is that it takes longer to generate a GPG key since that uses /dev/random, but I'm OK with that.... > 1) Since the number of samples between spills to the input pool is > variable (with > 64 samples now possible due to the trylock), wouldn't > it make more sense to accumulate an entropy estimate? In general, we probably will only retry a few times, so it's not worth it. > 2) Why only deny entropy credit for back-to-back timer interrupts? > If both both t2 - x and x - t1 are worth credit, why not for t2 - t1? > It seems a lot better (not to mention simpler) to not credit any > timer interrupt, so x - t1 will get credit but not t2 - x. > 3) Why only consider the status of the interrupts when spills occur? > This is the most confusing. The whole __IRQF_TIMER and last_timer_intr > logic simply skips over the intermediate samples, so it actually > detects timer interrupts 64 interrupt (or 1 second) apart. > Shouldn't that sort of thing actually be looking at *consecutive* > calls to add_interrupt_randomness? What I'd probably do instead is to count the number of timer interrupts, and if it's more than 50% time interrupts, give 0 bits of credit, else give 1 bit of credit each time we push from the fast pool to the input pool. Yes, that's being super conservative. Part of this is because on modern machines most of the oscillators are driven off of a single clock, and because not all architectures have a timestamp clock. We could probably be more aggressive here x86 systems, but I wouldn't be comfortable more being aggressive on ARM systems. And so to keep things simple, I've only given a single credit per push. The other reason why I haven't been in a hurry to try to be more aggressive about entropy credit is even with the current super conservative estimates, on my T540 laptop, I get the "nonblocking pool is initialized" message 2.8 seconds into the boot, which is before all of my USB devices have been enumerated, and before the root file system is mounted (4 seconds into the boot). Since this is well before the SSH host keys get generated in the init scripts after the first boot, I figured it's quite good enough. :-) > 4) If the above logic denies credit, why deny credit for > arch_get_random_seed_long as well? Whoops, that's a bug, that was caused when I reordered the arch_get_random_sed_long so it would be done while the spinlock was still taken. Thanks for pointing that out. I'll get that fixed on the random.git tree's dev branch. - Ted ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [RFC] random: is the IRQF_TIMER test working as intended? 2014-06-14 6:43 ` Theodore Ts'o @ 2014-06-14 7:23 ` George Spelvin 0 siblings, 0 replies; 57+ messages in thread From: George Spelvin @ 2014-06-14 7:23 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price > In general, yes. It's intended this way. I'm trying to be extremely > conservative with my entropy measurements, and part of it is because > there is generally a huge amount of interrupts available, at least on > desktop systems, and I'd much rather be very conservative than not. To be absolutely clear: being more aggressive is not the point. Using 1/8 of a bit per sample was simply for convenience, to keep the patch smaller. It can be easily adapted to be strictly more conservative. Consider the changes that make it more conservative: - Allow credit of less than 1 bit - If we get interrupts very rarely, credit *less* entropy. - Only allow credit for one side of a timer interrupt, not both. If t2-t1 is too predictable, then x-t1 has all of the entropy that's available. t2-x provides no new information. > What I'd probably do instead is to count the number of timer > interrupts, and if it's more than 50% time interrupts, give 0 bits of > credit, else give 1 bit of credit each time we push from the fast pool > to the input pool. Yes, that's being super conservative. If we're down in the 0/1 range, I really like the idea of allowing fractional credit. How about crediting 1/64 of a bit per non-timer interrupt? Equivalent result, but more linear. (Sorry if my digression about the sanity of 1/8 bit per sample confused things. I was just trying to say "it's not totally crazy", not "you should do this".) >> 1) Since the number of samples between spills to the input pool is >> variable (with > 64 samples now possible due to the trylock), wouldn't >> it make more sense to accumulate an entropy estimate? > In general, we probably will only retry a few times, so it's not > worth it. I'm not actually worrying about the "too many samples" case, but the "too few". The worrisome case is when someone on an energy-saving quest succeeds in tuning the kernel (or just this particular processor) so it gets less than 1 interrupt per second. Every interrupt credits 1 bit of entropy. Is *that* super-conservative? I agree that longer delays have more jitter, so it's worth a little bit more, but shouldn't we try to get a curve the same shape as reality and *then* apply the safety factors? Surely the presence or absence of intermediate samples makes *some* difference? ^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: drivers/char/random.c: More futzing about 2014-06-11 16:38 ` Theodore Ts'o 2014-06-11 16:48 ` H. Peter Anvin 2014-06-12 0:32 ` George Spelvin @ 2014-06-12 3:43 ` George Spelvin 2 siblings, 0 replies; 57+ messages in thread From: George Spelvin @ 2014-06-12 3:43 UTC (permalink / raw) To: linux, tytso; +Cc: hpa, linux-kernel, mingo, price Just to add to my total confusion about the totally disparate performance numbers we're seeing, I did some benchmarks on other machines. The speedup isn't as good one-pass as it is iterated, and as I mentioned it's slower on a P4, but it's not 7 times slower by any stretch. There are all 1-iteration numbers, run immediately after scp-ing the binary to the machine so there's no possibility if anything being cached. (The "64" and "32" versions are compiled -m32 and -m64, of course.) 2.5 GHz Phenom 9850: $ /tmp/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 199 142 (-57) 1: 104 95 (-9) 2: 104 110 (+6) 3: 103 109 (+6) 4: 105 89 (-16) 5: 103 88 (-15) 6: 104 89 (-15) 7: 104 95 (-9) 8: 105 85 (-20) 9: 105 85 (-20) $ /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 324 147 (-177) 1: 100 86 (-14) 2: 100 99 (-1) 3: 100 88 (-12) 4: 100 86 (-14) 5: 100 86 (-14) 6: 100 89 (-11) 7: 100 111 (+11) 8: 100 111 (+11) 9: 100 88 (-12) $ /tmp/random64 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 554788 220327 (-334461) 1: 554825 220176 (-334649) 2: 553505 220148 (-333357) 3: 554661 220064 (-334597) 4: 569559 220064 (-349495) 5: 612798 220065 (-392733) 6: 570287 220064 (-350223) 7: 554790 220064 (-334726) 8: 554715 220065 (-334650) 9: 569840 220064 (-349776) $ /tmp/random32 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 520117 280225 (-239892) 1: 520125 280154 (-239971) 2: 520104 280094 (-240010) 3: 520079 280060 (-240019) 4: 520069 280060 (-240009) 5: 520060 280060 (-240000) 6: 558971 280060 (-278911) 7: 520102 280060 (-240042) 8: 520082 280060 (-240022) 9: 520058 280060 (-239998) 3 GHz i5-3330: $ /tmp/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 78 75 (-3) 1: 36 33 (-3) 2: 33 39 (+6) 3: 36 30 (-6) 4: 36 33 (-3) 5: 30 33 (+3) 6: 30 54 (+24) 7: 24 48 (+24) 8: 27 33 (+6) 9: 30 33 (+3) $ /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 66 78 (+12) 1: 39 39 (+0) 2: 36 39 (+3) 3: 45 33 (-12) 4: 42 33 (-9) 5: 33 42 (+9) 6: 45 33 (-12) 7: 39 36 (-3) 8: 105 48 (-57) 9: 42 39 (-3) $ /tmp/random64 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 406188 218104 (-188084) 1: 402620 246968 (-155652) 2: 402652 239840 (-162812) 3: 402720 200312 (-202408) 4: 402584 200080 (-202504) 5: 447488 200228 (-247260) 6: 402788 200312 (-202476) 7: 402688 200080 (-202608) 8: 427140 224320 (-202820) 9: 402576 200080 (-202496) $ /tmp/random32 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 406485 266670 (-139815) 1: 392694 266463 (-126231) 2: 392496 266763 (-125733) 3: 426003 266145 (-159858) 4: 392688 266667 (-126021) 5: 432231 266589 (-165642) 6: 392754 298734 (-94020) 7: 392883 284994 (-107889) 8: 392637 266694 (-125943) 9: 392985 267024 (-125961) 3.5 GHz i7-2700: # /tmp/perftest /tmp/random64 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 82 90 (+8) 1: 38 41 (+3) 2: 46 38 (-8) 3: 35 41 (+6) 4: 46 41 (-5) 5: 38 38 (+0) 6: 41 55 (+14) 7: 41 35 (-6) 8: 46 24 (-22) 9: 35 38 (+3) # /tmp/perftest /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 82 76 (-6) 1: 32 53 (+21) 2: 49 44 (-5) 3: 35 41 (+6) 4: 46 35 (-11) 5: 35 44 (+9) 6: 49 50 (+1) 7: 41 41 (+0) 8: 32 44 (+12) 9: 49 44 (-5) # /tmp/perftest /tmp/random64 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 445486 227993 (-217493) 1: 445089 227602 (-217487) 2: 445381 227736 (-217645) 3: 445220 227661 (-217559) 4: 445159 227798 (-217361) 5: 445285 227608 (-217677) 6: 445005 227806 (-217199) 7: 445439 227608 (-217831) 8: 445328 227798 (-217530) 9: 445533 227625 (-217908) # /tmp/perftest /tmp/random32 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 437001 335959 (-101042) 1: 437004 336120 (-100884) 2: 436683 335936 (-100747) 3: 436806 336452 (-100354) 4: 436996 335988 (-101008) 5: 436916 336210 (-100706) 6: 437042 354722 (-82320) 7: 436896 336146 (-100750) 8: 436981 336741 (-100240) 9: 437016 331920 (-105096) And on a 1.6 GHz P4: $ /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 312 176 (-136) 1: 112 112 (+0) 2: 112 112 (+0) 3: 112 112 (+0) 4: 112 112 (+0) 5: 112 112 (+0) 6: 112 112 (+0) 7: 112 112 (+0) 8: 112 112 (+0) 9: 112 112 (+0) $ /tmp/random32 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 480344 550192 (+69848) 1: 480088 550084 (+69996) 2: 480140 550128 (+69988) 3: 480088 550084 (+69996) 4: 480088 550084 (+69996) 5: 480088 550084 (+69996) 6: 480088 550084 (+69996) 7: 480088 550084 (+69996) 8: 480088 550084 (+69996) 9: 480088 550084 (+69996) And just for lulz, a 1.533 GHz Athlon XP: $ /tmp/random32 pool 1 = 85670974 e96b1f8f 51244abf 5863283f pool 2 = 03564c6c eba81d03 55c77fa1 760374a7 0: 91 83 (-8) 1: 58 57 (-1) 2: 58 57 (-1) 3: 58 57 (-1) 4: 58 45 (-13) 5: 58 45 (-13) 6: 58 45 (-13) 7: 58 45 (-13) 8: 58 45 (-13) 9: 58 45 (-13) $ /tmp/random32 10000 pool 1 = 54b3ba06 4769d67a eb04bbf3 5e42df6e pool 2 = 9d3c469e 6fecdb60 423af4ca 465173d1 0: 570094 380159 (-189935) 1: 550045 380088 (-169957) 2: 550028 380043 (-169985) 3: 568533 380087 (-188446) 4: 550028 380021 (-170007) 5: 550028 380021 (-170007) 6: 550028 380021 (-170007) 7: 550028 380021 (-170007) 8: 550028 382985 (-167043) 9: 550088 380021 (-170067) ^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2014-06-15 13:01 UTC | newest] Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-06-09 0:05 [RFC PATCH] drivers/char/random.c: Is reducing locking range like this safe? George Spelvin 2014-06-09 1:35 ` Theodore Ts'o 2014-06-09 2:10 ` George Spelvin 2014-06-09 2:18 ` George Spelvin 2014-06-09 4:03 ` George Spelvin 2014-06-09 9:23 ` George Spelvin 2014-06-09 13:34 ` Theodore Ts'o 2014-06-09 15:04 ` George Spelvin 2014-06-09 15:50 ` Theodore Ts'o 2014-06-09 16:11 ` George Spelvin 2014-06-10 0:20 ` drivers/char/random.c: more ruminations George Spelvin 2014-06-10 1:20 ` Theodore Ts'o 2014-06-10 3:10 ` George Spelvin 2014-06-10 15:25 ` Theodore Ts'o 2014-06-10 20:40 ` George Spelvin 2014-06-10 21:20 ` Theodore Ts'o 2014-06-11 0:10 ` George Spelvin 2014-06-11 2:08 ` Theodore Ts'o 2014-06-11 3:58 ` George Spelvin 2014-06-11 13:11 ` Theodore Ts'o 2014-06-12 0:42 ` George Spelvin 2014-06-12 1:03 ` H. Peter Anvin 2014-06-11 4:34 ` George Spelvin 2014-06-11 13:09 ` Theodore Ts'o 2014-06-11 2:21 ` Theodore Ts'o 2014-06-09 13:17 ` drivers/char/random.c: More futzing about George Spelvin 2014-06-11 16:38 ` Theodore Ts'o 2014-06-11 16:48 ` H. Peter Anvin 2014-06-11 19:25 ` Theodore Ts'o 2014-06-11 20:41 ` H. Peter Anvin 2014-06-12 0:44 ` H. Peter Anvin 2014-06-12 1:51 ` George Spelvin 2014-06-12 0:32 ` George Spelvin 2014-06-12 3:22 ` Theodore Ts'o 2014-06-12 4:13 ` random: Benchamrking fast_mix2 George Spelvin 2014-06-12 11:18 ` George Spelvin 2014-06-12 20:17 ` Theodore Ts'o 2014-06-12 20:46 ` Theodore Ts'o 2014-06-13 0:23 ` George Spelvin 2014-06-13 15:52 ` Theodore Ts'o 2014-06-14 2:10 ` George Spelvin 2014-06-14 3:06 ` Theodore Ts'o 2014-06-14 5:25 ` George Spelvin 2014-06-14 6:24 ` Theodore Ts'o 2014-06-14 8:03 ` George Spelvin 2014-06-14 11:14 ` George Spelvin 2014-06-14 15:13 ` George Spelvin 2014-06-14 16:33 ` Theodore Ts'o 2014-06-15 0:23 ` George Spelvin 2014-06-15 1:17 ` Theodore Ts'o 2014-06-15 6:58 ` George Spelvin 2014-06-15 13:01 ` Theodore Ts'o 2014-06-14 6:27 ` Theodore Ts'o 2014-06-14 4:55 ` [RFC] random: is the IRQF_TIMER test working as intended? George Spelvin 2014-06-14 6:43 ` Theodore Ts'o 2014-06-14 7:23 ` George Spelvin 2014-06-12 3:43 ` drivers/char/random.c: More futzing about George Spelvin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.