x86/random: Speculation to the rescue
diff mbox series

Message ID alpine.DEB.2.21.1909290010500.2636@nanos.tec.linutronix.de
State New
Headers show
Series
  • x86/random: Speculation to the rescue
Related show

Commit Message

Thomas Gleixner Sept. 28, 2019, 10:24 p.m. UTC
Nicholas presented the idea to (ab)use speculative execution for random
number generation years ago at the Real-Time Linux Workshop:

   https://static.lwn.net/images/conf/rtlws11/random-hardware.pdf

The idea is to use the non-constant execution time of loops on speculative
CPUs. The trick is to "time" the execution with rdtsc() and having enough
other instructions within the loop which make the uops scheduling
non-deterministic.

A trivial loop:

	for (i = 0; i < nbits; i++) {
		t = rdtsc();
		if (t & 0x01ULL)
			d |= (1ULL << i);
	}

does not provide enough entropy, but adding a pointless construct into the
loop makes it work:

	for (i = 0; i < nbits; i++) {
		t = __sw_hweight64(rdtsc() + rdtsc();
		if (t & 0x01ULL)
			d |= (1ULL << i);
	}

The interesting part is that rdtsc() can be speculated and the software
variant of hweight64() is adding enough 'useless' instructions to make the
uops scheduling and therefore the execution time random enough so that bit
zero of the result yields useful entropy.

To verify that this is true, there is a debugfs interface which provides
two files:

    /sys/kernel/debug/x86/rnd_fill
    /sys/kernel/debug/x86/rnd_data

Writing anything into 'rnd_fill' triggers the entropy collection via the
above algorithm and stores 64bit in one go in a data array after running
the 64bit value through a lame folding algorithm (Fibonacci LFSR).

When the write returns (which takes less than 10ms), the buffer is filled
with 4096 64bit entries, i.e. 262144 bits. The resulting data was read out
from rnd_data and tested against dieharder and the test01 Rabbit suite.

Most runs pass all tests, but both test suites find occasionally a few
tests which they considers weak, but the weak points are not the same on
the failing runs. The number of weak findings depends on the
microarchitecture, older CPUs expose it more than newer ones, which is not
suprising as the speculation gets more crazy on newer generations.

As the folding algorithm is lazy and primitive, it's not necessarily a
proof that the approach is wrong.  Feeding the entropy into the proper
kernel random generator should make that go away. Aside of that as the
'weak' points are randomly moving around there is no real way to attack
them in my opinion (but I might be wrong as usual).

What's interesting is that even with the trivial 64bit folding the result
is surprisingly good. That means that the speculative execution provides
usable entropy.

A visual comparison of 262144 bits retrieved from /dev/random and from the
rng_data file is here:

  https://tglx.de/~tglx/random/index.html

The tests were successfully run on various generations of Intel and AMD
hardware, but of course that needs way more investigation than a few runs
on a few machines.

As this depends on the microarchitecure and the pipeline depth this needs
at least some basic runtime verification mechanism before utilizing this.

But contrary to the last two years experience, speculation seems to have
it's useful sides as well :)

Suggested-by: Nicholas Mc Guire <hofrat@opentech.at>
Not-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/processor.h |    2 
 arch/x86/kernel/cpu/common.c     |    4 +
 arch/x86/kernel/tsc.c            |   95 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+)

Comments

Linus Torvalds Sept. 28, 2019, 11:53 p.m. UTC | #1
On Sat, Sep 28, 2019 at 3:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Nicholas presented the idea to (ab)use speculative execution for random
> number generation years ago at the Real-Time Linux Workshop:

What you describe is just a particularly simple version of the jitter
entropy. Not very reliable.

But hey, here's a made-up patch. It basically does jitter entropy, but
it uses a more complex load than the fibonacci LFSR folding: it calls
"schedule()" in a loop, and it sets up a timer to fire.

And then it mixes in the TSC in that loop.

And to be fairly conservative, it then credits one bit of entropy for
every timer tick. Not because the timer itself would be all that
unpredictable, but because the interaction between the timer and the
loop is going to be pretty damn unpredictable.

Ok, I'm handwaving. But I do claim it really is fairly conservative to
think that a cycle counter would give one bit of entropy when you time
over a timer actually happening. The way that loop is written, we do
guarantee that we'll mix in the TSC value both before and after the
timer actually happened. We never look at the difference of TSC
values, because the mixing makes that uninteresting, but the code does
start out with verifying that "yes, the TSC really is changing rapidly
enough to be meaningful".

So if we want to do jitter entropy, I'd much rather do something like
this that actually has a known fairly complex load with timers and
scheduling.

And even if absolutely no actual other process is running, the timer
itself is still going to cause perturbations. And the "schedule()"
call is more complicated than the LFSR is anyway.

It does wait for one second the old way before it starts doing this.

Whatever. I'm entirely convinced this won't make everybody happy
anyway, but it's _one_ approach to handle the issue.

Ahmed - would you be willing to test this on your problem case (with
the ext4 optimization re-enabled, of course)?

And Thomas - mind double-checking that I didn't do anything
questionable with the timer code..

And this goes without saying - this patch is ENTIRELY untested.  Apart
from making people upset for the lack of rigor, it might do
unspeakable crimes against your pets. You have been warned.

               Linus
drivers/char/random.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d3beed084c0a..de434feb873a 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1732,6 +1732,56 @@ void get_random_bytes(void *buf, int nbytes)
 }
 EXPORT_SYMBOL(get_random_bytes);
 
+
+/*
+ * Each time the timer fires, we expect that we got an unpredictable
+ * jump in the cycle counter. Even if the timer is running on another
+ * CPU, the timer activity will be touching the stack of the CPU that is
+ * generating entropy..
+ *
+ * Note that we don't re-arm the timer in the timer itself - we are
+ * happy to be scheduled away, since that just makes the load more
+ * complex, but we do not want the timer to keep ticking unless the
+ * entropy loop is running.
+ *
+ * So the re-arming always happens in the entropy loop itself.
+ */
+static void entropy_timer(struct timer_list *t)
+{
+	credit_entropy_bits(&input_pool, 1);
+}
+
+/*
+ * If we have an actual cycle counter, see if we can
+ * generate enough entropy with timing noise
+ */
+static void try_to_generate_entropy(void)
+{
+	struct {
+		unsigned long now;
+		struct timer_list timer;
+	} stack;
+
+	stack.now = random_get_entropy();
+
+	/* Slow counter - or none. Don't even bother */
+	if (stack.now == random_get_entropy())
+		return;
+
+	timer_setup_on_stack(&stack.timer, entropy_timer, 0);
+	while (!crng_ready()) {
+		if (!timer_pending(&stack.timer))
+			mod_timer(&stack.timer, jiffies+1);
+		mix_pool_bytes(&input_pool, &stack.now, sizeof(stack.now));
+		schedule();
+		stack.now = random_get_entropy();
+	}
+
+	del_timer_sync(&stack.timer);
+	destroy_timer_on_stack(&stack.timer);
+	mix_pool_bytes(&input_pool, &stack.now, sizeof(stack.now));
+}
+
 /*
  * Wait for the urandom pool to be seeded and thus guaranteed to supply
  * cryptographically secure random numbers. This applies to: the /dev/urandom
@@ -1746,7 +1796,17 @@ int wait_for_random_bytes(void)
 {
 	if (likely(crng_ready()))
 		return 0;
-	return wait_event_interruptible(crng_init_wait, crng_ready());
+
+	do {
+		int ret;
+		ret = wait_event_interruptible_timeout(crng_init_wait, crng_ready(), HZ);
+		if (ret)
+			return ret > 0 ? 0 : ret;
+
+		try_to_generate_entropy();
+	} while (!crng_ready());
+
+	return 0;
 }
 EXPORT_SYMBOL(wait_for_random_bytes);
Thomas Gleixner Sept. 29, 2019, 7:40 a.m. UTC | #2
On Sat, 28 Sep 2019, Linus Torvalds wrote:

> On Sat, Sep 28, 2019 at 3:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Nicholas presented the idea to (ab)use speculative execution for random
> > number generation years ago at the Real-Time Linux Workshop:
> 
> What you describe is just a particularly simple version of the jitter
> entropy. Not very reliable.

I know that jitter entropy is not the most reliable source. Though its for
one better than no entropy and on the other hand the results are
interesting.

I just looked at the test I ran overnight on one of my machines which did
an aggregate test over 1024 runs both on the hacked up thing and the
veritable /dev/random. The overall difference is within the noise.

So I'm not trying to say that we should rely solely on that, but I think
it's something we should not dismiss upfront just because paranoid theory
says that jitter entropy is not reliable enough.

The point is that jitter entropy relies as any other entropy source on the
quality of the observed data. The non-deterministic behaviour of
speculative execution seems to yield quite good input.

There is another fun thing which I found while analyzing the crappy runtime
behaviour of a customer application:

static volatile unsigned int foo;

	t0 = rdtscp();
	for (i = 0; i < 10000; i++)
		foo++;
	/* rdtscp() or the fenced rdtsc() gives better results */
	t1 = rdtscp();

Even with interrupts disabled and no NMI disturbing the loop, the resulting
execution time varies depending on the microarchitecture:

       tmin < t < N * tmin

where N is >= 2 on all Intel machines which I tested. Feeding bit 0 of t1
mixed with t0 into the LSFR gives equally good results as the hacky loop I
used in the patch. Fun isn't it?

> Whatever. I'm entirely convinced this won't make everybody happy
> anyway, but it's _one_ approach to handle the issue.
> 
> Ahmed - would you be willing to test this on your problem case (with
> the ext4 optimization re-enabled, of course)?
> 
> And Thomas - mind double-checking that I didn't do anything
> questionable with the timer code..

Looks about right.
 
> And this goes without saying - this patch is ENTIRELY untested.  Apart
> from making people upset for the lack of rigor, it might do
> unspeakable crimes against your pets. You have been warned.

As I'm about to vanish for a 2 weeks vacation, I'm going to shut up now and
hope that experiment gave at least food for thought.

Thanks,

	tglx
Alexander E. Patrakov Sept. 29, 2019, 8:05 a.m. UTC | #3
29.09.2019 04:53, Linus Torvalds пишет:
> On Sat, Sep 28, 2019 at 3:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Nicholas presented the idea to (ab)use speculative execution for random
>> number generation years ago at the Real-Time Linux Workshop:
> 
> What you describe is just a particularly simple version of the jitter
> entropy. Not very reliable.
> 
> But hey, here's a made-up patch. It basically does jitter entropy, but
> it uses a more complex load than the fibonacci LFSR folding: it calls
> "schedule()" in a loop, and it sets up a timer to fire.
> 
> And then it mixes in the TSC in that loop.
> 
> And to be fairly conservative, it then credits one bit of entropy for
> every timer tick. Not because the timer itself would be all that
> unpredictable, but because the interaction between the timer and the
> loop is going to be pretty damn unpredictable.

This looks quite similar to the refactoring proposed earlier by Stephan 
Müller in his paper: https://www.chronox.de/lrng/doc/lrng.pdf . Indeed, 
he makes a good argument that the timing of device interrupts is right 
now the main actual source of entropy in Linux, at the end of Section 1.1:

"""
The discussion shows that the noise sources of block devices and HIDs 
are a derivative of the interrupt noise source. All events used as 
entropy source recorded by the block device and HID noise source are 
delivered to the Linux kernel via interrupts.
"""

Now your patch adds the timer interrupt (while the schedule() loop is 
running) to the mix, essentially in the same setup as proposed.

> 
> Ok, I'm handwaving. But I do claim it really is fairly conservative to
> think that a cycle counter would give one bit of entropy when you time
> over a timer actually happening. The way that loop is written, we do
> guarantee that we'll mix in the TSC value both before and after the
> timer actually happened. We never look at the difference of TSC
> values, because the mixing makes that uninteresting, but the code does
> start out with verifying that "yes, the TSC really is changing rapidly
> enough to be meaningful".
> 
> So if we want to do jitter entropy, I'd much rather do something like
> this that actually has a known fairly complex load with timers and
> scheduling.
> 
> And even if absolutely no actual other process is running, the timer
> itself is still going to cause perturbations. And the "schedule()"
> call is more complicated than the LFSR is anyway.
> 
> It does wait for one second the old way before it starts doing this.
> 
> Whatever. I'm entirely convinced this won't make everybody happy
> anyway, but it's _one_ approach to handle the issue.
> 
> Ahmed - would you be willing to test this on your problem case (with
> the ext4 optimization re-enabled, of course)?
> 
> And Thomas - mind double-checking that I didn't do anything
> questionable with the timer code..
> 
> And this goes without saying - this patch is ENTIRELY untested.  Apart
> from making people upset for the lack of rigor, it might do
> unspeakable crimes against your pets. You have been warned.
> 
>                 Linus
>
Linus Torvalds Sept. 30, 2019, 1:16 a.m. UTC | #4
On Sat, Sep 28, 2019 at 4:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But hey, here's a made-up patch. It basically does jitter entropy, but
> it uses a more complex load than the fibonacci LFSR folding: it calls
> "schedule()" in a loop, and it sets up a timer to fire.

Ok, I'm sure a lot of people will end up finding this distasteful, and
I'll admit to just waffling about it myself.

But I am supposed to close the merge window today, and honestly, I
want _something_ to happen about the getrandom() issue during the 5.4
merge cycle.

So I had a few choices

 - just ignore things and hope some consensus happens

 - start the movement to a new getrandom() interface and encourage
user space to say "yeah, I don't need _secure_ random numbers"

 - or just say "hey, a lot of people find jitter entropy reasonable,
so let's just try this".

And I went with that last choice. If it works, it makes the
getrandom() interface changes a non-issue.

I'm not saying my patch is going to be the last word on the issue. I'm
_personally_ ok with it and believe it's not crazy, and if it then
makes serious people go "Eww" and send some improvements to it, then
it has served its purpose.

But I've committed that patch and the revert of the ext4 revert to a
local branch, I'll do some basic testing of it (which honestly on my
machines are kind of pointless, since all of them support rdrand), but
assuming it passes the basic smoke tests - and I expect it to - I'll
merge it for rc1.

I also have my old readdir branch that I decided I want to merge due
to the (completely independent) discussion about filldir issues, so
I'll probably end up delaying rc1 until tomorrow, but just a heads up.
I don't want to leave this until "some time later in the -rc series",
although I will be _more_ than happy to have people send me fixes to
my somewhat simplistic patch.

That said, my patch may be simplistic, but I suspect using a loop with
a real load like schedule() and arming a timer is a lot more realistic
than some of the jitter entropy papers with their _very_ trivial
LFSR's and some made-up pointer chasing.

But yeah, I think improvements to it are also not unexpected or unreasonable ;)

              Linus
Linus Torvalds Sept. 30, 2019, 2:59 a.m. UTC | #5
On Sun, Sep 29, 2019 at 6:16 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But I've committed that patch and the revert of the ext4 revert to a
> local branch, I'll do some basic testing of it (which honestly on my
> machines are kind of pointless, since all of them support rdrand), but
> assuming it passes the basic smoke tests - and I expect it to - I'll
> merge it for rc1.

All my smoke testing looked fine - I disabled trusting the CPU, I
increased the required entropy a lot, and to actually trigger the
lockup issue without the broken user space, I made /dev/urandom do
that "wait for entropy" thing too.

It all looked sane to me, and the urandom part also had the side
effect of then silencing all the "reading urandom without entropy"
warning cases as expected.

So it's merged.

Note that what I merged did _not_ contain the urandom changes, that
was purely for my testing. But it might well be a reasonable thing to
do at some point.

Of course, whether this jitter-entropy approach is reasonable in the
first place ends up likely being debated, but it does seem to be the
simplest way forward.

           Linus
Theodore Y. Ts'o Sept. 30, 2019, 3:37 a.m. UTC | #6
On Sun, Sep 29, 2019 at 06:16:33PM -0700, Linus Torvalds wrote:
> 
>  - or just say "hey, a lot of people find jitter entropy reasonable,
> so let's just try this".
> 

I'm OK with this as a starting point.  If a jitter entropy system
allow us to get pass this logjam, let's do it.  At least for the x86
architecture, it will be security through obscurity.  And if the
alternative is potentially failing where the adversary can attack the
CRNG, it's my preference.  It's certainly better than nothing.

That being said, I'd very much like to see someone do an analysis of
how well these jitter schemes work on an RISC-V iplementation (you
know, the ones that were so simplistic and didn't have any speculation
so they weren't vulnerable to Specture/Meltdown).  If jitter
approaches turn out not to work that well on RISC-V, perhaps that will
be a goad for future RISC-V chips to include the crypto extension to
their ISA.

In the long term (not in time for the 5.4 merge window), I'm convinced
that we should be trying as many ways of getting entropy as possible.
If we're using UEFI, we should be trying to get it from UEFI's secure
random number generator; if there is a TPM, we should be trying to get
random numbers from the RPM, and mix them in, and so on.

After all, the reason why lived with getrandom(0) blocking for five
years was because for the vast majority of x86 platforms, it simply
wasn't problem in practice.  We need to get back to that place where
in practice, we've harvested as much uncertainty from hardware as
possible, so most folks are comfortable that attacking the CRNG is no
longer the simplest way to crack system security.

       	     	     	      	     - Ted
Borislav Petkov Sept. 30, 2019, 6:10 a.m. UTC | #7
On Sun, Sep 29, 2019 at 07:59:19PM -0700, Linus Torvalds wrote:
> All my smoke testing looked fine - I disabled trusting the CPU, I
> increased the required entropy a lot, and to actually trigger the
> lockup issue without the broken user space, I made /dev/urandom do
> that "wait for entropy" thing too.

Hohum, seems to get rid of the longish delay during boot on my test
boxes here:

$ grep random /var/log/messages

<--- that's before

Sep 30 07:46:07 cz vmunix: [    0.000000] random: get_random_bytes called from start_kernel+0x304/0x4ac with crng_init=0
Sep 30 07:46:07 cz vmunix: [    1.505641] random: fast init done
Sep 30 07:46:07 cz vmunix: [    7.124808] random: dd: uninitialized urandom read (512 bytes read)
Sep 30 07:46:07 cz vmunix: [    8.507672] random: dbus-daemon: uninitialized urandom read (12 bytes read)
Sep 30 07:46:07 cz vmunix: [    8.518621] random: dbus-daemon: uninitialized urandom read (12 bytes read)
Sep 30 07:46:07 cz vmunix: [    8.565073] random: avahi-daemon: uninitialized urandom read (4 bytes read)
Sep 30 07:46:21 cz vmunix: [   23.092795] random: crng init done
Sep 30 07:46:21 cz vmunix: [   23.096419] random: 3 urandom warning(s) missed due to ratelimiting

<--- that's after and we're 15 secs faster:

Sep 30 07:47:53 cz vmunix: [    0.329599] random: get_random_bytes called from start_kernel+0x304/0x4ac with crng_init=0
Sep 30 07:47:53 cz vmunix: [    1.949216] random: fast init done
Sep 30 07:47:53 cz vmunix: [    4.806132] random: dd: uninitialized urandom read (512 bytes read)
Sep 30 07:47:53 cz vmunix: [    5.954547] random: dbus-daemon: uninitialized urandom read (12 bytes read)
Sep 30 07:47:53 cz vmunix: [    5.965483] random: dbus-daemon: uninitialized urandom read (12 bytes read)
Sep 30 07:47:53 cz vmunix: [    6.014102] random: avahi-daemon: uninitialized urandom read (4 bytes read)
Sep 30 07:47:55 cz vmunix: [    8.238514] random: crng init done
Sep 30 07:47:55 cz vmunix: [    8.240205] random: 3 urandom warning(s) missed due to ratelimiting

Seeing how those uninitialized urandom read warns still happen, I added a
dump_stack() to see when we do wait for the random bytes first and I got this:

[    5.522348] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[    5.532008] random: dbus-daemon: uninitialized urandom read (12 bytes read)
[    5.579922] random: avahi-daemon: uninitialized urandom read (4 bytes read)
[    5.751790] elogind-daemon[1730]: New seat seat0.
[    5.756376] elogind-daemon[1730]: Watching system buttons on /dev/input/event6 (Power Button)
[    5.777381] elogind-daemon[1730]: Watching system buttons on /dev/input/event3 (Power Button)
[    5.781485] elogind-daemon[1730]: Watching system buttons on /dev/input/event5 (Lid Switch)
[    5.783547] elogind-daemon[1730]: Watching system buttons on /dev/input/event4 (Sleep Button)
[    5.885300] elogind-daemon[1730]: Watching system buttons on /dev/input/event0 (AT Translated Set 2 keyboard)
[    5.911602] CPU: 2 PID: 1798 Comm: sshd Not tainted 5.3.0+ #1
[    5.914672] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.39 04/16/2019
[    5.917774] Call Trace:
[    5.920905]  dump_stack+0x46/0x60
[    5.924044]  wait_for_random_bytes.part.32+0x21/0x163
[    5.927256]  ? handle_mm_fault+0x50/0xc0
[    5.930425]  ? _raw_spin_unlock_irq+0x17/0x40
[    5.933604]  ? __do_page_fault+0x225/0x500
[    5.936763]  __x64_sys_getrandom+0x70/0xb0
[    5.939902]  do_syscall_64+0x4c/0x180
[    5.943003]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    5.946152] RIP: 0033:0x7f4417f4d495
[    5.949225] Code: 74 4c 8d 0c 37 41 ba 3e 01 00 00 66 2e 0f 1f 84 00 00 00 00 00 4d 39 c8 73 27 4c 89 ce 31 d2 4c 89 c7 44 89 d0 4c 29 c6 0f 05 <48> 3d 00 f0 ff ff 77 2b 48 85 c0 78 0e 74 3c 49 01 c0 4d 39 c8 72
[    5.952902] RSP: 002b:00007ffc69e6e328 EFLAGS: 00000202 ORIG_RAX: 000000000000013e
[    5.956227] RAX: ffffffffffffffda RBX: 0000000000000020 RCX: 00007f4417f4d495
[    5.959530] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000559262c74780
[    5.962820] RBP: 0000559262c708b0 R08: 0000559262c74780 R09: 0000559262c747a0
[    5.966104] R10: 000000000000013e R11: 0000000000000202 R12: 00007ffc69e6e470
[    5.969373] R13: 0000000000000002 R14: 00007f4417f4d460 R15: 000000007fffffff
[    7.852837] random: crng init done
[    7.854637] random: 3 urandom warning(s) missed due to ratelimiting
[   17.767786] elogind-daemon[1730]: New session c1 of user root.

so sshd does getrandom(2) while those other userspace things don't. Oh
well.
Theodore Y. Ts'o Sept. 30, 2019, 1:16 p.m. UTC | #8
On Sun, Sep 29, 2019 at 11:37:06PM -0400, Theodore Y. Ts'o wrote:
> I'm OK with this as a starting point.  If a jitter entropy system
> allow us to get pass this logjam, let's do it.  At least for the x86
> architecture, it will be security through obscurity.  And if the
> alternative is potentially failing where the adversary can attack the
> CRNG, it's my preference.  It's certainly better than nothing.

Upon rereading this, this came out wrong.  What I was trying to say is
in the very worst case, it will be security through obscurity, and if
the alternative "don't block, because blocking is always worse than an
guessable value being returned through getrandom(0)", it's better than
nothing.

Which is to say, I'm still worried that people with deep access to the
implementation details of a CPU might be able to reverse engineer what
a jitter entropy scheme produces.  This is why I'd be curious to see
the results when someone tries to attack a jitter scheme on a fully
open, simple architecture such as RISC-V.

						- Ted
Linus Torvalds Sept. 30, 2019, 4:06 p.m. UTC | #9
On Sun, Sep 29, 2019 at 11:10 PM Borislav Petkov <bp@alien8.de> wrote:
>
> so sshd does getrandom(2) while those other userspace things don't. Oh
> well.

Well, that's actually what systems _should_ do. Presumably sshd
actually wants secure randomness, and nothing else waits for it.

Obviously, that can be a problem if you then need sshd in order to get
into a headless box, so my patch fixes things for you too, but at
least your box doesn't show the problem that Ahmed had, and the boot
completing presumably means that you got more entropy from other disk
IO being done by the rest of the boot.

If you want to test my hacky "do /dev/urandom too", it was this one-liner:

  --- a/drivers/char/random.c
  +++ b/drivers/char/random.c
  @@ -2027,6 +2027,7 @@ urandom_read(struct file *file, char __user
*buf, size_t nbytes, loff_t *ppos)
        static int maxwarn = 10;
        int ret;

  +     if (!crng_ready()) try_to_generate_entropy();
        if (!crng_ready() && maxwarn > 0) {
                maxwarn--;
                if (__ratelimit(&urandom_warning))

and that should get rid of the warnings.

It's not using the full "wait_for_random_bytes()", because in the
absence of a cycle counter, that would introduce the boot-time lockup
for /dev/urandom too.

Doing something like the above to /dev/urandom is likely the right
thing to do eventually, but I didn't want to mix up "we can perhaps
improve the urandom situation too" with the basic "let's fix the boot
problem". The urandom behavior change would be a separate thing.

Also, talking about "future changes". Right now
"try_to_generate_entropy()" is actually uninterruptible once it gets
started. I think we should add a test for signal_pending() too, but it
should generally complete really fairly quickly so I left it without
one just to see if anybody even notices.

                 Linus
Linus Torvalds Sept. 30, 2019, 4:15 p.m. UTC | #10
On Mon, Sep 30, 2019 at 6:16 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> Which is to say, I'm still worried that people with deep access to the
> implementation details of a CPU might be able to reverse engineer what
> a jitter entropy scheme produces.  This is why I'd be curious to see
> the results when someone tries to attack a jitter scheme on a fully
> open, simple architecture such as RISC-V.

Oh, I agree.

One of the reasons I didn't like some of the other jitter entropy
things was that they seemed to rely _entirely_ on just purely
low-level CPU unpredictability. I think that exists, but I think it
makes for problems for really simple cores.

Timing over a bigger thing and an actual interrupt (even if it's
"just" a timer interrupt, which is arguably much closer to the CPU and
has a much higher likelihood of having common frequency domains with
the cycle counter etc) means that I'm pretty damn convinced that a big
complex CPU will absolutely see issues, even if it has big caches.

But it _also_ means that if you have a small and excessively stupid
in-order CPU, I can almost guarantee that you will at least have cache
misses likely all the way out to memory. So a CPU-only loop like the
LFSR thing that Thomas reports generates entropy even on its own would
likely generate nothing at all on a simple in-order core - but I do
think that with timers and real cache misses etc, it's going to be
really really hard to try to figure out cycle counters even if you're
a CPU expert.

But the embedded market with small cores and 100% identical machines
and 100% identical system images is always going to be a potential
huge problem.

If somebody has connections to RISC-V hw people, maybe they could
bring this issue up with them?

                  Linus
Peter Zijlstra Sept. 30, 2019, 4:32 p.m. UTC | #11
On Mon, Sep 30, 2019 at 09:15:55AM -0700, Linus Torvalds wrote:
> On Mon, Sep 30, 2019 at 6:16 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:

> But it _also_ means that if you have a small and excessively stupid
> in-order CPU, I can almost guarantee that you will at least have cache
> misses likely all the way out to memory. So a CPU-only loop like the
> LFSR thing that Thomas reports generates entropy even on its own would
> likely generate nothing at all on a simple in-order core - but I do

In my experience LFSRs are good at defeating branch predictors, which
would make even in-order cores suffer lots of branch misses. And that
might be enough, maybe.
Linus Torvalds Sept. 30, 2019, 5:03 p.m. UTC | #12
On Mon, Sep 30, 2019 at 9:32 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> In my experience LFSRs are good at defeating branch predictors, which
> would make even in-order cores suffer lots of branch misses. And that
> might be enough, maybe.

Agreed, branch mis-prediction is likely fairly hard to take into
account ahead of time even in an in-order CPU.

But when you know the LFSR, and you know the architecture, you could
just re-create the timing, and have a fairly high chance of getting
the same complex pattern.

And in the simple enough (ie bad) case - the embedded world - you
don't need to "know" or do any deep analysis of anything or try to
predict it ahead of time. You just look at what another identical
machine does when given the identical starting point.

So I don't think an LFSR is all that great on its own. It's
complicated to predict, and it gives odd patterns, but on an in-order
core I'm not convinced it gives sufficiently _different_ odd patterns
across booting.

This, btw, is why you shouldn't trust the "I ran the thing a billion
times" on my PC, even if you were to have an old in-order Atom CPU
available to you. If you didn't restart the whole CPU state from an
identical starting point as you re-run them, the differences you see
may simply not be real. They may be an artificial effect of cumulative
changes to internal CPU branch prediction arrays and cache tag layout.

I don't think it's a huge issue if you have a real load, and you have
_any_ source of entropy at all, but I really do not think that an LFSR
is necessarily a good idea. It's just _too_ identical across reboots,
and will have very similar (but yes, complex due to branch prediction)
behavior across different runs.

Of course, in the "completely and utterly identical state and
absolutely no timing differences anywhere" situation, even my "take
timer interrupts and force at least cache misses on SMP" model doesn't
protect you from just re-running the 100% identical sequence.

But when it's a more complex load than an LFSR, I personally at least
feel better about it. An LFSR I can well imagine will give the exact
same (odd) timing patterns across boots even if there were earlier
minor changes. But hopefully a bigger load with just a more complex
footprint will have more of that. More cache misses, more DRAM
accesses, more branch mispredicts, more "pipeline was broken in a
_slightly_ different place due to timer".

It is also, btw, why I don't mix in TSC _differences_ when I mix
things in. I think it's better to actually mix in the TSC value
itself. Even if you re-run the LFSR, and it has the exact same branch
mis-predicts (because it's the same LFSR), if there were any timing
differences from _anything_ else before you ran that LFSR, then the
bits you'll be mixing in are different across boots. But if you mix in
the relative difference, you might be mixing in the identical bits.

The only real difference is only the initial TSC value, of course, so
the added entropy is small. But when we're talking about trying to get
to a total of 256 bits, a couple of bits here and there end up
mattering.

But no. Never any _guarantees_. There is no absolute security. Only best effort.

An OoO CPU will have a _lot_ more internal state, and a lot of things
that perturb that internal state, and that will make small changes in
timing cause more chaos in the end. Much less to worry about.

An in-order CPU will have less internal state, and so less
perturbations and sources of real entropy from small differences. We
can only hope there is _some_.

It's not like our existing "depend on external interrupt timing" is
any hard guarantee either, regardless of how long we wait or how many
external interrupts we'd get.

It's always a "at some point you have to make a judgement call".

And we all have different levels of comfort about where that point
ends up being.

               Linus
Kees Cook Sept. 30, 2019, 6:05 p.m. UTC | #13
On Mon, Sep 30, 2019 at 08:10:15AM +0200, Borislav Petkov wrote:
> On Sun, Sep 29, 2019 at 07:59:19PM -0700, Linus Torvalds wrote:
> > All my smoke testing looked fine - I disabled trusting the CPU, I
> > increased the required entropy a lot, and to actually trigger the
> > lockup issue without the broken user space, I made /dev/urandom do
> > that "wait for entropy" thing too.
> 
> Hohum, seems to get rid of the longish delay during boot on my test
> boxes here:

Yes; for me too. This makes a huge difference in my ARM emulation
environments (where I wasn't using virtio-rng-device). Those VMs were
very boot entropy starved -- I was waiting minutes for sshd to start.

I doubt running something like dieharder on urandom would actually show
any deficiencies, but I've started a test up anyway. I'll yell in a few
hours if it actually has something bad to say. :)
David Laight Oct. 1, 2019, 10:28 a.m. UTC | #14
From: Linus Torvalds
> Sent: 30 September 2019 17:16
> 
> On Mon, Sep 30, 2019 at 6:16 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> >
> > Which is to say, I'm still worried that people with deep access to the
> > implementation details of a CPU might be able to reverse engineer what
> > a jitter entropy scheme produces.  This is why I'd be curious to see
> > the results when someone tries to attack a jitter scheme on a fully
> > open, simple architecture such as RISC-V.
> 
> Oh, I agree.
> 
> One of the reasons I didn't like some of the other jitter entropy
> things was that they seemed to rely _entirely_ on just purely
> low-level CPU unpredictability. I think that exists, but I think it
> makes for problems for really simple cores.
> 
> Timing over a bigger thing and an actual interrupt (even if it's
> "just" a timer interrupt, which is arguably much closer to the CPU and
> has a much higher likelihood of having common frequency domains with
> the cycle counter etc) means that I'm pretty damn convinced that a big
> complex CPU will absolutely see issues, even if it has big caches.

Agreed, you need something that is actually non-deterministic.
While 'speculation' is difficult to predict, it is actually fully deterministic.
Until you get some perturbation from an outside source the cpu state
(including caches and DRAM) is likely to be the same on every boot.
For a desktop (etc) PC booting from a disk (even SSD) you'll get some variation.
Boot an embedded system from onboard flash and every boot could
well be the same (or one of a small number of results).

Synchronising a signal between frequency domains might generate
some 'randomness', but maybe not if both come from the same PLL.

Even if there are variations, they may not be large enough to give
a lot of variations in the state.
The variations between systems could also be a lot larger than the
variations within a system.

If there are 'only' 2^32 variations an exhaustive search might be
possible to find an ssh key.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Borislav Petkov Oct. 1, 2019, 1:51 p.m. UTC | #15
On Mon, Sep 30, 2019 at 09:06:36AM -0700, Linus Torvalds wrote:
> Obviously, that can be a problem if you then need sshd in order to get
> into a headless box, so my patch fixes things for you too, but at
> least your box doesn't show the problem that Ahmed had, and the boot
> completing presumably means that you got more entropy from other disk
> IO being done by the rest of the boot.

Right, another observation I did was that when it would wait for
entropy, if I press random keys, it would get done faster because
apparently it would collect entropy from the key presses too.

> If you want to test my hacky "do /dev/urandom too", it was this one-liner:
> 
>   --- a/drivers/char/random.c
>   +++ b/drivers/char/random.c
>   @@ -2027,6 +2027,7 @@ urandom_read(struct file *file, char __user
> *buf, size_t nbytes, loff_t *ppos)
>         static int maxwarn = 10;
>         int ret;
> 
>   +     if (!crng_ready()) try_to_generate_entropy();
>         if (!crng_ready() && maxwarn > 0) {
>                 maxwarn--;
>                 if (__ratelimit(&urandom_warning))
> 
> and that should get rid of the warnings.

So when I add this by hand and do git diff, it adds a second hunk:

---
diff --git a/drivers/char/random.c b/drivers/char/random.c
index c2f7de9dc543..93bad17bef98 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -2027,6 +2027,7 @@ urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 	static int maxwarn = 10;
 	int ret;
 
+	if (!crng_ready()) try_to_generate_entropy();
 	if (!crng_ready() && maxwarn > 0) {
 		maxwarn--;
 		if (__ratelimit(&urandom_warning))
@@ -2520,4 +2521,4 @@ void add_bootloader_randomness(const void *buf, unsigned int size)
 	else
 		add_device_randomness(buf, size);
 }
-EXPORT_SYMBOL_GPL(add_bootloader_randomness);
\ No newline at end of file
+EXPORT_SYMBOL_GPL(add_bootloader_randomness);
---

and I kinda get what it is trying to tell me but this is new. And when I
do

$ xxd drivers/char/random.c
..

000125e0: 646f 6d6e 6573 7329 3b0a                 domness);.

there's a 0xa at the end so what's git really trying to tell me?

Anyway, that does get rid of the warns too.

> Doing something like the above to /dev/urandom is likely the right
> thing to do eventually, but I didn't want to mix up "we can perhaps
> improve the urandom situation too" with the basic "let's fix the boot
> problem". The urandom behavior change would be a separate thing.

So make it a separate patch and let's hammer on it during the next weeks
and see what happens?

> Also, talking about "future changes". Right now
> "try_to_generate_entropy()" is actually uninterruptible once it gets
> started. I think we should add a test for signal_pending() too, but it

Wouldn't that even increase its entropy, which would be a good thing?

> should generally complete really fairly quickly so I left it without
> one just to see if anybody even notices.

Right.

Thx.
Ahmed S. Darwish Oct. 1, 2019, 4:15 p.m. UTC | #16
Hi,

Sorry for the late reply as I'm also on vacation this week :-)

On Sat, Sep 28, 2019 at 04:53:52PM -0700, Linus Torvalds wrote:
> On Sat, Sep 28, 2019 at 3:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Nicholas presented the idea to (ab)use speculative execution for random
> > number generation years ago at the Real-Time Linux Workshop:
>
> What you describe is just a particularly simple version of the jitter
> entropy. Not very reliable.
>
> But hey, here's a made-up patch. It basically does jitter entropy, but
> it uses a more complex load than the fibonacci LFSR folding: it calls
> "schedule()" in a loop, and it sets up a timer to fire.
>
> And then it mixes in the TSC in that loop.
>
> And to be fairly conservative, it then credits one bit of entropy for
> every timer tick. Not because the timer itself would be all that
> unpredictable, but because the interaction between the timer and the
> loop is going to be pretty damn unpredictable.
>
> Ok, I'm handwaving. But I do claim it really is fairly conservative to
> think that a cycle counter would give one bit of entropy when you time
> over a timer actually happening. The way that loop is written, we do
> guarantee that we'll mix in the TSC value both before and after the
> timer actually happened. We never look at the difference of TSC
> values, because the mixing makes that uninteresting, but the code does
> start out with verifying that "yes, the TSC really is changing rapidly
> enough to be meaningful".
>
> So if we want to do jitter entropy, I'd much rather do something like
> this that actually has a known fairly complex load with timers and
> scheduling.
>
> And even if absolutely no actual other process is running, the timer
> itself is still going to cause perturbations. And the "schedule()"
> call is more complicated than the LFSR is anyway.
>
> It does wait for one second the old way before it starts doing this.
>
> Whatever. I'm entirely convinced this won't make everybody happy
> anyway, but it's _one_ approach to handle the issue.
>
> Ahmed - would you be willing to test this on your problem case (with
> the ext4 optimization re-enabled, of course)?
>

So I pulled the patch and the revert of the ext4 revert as they're all
now merged in master. It of course made the problem go away...

To test the quality of the new jitter code, I added a small patch on
top to disable all other sources of randomness except the new jitter
entropy code, [1] and made quick tests on the quality of getrandom(0).

Using the "ent" tool, [2] also used to test randomness in the Stephen
Müller LRNG paper, on a 500000-byte file, produced the following
results:

    $ ent rand-file

    Entropy = 7.999625 bits per byte.

    Optimum compression would reduce the size of this 500000 byte file
    by 0 percent.

    Chi square distribution for 500000 samples is 259.43, and randomly
    would exceed this value 41.11 percent of the times.

    Arithmetic mean value of data bytes is 127.4085 (127.5 = random).

    Monte Carlo value for Pi is 3.148476594 (error 0.22 percent).

    Serial correlation coefficient is 0.001740 (totally uncorrelated = 0.0).

As can be seen above, everything looks random, and almost all of the
statistical randomness tests matched the same kernel without the
"jitter + schedule()" patch added (after getting it un-stuck).

Thanks!

[1] Nullified add_{device,timer,input,interrupt,disk,.*}_randomness()
[2] http://www.fourmilab.ch/random/

--
Ahmed Darwish
Kees Cook Oct. 1, 2019, 4:37 p.m. UTC | #17
On Tue, Oct 01, 2019 at 06:15:02PM +0200, Ahmed S. Darwish wrote:
> On Sat, Sep 28, 2019 at 04:53:52PM -0700, Linus Torvalds wrote:
> > Ahmed - would you be willing to test this on your problem case (with
> > the ext4 optimization re-enabled, of course)?
> >
> 
> So I pulled the patch and the revert of the ext4 revert as they're all
> now merged in master. It of course made the problem go away...
> 
> To test the quality of the new jitter code, I added a small patch on
> top to disable all other sources of randomness except the new jitter
> entropy code, [1] and made quick tests on the quality of getrandom(0).
> 
> Using the "ent" tool, [2] also used to test randomness in the Stephen
> Müller LRNG paper, on a 500000-byte file, produced the following
> results:
> 
>     $ ent rand-file
> 
>     Entropy = 7.999625 bits per byte.
> 
>     Optimum compression would reduce the size of this 500000 byte file
>     by 0 percent.
> 
>     Chi square distribution for 500000 samples is 259.43, and randomly
>     would exceed this value 41.11 percent of the times.
> 
>     Arithmetic mean value of data bytes is 127.4085 (127.5 = random).
> 
>     Monte Carlo value for Pi is 3.148476594 (error 0.22 percent).
> 
>     Serial correlation coefficient is 0.001740 (totally uncorrelated = 0.0).
> 
> As can be seen above, everything looks random, and almost all of the
> statistical randomness tests matched the same kernel without the
> "jitter + schedule()" patch added (after getting it un-stuck).

Can you post the patch for [1]? Another test we should do is the
multi-boot test. Testing the stream (with ent, or with my dieharder run)
is mainly testing the RNG algo. I'd like to see if the first 8 bytes
out of the kernel RNG change between multiple boots of the same system.
e.g. read the first 8 bytes, for each of 100000 boots, and feed THAT
byte "stream" into ent...
Linus Torvalds Oct. 1, 2019, 5:14 p.m. UTC | #18
On Tue, Oct 1, 2019 at 6:51 AM Borislav Petkov <bp@alien8.de> wrote:
>
> So when I add this by hand and do git diff, it adds a second hunk:

You and me both.

We both have editors that always add line-endings, and right now that
file doesn't have a newline at the end of the file thanks to commit
428826f5358c ("fdt: add support for rng-seed").


> and I kinda get what it is trying to tell me but this is new. And when I
> do
>
> $ xxd drivers/char/random.c
> ..
>
> 000125e0: 646f 6d6e 6573 7329 3b0a                 domness);.
>
> there's a 0xa at the end so what's git really trying to tell me?

The previous state of the file didn't have that 0xa at the end, so you get that


  -EXPORT_SYMBOL_GPL(add_bootloader_randomness);
  \ No newline at end of file
  +EXPORT_SYMBOL_GPL(add_bootloader_randomness);

which is "the '-' line doesn't have a newline, the '+' line does" marker.

> > Doing something like the above to /dev/urandom is likely the right
> > thing to do eventually, but I didn't want to mix up "we can perhaps
> > improve the urandom situation too" with the basic "let's fix the boot
> > problem". The urandom behavior change would be a separate thing.
>
> So make it a separate patch and let's hammer on it during the next weeks
> and see what happens?

Yeah, probably.

> > Also, talking about "future changes". Right now
> > "try_to_generate_entropy()" is actually uninterruptible once it gets
> > started. I think we should add a test for signal_pending() too, but it
>
> Wouldn't that even increase its entropy, which would be a good thing?

I'm not sure it increases it, but it certainly shouldn't make it any worse.

                  Linus
Ahmed S. Darwish Oct. 1, 2019, 5:18 p.m. UTC | #19
On Tue, Oct 01, 2019 at 09:37:39AM -0700, Kees Cook wrote:
> On Tue, Oct 01, 2019 at 06:15:02PM +0200, Ahmed S. Darwish wrote:
> > On Sat, Sep 28, 2019 at 04:53:52PM -0700, Linus Torvalds wrote:
> > > Ahmed - would you be willing to test this on your problem case (with
> > > the ext4 optimization re-enabled, of course)?
> > >
> > 
> > So I pulled the patch and the revert of the ext4 revert as they're all
> > now merged in master. It of course made the problem go away...
> > 
> > To test the quality of the new jitter code, I added a small patch on
> > top to disable all other sources of randomness except the new jitter
> > entropy code, [1] and made quick tests on the quality of getrandom(0).
> > 
> > Using the "ent" tool, [2] also used to test randomness in the Stephen
> > Müller LRNG paper, on a 500000-byte file, produced the following
> > results:
> > 
> >     $ ent rand-file
> > 
> >     Entropy = 7.999625 bits per byte.
> > 
> >     Optimum compression would reduce the size of this 500000 byte file
> >     by 0 percent.
> > 
> >     Chi square distribution for 500000 samples is 259.43, and randomly
> >     would exceed this value 41.11 percent of the times.
> > 
> >     Arithmetic mean value of data bytes is 127.4085 (127.5 = random).
> > 
> >     Monte Carlo value for Pi is 3.148476594 (error 0.22 percent).
> > 
> >     Serial correlation coefficient is 0.001740 (totally uncorrelated = 0.0).
> > 
> > As can be seen above, everything looks random, and almost all of the
> > statistical randomness tests matched the same kernel without the
> > "jitter + schedule()" patch added (after getting it un-stuck).
> 
> Can you post the patch for [1]?
>

Yup, it was the quick&dirty patch below:

(discussion continues after the patch)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index c2f7de9dc543..26d0d2bb3337 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1177,6 +1177,8 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
+	return;
+
 	unsigned long time = random_get_entropy() ^ jiffies;
 	unsigned long flags;
 
@@ -1205,6 +1207,8 @@ static struct timer_rand_state input_timer_state = INIT_TIMER_RAND_STATE;
  */
 static void add_timer_randomness(struct timer_rand_state *state, unsigned num)
 {
+	return;
+
 	struct entropy_store	*r;
 	struct {
 		long jiffies;
@@ -1255,6 +1259,8 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num)
 void add_input_randomness(unsigned int type, unsigned int code,
 				 unsigned int value)
 {
+	return;
+
 	static unsigned char last_value;
 
 	/* ignore autorepeat and the like */
@@ -1308,6 +1314,8 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs *regs)
 
 void add_interrupt_randomness(int irq, int irq_flags)
 {
+	return;
+
 	struct entropy_store	*r;
 	struct fast_pool	*fast_pool = this_cpu_ptr(&irq_randomness);
 	struct pt_regs		*regs = get_irq_regs();
@@ -1375,6 +1383,8 @@ EXPORT_SYMBOL_GPL(add_interrupt_randomness);
 #ifdef CONFIG_BLOCK
 void add_disk_randomness(struct gendisk *disk)
 {
+	return;
+
 	if (!disk || !disk->random)
 		return;
 	/* first major is 1, so we get >= 0x200 here */
@@ -2489,6 +2499,8 @@ randomize_page(unsigned long start, unsigned long range)
 void add_hwgenerator_randomness(const char *buffer, size_t count,
 				size_t entropy)
 {
+	return;
+
 	struct entropy_store *poolp = &input_pool;
 
 	if (unlikely(crng_init == 0)) {
@@ -2515,9 +2527,11 @@ EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
  */
 void add_bootloader_randomness(const void *buf, unsigned int size)
 {
+	return;
+
 	if (IS_ENABLED(CONFIG_RANDOM_TRUST_BOOTLOADER))
 		add_hwgenerator_randomness(buf, size, size * 8);
 	else

> Another test we should do is the
> multi-boot test. Testing the stream (with ent, or with my dieharder run)
> is mainly testing the RNG algo. I'd like to see if the first 8 bytes
> out of the kernel RNG change between multiple boots of the same system.
> e.g. read the first 8 bytes, for each of 100000 boots, and feed THAT
> byte "stream" into ent...
>

Oh, indeed, that's an excellent point... I'll prototype this and come
back.

thanks,

--
Ahmed Darwish
Linus Torvalds Oct. 1, 2019, 5:25 p.m. UTC | #20
On Tue, Oct 1, 2019 at 9:15 AM Ahmed S. Darwish <darwish.07@gmail.com> wrote:
>
> To test the quality of the new jitter code, I added a small patch on
> top to disable all other sources of randomness except the new jitter
> entropy code, [1] and made quick tests on the quality of getrandom(0).

You also need to make sure to disable rdrand. Even if we don't trust
it, we always mix it in.

> Using the "ent" tool, [2] also used to test randomness in the Stephen
> Müller LRNG paper, on a 500000-byte file, produced the following
> results:

Entropy is hard to estimate, for roughly the same reasons it's hard to generate.

The entropy estimation is entirely bvroken by the whitening we do:
first we do the LFSR to mix things into the pools, then we whiten it
when we mix it between the input pool and the final pool, and then we
whiten it once more when we extract it when reading.

So the end result of urandom will look random to all the entropy tools
regardless of what the starting point is. Because we use good hashes
for whitening, and do all the updating of the pools while extracing,
the end result had better look perfect.

The only way to even make an educated estimate of actual entropy would
be to print out the raw state of the input pool when we do that "crng
init done".

And then you would have to automate some "reboot machine thousands of
times" and start looking for patterns.

And even then you'd only have a few thousand starting points that we
_claim_ have at least 128 bits of entropy in, and you'd have a really
hard time to prove that is the case.

You might prove that we are doing something very very wrong and don't
have even remotely close to 128 bits of randomness, but just 5 bits of
actual entropy or whatever - _that_ kind of pattern is easy to see.

But even then /dev/urandom as a _stream_ should look fine. Only the
(multiple, repeated) initial states in the input pool would show the
lack of entropy.

And you'd really have to reboot things for real. And not in a VM
either. Just repeating the entropy initialization wouldn't show the
pattern (unless it's even more broken) because the base TSC values
would be changing.

Entropy really is hard. It's hard to generate, and it's hard to measure.

              Linus
Theodore Y. Ts'o Oct. 2, 2019, 12:01 p.m. UTC | #21
On Tue, Oct 01, 2019 at 06:15:02PM +0200, Ahmed S. Darwish wrote:
> 
> Using the "ent" tool, [2] also used to test randomness in the Stephen
> Müller LRNG paper, on a 500000-byte file, produced the following
> results:

The "ent" tool is really, really useless.  If you take any CRNG, even
intialized with a known seed, "ent" will say that it's *GREAT*!

If you don't believe me, disable all entropy inputs into the CRNG,
initialize it with "THE NSA IS OUR LORD AND MASTER", and then run it.
You'll get substantially the same results.  (And if we didn't the Cha
Cha 20 encryption algorithm would be totally broken).

       		  	    	     	     - Ted
Pavel Machek Oct. 6, 2019, 11:41 a.m. UTC | #22
Hi!

On Sat 2019-09-28 16:53:52, Linus Torvalds wrote:
> On Sat, Sep 28, 2019 at 3:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Nicholas presented the idea to (ab)use speculative execution for random
> > number generation years ago at the Real-Time Linux Workshop:
> 
> What you describe is just a particularly simple version of the jitter
> entropy. Not very reliable.
> 
> But hey, here's a made-up patch. It basically does jitter entropy, but
> it uses a more complex load than the fibonacci LFSR folding: it calls
> "schedule()" in a loop, and it sets up a timer to fire.
> 
> And then it mixes in the TSC in that loop.
> 
> And to be fairly conservative, it then credits one bit of entropy for
> every timer tick. Not because the timer itself would be all that
> unpredictable, but because the interaction between the timer and the
> loop is going to be pretty damn unpredictable.
> 
> Ok, I'm handwaving. But I do claim it really is fairly conservative to
> think that a cycle counter would give one bit of entropy when you time
> over a timer actually happening. The way that loop is written, we do
> guarantee that we'll mix in the TSC value both before and after the
> timer actually happened. We never look at the difference of TSC
> values, because the mixing makes that uninteresting, but the code does
> start out with verifying that "yes, the TSC really is changing rapidly
> enough to be meaningful".
> 
> So if we want to do jitter entropy, I'd much rather do something like
> this that actually has a known fairly complex load with timers and
> scheduling.

> +/*
> + * If we have an actual cycle counter, see if we can
> + * generate enough entropy with timing noise
> + */
> +static void try_to_generate_entropy(void)
> +{
> +	struct {
> +		unsigned long now;
> +		struct timer_list timer;
> +	} stack;

Should we have some kind of notifier chain, so that we could utilize
better random sources (spinning rust) if we had them?

Best regards,
									Pavel
Pavel Machek Oct. 6, 2019, 12:07 p.m. UTC | #23
Hi!

> Entropy really is hard. It's hard to generate, and it's hard to measure.

It is not hard to generate... not on PC, not on most machines. "find
/" can generate plenty of entropy... certainly on any PC.

But it does not work everywhere, and we may need other methods of
generating entropy on some very embedded systems... So IMO we shold
have generic driver for PC-like machines and specific drivers for the
embedded stuff that needs it.

Best regards,
									Pavel
Linus Torvalds Oct. 6, 2019, 5:26 p.m. UTC | #24
On Sun, Oct 6, 2019 at 4:41 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> Should we have some kind of notifier chain, so that we could utilize
> better random sources (spinning rust) if we had them?

The spinning rust will get entropy on its own just thanks to the
regular interrupt stuff. And the kernel tryin gto do IO is a bad idea.

Plus I think it's kind of pointless to do anythign at all for things
like spinning rust in this day and age. It's no longer relevant, and
never really was in the area where this was a problem.

Also, I don't really like the notion of random (sic) notifiers that
different drivers or things could attach to this thing. People will
disagree about how much entropy it has anyway, and I'd rather have
_one_ clear implementation that people can look at and comment on and
try to actually write an academic paper on and suggest improvements
to, than some generic "entropy notifier interface" that then gets
whatever input somebody decides is appropriate.

We already have interfaces for "I think I have interesting data":
add_interrupt_randomness(), add_device_randomness(),
add_hwgenerator_randomness() are all for different sources of entropy.

             Linus
Pavel Machek Oct. 6, 2019, 5:35 p.m. UTC | #25
On Sun 2019-10-06 10:26:18, Linus Torvalds wrote:
> On Sun, Oct 6, 2019 at 4:41 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > Should we have some kind of notifier chain, so that we could utilize
> > better random sources (spinning rust) if we had them?
> 
> The spinning rust will get entropy on its own just thanks to the
> regular interrupt stuff. And the kernel tryin gto do IO is a bad
> idea.

It will not: boot is now halted because systemd wants some
entropy. Everything is idle and very little interrupts are
happening. We have spinning rust, but it is idle, and thus not
generating any interrupts.

> Plus I think it's kind of pointless to do anythign at all for things
> like spinning rust in this day and age. It's no longer relevant, and
> never really was in the area where this was a problem.
> 
> Also, I don't really like the notion of random (sic) notifiers that
> different drivers or things could attach to this thing. People will
> disagree about how much entropy it has anyway, and I'd rather have
> _one_ clear implementation that people can look at and comment on and
> try to actually write an academic paper on and suggest improvements
> to, than some generic "entropy notifier interface" that then gets
> whatever input somebody decides is appropriate.
> 
> We already have interfaces for "I think I have interesting data":
> add_interrupt_randomness(), add_device_randomness(),
> add_hwgenerator_randomness() are all for different sources of
> entropy.

I'm not suggesting the notifier would invent some entropy... I agree
that kernel doing IO is strange, but I'm suggesting just that: if
userspace is blocked waiting for entropy, do some I/O, and let
interrupt randomness do its job.

It will work great on spinning rust. It will also work somehow on SSDs
and SD cards etc, because they have separate CPUs these days. They'll
certainly generate some interrupts, and we already assign some
randomness to that... It will let the machine boot, and entropy
calculation rules do not need to change.

Best regards,
									Pavel
Linus Torvalds Oct. 6, 2019, 6:06 p.m. UTC | #26
On Sun, Oct 6, 2019 at 10:35 AM Pavel Machek <pavel@ucw.cz> wrote:
>
> It will not: boot is now halted because systemd wants some
> entropy. Everything is idle and very little interrupts are
> happening. We have spinning rust, but it is idle, and thus not
> generating any interrupts.

Yes, but we have that problem now solved.

Except on embedded platforms that have garbage CPU's without even a
cycle counter.

But those won't have spinning rust anyway.

Yes, bad SSD's and MMC disks (that they do have) will generate timing
noise too, but in the absense of a cycle counter, that noise won't be
much use.

                 Linus
Pavel Machek Oct. 6, 2019, 6:21 p.m. UTC | #27
On Sun 2019-10-06 11:06:38, Linus Torvalds wrote:
> On Sun, Oct 6, 2019 at 10:35 AM Pavel Machek <pavel@ucw.cz> wrote:
> >
> > It will not: boot is now halted because systemd wants some
> > entropy. Everything is idle and very little interrupts are
> > happening. We have spinning rust, but it is idle, and thus not
> > generating any interrupts.
> 
> Yes, but we have that problem now solved.
> 
> Except on embedded platforms that have garbage CPU's without even a
> cycle counter.
> 
> But those won't have spinning rust anyway.
> 
> Yes, bad SSD's and MMC disks (that they do have) will generate timing
> noise too, but in the absense of a cycle counter, that noise won't be
> much use.

Even without cycle counter... if we _know_ we are trying to generate
entropy and have MMC available, we don't care about power and
performance.

So we can just...

   issue read request on MMC
   while (!interrupt_done)
   	 i++
	 
...and then use i++ as poor man's version of cycle counter.

[We would not want to do that in normal operation, for obvious
reasons, just when userland is blocked and waiting for entropy.]

Hmm?

Best regards,
									Pavel
Linus Torvalds Oct. 6, 2019, 6:26 p.m. UTC | #28
On Sun, Oct 6, 2019 at 11:21 AM Pavel Machek <pavel@ucw.cz> wrote:
>
>
> Even without cycle counter... if we _know_ we are trying to generate
> entropy and have MMC available, we don't care about power and
> performance.
>
> So we can just...
>
>    issue read request on MMC
>    while (!interrupt_done)
>          i++
>
> ...and then use i++ as poor man's version of cycle counter.
>
> [We would not want to do that in normal operation, for obvious
> reasons, just when userland is blocked and waiting for entropy.]
>
> Hmm?

I hate it, but it might be worth it for the existing timer thing
alternative when we don't have a cycle counter.

Then we'd have _something_ for those bad embedded devices.

I still absolutely hate the idea of doing disk IO for this.

               Linus
Theodore Y. Ts'o Oct. 7, 2019, 11:47 a.m. UTC | #29
On Sun, Oct 06, 2019 at 08:21:03PM +0200, Pavel Machek wrote:
> Even without cycle counter... if we _know_ we are trying to generate
> entropy and have MMC available, we don't care about power and
> performance.
> 
> So we can just...
> 
>    issue read request on MMC
>    while (!interrupt_done)
>    	 i++
> 	 
> ...and then use i++ as poor man's version of cycle counter.

I suggest that you try that and see how much uncertainty you really
have before you assume that this is actually going to work.  Again, on
"System on a Chip" systems, there is very likely a single master
oscillator, and the eMMC is going to be on the the same silicon die as
the CPU.  At least for spinning rust platters it's on a physically
separate peripheral, and there was a physical claim about how chaotic
air turbulence might add enough uncertainty that timing HDD reads
could be unpredictable (although note that the paper[1] which claimed
this was written 25 years ago, and HDD's have had to get more precise,
not less, in order to cram more bits into the same squire millimeter).
more.)

[1] D. Davis, R. Ihaka, P.R. Fenstermacher, "Cryptographic Randomness
from Air Turbulence in Disk Drives", in Advances in Cryptology --
CRYPTO '94 Conference Proceedings, edited by Yvo G. Desmedt,
pp.114--120. Lecture Notes in Computer Science #839. Heidelberg:
Springer-Verlag, 1994

But for a flash device that is on the same silicon as the CPU, and
driven off of the same clock crystal, and where you are doing *reads*,
so you can't even depend on SSD GC and uncertainties in programming
the flash cell --- I'm going to be dubious.

If someone wants to use this as a last ditch hope, then we can let
systemd do this, where there's at least a bit more uncertainty in
userspace, and maybe you could be doing something useful, like running
fsck on the stateful partitions of the system.

Ultimately, though, we really want to try to get a hardware RNG, and
for that matter, a cryptographic secure element built into these
devices, and the sooner we can pound it into CPU manufacturers' heads
that not doing this should be professional malpractice, the better.

     	       	    	      		   - Ted

P.S.  Note that this Don Davis's paper[1] claims that they were able
to extract 100 independent unpredictable bits per _minute_.  Given
that modern init scripts want us to be able to boot in under a few
seconds, and we need at least 128 independent bits to securely
initialize the CRNG, people who think that we can get secure random
bits suitable for long-term cryptographic keys (say, such as
initializing the host's SSH key) during early boot based only on HDD
timings may be a bit.... optimistic.

P.P.S.  I actually knew Don; he was at MIT Project Athena as a staff
member at the same time I was working on Kerberos as my day job, and
Linux's e2fsprogs, /dev/random, etc. as a hobby...
Pavel Machek Oct. 7, 2019, 10:18 p.m. UTC | #30
On Mon 2019-10-07 07:47:34, Theodore Y. Ts'o wrote:
> On Sun, Oct 06, 2019 at 08:21:03PM +0200, Pavel Machek wrote:
> > Even without cycle counter... if we _know_ we are trying to generate
> > entropy and have MMC available, we don't care about power and
> > performance.
> > 
> > So we can just...
> > 
> >    issue read request on MMC
> >    while (!interrupt_done)
> >    	 i++
> > 	 
> > ...and then use i++ as poor man's version of cycle counter.
> 
> I suggest that you try that and see how much uncertainty you really
> have before you assume that this is actually going to work.  Again, on
> "System on a Chip" systems, there is very likely a single master
> oscillator, and the eMMC is going to be on the the same silicon die as
> the CPU.  At least for spinning rust platters it's on a physically

I have many systems including SoC here, but technology needed for NAND
flash is different from technology for CPU, so these parts do _not_
share a silicon die. They do not even share same package. (Also RTC
tends to be on separate chip, connected using i2c).

Would you have an example of Linux-capable system where eMMC is on
same chip as CPU?

> P.S.  Note that this Don Davis's paper[1] claims that they were able
> to extract 100 independent unpredictable bits per _minute_.  Given
> that modern init scripts want us to be able to boot in under a few

Well, for now I'm arguing that it is possible to gather entropy, not
neccessarily that it is going to be fast. Still waiting minute and a
half during boot is better than generating non-random keys.

Linux already credits interrupts with some entropy, so all I really
need to do is generate some interrupts. And "find /" generates lots of
those on embedded systems. (Even with nfsroot as long as network card
is not being polled...)

Best regards,
								Pavel
David Laight Oct. 8, 2019, 11:33 a.m. UTC | #31
From: Pavel Machek
> Sent: 07 October 2019 23:18
..
> I have many systems including SoC here, but technology needed for NAND
> flash is different from technology for CPU, so these parts do _not_
> share a silicon die. They do not even share same package. (Also RTC
> tends to be on separate chip, connected using i2c).

NAND flash requires ECC so is likely to be async.
But I2C is clocked from the cpu end - so is fixed.

Also an embedded system could be booting off a large serial EEPROM.
These have fixed timings and are clocked from the cpu end.

So until you get any ethernet interface up and can look at receive packet
timings there isn't likely to be any randomness at all.
A high resolution voltage  (or temperature) monitor might generate noise
in its LSB - but they don't often have a higher resolution than the stability
of the signal.

I can't remember (from the start of this thread) why 'speculation' is expected to
generate randomness.
I'd have thought the loop was deterministic - but you don't know the initial state.
More iterations may just be amplifying the initial small randomness - rather than
generating extra randomness.
So while it gets the system to boot, it hasn't actually done its job.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Pavel Machek Oct. 9, 2019, 8:02 a.m. UTC | #32
Hi!

> > I have many systems including SoC here, but technology needed for NAND
> > flash is different from technology for CPU, so these parts do _not_
> > share a silicon die. They do not even share same package. (Also RTC
> > tends to be on separate chip, connected using i2c).
> 
> NAND flash requires ECC so is likely to be async.
> But I2C is clocked from the cpu end - so is fixed.

RTC i2c may be clocked from the CPU end, but the time source needs to
work when machine is off, so that has a separate crystal for
timekeeping.


> Also an embedded system could be booting off a large serial EEPROM.
> These have fixed timings and are clocked from the cpu end.

Have you seen such system running Linux?

Best regards,
								Pavel
David Laight Oct. 9, 2019, 9:37 a.m. UTC | #33
From: Pavel Machek
> Sent: 09 October 2019 09:03
> > NAND flash requires ECC so is likely to be async.
> > But I2C is clocked from the cpu end - so is fixed.
>
> RTC i2c may be clocked from the CPU end, but the time source needs to
> work when machine is off, so that has a separate crystal for
> timekeeping.

That only helps if the rtc chip lets you read its internal counters.
You get one read of a few bits of 'randomness'.

> > Also an embedded system could be booting off a large serial EEPROM.
> > These have fixed timings and are clocked from the cpu end.
> 
> Have you seen such system running Linux?

You can run Linux on the Nios cpu on an Altera/Intel FPGA.
The kernel is likely to be loaded from the same serial eeprom as the FPGA image.

I've not personally run such a setup, but there are examples for the dev boards
so I assume some people do.
I'm not sure I'd want to run Linux on a 100MHz cpu with a slow memory interface.
Better finding an fpga with an arm core in the corner!

(We do use the Nios cpu - but for standalone code that fits in small internal
memory blocks.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Thomas Gleixner Oct. 15, 2019, 9:50 p.m. UTC | #34
David,

On Tue, 1 Oct 2019, David Laight wrote:
> From: Linus Torvalds
> > Sent: 30 September 2019 17:16
> > 
> > On Mon, Sep 30, 2019 at 6:16 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> > >
> > > Which is to say, I'm still worried that people with deep access to the
> > > implementation details of a CPU might be able to reverse engineer what
> > > a jitter entropy scheme produces.  This is why I'd be curious to see
> > > the results when someone tries to attack a jitter scheme on a fully
> > > open, simple architecture such as RISC-V.
> > 
> > Oh, I agree.
> > 
> > One of the reasons I didn't like some of the other jitter entropy
> > things was that they seemed to rely _entirely_ on just purely
> > low-level CPU unpredictability. I think that exists, but I think it
> > makes for problems for really simple cores.
> > 
> > Timing over a bigger thing and an actual interrupt (even if it's
> > "just" a timer interrupt, which is arguably much closer to the CPU and
> > has a much higher likelihood of having common frequency domains with
> > the cycle counter etc) means that I'm pretty damn convinced that a big
> > complex CPU will absolutely see issues, even if it has big caches.
> 
> Agreed, you need something that is actually non-deterministic.
> While 'speculation' is difficult to predict, it is actually fully deterministic.

I surely agree with Linus that simple architectures could have a more or
less predictable or at least searchable behaviour. If we talk about complex
x86 CPUs, I tend to disagree.

Even the Intel architects cannot explain why the following scenario is not
deterministic at all:

	Single CPU
	No NMIs
	No MCEs
	No DMAs in the background, nothing.
	CPU frequency is identical to TSC frequency

	volatile int foo;

	local_irq_disable();
	start = rdtscp();
	for (i = 0; i < 100000; i++)
		foo++;
	end = rdtscp();
	local_irq_enable();

Repeat that loop as often as you wish and observe the end - start
delta. You'll see

       min <= delta <= N * min

where N is something >= 2. The actual value of N depends on the micro
architecture, but is not identical on two systems and not even identical on
the same system after boot and 1e6 iterations of the above.

Aside of the fact that N is insane big there is absolutely no pattern in
the delta value even over a large number of runs.

> Until you get some perturbation from an outside source the cpu state
> (including caches and DRAM) is likely to be the same on every boot.

See above and read Nicholas paper. It's simply not that likely on anything
halfways modern.

> For a desktop (etc) PC booting from a disk (even SSD) you'll get some variation.
> Boot an embedded system from onboard flash and every boot could
> well be the same (or one of a small number of results).
>
> Synchronising a signal between frequency domains might generate
> some 'randomness', but maybe not if both come from the same PLL.
> 
> Even if there are variations, they may not be large enough to give
> a lot of variations in the state.
> The variations between systems could also be a lot larger than the
> variations within a system.

The variations between systems are going to be larger as any minimal
tolerance in the components have an influence.

But even between two boots on a 'noiseless' embedded system factors like
temperature, PLL lock times, swing in times of voltage regulators and other
tiny details create non-deterministic behaviour. In my former life as a
hardware engineer I had to analyze such issues deeply as they create
serious trouble for some application scenarios, but those systems where
based on very trivial and truly deterministic silicon parts. No commodity
hardware vendor will ever go the extra mile to address these things as the
effort required to get them under control is exponential vs. the effect.

Whether that's enough to create true entropy is a different question, but I
do not agree with the upfront dismissal of a potentially useful entropy
source. 

I'm in no way saying that this should be used as the sole source of
entropy, but I definitely want to explore the inherent non-determinism of
modern OoO machines further.

The results are way too interesting to ignore them and amazingly fast at
least with the algorithm which I used in my initial post which started this
whole discussion.

I let that thing run on a testbox for the last two weeks while I was on
vacation and gathered 32768 random bits via that debugfs hack every 100ms,
i.e. a total of 1.2e7 samples amounting to ~4e11 bits. The analysis is
still running, but so far it holds up.

Thanks,

	tglx

Patch
diff mbox series

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -988,4 +988,6 @@  enum mds_mitigations {
 	MDS_MITIGATION_VMWERV,
 };
 
+extern bool cpu_has_speculation;
+
 #endif /* _ASM_X86_PROCESSOR_H */
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -77,6 +77,9 @@  EXPORT_SYMBOL(smp_num_siblings);
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* Indicator that the CPU is speculating */
+bool cpu_has_speculation __ro_after_init;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
@@ -1099,6 +1102,7 @@  static void __init cpu_set_bug_bits(stru
 	if (cpu_matches(NO_SPECULATION))
 		return;
 
+	cpu_has_speculation = true;
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
 	setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
 
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -14,6 +14,7 @@ 
 #include <linux/percpu.h>
 #include <linux/timex.h>
 #include <linux/static_key.h>
+#include <linux/debugfs.h>
 
 #include <asm/hpet.h>
 #include <asm/timer.h>
@@ -1531,3 +1532,97 @@  unsigned long calibrate_delay_is_known(v
 	return 0;
 }
 #endif
+
+unsigned int arch_get_speculative_entropy(u64 *buf, unsigned int nentries)
+{
+	unsigned int n, i;
+
+	if (!boot_cpu_has(X86_FEATURE_TSC) || !cpu_has_speculation)
+		return 0;
+
+	for (n = 0; n < nentries; n++) {
+		u64 d = 0;
+
+		/*
+		 * The loop below does not execute in constant time on
+		 * speculative CPUs. The trick is to do useless operations
+		 * between the TSC reads, i.e. calling __sw_hweight64().
+		 * This makes uops scheduling sufficiently random resulting
+		 * in useable entropy.
+		 */
+		for (i = 0; i < 64; i++) {
+			u64 t = rdtsc() + __sw_hweight64(rdtsc());
+
+			if (t & 0x01ULL)
+				d |= (1ULL << i);
+		}
+		buf[n] = d;
+	}
+	return nentries;
+}
+
+#define BUFSIZE 4096
+static u64 rng_array[4096];
+
+static struct debugfs_blob_wrapper rng_blob = {
+	.data = rng_array,
+	.size = sizeof(rng_array),
+};
+
+static u64 fold(u64 d)
+{
+	unsigned int i;
+	u64 res = d;
+
+	/*
+	 * Lazy and trivial folding just for testing purposes.
+	 *
+	 * Fibonacci LFSR with the primitive polynomial:
+	 * x^64 + x^61 + x^56 + x^31 + x^28 + x^23 + 1
+	 */
+	for (i = 1; i < 64; i++) {
+		u64 tmp = d << 63;
+
+		tmp = tmp >> 63;
+
+		tmp ^= ((res >> 63) & 1);
+		tmp ^= ((res >> 60) & 1);
+		tmp ^= ((res >> 55) & 1);
+		tmp ^= ((res >> 30) & 1);
+		tmp ^= ((res >> 27) & 1);
+		tmp ^= ((res >> 22) & 1);
+		res <<= 1;
+		res ^= tmp;
+	}
+	return res;
+}
+
+static ssize_t rngfill_write(struct file *file, const char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	unsigned int n;
+	u64 d;
+
+	for (n = 0; n < BUFSIZE; n++) {
+		if (!arch_get_speculative_entropy(&d, 1))
+			return -ENODEV;
+		rng_array[n] = fold(d);
+	}
+	return count;
+}
+
+static const struct file_operations fops_rngfill = {
+	.write = rngfill_write,
+	.llseek = default_llseek,
+};
+
+static int __init rng_debug_init(void)
+{
+	debugfs_create_file("rng_fill", S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_rngfill);
+
+	debugfs_create_blob("rng_data", 0444, arch_debugfs_dir, &rng_blob);
+
+	return 0;
+}
+late_initcall(rng_debug_init);