Re: Updated scalable urandom patchkit

From: "George Spelvin" <linux@horizon.com>
To: andi@firstfloor.org
Cc: ahferroin7@gmail.com, jepler@unpythonic.net, linux@horizon.com,
	linux-kernel@vger.kernel.org, linux@rasmusvillemoes.dk,
	tytso@mit.edu
Subject: Re: Updated scalable urandom patchkit
Date: 10 Oct 2015 14:45:46 -0400	[thread overview]
Message-ID: <20151010184546.759.qmail@ns.horizon.com> (raw)

I'm very very sorry to be so late to the party; I didn't see this thread
until the week-delayed LWN article on the subject drew my attention to it.

There's a fundamental design issue with this patch set that seems
unnecessary to me: multiple nonblocking pools.

I know, I know, that seems like the whole point of it, but hear me out.

Entropy is a holographic property of the pool, not located in any
particular bit position.

And the most basic operation, of reading from the pool, can easily be
done by multiple readers at the same time from the same bit pattern.
They just need to be salted with distinguishing nonces (CPU IDs will do
nicely) to ensure they all get different results.  So a reader doesn't
just get hash(pool[]), but hash(unique_nonce, pool[]).

This in turn lets the spin_lock_irqsave() in extract_buf be moved to
after the sha_transform() calls.

I think the whole thing can be done, more elegantly, using a single pool
and judiciously relaxed locking rules.  You don't even need reader-writer
locks; it's actually beneficial rather than harmful if there's a race
between a pool reader and writer.

In general, fewer larger pools is better for entropy storage.  The only
reason for the current three-pool design is to achieve strict isolation
of the /dev/random pool.

The two operations that require locking are:

1. Entropy accounting.  However, that only has to be strict
   for /dev/random.  For /dev/urandom, it can be late in various ways.
   One possibility is to have separate entropy accounding per NUMA
   node; only the pool proper has to be shared.

2. Add-back of the output to the pool.  This involves writing to pool,
   but for non-invertibility, it only has to be do ne byone of a number of
   concurrent readers.  I think significant cleverness can be applied
   here.

There are a lot of things that can be done about this second point:

2a. The obvious one is to batch the add-back.  For backtracking
    protection, we only need to do one add-back per read, not one per
    10 bytes of output.  Combined with the above change that locks only
    around the add-back and not the pool hashing, this greatly reduces
    both the lock window and the number of times it's taken.

2b. Another idea would be to optimize for 16 bytes.  I like the basic
    concept of having a larger hash internal state than the output, but
    would it be possible to output 16 of the 20 bytes of SHA-1 output
    rather than the current 10?  This is obviously a Ted question (the
    current folding is his idea), but even 32 bits is enough to protect
    against internal collisions in the hash state.

2c. Another possibility would be to add small separate "salt piles"
    which are initialized to the CPU id, and stirred by reads.  They only
    need to be large enough to not experience birthsay collisions even
    assuming a stupid number of reads between stirs of the main pool.
    (So 128 bits would be more than ample.)  A read would consist of:

	hash[5] = { initial values }
	sha1(hash, my_salt)
	sha1(hash, main_pool)
	if (spin_trylock(main_pool)) {
		add_back(hash, main_pool);
		spin_unlock(main_pool);
	} else {
		add_back(hash, my_salt);
	}

2d. A more complete solution involves noting that, if there are multiple
    concurrent readers, only one has to add back its output to prevent
    backtracking, because all of the concurrent reads are equivalent
    in that sense.  (Even though, because they're salted with separate
    nonces like the CPU id, they're not identical.)

I'm thinking in terms of a "generation counter" for the pool.  (Perhaps
implemented as a seqlock.)  The readers of the generation-n pool must
ensure that *someone* is responsible for bumping the generation to n+1
to prevent backtracking of the generation-n state, but only one.

A key point is that the add-back that bumps the generation to n+2 may
actaully be a hash of the generation-n pool state, the generation-n+1
state, or some transient intermedicate state due to races with the
generation-n+1 add-back.  It is not necessary for readers to wait for
the pool to be stable, only that it's a non-invertible transformation
based on some earlier generation.

Now, to actually implement that, a "spin lock while checking extra
condition" would be ideal, but I do not want to add Yet Another Locking
Primitive to the kernel.

One possibility would be to accept the possibility of a race condition
breaking anti-backtracking.  Another reader will come along soom enough
and fix it.

But if that's not okay, consider the following:

We add a second lock, the "responsibility lock", the holder of which
is responsible for doing anti-backtracking.

After hashing the pool, each reader does the following:
1. (Optional) trylock() the pool lock.
   This is a low-load optimization.  If it succeeds, go
   directly to step 5.
2. Trylock() the responsibility lock.
   If this fails, we're done; return.
3. Wait for the pool lock.
4. Drop the responsibility lock.  This must be done before
   updating the pool proper.
5. Add our anti-backtracking bytes to the pool.
6. Release the pool lock.

(I originally thought I'd need to ping-pong between two responsibility
locks based on the generation counter % 2, but now I don't think so.)

Under high load, you have one processor adding back, a second waiting
to add back, and everyone else just sails right through without
waiting at all.

Details to be filled in:
- Each reader needs some sort of nonce to distinguish multiple reads
  if there's no main pool add-back between them.  RDTSC, per-cpu
  counter, or whatever.
- Entropy accounting.  As mentioned, some sloppiness is okay, so I
  think it's solvable, but I haven't looked at it yet.