From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3WZ0=R6=vger.kernel.org=netdev-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B2B5BC43381
	for <netdev@archiver.kernel.org>; Wed, 27 Mar 2019 21:43:38 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 51EA42082F
	for <netdev@archiver.kernel.org>; Wed, 27 Mar 2019 21:43:38 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=sdf.org header.i=@sdf.org header.b="R9N/RJMp"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727430AbfC0Vnh (ORCPT <rfc822;netdev@archiver.kernel.org>);
        Wed, 27 Mar 2019 17:43:37 -0400
Received: from mx.sdf.org ([205.166.94.20]:63003 "EHLO mx.sdf.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726120AbfC0Vng (ORCPT <rfc822;netdev@vger.kernel.org>);
        Wed, 27 Mar 2019 17:43:36 -0400
Received: from sdf.org (IDENT:lkml@sdf.lonestar.org [205.166.94.16])
        by mx.sdf.org (8.15.2/8.14.5) with ESMTPS id x2RLh2mx016739
        (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256 bits) verified NO);
        Wed, 27 Mar 2019 21:43:03 GMT
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=sdf.org; s=default;
        t=1553722991; bh=v5Ny7MTWA1B13HfsXhb+1oGgv5UJgJ+XGt8azNlXx9I=;
        h=Date:From:To:Subject:Cc:In-Reply-To:References;
        b=R9N/RJMpYwfazXjQdS3cY0/Q7uqVQJwtwROOtJ+gg4woy6bch1fz8+wvtwPBTvURi
         E5nzT4iTLO6pH9CQtOxWqslQ+bHMjax6AXTBiSALWBqz8KGFcBdBCkABTg10q025JD
         MTjAQMVyFRs8LlZ4/1fnMVOo3+YTSqCKbnxp+Q90=
Received: (from lkml@localhost)
        by sdf.org (8.15.2/8.12.8/Submit) id x2RLh1AP002035;
        Wed, 27 Mar 2019 21:43:02 GMT
Date:   Wed, 27 Mar 2019 21:43:02 GMT
From:   George Spelvin <lkml@sdf.org>
Message-Id: <201903272143.x2RLh1AP002035@sdf.org>
To:     daniel@iogearbox.net, hannes@stressinduktion.org, lkml@sdf.org
Subject: Re: Revising prandom_32 generator
Cc:     netdev@vger.kernel.org
In-Reply-To: <0ce9983d-f257-4808-bd84-4385587e1b41@www.fastmail.com>
References: <201903261110.x2QBAFmp001613@sdf.org>,
    <201903261117.x2QBHTnl002697@sdf.org>,
    <109c67e3-7cda-40cf-80e1-a2d3500a2b5d@www.fastmail.com>,
    <201903261907.x2QJ7148017418@sdf.org>,
    <0ce9983d-f257-4808-bd84-4385587e1b41@www.fastmail.com>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

On Wed, 27 Mar 2019 at 14:32:55 -0400, Hannes Frederic Sowa wrote:
> On Tue, Mar 26, 2019, at 20:07, George Spelvin wrote:
>> On Tue, 26 Mar 2019 at 14:03:55 -0400, Hannes Frederic Sowa wrote:
>>> Are you trying to fix the modulo biases? I think that prandom_u32_max
>>> also has bias, would that be worth fixing as well?
>> 
>> I could if you like.  I submitted a patch (appended) to do that
>> for /dev/urandom (using a very clever new algorithm that does it
>> all with a single multiply in the common case!), but didn't think
>> the prandom users were critical enough to care.
> 
> I am not sure. A quick look shows that it might get used for
> selecting timeouts in the networking stack. Should not matter here
> at all. If used to select an index in pre-allocated arrays, then
> effects might be visible.

I looked around the code and couldn't find anything that looked
critical.

Remember that the bias is very small for small ranges.  You're
choosing between floor(2^32/range) and ceil(2^32/range), which is
a relative difference of 1/(2^32/range) = range/2^32.

If range = 100, that's 12.3 ppb.  It reaches 1 ppm at range=4295.

> Additionally, by adding all the branches it might make sense to
> downgrade the function from 'static inline' to regular function.

If the range is a compile-time constant, the code is

	u64 prod;

	do
		prod = (u64)range * prandom_u32();
	whie (unlikely((u32)prod < lim));
	reurn prod >> 32;

This is almost the same as the biased code, except that one
compare and one branch is added.  (And a mutiply-low on
processors with separate mutiply-high instructions.)

But if the range is variable, it's worth taking out of line.

> All in all, I don't think it is required and I would just add it
> if you think it is worth it as well. The additional comments you
> proposed are definitely worth to add.

I also judged it unnecessary, so until someone speaks up I won't do it.

> I wasn't sure if the non-consecutive extraction of output
> would blow up the Gaussian Elimination at some point (as in taking
> a few hours to reduce), which I would not consider trivial anymore.

No, it doesn't matter where the bits are as long as you know the
spacing.  What might happen with non-consecutive bits is that a
few are linearly dependent and so don't count toward the required
113.  This is easily solved by getting a few more bits.

Also, linear equations are just as good as bits.  For example,
given the output of prandom_u32_max(3), we can always derive an
equation:

Output = 0 -> bit 31 = 0
Output = 1 -> bit 31 ^ bit30 == 1
Output = 2 -> bit 31 = 1

> Anyway, using those functions in those situations would already
> be a bug, as Stephen correctly said, but some safety net would be
> useful (e.g. observing timeouts from a box right after boot up to
> determine the initial state of the fragmentation IDs would be bad
> - but early reseeding etc. makes that already difficult).

If it's attackable, you're not supposed to use prandom().  Dithered
timeouts, for example, are a way for *cooperating* devices to avoid
fighting over a shared bus (or other resource).  A malicious device
can just monopolize the bus and fight all comers.

>> All of the proposed replacements are considerably less linear and
>> harder to invert.  The xoshiro family are the most linear,
>> with an LFSR core and only a very simple non-linear state masking.
> 
> Then all of the proposed prngs are an improvement? Definitely go for it.

>> Here's the patch mentioned above for unbiased range reduction.
>> 
>> I just thought of a great improvement!  The slow computation
>> is "-range % range".  I could use __buitin_constant_p(range)
>> to decide between the following implementation and one that
>> has -range % range pre-computed and passed in.  I.e.
> 
> Heh, if you already made the work and maybe benchmarked it, it
> might make sense to kill the bias in prandom_u32_max anyway? :)
> The proposal below looks great!

I aready wrote and posted the code for get_random_u32().
The question is code size and time, not implementation.
(It's appended for reference.)

>> 	if ((u32)prod < range - 1) {
> 
> Maybe 'unlikely' could help here?

That's in the final version; see below.

In-Reply-To: <201903241244.x2OCiL8P011277@sdf.org>
References: <201903241244.x2OCiL8P011277@sdf.org>
From: George Spelvin <lkml@sdf.org>
Subject: [RFC PATCH v2] random: add get_random_max() function
To: linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
Cc: : Jason A. Donenfeld <Jason@zx2c4.com>, George Spelvin <lkml@sdf.org>

RFC because:
- Only lightly tested so far, and
- Expecting feedback on the function names

v1->v2: I realized that if the range is a compile-time constant, the
	code can be simplified considerably.  This version has two
	implementations depending on __buitin_constant_p(range).
	(Specifically, the expensive "lim = -range % range" is also a
	compile-time constant, and the "fast path" which tries to avoid
	computing it is unnecessary.)

I got annoyed that randomize_page() wasn't exactly uniform, and used the
slow modulo-based range reduction rather than the fast multiplicative one,
so I fixed it.

I figure, if you're going to the trouble of using cryptographic random
numbers, you don't want biased outputs, even if it's slight.

This includes optimized code for compile-time constant ranges, for the
benefit of future callers, even though that doesn't help randomize_page().

Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/char/random.c  | 113 ++++++++++++++++++++++++++++++++++++++---
 include/linux/random.h | 109 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 214 insertions(+), 8 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 9f4a348c7eb5..ffe6af0e165b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -2353,6 +2353,99 @@ static void invalidate_batched_entropy(void)
 	write_unlock_irqrestore(&batched_entropy_reset_lock, flags);
 }
 
+/**
+ * __get_random_max32 - Generate a uniform random number 0 <= x < range < 2^32
+ * @range:	Modulus for random number; must not be zero!
+ *
+ * This generates an exactly uniform distribution over [0, range), where
+ * range is *not* a compile-time constant.
+ *
+ * This uses Lemire's multiplicative algorithm, from
+ *	Fast Random Integer Generation in an Interval
+ *	https://arxiv.org/abs/1805.10941
+ *
+ * The standard multiplicative range technique is (rand32() * range) >> 32.
+ * However, there are either floor((1<<32) / range) or ceil((1<<32) / range)
+ * random values that will result in each output value.  More specifically,
+ * lim = (1 << 32) % range values will be generated one extra time.
+ *
+ * Lemire proves (Lemma 4.1) that those extra values can be identified
+ * by the lsbits of the product.  If you discard and retry whenever the
+ * lsbits are less than lim, you get the floor option always.
+ */
+u32 __get_random_max32(u32 range)
+{
+	u64 prod = mul_u32_u32(get_random_u32(), range);
+
+	/*
+	 * Fast-path check: lim < range, so lsbits < range-1
+	 * implies lsbits < lim.
+	 */
+	if (unlikely((u32)prod < range - 1)) {
+		/* Slow path; we need to divide to check exactly */
+		u32 lim = -range % range;	/* = (1<<32) % range */
+
+		while (unlikely((u32)prod < lim))
+			prod = mul_u32_u32(get_random_u32(), range);
+	}
+	return prod >> 32;
+}
+EXPORT_SYMBOL(__get_random_max32);
+
+#ifdef CONFIG_64BIT
+/**
+ * __get_random_max64 - Generate a uniform random number 0 <= x < range < 2^64
+ * @range:	Modulus for random number; must not be zero!
+ *
+ * Like get_random_max32, but for larger ranges.  There are two
+ * implementations, depending on CONFIG_ARCH_SUPPORTS_INT128.
+ */
+u64 __get_random_max64(u64 range)
+{
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+	/* Same as get_random_max32(), with larger data types. */
+	unsigned __int128 prod;
+
+	if (range >> 32 == 0)
+		return __get_random_max32(range);
+
+	prod = (unsigned __int128)get_random_u64() * range;
+
+	if (likely((u64)prod < range - 1)) {
+		/* Slow path; we need to divide to check exactly */
+		u64 lim = -range % range;	/* = (1<<64) % range */
+
+		while ((u64)prod < lim)
+			prod = (unsigned __int128)get_random_u64() * range;
+	}
+	return prod >> 64;
+#else
+	/*
+	 * We don't have efficient 128-bit products (e.g. SPARCv9), so
+	 * break it into two halves: choose the high bits, then
+	 * choose the low bits depending on them.
+	 */
+	u32 lsbits, msbits:
+
+	if (range >> 32 == 0)
+		return __get_random_max32(range);
+
+	msbits = (range + 0xffffffff) >> 32;
+	/* If range >= 0xffffffff00000001, msbits will wrap to 0 */
+	if (likely(msbits))
+		msbits = __get_random_max32(msbits)
+	else
+		msbits = get_random_u32();
+	if (unlikely(msbits == (u32)(range >> 32)))
+		lsbits = __get_random_max32((u32)range)
+	else
+		lsbits = get_random_u32();
+	return (u64)msbits << 32 | lsbits;
+#endif
+}
+EXPORT_SYMBOL(__get_random_max64);
+#endif /* CONFIG_64BIT */
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start:	The smallest acceptable address the caller will take.
@@ -2370,20 +2463,24 @@ static void invalidate_batched_entropy(void)
 unsigned long
 randomize_page(unsigned long start, unsigned long range)
 {
-	if (!PAGE_ALIGNED(start)) {
-		range -= PAGE_ALIGN(start) - start;
-		start = PAGE_ALIGN(start);
-	}
+	/* How much to round up to make start page-aligned */
+	unsigned long align = -start & (PAGE_SIZE - 1);
 
-	if (start > ULONG_MAX - range)
-		range = ULONG_MAX - start;
+	if (unlikely(range < align))
+		return start;
+
+	start += align;
+	range -= align;
+
+	if (unlikely(range > -start))
+		range = -start;
 
 	range >>= PAGE_SHIFT;
 
-	if (range == 0)
+	if (unlikely(range == 0))
 		return start;
 
-	return start + (get_random_long() % range << PAGE_SHIFT);
+	return start + (get_random_max(range) << PAGE_SHIFT);
 }
 
 /* Interface for in-kernel drivers of true hardware RNGs.
diff --git a/include/linux/random.h b/include/linux/random.h
index 25c0e3f0152c..fc929369471f 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -60,6 +60,115 @@ static inline unsigned long get_random_long(void)
 #endif
 }
 
+/**
+ * get_random_max32 - return a 32-bit random number in interval [0, range)
+ * @range: Size of random interval [0,range); the first value not included.
+ *
+ * This uses Lemire's algorithm; see drivers/char/random.c for details.
+ *
+ * The expensive part of this is the "(1ull<<32) % range" computation.
+ * If that's a compile-time constant, things get simple enough to inline.
+ */
+/* When the range is a compile-time constant, so is lim, and it's simple. */
+static __always_inline u32 _get_random_max32(u32 range)
+{
+	u32 lim = -range % range;
+	u64 prod;
+
+	do
+		prod = mul_u32_u32(get_random_u32(), range);
+	while (unlikely((u32)prod < lim));
+	return prod >> 32;
+}
+/* Non-constant ranges are done out of line. */
+u32 __get_random_max32(u32 range);
+
+/* Decide between the two.  (Case #3 is power of two.) */
+static __always_inline u32 get_random_max32(u32 range)
+{
+	if (!__builtin_constant_p(range))
+		return __get_random_max32(range);
+	else if (range & (range - 1))
+		return _get_random_max32(range);
+	else
+		return get_random_u32() / (u32)(0x100000000 / range);
+}
+
+#if CONFIG_64BIT
+/*
+ * The 64-bit equivalent is trickier.  If we have 64x64->128-bit multiply,
+ * the same algorithm works.  If we don't (*cough* SPARCv9 *cough*),
+ * we can do it in two halves.
+ */
+static __always_inline u64 _get_random_max64(u64 range)
+{
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+	u64 lim = -range % range;
+	unsigned __int128 prod;
+
+	do
+		prod = (__int128)get_random_u64() * range;
+	while (unlikely((u64)prod < lim));
+	return prod >> 64;
+
+#else /* No 128-bit product - this is messy */
+	u32 lsbits, msbits = (range + 0xffffffff) >> 32;
+	/* If range >= 0xffffffff00000001, msbits will wrap to 0 */
+	if (msbits)
+		msbits = get_random_max32(msbits)
+	else
+		msbits = get_random_u32();
+	if ((u32)range && unlikely(msbits == (u32)(range >> 32)))
+		lsbits = get_random_max32((u32)range);
+	else
+		lsbits = get_random_u32()
+	return (u64)msbits << 32 | lsbits;
+#endif
+}
+
+u64 __get_random_max64(u64 range);
+
+static __always_inline u64 get_random_max64(u64 range)
+{
+	/*
+	 * We may know at compile time that range is 32 bits,
+	 * even if we don't know its exact value.
+	 */
+	if (__builtin_constant_p(range <= 0xffffffff) && range <= 0xffffffff)
+		return get_random_max32(range);
+	else if (!__builtin_constant_p(range))
+		return __get_random_max64(range);
+	else if (range & (range - 1))
+		return _get_random_max64(range);
+	else
+		return get_random_u64() / (0x100000000 / (range >> 32));
+}
+#endif
+
+/**
+ * get_random_max - Generate a uniform random number 0 <= x < range
+ * @range:	Modulus for random number; must not be zero!
+ *
+ * This uses a cryptographic random number generator, for safety
+ * against malicious attackers trying to guess the output.
+ *
+ * This generates exactly uniform random numbers in the range, retrying
+ * occasionally if range is not a power of two.  If we're going to
+ * go to the expense of a cryptographic RNG, we may as well get the
+ * distribution right.
+ *
+ * For less critical applications (self-tests, error injection, dithered
+ * timers), use prandom_u32_max().
+ */
+static __always_inline unsigned long get_random_max(unsigned long range)
+{
+#if BITS_PER_LONG == 64
+	return get_random_max64(range);
+#else
+	return get_random_max32(range);
+#endif
+}
+
 /*
  * On 64-bit architectures, protect against non-terminated C string overflows
  * by zeroing out the first byte of the canary; this leaves 56 bits of entropy.
-- 
2.20.1