From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2B5BC43381 for ; Wed, 27 Mar 2019 21:43:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 51EA42082F for ; Wed, 27 Mar 2019 21:43:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=sdf.org header.i=@sdf.org header.b="R9N/RJMp" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727430AbfC0Vnh (ORCPT ); Wed, 27 Mar 2019 17:43:37 -0400 Received: from mx.sdf.org ([205.166.94.20]:63003 "EHLO mx.sdf.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726120AbfC0Vng (ORCPT ); Wed, 27 Mar 2019 17:43:36 -0400 Received: from sdf.org (IDENT:lkml@sdf.lonestar.org [205.166.94.16]) by mx.sdf.org (8.15.2/8.14.5) with ESMTPS id x2RLh2mx016739 (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256 bits) verified NO); Wed, 27 Mar 2019 21:43:03 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=sdf.org; s=default; t=1553722991; bh=v5Ny7MTWA1B13HfsXhb+1oGgv5UJgJ+XGt8azNlXx9I=; h=Date:From:To:Subject:Cc:In-Reply-To:References; b=R9N/RJMpYwfazXjQdS3cY0/Q7uqVQJwtwROOtJ+gg4woy6bch1fz8+wvtwPBTvURi E5nzT4iTLO6pH9CQtOxWqslQ+bHMjax6AXTBiSALWBqz8KGFcBdBCkABTg10q025JD MTjAQMVyFRs8LlZ4/1fnMVOo3+YTSqCKbnxp+Q90= Received: (from lkml@localhost) by sdf.org (8.15.2/8.12.8/Submit) id x2RLh1AP002035; Wed, 27 Mar 2019 21:43:02 GMT Date: Wed, 27 Mar 2019 21:43:02 GMT From: George Spelvin Message-Id: <201903272143.x2RLh1AP002035@sdf.org> To: daniel@iogearbox.net, hannes@stressinduktion.org, lkml@sdf.org Subject: Re: Revising prandom_32 generator Cc: netdev@vger.kernel.org In-Reply-To: <0ce9983d-f257-4808-bd84-4385587e1b41@www.fastmail.com> References: <201903261110.x2QBAFmp001613@sdf.org>, <201903261117.x2QBHTnl002697@sdf.org>, <109c67e3-7cda-40cf-80e1-a2d3500a2b5d@www.fastmail.com>, <201903261907.x2QJ7148017418@sdf.org>, <0ce9983d-f257-4808-bd84-4385587e1b41@www.fastmail.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Wed, 27 Mar 2019 at 14:32:55 -0400, Hannes Frederic Sowa wrote: > On Tue, Mar 26, 2019, at 20:07, George Spelvin wrote: >> On Tue, 26 Mar 2019 at 14:03:55 -0400, Hannes Frederic Sowa wrote: >>> Are you trying to fix the modulo biases? I think that prandom_u32_max >>> also has bias, would that be worth fixing as well? >> >> I could if you like. I submitted a patch (appended) to do that >> for /dev/urandom (using a very clever new algorithm that does it >> all with a single multiply in the common case!), but didn't think >> the prandom users were critical enough to care. > > I am not sure. A quick look shows that it might get used for > selecting timeouts in the networking stack. Should not matter here > at all. If used to select an index in pre-allocated arrays, then > effects might be visible. I looked around the code and couldn't find anything that looked critical. Remember that the bias is very small for small ranges. You're choosing between floor(2^32/range) and ceil(2^32/range), which is a relative difference of 1/(2^32/range) = range/2^32. If range = 100, that's 12.3 ppb. It reaches 1 ppm at range=4295. > Additionally, by adding all the branches it might make sense to > downgrade the function from 'static inline' to regular function. If the range is a compile-time constant, the code is u64 prod; do prod = (u64)range * prandom_u32(); whie (unlikely((u32)prod < lim)); reurn prod >> 32; This is almost the same as the biased code, except that one compare and one branch is added. (And a mutiply-low on processors with separate mutiply-high instructions.) But if the range is variable, it's worth taking out of line. > All in all, I don't think it is required and I would just add it > if you think it is worth it as well. The additional comments you > proposed are definitely worth to add. I also judged it unnecessary, so until someone speaks up I won't do it. > I wasn't sure if the non-consecutive extraction of output > would blow up the Gaussian Elimination at some point (as in taking > a few hours to reduce), which I would not consider trivial anymore. No, it doesn't matter where the bits are as long as you know the spacing. What might happen with non-consecutive bits is that a few are linearly dependent and so don't count toward the required 113. This is easily solved by getting a few more bits. Also, linear equations are just as good as bits. For example, given the output of prandom_u32_max(3), we can always derive an equation: Output = 0 -> bit 31 = 0 Output = 1 -> bit 31 ^ bit30 == 1 Output = 2 -> bit 31 = 1 > Anyway, using those functions in those situations would already > be a bug, as Stephen correctly said, but some safety net would be > useful (e.g. observing timeouts from a box right after boot up to > determine the initial state of the fragmentation IDs would be bad > - but early reseeding etc. makes that already difficult). If it's attackable, you're not supposed to use prandom(). Dithered timeouts, for example, are a way for *cooperating* devices to avoid fighting over a shared bus (or other resource). A malicious device can just monopolize the bus and fight all comers. >> All of the proposed replacements are considerably less linear and >> harder to invert. The xoshiro family are the most linear, >> with an LFSR core and only a very simple non-linear state masking. > > Then all of the proposed prngs are an improvement? Definitely go for it. >> Here's the patch mentioned above for unbiased range reduction. >> >> I just thought of a great improvement! The slow computation >> is "-range % range". I could use __buitin_constant_p(range) >> to decide between the following implementation and one that >> has -range % range pre-computed and passed in. I.e. > > Heh, if you already made the work and maybe benchmarked it, it > might make sense to kill the bias in prandom_u32_max anyway? :) > The proposal below looks great! I aready wrote and posted the code for get_random_u32(). The question is code size and time, not implementation. (It's appended for reference.) >> if ((u32)prod < range - 1) { > > Maybe 'unlikely' could help here? That's in the final version; see below. In-Reply-To: <201903241244.x2OCiL8P011277@sdf.org> References: <201903241244.x2OCiL8P011277@sdf.org> From: George Spelvin Subject: [RFC PATCH v2] random: add get_random_max() function To: linux-kernel@vger.kernel.org, Theodore Ts'o Cc: : Jason A. Donenfeld , George Spelvin RFC because: - Only lightly tested so far, and - Expecting feedback on the function names v1->v2: I realized that if the range is a compile-time constant, the code can be simplified considerably. This version has two implementations depending on __buitin_constant_p(range). (Specifically, the expensive "lim = -range % range" is also a compile-time constant, and the "fast path" which tries to avoid computing it is unnecessary.) I got annoyed that randomize_page() wasn't exactly uniform, and used the slow modulo-based range reduction rather than the fast multiplicative one, so I fixed it. I figure, if you're going to the trouble of using cryptographic random numbers, you don't want biased outputs, even if it's slight. This includes optimized code for compile-time constant ranges, for the benefit of future callers, even though that doesn't help randomize_page(). Signed-off-by: George Spelvin Cc: Theodore Ts'o Cc: Jason A. Donenfeld --- drivers/char/random.c | 113 ++++++++++++++++++++++++++++++++++++++--- include/linux/random.h | 109 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 214 insertions(+), 8 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 9f4a348c7eb5..ffe6af0e165b 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -2353,6 +2353,99 @@ static void invalidate_batched_entropy(void) write_unlock_irqrestore(&batched_entropy_reset_lock, flags); } +/** + * __get_random_max32 - Generate a uniform random number 0 <= x < range < 2^32 + * @range: Modulus for random number; must not be zero! + * + * This generates an exactly uniform distribution over [0, range), where + * range is *not* a compile-time constant. + * + * This uses Lemire's multiplicative algorithm, from + * Fast Random Integer Generation in an Interval + * https://arxiv.org/abs/1805.10941 + * + * The standard multiplicative range technique is (rand32() * range) >> 32. + * However, there are either floor((1<<32) / range) or ceil((1<<32) / range) + * random values that will result in each output value. More specifically, + * lim = (1 << 32) % range values will be generated one extra time. + * + * Lemire proves (Lemma 4.1) that those extra values can be identified + * by the lsbits of the product. If you discard and retry whenever the + * lsbits are less than lim, you get the floor option always. + */ +u32 __get_random_max32(u32 range) +{ + u64 prod = mul_u32_u32(get_random_u32(), range); + + /* + * Fast-path check: lim < range, so lsbits < range-1 + * implies lsbits < lim. + */ + if (unlikely((u32)prod < range - 1)) { + /* Slow path; we need to divide to check exactly */ + u32 lim = -range % range; /* = (1<<32) % range */ + + while (unlikely((u32)prod < lim)) + prod = mul_u32_u32(get_random_u32(), range); + } + return prod >> 32; +} +EXPORT_SYMBOL(__get_random_max32); + +#ifdef CONFIG_64BIT +/** + * __get_random_max64 - Generate a uniform random number 0 <= x < range < 2^64 + * @range: Modulus for random number; must not be zero! + * + * Like get_random_max32, but for larger ranges. There are two + * implementations, depending on CONFIG_ARCH_SUPPORTS_INT128. + */ +u64 __get_random_max64(u64 range) +{ +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__) + /* Same as get_random_max32(), with larger data types. */ + unsigned __int128 prod; + + if (range >> 32 == 0) + return __get_random_max32(range); + + prod = (unsigned __int128)get_random_u64() * range; + + if (likely((u64)prod < range - 1)) { + /* Slow path; we need to divide to check exactly */ + u64 lim = -range % range; /* = (1<<64) % range */ + + while ((u64)prod < lim) + prod = (unsigned __int128)get_random_u64() * range; + } + return prod >> 64; +#else + /* + * We don't have efficient 128-bit products (e.g. SPARCv9), so + * break it into two halves: choose the high bits, then + * choose the low bits depending on them. + */ + u32 lsbits, msbits: + + if (range >> 32 == 0) + return __get_random_max32(range); + + msbits = (range + 0xffffffff) >> 32; + /* If range >= 0xffffffff00000001, msbits will wrap to 0 */ + if (likely(msbits)) + msbits = __get_random_max32(msbits) + else + msbits = get_random_u32(); + if (unlikely(msbits == (u32)(range >> 32))) + lsbits = __get_random_max32((u32)range) + else + lsbits = get_random_u32(); + return (u64)msbits << 32 | lsbits; +#endif +} +EXPORT_SYMBOL(__get_random_max64); +#endif /* CONFIG_64BIT */ + /** * randomize_page - Generate a random, page aligned address * @start: The smallest acceptable address the caller will take. @@ -2370,20 +2463,24 @@ static void invalidate_batched_entropy(void) unsigned long randomize_page(unsigned long start, unsigned long range) { - if (!PAGE_ALIGNED(start)) { - range -= PAGE_ALIGN(start) - start; - start = PAGE_ALIGN(start); - } + /* How much to round up to make start page-aligned */ + unsigned long align = -start & (PAGE_SIZE - 1); - if (start > ULONG_MAX - range) - range = ULONG_MAX - start; + if (unlikely(range < align)) + return start; + + start += align; + range -= align; + + if (unlikely(range > -start)) + range = -start; range >>= PAGE_SHIFT; - if (range == 0) + if (unlikely(range == 0)) return start; - return start + (get_random_long() % range << PAGE_SHIFT); + return start + (get_random_max(range) << PAGE_SHIFT); } /* Interface for in-kernel drivers of true hardware RNGs. diff --git a/include/linux/random.h b/include/linux/random.h index 25c0e3f0152c..fc929369471f 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -60,6 +60,115 @@ static inline unsigned long get_random_long(void) #endif } +/** + * get_random_max32 - return a 32-bit random number in interval [0, range) + * @range: Size of random interval [0,range); the first value not included. + * + * This uses Lemire's algorithm; see drivers/char/random.c for details. + * + * The expensive part of this is the "(1ull<<32) % range" computation. + * If that's a compile-time constant, things get simple enough to inline. + */ +/* When the range is a compile-time constant, so is lim, and it's simple. */ +static __always_inline u32 _get_random_max32(u32 range) +{ + u32 lim = -range % range; + u64 prod; + + do + prod = mul_u32_u32(get_random_u32(), range); + while (unlikely((u32)prod < lim)); + return prod >> 32; +} +/* Non-constant ranges are done out of line. */ +u32 __get_random_max32(u32 range); + +/* Decide between the two. (Case #3 is power of two.) */ +static __always_inline u32 get_random_max32(u32 range) +{ + if (!__builtin_constant_p(range)) + return __get_random_max32(range); + else if (range & (range - 1)) + return _get_random_max32(range); + else + return get_random_u32() / (u32)(0x100000000 / range); +} + +#if CONFIG_64BIT +/* + * The 64-bit equivalent is trickier. If we have 64x64->128-bit multiply, + * the same algorithm works. If we don't (*cough* SPARCv9 *cough*), + * we can do it in two halves. + */ +static __always_inline u64 _get_random_max64(u64 range) +{ +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__) + u64 lim = -range % range; + unsigned __int128 prod; + + do + prod = (__int128)get_random_u64() * range; + while (unlikely((u64)prod < lim)); + return prod >> 64; + +#else /* No 128-bit product - this is messy */ + u32 lsbits, msbits = (range + 0xffffffff) >> 32; + /* If range >= 0xffffffff00000001, msbits will wrap to 0 */ + if (msbits) + msbits = get_random_max32(msbits) + else + msbits = get_random_u32(); + if ((u32)range && unlikely(msbits == (u32)(range >> 32))) + lsbits = get_random_max32((u32)range); + else + lsbits = get_random_u32() + return (u64)msbits << 32 | lsbits; +#endif +} + +u64 __get_random_max64(u64 range); + +static __always_inline u64 get_random_max64(u64 range) +{ + /* + * We may know at compile time that range is 32 bits, + * even if we don't know its exact value. + */ + if (__builtin_constant_p(range <= 0xffffffff) && range <= 0xffffffff) + return get_random_max32(range); + else if (!__builtin_constant_p(range)) + return __get_random_max64(range); + else if (range & (range - 1)) + return _get_random_max64(range); + else + return get_random_u64() / (0x100000000 / (range >> 32)); +} +#endif + +/** + * get_random_max - Generate a uniform random number 0 <= x < range + * @range: Modulus for random number; must not be zero! + * + * This uses a cryptographic random number generator, for safety + * against malicious attackers trying to guess the output. + * + * This generates exactly uniform random numbers in the range, retrying + * occasionally if range is not a power of two. If we're going to + * go to the expense of a cryptographic RNG, we may as well get the + * distribution right. + * + * For less critical applications (self-tests, error injection, dithered + * timers), use prandom_u32_max(). + */ +static __always_inline unsigned long get_random_max(unsigned long range) +{ +#if BITS_PER_LONG == 64 + return get_random_max64(range); +#else + return get_random_max32(range); +#endif +} + /* * On 64-bit architectures, protect against non-terminated C string overflows * by zeroing out the first byte of the canary; this leaves 56 bits of entropy. -- 2.20.1