Re: [PATCH 1/1] arm64: syscall: Direct PRNG kstack randomization

From: Jeremy Linton <jeremy.linton@arm.com>
To: Arnd Bergmann <arnd@arndb.de>, Kees Cook <keescook@chromium.org>
Cc: linux-arm-kernel@lists.infradead.org,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	"Jason A . Donenfeld" <Jason@zx2c4.com>,
	"Gustavo A. R. Silva" <gustavoars@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mark Brown <broonie@kernel.org>, Guo Hui <guohui@uniontech.com>,
	Manoj.Iyer@arm.com, linux-kernel@vger.kernel.org,
	linux-hardening@vger.kernel.org, James Yang <james.yang@arm.com>,
	Shiyou Huang <shiyou.huang@arm.com>
Subject: Re: [PATCH 1/1] arm64: syscall: Direct PRNG kstack randomization
Date: Fri, 8 Mar 2024 10:49:46 -0600	[thread overview]
Message-ID: <db7dfa2d-c7c6-4b10-981a-a7ecc87c8541@arm.com> (raw)
In-Reply-To: <f1dd15ce-69af-4315-8d7c-b7a480e541aa@app.fastmail.com>

Hi,

On 3/7/24 05:10, Arnd Bergmann wrote:
> On Wed, Mar 6, 2024, at 22:54, Jeremy Linton wrote:
>> On 3/6/24 14:46, Arnd Bergmann wrote:
>>> On Wed, Mar 6, 2024, at 00:33, Kees Cook wrote:
>>>> On Tue, Mar 05, 2024 at 04:18:24PM -0600, Jeremy Linton wrote:
>>>>> The existing arm64 stack randomization uses the kernel rng to acquire
>>>>> 5 bits of address space randomization. This is problematic because it
>>>>> creates non determinism in the syscall path when the rng needs to be
>>>>> generated or reseeded. This shows up as large tail latencies in some
>>>>> benchmarks and directly affects the minimum RT latencies as seen by
>>>>> cyclictest.
>>>>>
>>>>> Other architectures are using timers/cycle counters for this function,
>>>>> which is sketchy from a randomization perspective because it should be
>>>>> possible to estimate this value from knowledge of the syscall return
>>>>> time, and from reading the current value of the timer/counters.
>>>
>>> As I commented on the previous version, I don't want to see
>>> a change that only addresses one architecture like this. If you
>>> are convinced that using a cycle counter is a mistake, then we
>>> should do the same thing on the other architectures as well
>>> that currently use a cycle counter.
>>
>> I personally tend to agree as long as we aren't creating a similar set
>> of problems for those architectures as we are seeing on arm. Currently
>> the kstack rng on/off choice is basically zero overhead for them.
> 
> I think we have two separate problems to solve here: how
> strong a kernel stack randomization should be on a given system,
> and who gets to make the decision.
> 
> For the strength, we have at least four options:
> 
> - strong rng, most expensive
> - your new prng, less strong but somewhat cheaper and/or more
>    predictable overhead
> - cycle counter, cheap but probably even less strong,
>    needs architecture code.
> - no rng, no overhead and no protection.
> 
> On architectures that have a cheap hardware rng instruction, you
> can count that one as well.
> 
> For picking between the options, we have
> 
> - architecture maintainer
> - defconfig (CONFIG_RANDOMIZE_KSTACK_OFFSET)
> - boot time (randomize_kstack_offset=on)
> 
> For both of these lists, I would like to see as few options
> as possible, and in particular I think the architecture
> maintainer should not make that decision for the users.
> 
> If we want to improve this, I would start by changing
> the binary CONFIG_RANDOMIZE_KSTACK_OFFSET option into
> a three-way choice between cycle counter (where available),
> strong rng and off, possibly adding the cycle counter
> option to the two architectures that currently don't use
> it, while moving the strong rng into common code.
> 
> After that, we can debate adding a fourth option, or
> replacing one of the existing two that is too slow or
> not random enough.
> 
>>> Won't that defeat the purpose of the patch that was intended
>>> to make the syscall latency more predictable? At least the
>>> simpler approaches of reseeding from the kstack_rng()
>>> function itself would have this problem, deferring it to
>>> another context comes with a separate set of problems.
>>
>> And that describes why I've not come up with an inline reseeding
>> solution. Which of course isn't a problem on !arm if one just pushes a
>> few bits of a cycle counter into the rnd_state every few dozen syscalls,
>> or whatever. Mark R, mentioned offline the idea of just picking a few
>> bits off CNTVCT as a seed, but its so slow it basically has to be used
>> to fuzz a bit or two of rnd_state on some fairly long interval. Long
>> enough that if someone has a solution for extracting rnd_state it might
>> not add any additional security. Or that is my take, since i'm not a big
>> fan of any independent counter/clock based RNG seeding (AFAIK, entropy
>> from clocks requires multiple _independent_ sources).
> 
> I'm not sure I understand the logic. Do you mean that accessing
> CNTVCT itself is slow, or that reseeding based on CNTVCT is slow
> because of the overhead of reseeding?

Slow, as in, its running at a much lower frequency than a cycle counter.

> 
> On powerpc/s390/x86, the low bits of the respective cycle counter
> is simply used without any kind of bit shuffling, and we already
> rely on the cycle counter to be fast to access since it's used
> for timekeeping anywhere.
> 
> There is not even any attempt to use the most random bits of
> the cycle counter, as both the high 22 to 24 bits get masked
> out (to keep the wasted stack space small) and the low 3 or 4
> bits get ignored because of stack alignment. If there was
> any desire to make it more random, a trivial improvement
> would be:
> 
> +++ b/include/linux/randomize_kstack.h
> @@ -80,7 +80,7 @@ DECLARE_PER_CPU(u32, kstack_offset);
>          if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
>                                  &randomize_kstack_offset)) {            \
>                  u32 offset = raw_cpu_read(kstack_offset);               \
> -               offset ^= (rand);                                       \
> +               offset = ror32(offset, 5) & (rand);                     \
>                  raw_cpu_write(kstack_offset, offset);                   \
>          }                                                               \
>   } while (0)
> 
> My impression is that is is already bordering on becoming
> a "bespoke rng" implementation that Jason was objecting to,
> so the current version is intentionally left weak in order
> to not even give the appearance of being a security relevant
> feature.
> 
>> This is a bit out of my wheelhouse, so I defer to anyone with a better
>> feel or some actual data.
>>
>> The best plan I have at the moment is just some deferred work to call
>> kstack_rng_setup on some call or time based interval, which AFAIK isn't
>> ideal for RT workloads which expect ~100% CPU isolation. Plus, that
>> solution assumes we have some handle on how fast an attacker can extract
>> kstackrng sufficiently to make predictions.
> 
> I think you fundamentally can't have both. If you rely on the
> reseeding to happen for a particular number of syscalls, you
> eventually end up blocking on it, regardless of the context
> it runs in. Deferring it to another process means blocking
> for longer, and deferring it to a low-priority task would mean
> that real-time tasks get less randomness.
> 
>         Arnd