All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] kfence: buffer random bools in bitmask
@ 2022-10-26 20:40 Jason A. Donenfeld
  2022-10-26 20:45 ` Jason A. Donenfeld
  2022-10-27  0:04 ` Marco Elver
  0 siblings, 2 replies; 5+ messages in thread
From: Jason A. Donenfeld @ 2022-10-26 20:40 UTC (permalink / raw)
  To: kasan-dev, elver, patches; +Cc: Jason A. Donenfeld, Sebastian Andrzej Siewior

Recently kfence got a 4x speed up in calls to the RNG, due to using
internally get_random_u8() instead of get_random_u32() for its random
boolean values. We can extend that speed up another 8x, to 32x total, by
buffering a long at a time, and reading bits from it.

I'd looked into introducing a get_random_bool(), along with the
complexities required for that kind of function to work for a general
case. But kfence is the only high-speed user of random booleans in a hot
path, so we're better off open coding this to take advantage of kfence
particularities.

In particular, we take advantage of the fact that kfence_guarded_alloc()
already disables interrupts for its raw spinlocks, so that we can keep
track of a per-cpu buffered boolean bitmask, without needing to add more
interrupt disabling.

This is slightly complicated by PREEMPT_RT, where we actually need to
take a local_lock instead. But the resulting code in both cases compiles
down to something very compact, and is basically zero cost.
Specifically, on !PREEMPT_RT, this amounts to:

    local_irq_save(flags);
    random boolean stuff;
    raw_spin_lock(&other_thing);
    do the existing stuff;
    raw_spin_unlock_irqrestore(&other_thing, flags);

By using a local_lock in the way this patch does, we now also get this
code on PREEMPT_RT:

    spin_lock(this_cpu_ptr(&local_lock));
    random boolean stuff;
    spin_unlock(this_cpu_ptr(&local_lock));
    raw_spin_lock_irqsave(&other_thing, flags);
    do the existing stuff;
    raw_spin_unlock_irqrestore(&other_thing, flags);

This is also optimal for RT systems. So all and all, this is pretty
good. But there are some compile-time conditionals in order to
accomplish this.

Cc: Marco Elver <elver@google.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 mm/kfence/core.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 6cbd93f2007b..c212ae0cecba 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -356,21 +356,47 @@ static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t g
 				  unsigned long *stack_entries, size_t num_stack_entries,
 				  u32 alloc_stack_hash)
 {
+	struct random_bools {
+		unsigned long bits;
+		unsigned int len;
+		local_lock_t lock;
+	};
+	static DEFINE_PER_CPU(struct random_bools, pcpu_bools) = {
+		.lock = INIT_LOCAL_LOCK(pcpu_bools.lock)
+	};
+	struct random_bools *bools;
 	struct kfence_metadata *meta = NULL;
 	unsigned long flags;
 	struct slab *slab;
 	void *addr;
-	const bool random_right_allocate = get_random_u32_below(2);
+	bool random_right_allocate;
 	const bool random_fault = CONFIG_KFENCE_STRESS_TEST_FAULTS &&
 				  !get_random_u32_below(CONFIG_KFENCE_STRESS_TEST_FAULTS);
 
+	local_lock_irqsave(&pcpu_bools.lock, flags);
+	bools = raw_cpu_ptr(&pcpu_bools);
+	if (unlikely(!bools->len)) {
+		bools->bits = get_random_long();
+		bools->len = BITS_PER_LONG;
+	}
+	random_right_allocate = bools->bits & 1;
+	bools->bits >>= 1;
+	bools->len--;
+
 	/* Try to obtain a free object. */
-	raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
+	else
+		raw_spin_lock(&kfence_freelist_lock);
 	if (!list_empty(&kfence_freelist)) {
 		meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
 		list_del_init(&meta->list);
 	}
-	raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
+	else
+		raw_spin_unlock(&kfence_freelist_lock);
+	local_unlock_irqrestore(&pcpu_bools.lock, flags);
 	if (!meta) {
 		atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_CAPACITY]);
 		return NULL;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] kfence: buffer random bools in bitmask
  2022-10-26 20:40 [PATCH] kfence: buffer random bools in bitmask Jason A. Donenfeld
@ 2022-10-26 20:45 ` Jason A. Donenfeld
  2022-10-27  0:04 ` Marco Elver
  1 sibling, 0 replies; 5+ messages in thread
From: Jason A. Donenfeld @ 2022-10-26 20:45 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: kasan-dev, elver, patches

Hi Sebastian,

On Wed, Oct 26, 2022 at 10:40:31PM +0200, Jason A. Donenfeld wrote:
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
> +	else
> +		raw_spin_lock(&kfence_freelist_lock);
[...]
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
> +	else
> +		raw_spin_unlock(&kfence_freelist_lock);
> +	local_unlock_irqrestore(&pcpu_bools.lock, flags);

I assume those conditionals are really upsetting to you. If you don't
want those, can you propose something that has the same good codegen but
doesn't need the conditional?

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] kfence: buffer random bools in bitmask
  2022-10-26 20:40 [PATCH] kfence: buffer random bools in bitmask Jason A. Donenfeld
  2022-10-26 20:45 ` Jason A. Donenfeld
@ 2022-10-27  0:04 ` Marco Elver
  2022-10-27  0:26   ` Jason A. Donenfeld
  1 sibling, 1 reply; 5+ messages in thread
From: Marco Elver @ 2022-10-27  0:04 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: kasan-dev, patches, Sebastian Andrzej Siewior

On Wed, 26 Oct 2022 at 13:40, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Recently kfence got a 4x speed up in calls to the RNG, due to using
> internally get_random_u8() instead of get_random_u32() for its random
> boolean values. We can extend that speed up another 8x, to 32x total, by
> buffering a long at a time, and reading bits from it.
>
> I'd looked into introducing a get_random_bool(), along with the
> complexities required for that kind of function to work for a general
> case. But kfence is the only high-speed user of random booleans in a hot
> path, so we're better off open coding this to take advantage of kfence
> particularities.

kfence_guarded_alloc() is supposed to be a slow-path. And if it were a
hot-path, I currently see no evidence that a call into the RNG
dominates the time spent there.

Do you have profiles?

What are the real benefits of this change?
Is it to avoid depleting the entropy pool?

> In particular, we take advantage of the fact that kfence_guarded_alloc()
> already disables interrupts for its raw spinlocks, so that we can keep
> track of a per-cpu buffered boolean bitmask, without needing to add more
> interrupt disabling.
>
> This is slightly complicated by PREEMPT_RT, where we actually need to
> take a local_lock instead. But the resulting code in both cases compiles
> down to something very compact, and is basically zero cost.
> Specifically, on !PREEMPT_RT, this amounts to:
>
>     local_irq_save(flags);
>     random boolean stuff;
>     raw_spin_lock(&other_thing);
>     do the existing stuff;
>     raw_spin_unlock_irqrestore(&other_thing, flags);
>
> By using a local_lock in the way this patch does, we now also get this
> code on PREEMPT_RT:
>
>     spin_lock(this_cpu_ptr(&local_lock));
>     random boolean stuff;
>     spin_unlock(this_cpu_ptr(&local_lock));
>     raw_spin_lock_irqsave(&other_thing, flags);
>     do the existing stuff;
>     raw_spin_unlock_irqrestore(&other_thing, flags);
>
> This is also optimal for RT systems. So all and all, this is pretty
> good. But there are some compile-time conditionals in order to
> accomplish this.
>
> Cc: Marco Elver <elver@google.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  mm/kfence/core.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/mm/kfence/core.c b/mm/kfence/core.c
> index 6cbd93f2007b..c212ae0cecba 100644
> --- a/mm/kfence/core.c
> +++ b/mm/kfence/core.c
> @@ -356,21 +356,47 @@ static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t g
>                                   unsigned long *stack_entries, size_t num_stack_entries,
>                                   u32 alloc_stack_hash)
>  {
> +       struct random_bools {
> +               unsigned long bits;
> +               unsigned int len;
> +               local_lock_t lock;
> +       };
> +       static DEFINE_PER_CPU(struct random_bools, pcpu_bools) = {
> +               .lock = INIT_LOCAL_LOCK(pcpu_bools.lock)
> +       };

If I remember right, function-scoped static DEFINE_PER_CPU were
disallowed (but I now cannot recall why and where it said that :-/).

> +       struct random_bools *bools;
>         struct kfence_metadata *meta = NULL;
>         unsigned long flags;
>         struct slab *slab;
>         void *addr;
> -       const bool random_right_allocate = get_random_u32_below(2);
> +       bool random_right_allocate;
>         const bool random_fault = CONFIG_KFENCE_STRESS_TEST_FAULTS &&
>                                   !get_random_u32_below(CONFIG_KFENCE_STRESS_TEST_FAULTS);
>
> +       local_lock_irqsave(&pcpu_bools.lock, flags);
> +       bools = raw_cpu_ptr(&pcpu_bools);
> +       if (unlikely(!bools->len)) {
> +               bools->bits = get_random_long();
> +               bools->len = BITS_PER_LONG;
> +       }
> +       random_right_allocate = bools->bits & 1;
> +       bools->bits >>= 1;
> +       bools->len--;

This should be factored into its own function that returns a result
for random_right_allocate.

>         /* Try to obtain a free object. */
> -       raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               raw_spin_lock_irqsave(&kfence_freelist_lock, flags);
> +       else
> +               raw_spin_lock(&kfence_freelist_lock);
>         if (!list_empty(&kfence_freelist)) {
>                 meta = list_entry(kfence_freelist.next, struct kfence_metadata, list);
>                 list_del_init(&meta->list);
>         }
> -       raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
> +       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +               raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
> +       else
> +               raw_spin_unlock(&kfence_freelist_lock);
> +       local_unlock_irqrestore(&pcpu_bools.lock, flags);

Overall this introduces complexities that should be hidden behind some
new abstractions.

But besides that, I'd first want to understand what the real benefit
is given all this is supposed to be a slow-path.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] kfence: buffer random bools in bitmask
  2022-10-27  0:04 ` Marco Elver
@ 2022-10-27  0:26   ` Jason A. Donenfeld
  2022-10-27  1:40     ` Marco Elver
  0 siblings, 1 reply; 5+ messages in thread
From: Jason A. Donenfeld @ 2022-10-27  0:26 UTC (permalink / raw)
  To: Marco Elver; +Cc: kasan-dev, patches, Sebastian Andrzej Siewior

Hi Marco,

On Wed, Oct 26, 2022 at 05:04:27PM -0700, Marco Elver wrote:
> Is it to avoid depleting the entropy pool?

The entropy pool never depletes, so no.

> kfence_guarded_alloc() is supposed to be a slow-path. And if it were a

Ahh, my huge misunderstanding, then. For some reason, I was under the
general assumption that this got called for every allocation. Given that
this apparently isn't the case, let's indeed just forget I posted this.

This then means, by the way, that there are in fact no fast-path
users of random booleans, which means get_random_bool() is totally
unnecessary. Before I thought this was the one case, hence open coding
it, but luckily that even isn't necessary.

Anyway, sorry for the noise.

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] kfence: buffer random bools in bitmask
  2022-10-27  0:26   ` Jason A. Donenfeld
@ 2022-10-27  1:40     ` Marco Elver
  0 siblings, 0 replies; 5+ messages in thread
From: Marco Elver @ 2022-10-27  1:40 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: kasan-dev, patches, Sebastian Andrzej Siewior

On Wed, 26 Oct 2022 at 17:26, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Marco,
>
> On Wed, Oct 26, 2022 at 05:04:27PM -0700, Marco Elver wrote:
> > Is it to avoid depleting the entropy pool?
>
> The entropy pool never depletes, so no.
>
> > kfence_guarded_alloc() is supposed to be a slow-path. And if it were a
>
> Ahh, my huge misunderstanding, then. For some reason, I was under the
> general assumption that this got called for every allocation. Given that
> this apparently isn't the case, let's indeed just forget I posted this.

No worries. Allocations via sl[au]b are "sampled", i.e. a fixed sample
interval decides if an allocation is redirected through KFENCE (the
lowest sample interval is 1ms, but that's generally not recommended
anyway - default is 100ms - so if the system is busy doing
allocations, every 100ms there's a call into kfence_guarded_alloc()).

> This then means, by the way, that there are in fact no fast-path
> users of random booleans, which means get_random_bool() is totally
> unnecessary. Before I thought this was the one case, hence open coding
> it, but luckily that even isn't necessary.
>
> Anyway, sorry for the noise.

Thanks for clarifying - and good we sorted it out. ;-)

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-27  1:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-26 20:40 [PATCH] kfence: buffer random bools in bitmask Jason A. Donenfeld
2022-10-26 20:45 ` Jason A. Donenfeld
2022-10-27  0:04 ` Marco Elver
2022-10-27  0:26   ` Jason A. Donenfeld
2022-10-27  1:40     ` Marco Elver

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.