linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
@ 2020-03-17 13:50 George Spelvin
  2020-03-17 21:44 ` Kees Cook
  0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-17 13:50 UTC (permalink / raw)
  To: Dan Williams, linux-mm; +Cc: Kees Cook, Andrew Morton, lkml

First, use long rather than u64 for the bit buffer type, which
is significantly more efficient on 32-bit processors.

Second, avoid the need for a separate rand_bits counter.
rand_bits is never more than 63, so there's always room in rand
for a bit to mark the end of the available bits.  This makes the
shared state atomically updatable, which makes it a lot easier
to reason about race conditions.

Third, use READ_ONCE and WRITE_ONCE.  Without them, the compiler
may spill to the shared static in arbitrarily perverse ways,
and combined with the fact that the code eschews locking, that
is a recipe for hard-to-find bugs.  Now, a race might cause a bit
to be used twice, or get_random_long() to be called redundantly,
but it can't summon nasal daemons.

I've tried a few variants.  Keeping random lsbits with a most-
significant end marker, or using an explicit bool variable rather
than testing r both increase code size slightly.

		x86_64	i386
This code		 94	 95
Explicit bool		103	 99
Lsbits		 99	101
Both		 96	100

Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
 mm/shuffle.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/shuffle.c b/mm/shuffle.c
index b3fe97fd6654..0e4bf6a8da52 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,25 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 void add_to_free_area_random(struct page *page, struct free_area *area,
 		int migratetype)
 {
-	static u64 rand;
-	static u8 rand_bits;
+	static long rand;	/* 0..BITS_PER_LONG-1 buffered random bits */
+	long r = READ_ONCE(rand), rshift = r << 1;;
 
 	/*
-	 * The lack of locking is deliberate. If 2 threads race to
+	 * rand holds some random msbits, with the end marked by a 1 bit.
+	 * This allows us to maintain the pre-generated bits and the
+	 * count of bits in a single, atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
 	 * update the rand state it just adds to the entropy.
 	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
 	}
+	WRITE_ONCE(rand, rshift);
 
-	if (rand & 1)
+	if (r < 0)
 		add_to_free_area(page, area, migratetype);
 	else
 		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
  2020-03-17 13:50 [PATCH] mm/shuffle.c: optimize add_to_free_area_random() George Spelvin
@ 2020-03-17 21:44 ` Kees Cook
  2020-03-17 23:06   ` George Spelvin
  0 siblings, 1 reply; 25+ messages in thread
From: Kees Cook @ 2020-03-17 21:44 UTC (permalink / raw)
  To: George Spelvin; +Cc: Dan Williams, linux-mm, Andrew Morton

On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote:
> First, use long rather than u64 for the bit buffer type, which
> is significantly more efficient on 32-bit processors.

Why? This is twice as many get_random_*() calls for 32-bit: it'll save
whatever bits it doesn't use. (Speaking to word-sized stores, yes, that
makes sense, but see below...)

> Second, avoid the need for a separate rand_bits counter.
> rand_bits is never more than 63, so there's always room in rand
> for a bit to mark the end of the available bits.  This makes the
> shared state atomically updatable, which makes it a lot easier
> to reason about race conditions.

What are the race conditions? I had to go look at I see that
add_to_free_area_random() is called under __free_one_page() which is
under &zone->lock. Would it make more sense to store the randomness
per-zone and entirely side-step the issues?

> Third, use READ_ONCE and WRITE_ONCE.  Without them, the compiler
> may spill to the shared static in arbitrarily perverse ways,
> and combined with the fact that the code eschews locking, that
> is a recipe for hard-to-find bugs.  Now, a race might cause a bit
> to be used twice, or get_random_long() to be called redundantly,
> but it can't summon nasal daemons.

All these things might make the results less predictable, too. :) But,
yes, if there was deterministic sharing of random values, that'd be
bad. But this patch doesn't really solve the "use the same random bits"
problem, which is the very thing that might get us into a predictable
state. If two zones both call READ_ONCE(), use the same value (eek),
and then both call WRITE_ONCE(), one of them wins the write (no big deal).

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
  2020-03-17 21:44 ` Kees Cook
@ 2020-03-17 23:06   ` George Spelvin
  2020-03-17 23:38     ` Kees Cook
  0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-17 23:06 UTC (permalink / raw)
  To: Kees Cook; +Cc: Dan Williams, linux-mm, Andrew Morton, lkml

On Tue, Mar 17, 2020 at 02:44:33PM -0700, Kees Cook wrote:
> On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote:
> > First, use long rather than u64 for the bit buffer type, which
> > is significantly more efficient on 32-bit processors.
> 
> Why? This is twice as many get_random_*() calls for 32-bit: it'll save
> whatever bits it doesn't use. (Speaking to word-sized stores, yes, that
> makes sense, but see below...)

Yes, but the per-call overhead is fairly low until you run out of the
64-byte buffer; then there's an expensive crypto operation to
refill the buffer.

So two get_random_32 calls is not a lot more than one get_random_64. (The 
buffers are per-CPU, so they are pretty cheap to access.  (I have other 
patches I'm working on to speed up the locking they have, but this one is 
independent so I'm sending it separately.)

And if you want faster, prandom_u32() is probably good enough for this
application.

I did it to speed up and shrink the fast path.  It saves at least
3 instructions on 32-bit machines (one load, one rotate through carry, 
and one store), at the cost of one extra get_random_*() call per 64
bits.  If get_random_u32()'s fast path is less than 192 instructions,
it's a win.

It's also relevant to the race condition reasoning mentioned below;
a one-word value can be atomically updated.  On a 32-bit machine,
64-bit operations get tearing and it's a PITA to reason about.

>> Second, avoid the need for a separate rand_bits counter.
>> rand_bits is never more than 63, so there's always room in rand
>> for a bit to mark the end of the available bits.  This makes the
>> shared state atomically updatable, which makes it a lot easier
>> to reason about race conditions.
> 
> What are the race conditions? I had to go look at I see that
> add_to_free_area_random() is called under __free_one_page() which is
> under &zone->lock. Would it make more sense to store the randomness
> per-zone and entirely side-step the issues?

The most serious is that if two threads simultaneously observe
rand_bits == 1, but do their decrements separately, you end up
with rand_bits = 255 and you generate 255 consecutive 0 bits
before refilling the buffer.

Since we're only generating random bits, a screwed-up answer occasionally 
doesn't really matter (like the comment says "lack of locking is 
deliberate"), but 255 screwed up bits is a bit much.

I avoided changing the underlying locking model because I didn't
feel up to such an invasive change; I managed to fix the problems
I saw without going there.  And shrink the code; tht seemed like
enough of a win to justify it to me.

If you want to change the locking, the other alternative to per-zone is 
per-CPU, which also avoids locking.  We could even add a generic 
get_random_bit() function.

>> Third, use READ_ONCE and WRITE_ONCE.  Without them, the compiler
>> may spill to the shared static in arbitrarily perverse ways,
>> and combined with the fact that the code eschews locking, that
>> is a recipe for hard-to-find bugs.  Now, a race might cause a bit
>> to be used twice, or get_random_long() to be called redundantly,
>> but it can't summon nasal daemons.
> 
> All these things might make the results less predictable, too. :) But,
> yes, if there was deterministic sharing of random values, that'd be
> bad. But this patch doesn't really solve the "use the same random bits"
> problem, which is the very thing that might get us into a predictable
> state. If two zones both call READ_ONCE(), use the same value (eek),
> and then both call WRITE_ONCE(), one of them wins the write (no big deal).

Absolutely, but in case of a tight race, only *one* bit is used twice,
and the bit used is uniformly distributed, so it's a really small
correlation.  Unlike the 255x zero race the existing code can have.

The main reason I did this is to, as I already wrote, keep the nasal 
demons at bay.  The existing code does things (having another thread 
change a non-volatile variable while this code is executing) that are 
explicitly undefined by the ANSI C standard.

The compiler is allowed to (in John Woods' memorable explanation)
produce code that makes demons fly out of your nose.  (More plausibly,
it may simply crash.)

It probably won't, but I prefer to stay on the safe side of the
language specification contact.  Linus has ranted on more than one 
occasion about the perverse ways that GCC has used its permission
to assume that people won't violate the contract.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
  2020-03-17 23:06   ` George Spelvin
@ 2020-03-17 23:38     ` Kees Cook
  2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
  0 siblings, 1 reply; 25+ messages in thread
From: Kees Cook @ 2020-03-17 23:38 UTC (permalink / raw)
  To: George Spelvin; +Cc: Dan Williams, linux-mm, Andrew Morton

On Tue, Mar 17, 2020 at 11:06:12PM +0000, George Spelvin wrote:
> The most serious is that if two threads simultaneously observe
> rand_bits == 1, but do their decrements separately, you end up
> with rand_bits = 255 and you generate 255 consecutive 0 bits
> before refilling the buffer.
> 
> Since we're only generating random bits, a screwed-up answer occasionally 
> doesn't really matter (like the comment says "lack of locking is 
> deliberate"), but 255 screwed up bits is a bit much.

Okay, I'm on board! :) Thanks for spelling this race out; I hadn't seen
quite how nasty it could get. (Perhaps mention in the commit log for v2?)

> I avoided changing the underlying locking model because I didn't
> feel up to such an invasive change; I managed to fix the problems
> I saw without going there.  And shrink the code; tht seemed like
> enough of a win to justify it to me.

Fair enough.

> The compiler is allowed to (in John Woods' memorable explanation)
> produce code that makes demons fly out of your nose.  (More plausibly,
> it may simply crash.)

So one thing that I see here that is still in the nasal demon realm is
that the left-shift of a signed value, which is technically undefined
behavior in C. (See the comment on check_shl_overflow().)

Doing a signedness check is very cheap in the resulting machine code;
but I suspect sticking to unsigned and reversing direction for a
bottom-bit test too bad?

i.e.:

	static unsigned long rand_bits;
	unsigned long r = READ_ONCE(rand_bits), rshift = r >> 1;

	if (unlikely(rshift == 0)) {
		r = get_random_long();
		rshift = (r >> 1) | (0x1UL << (BITS_PER_LONG - 1));
	}
	WRITE_ONCE(rand_bits, rshift);

	if (r & 1)
		add_to...
	else
		add_to...tail



-- 
Kees Cook


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-17 23:38     ` Kees Cook
@ 2020-03-18  1:44       ` George Spelvin
  2020-03-18  1:49         ` Randy Dunlap
                           ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18  1:44 UTC (permalink / raw)
  To: Kees Cook; +Cc: Dan Williams, linux-mm, Andrew Morton, lkml

The old code had separate "rand" and "rand_count" variables,
which could get out of sync with bad results.

In the worst case, two threads would see rand_count == 1 and
both decrement it, resultint in rand_count = 255 and rand being
filled with zeros for the next 255 calls.

Instead, pack them both into a single, atomically updatable,
variable.  This makes it a lot easier to reason about race
conditions.  They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.

Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
in the land of nasal demons.  The compiler would be free to spill
temporaries to the static variables in arbitrary perverse ways
and create hard-to-find bugs.

(Alternatively, we could declare the static variable "volatile",
one of the few places in the Linux kernel that would be correct,
but it would probably annoy Linus.)

Third, use long rather than u64.  This not only keeps the
state atomically updatable, it also speeds up the fast path
on 32-bit machines.  Saving at least three instructions on
the fast path (one load, one add-with-carry, and one store)
is worth exchanging one call to get_random_u64 for two
calls to get_random_u32.  The fast path of get_random_* is
less than the 3*64 = 192 instructions saved, and the slow
path happens every 64 bytes so isn't affectrd by the change.

I've tried a few variants.  Keeping random lsbits with
a most-significant end marker, and using an explicit bool
flag rather than testing r both increase code size slightly.

		x86_64	i386
This code		 94	 95
Explicit bool		103	 99
Lsbits		 99	101
Both		 96	100

Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
 mm/shuffle.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/shuffle.c b/mm/shuffle.c
index e0ed247f8d90..4ba3ba84764d 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 void add_to_free_area_random(struct page *page, struct free_area *area,
 		int migratetype)
 {
-	static u64 rand;
-	static u8 rand_bits;
+	static long rand;	/* 0..BITS_PER_LONG-1 buffered random bits */
+	unsigned long r = READ_ONCE(rand), rshift = r << 1;;
 
 	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
+	 * rand holds some random msbits, with a 1 bit appended, followed
+	 * by zero-padding in the lsbits.  This allows us to maintain
+	 * the pre-generated bits and the count of bits in a single,
+	 * atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
+	 * update the rand state it just adds to the entropy.  The
+	 * worst that can happen is a random bit is used twice, or
+	 * get_random_long is called redundantly.
 	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
 	}
+	WRITE_ONCE(rand, rshift);
 
-	if (rand & 1)
+	if ((long)r < 0)
 		add_to_free_area(page, area, migratetype);
 	else
 		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
 }
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
@ 2020-03-18  1:49         ` Randy Dunlap
  2020-03-18  3:53         ` Dan Williams
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Randy Dunlap @ 2020-03-18  1:49 UTC (permalink / raw)
  To: George Spelvin, Kees Cook; +Cc: Dan Williams, linux-mm, Andrew Morton

Hi,

On 3/17/20 6:44 PM, George Spelvin wrote:
> +	unsigned long r = READ_ONCE(rand), rshift = r << 1;;

don't need the double ending ;;

-- 
~Randy



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
  2020-03-18  1:49         ` Randy Dunlap
@ 2020-03-18  3:53         ` Dan Williams
  2020-03-18  8:20           ` George Spelvin
  2020-03-18  3:58         ` Kees Cook
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18  3:53 UTC (permalink / raw)
  To: George Spelvin; +Cc: Kees Cook, Linux MM, Andrew Morton

On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64.  This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines.  Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32.  The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants.  Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
>                 x86_64  i386
> This code                94      95
> Explicit bool           103      99
> Lsbits           99     101
> Both             96     100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>                 int migratetype)
>  {
> -       static u64 rand;
> -       static u8 rand_bits;
> +       static long rand;       /* 0..BITS_PER_LONG-1 buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
>         /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> +        * rand holds some random msbits, with a 1 bit appended, followed
> +        * by zero-padding in the lsbits.  This allows us to maintain
> +        * the pre-generated bits and the count of bits in a single,
> +        * atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
>          */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> +       if (unlikely(rshift == 0)) {

I had the impression that unless unlikely is "mostly never" then it
can do more harm than good. Is a branch guaranteed to be taken every
BITS_PER_LONG'th occurrence really a candidate for unlikely()
annotation?

Otherwise this change looks good to me.

> +               r = get_random_long();
> +               rshift = r << 1 | 1;
>         }
> +       WRITE_ONCE(rand, rshift);
>
> -       if (rand & 1)
> +       if ((long)r < 0)
>                 add_to_free_area(page, area, migratetype);
>         else
>                 add_to_free_area_tail(page, area, migratetype);
> -       rand_bits--;
> -       rand >>= 1;
>  }
> --
> 2.26.0.rc2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
  2020-03-18  1:49         ` Randy Dunlap
  2020-03-18  3:53         ` Dan Williams
@ 2020-03-18  3:58         ` Kees Cook
  2020-03-18 15:26         ` Alexander Duyck
  2020-03-18 20:39         ` [PATCH v3] " George Spelvin
  4 siblings, 0 replies; 25+ messages in thread
From: Kees Cook @ 2020-03-18  3:58 UTC (permalink / raw)
  To: George Spelvin; +Cc: Dan Williams, linux-mm, Andrew Morton

On Wed, Mar 18, 2020 at 01:44:10AM +0000, George Spelvin wrote:
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
> 
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being

typo: resultint -> resulting

> filled with zeros for the next 255 calls.
> 
> Instead, pack them both into a single, atomically updatable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
> 
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
> 
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
> 
> Third, use long rather than u64.  This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines.  Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32.  The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
> 
> I've tried a few variants.  Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
> 
> 		x86_64	i386
> This code		 94	 95
> Explicit bool		103	 99
> Lsbits		 99	101
> Both		 96	100
> 
> Signed-off-by: George Spelvin <lkml@sdf.org>

And with Randy's other fix, please consider this:

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>  		int migratetype)
>  {
> -	static u64 rand;
> -	static u8 rand_bits;
> +	static long rand;	/* 0..BITS_PER_LONG-1 buffered random bits */
> +	unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>  
>  	/*
> -	 * The lack of locking is deliberate. If 2 threads race to
> -	 * update the rand state it just adds to the entropy.
> +	 * rand holds some random msbits, with a 1 bit appended, followed
> +	 * by zero-padding in the lsbits.  This allows us to maintain
> +	 * the pre-generated bits and the count of bits in a single,
> +	 * atomically updatable, variable.
> +	 *
> +	 * The lack of locking is deliberate. If two threads race to
> +	 * update the rand state it just adds to the entropy.  The
> +	 * worst that can happen is a random bit is used twice, or
> +	 * get_random_long is called redundantly.
>  	 */
> -	if (rand_bits == 0) {
> -		rand_bits = 64;
> -		rand = get_random_u64();
> +	if (unlikely(rshift == 0)) {
> +		r = get_random_long();
> +		rshift = r << 1 | 1;
>  	}
> +	WRITE_ONCE(rand, rshift);
>  
> -	if (rand & 1)
> +	if ((long)r < 0)
>  		add_to_free_area(page, area, migratetype);
>  	else
>  		add_to_free_area_tail(page, area, migratetype);
> -	rand_bits--;
> -	rand >>= 1;
>  }
> -- 
> 2.26.0.rc2

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  3:53         ` Dan Williams
@ 2020-03-18  8:20           ` George Spelvin
  2020-03-18 17:36             ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18  8:20 UTC (permalink / raw)
  To: Dan Williams; +Cc: Kees Cook, Linux MM, Andrew Morton, lkml

On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>> -       if (rand_bits == 0) {
>> -               rand_bits = 64;
>> -               rand = get_random_u64();
>> +       if (unlikely(rshift == 0)) {
> 
> I had the impression that unless unlikely is "mostly never" then it
> can do more harm than good. Is a branch guaranteed to be taken every
> BITS_PER_LONG'th occurrence really a candidate for unlikely()
> annotation?

I had to look this up.  GCC manual:

For the purposes of branch prediction optimizations, the probability
that a '__builtin_expect' expression is 'true' is controlled by GCC's
'builtin-expect-probability' parameter, which defaults to 90%.  You can
also use '__builtin_expect_with_probability' to explicitly assign a
probability value to individual expressions.

So I think that <= 10% is good enough, which is true in this case.

I was tring to encourage the compiler to:
* Place this code path out of line, and
* Not do the stack manipulations (build a frame, spill registers)
  needed for a non-leaf function if this path isn't taken.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
                           ` (2 preceding siblings ...)
  2020-03-18  3:58         ` Kees Cook
@ 2020-03-18 15:26         ` Alexander Duyck
  2020-03-18 18:35           ` George Spelvin
  2020-03-18 20:39         ` [PATCH v3] " George Spelvin
  4 siblings, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2020-03-18 15:26 UTC (permalink / raw)
  To: George Spelvin; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton

On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64.  This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines.  Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32.  The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants.  Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
>                 x86_64  i386
> This code                94      95
> Explicit bool           103      99
> Lsbits           99     101
> Both             96     100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>                 int migratetype)
>  {
> -       static u64 rand;
> -       static u8 rand_bits;
> +       static long rand;       /* 0..BITS_PER_LONG-1 buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
>         /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> +        * rand holds some random msbits, with a 1 bit appended, followed
> +        * by zero-padding in the lsbits.  This allows us to maintain
> +        * the pre-generated bits and the count of bits in a single,
> +        * atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
>          */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> +       if (unlikely(rshift == 0)) {
> +               r = get_random_long();
> +               rshift = r << 1 | 1;

You might want to wrap the "r << 1" in parenthesis. Also you could
probably use a + 1 instead of an | 1.

>         }
> +       WRITE_ONCE(rand, rshift);
>
> -       if (rand & 1)
> +       if ((long)r < 0)

One trick you might be able to get away with here is to actually
compare r to rshift. "If (rshift <= r)" should give you the same
result. This works since what you are essentially doing is just adding
r to itself so if you overflow rshift will be equal to at most r - 1.
However with the addition of the single bit in the rshift == 0 case it
could potentially be equal in the unlikely case of r being all 1's.

>                 add_to_free_area(page, area, migratetype);
>         else
>                 add_to_free_area_tail(page, area, migratetype);
> -       rand_bits--;
> -       rand >>= 1;
>  }
> --
> 2.26.0.rc2
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  8:20           ` George Spelvin
@ 2020-03-18 17:36             ` Dan Williams
  2020-03-18 19:29               ` George Spelvin
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 17:36 UTC (permalink / raw)
  To: George Spelvin; +Cc: Kees Cook, Linux MM, Andrew Morton

On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
>
> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
> >> -       if (rand_bits == 0) {
> >> -               rand_bits = 64;
> >> -               rand = get_random_u64();
> >> +       if (unlikely(rshift == 0)) {
> >
> > I had the impression that unless unlikely is "mostly never" then it
> > can do more harm than good. Is a branch guaranteed to be taken every
> > BITS_PER_LONG'th occurrence really a candidate for unlikely()
> > annotation?
>
> I had to look this up.  GCC manual:
>
> For the purposes of branch prediction optimizations, the probability
> that a '__builtin_expect' expression is 'true' is controlled by GCC's
> 'builtin-expect-probability' parameter, which defaults to 90%.  You can
> also use '__builtin_expect_with_probability' to explicitly assign a
> probability value to individual expressions.
>
> So I think that <= 10% is good enough, which is true in this case.
>
> I was tring to encourage the compiler to:
> * Place this code path out of line, and
> * Not do the stack manipulations (build a frame, spill registers)
>   needed for a non-leaf function if this path isn't taken.

Understood, I think it's ok in this case because the shuffling only
happens for order-10 page free events by default so it will be
difficult to measure the perf impact either way.  But in other kernel
contexts I think unlikely() annotation should come with numbers, 90%
not taken is not sufficient in and of itself.

You can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 15:26         ` Alexander Duyck
@ 2020-03-18 18:35           ` George Spelvin
  2020-03-18 19:17             ` Alexander Duyck
  0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 18:35 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, lkml

On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote:
> On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>> +       if (unlikely(rshift == 0)) {
>> +               r = get_random_long();
>> +               rshift = r << 1 | 1;
> 
> You might want to wrap the "r << 1" in parenthesis. Also you could
> probably use a + 1 instead of an | 1.

I could, but what would it matter?  I have just confirmed that all of:
	x << 1 | 1;
	(x << 1) + 1;
	x + x + 1;
	x + x | 1;
	2*x + 1;
	2*x | 1;
compile to
	leal	1(%rdi,%rdi), %eax

on x86, and two instructions on every other processor I can think of. 

Since this is concpetually a bit-manipulation operation where carry
propagation is undesirable, the logical operation form seems the most
natural way to write it.

As for the parens, all C programmers are forced to remember that the
boolean operators have weirdly low precedence (below < <= == >= >),
so there's no risk of confusion.

>>         }
>> +       WRITE_ONCE(rand, rshift);
>>
>> -       if (rand & 1)
>> +       if ((long)r < 0)
> 
> One trick you might be able to get away with here is to actually
> compare r to rshift. "If (rshift <= r)" should give you the same
> result. This works since what you are essentially doing is just adding
> r to itself so if you overflow rshift will be equal to at most r - 1.
> However with the addition of the single bit in the rshift == 0 case it
> could potentially be equal in the unlikely case of r being all 1's.

Er... but why would I want to?  On most processors, "branch on sign bit"
is a single instruction, and that's the instruction I'm hoping the 
compiler will generate.

That's why I changed the shift direction from the original right (testing
the lsbit) to left (testing the msbit): slight code size reduction.

Anything else produces larger and slower object code, for no benefit.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 18:35           ` George Spelvin
@ 2020-03-18 19:17             ` Alexander Duyck
  2020-03-18 20:06               ` George Spelvin
  0 siblings, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2020-03-18 19:17 UTC (permalink / raw)
  To: George Spelvin; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton

On Wed, Mar 18, 2020 at 11:35 AM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote:
> > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
> >> +       if (unlikely(rshift == 0)) {
> >> +               r = get_random_long();
> >> +               rshift = r << 1 | 1;
> >
> > You might want to wrap the "r << 1" in parenthesis. Also you could
> > probably use a + 1 instead of an | 1.
>
> I could, but what would it matter?  I have just confirmed that all of:
>         x << 1 | 1;
>         (x << 1) + 1;
>         x + x + 1;
>         x + x | 1;
>         2*x + 1;
>         2*x | 1;
> compile to
>         leal    1(%rdi,%rdi), %eax
>
> on x86, and two instructions on every other processor I can think of.
>
> Since this is concpetually a bit-manipulation operation where carry
> propagation is undesirable, the logical operation form seems the most
> natural way to write it.
>
> As for the parens, all C programmers are forced to remember that the
> boolean operators have weirdly low precedence (below < <= == >= >),
> so there's no risk of confusion.

Okay, if this is coming out the same regardless I suppose it doesn't
matter. I am still not a huge fan of skipping the parenthesis as I
feel it is a bit uglier to read but that isn't anything that has to be
fixed.

> >>         }
> >> +       WRITE_ONCE(rand, rshift);
> >>
> >> -       if (rand & 1)
> >> +       if ((long)r < 0)
> >
> > One trick you might be able to get away with here is to actually
> > compare r to rshift. "If (rshift <= r)" should give you the same
> > result. This works since what you are essentially doing is just adding
> > r to itself so if you overflow rshift will be equal to at most r - 1.
> > However with the addition of the single bit in the rshift == 0 case it
> > could potentially be equal in the unlikely case of r being all 1's.
>
> Er... but why would I want to?  On most processors, "branch on sign bit"
> is a single instruction, and that's the instruction I'm hoping the
> compiler will generate.
>
> That's why I changed the shift direction from the original right (testing
> the lsbit) to left (testing the msbit): slight code size reduction.
>
> Anything else produces larger and slower object code, for no benefit.

I was just putting it out there as a possibility. What I have seen in
the past is that under some circumstances gcc can be smart enough to
interpret that as a "branch on carry". My thought was you are likely
having to test the value against itself and then you might be able to
make use of shift and carry flag to avoid that. In addition you could
get away from having to recast a unsigned value as a signed one in
order to perform the bit test.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 17:36             ` Dan Williams
@ 2020-03-18 19:29               ` George Spelvin
  2020-03-18 19:40                 ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 19:29 UTC (permalink / raw)
  To: Dan Williams; +Cc: Kees Cook, Linux MM, Andrew Morton, lkml

On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote:
> On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
>> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
>>> I had the impression that unless unlikely is "mostly never" then it
>>> can do more harm than good. Is a branch guaranteed to be taken every
>>> BITS_PER_LONG'th occurrence really a candidate for unlikely()
>>> annotation?
>>
>> I had to look this up.  GCC manual:
>>
>> For the purposes of branch prediction optimizations, the probability
>> that a '__builtin_expect' expression is 'true' is controlled by GCC's
>> 'builtin-expect-probability' parameter, which defaults to 90%.  You can
>> also use '__builtin_expect_with_probability' to explicitly assign a
>> probability value to individual expressions.
>>
>> So I think that <= 10% is good enough, which is true in this case.
>>
>> I was tring to encourage the compiler to:
>> * Place this code path out of line, and
>> * Not do the stack manipulations (build a frame, spill registers)
>>   needed for a non-leaf function if this path isn't taken.
> 
> Understood, I think it's ok in this case because the shuffling only
> happens for order-10 page free events by default so it will be
> difficult to measure the perf impact either way.  But in other kernel
> contexts I think unlikely() annotation should come with numbers, 90%
> not taken is not sufficient in and of itself.

I'm not sure I fully understand your point.  I *think* you're
editorializing on unlikely() in general and not this specific code,
but it's a little hard to follow.

Your mention of "order-10 page free events" is confusing.  Do you mean 
"(order-10 page) free events", i.e. freeing of 1024 consecutive pages?  
Or are you using "order" as a synonym for "approximately" and you mean 
"approximately 10 (page free event)s"?

We both agree (I hope) that the number here is obvious on brief 
inspection: 1/BITS_PER_LONG.  

GCC's heuristics are tuned to value cycles on the fast path 9x as much as 
cycles on the slow path, so it will spend up to 9 cycles on the slow path 
to save a cycle on the fast path.

I've found one comment (https://pastebin.com/S8Y8tqZy) saying that
GCC < 9.x was a lot sloppier on the cost ratio and could pessimize
the code if the branch was more than ~ 1% taken.  Perhaps that's what 
you're remembering?

Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too
worried.

> You can add:
> 
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Thank you!


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 19:29               ` George Spelvin
@ 2020-03-18 19:40                 ` Dan Williams
  2020-03-18 21:02                   ` George Spelvin
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 19:40 UTC (permalink / raw)
  To: George Spelvin; +Cc: Kees Cook, Linux MM, Andrew Morton

On Wed, Mar 18, 2020 at 12:29 PM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote:
> > On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
> >> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> >>> I had the impression that unless unlikely is "mostly never" then it
> >>> can do more harm than good. Is a branch guaranteed to be taken every
> >>> BITS_PER_LONG'th occurrence really a candidate for unlikely()
> >>> annotation?
> >>
> >> I had to look this up.  GCC manual:
> >>
> >> For the purposes of branch prediction optimizations, the probability
> >> that a '__builtin_expect' expression is 'true' is controlled by GCC's
> >> 'builtin-expect-probability' parameter, which defaults to 90%.  You can
> >> also use '__builtin_expect_with_probability' to explicitly assign a
> >> probability value to individual expressions.
> >>
> >> So I think that <= 10% is good enough, which is true in this case.
> >>
> >> I was tring to encourage the compiler to:
> >> * Place this code path out of line, and
> >> * Not do the stack manipulations (build a frame, spill registers)
> >>   needed for a non-leaf function if this path isn't taken.
> >
> > Understood, I think it's ok in this case because the shuffling only
> > happens for order-10 page free events by default so it will be
> > difficult to measure the perf impact either way.  But in other kernel
> > contexts I think unlikely() annotation should come with numbers, 90%
> > not taken is not sufficient in and of itself.
>
> I'm not sure I fully understand your point.  I *think* you're
> editorializing on unlikely() in general and not this specific code,
> but it's a little hard to follow.

Yes, editorializing on unlikely(). Specifically I would normally ask
for perf numbers to show that the hint is worth it, but I talked
myself out of asking for that in this case.

> Your mention of "order-10 page free events" is confusing.  Do you mean
> "(order-10 page) free events", i.e. freeing of 1024 consecutive pages?
> Or are you using "order" as a synonym for "approximately" and you mean
> "approximately 10 (page free event)s"?

I'm referring to this:

      if (is_shuffle_order(order))
                add_to_free_area_random(page, &zone->free_area[order],

Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered
when we might be releasing a 4MB buddy-page.

> We both agree (I hope) that the number here is obvious on brief
> inspection: 1/BITS_PER_LONG.
>
> GCC's heuristics are tuned to value cycles on the fast path 9x as much as
> cycles on the slow path, so it will spend up to 9 cycles on the slow path
> to save a cycle on the fast path.
>
> I've found one comment (https://pastebin.com/S8Y8tqZy) saying that
> GCC < 9.x was a lot sloppier on the cost ratio and could pessimize
> the code if the branch was more than ~ 1% taken.  Perhaps that's what
> you're remembering?

Yes, thanks for that digging!

> Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too
> worried.

Makes sense.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 19:17             ` Alexander Duyck
@ 2020-03-18 20:06               ` George Spelvin
  0 siblings, 0 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 20:06 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, lkml

On Wed, Mar 18, 2020 at 12:17:14PM -0700, Alexander Duyck wrote:
> I was just putting it out there as a possibility. What I have seen in
> the past is that under some circumstances gcc can be smart enough to
> interpret that as a "branch on carry". My thought was you are likely
> having to test the value against itself and then you might be able to
> make use of shift and carry flag to avoid that. In addition you could
> get away from having to recast a unsigned value as a signed one in
> order to perform the bit test.

Ah, yes, it would be nice if gcc could use the carry bit for r
rather than having to devote a whole register to it.  But it has
to do two unrelated flag tests (zero and carry), and it's generally
pretty bad at taking advantage of preserved flag bits like that.

My ideal x86-64 object code would be:
	shlq	rand(%rip)
	jz	fixup
fixed:
	jnc	tail
	jmp	add_to_free_area
tail:
	jmp	add_to_free_area_tail
fixup:
	pushq	%rdx
	pushq	%rsi
	pushq	%rdi
	call	get_random_u64
	popq	%rdi
	popq	%rsi
	popq	%rdx
	stc
	adcq	%rax,%rax
	movq	%rax, rand(%rip)
	jmp	fixed

... but I don't know how to induce GCC to generate that, and
the function doesn't seem worthy of platform-specific asm.

(Note that I have to use add add on the slow path because lea doesn't
set the carry bit.)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
                           ` (3 preceding siblings ...)
  2020-03-18 15:26         ` Alexander Duyck
@ 2020-03-18 20:39         ` George Spelvin
  2020-03-18 21:34           ` Alexander Duyck
  2020-03-19 12:05           ` [PATCH v4] " George Spelvin
  4 siblings, 2 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 20:39 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dan Williams, linux-mm, Andrew Morton, Alexander Duyck,
	Randy Dunlap, lkml

The old code had separate "rand" and "rand_count" variables,
which could get out of sync with bad results.

In the worst case, two threads would see rand_count = 1 and
both decrement it, resulting in rand_count = 255 and rand being
filled with zeros for the next 255 calls.

Instead, pack them both into a single, atomically updateable,
variable.  This makes it a lot easier to reason about race
conditions.  They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.

Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
in the land of nasal demons.  The compiler would be free to spill
temporaries to the static variables in arbitrarily perverse ways
and create hard-to-find bugs.

(Alternatively, we could declare the static variable "volatile",
one of the few places in the Linux kernel that would be correct,
but it would probably annoy Linus.)

Third, use long rather than u64.  This not only keeps the state
atomically updateable, it also speeds up the fast path on 32-bit
machines.  Saving at least three instructions on the fast path (one
load, one add-with-carry, and one store) is worth a second call
to get_random_u*() per 64 bits.  The fast path of get_random_u*
is less than the 3*64 = 192 instructions saved, and the slow path
happens every 64 bytes so isn't affected by the change.

Fourth, use left-shift rather than right.  Testing the sign bit
produces slightly smaller/faster code than testing the lsbit.

I've tried shifting both ways, and copying the desired bit to
a boolean before shifting rather than keeping separate full-width
r and rshift variables, but both produce larger code:

		x86_64	i386
This code	 94	 95
Explicit bool	103	 99
Lsbits		 99	101
Both		 96	100

In a perfect world, the x86-64 object code would be tiny,
and dominated by the unavoidable unpredictable branch, but
this function doesn't warrant arch-specific optimization.

add_to_free_area_random:
	shlq	rand(%rip)
	jz	3f
1:
	jnc	2f
	jmp	add_to_free_area	# tail call
2:
	jmp	add_to_free_area_tail
3:
	pushq	%rdx
	pushq	%rsi
	pushq	%rdi
	call	get_random_u64
	popq	%rdi
	popq	%rsi
	popq	%rdx
	stc
	adcq	%rax,%rax	# not lea because we need carry out
	movq	%rax, rand(%rip)
	jmp	1b

Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
v2: Rewrote commit message to explain existing races better.
    Made local variables unsigned to avoid (technically undefined)
    signed overflow.
v3: Typos fixed, Acked-by, expanded commit message.

 mm/shuffle.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/shuffle.c b/mm/shuffle.c
index e0ed247f8d90..16c5fddc292f 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 void add_to_free_area_random(struct page *page, struct free_area *area,
 		int migratetype)
 {
-	static u64 rand;
-	static u8 rand_bits;
+	static unsigned long rand;	/* buffered random bits */
+	unsigned long r = READ_ONCE(rand), rshift = r << 1;
 
 	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
+	 * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
+	 * 1 bit, then zero-padding in the lsbits.  This allows us to
+	 * maintain the pre-generated bits and the count of bits in a
+	 * single, atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
+	 * update the rand state it just adds to the entropy.  The
+	 * worst that can happen is a random bit is used twice, or
+	 * get_random_long is called redundantly.
 	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
 	}
+	WRITE_ONCE(rand, rshift);
 
-	if (rand & 1)
+	if ((long)r < 0)
 		add_to_free_area(page, area, migratetype);
 	else
 		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
 }
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 19:40                 ` Dan Williams
@ 2020-03-18 21:02                   ` George Spelvin
  0 siblings, 0 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 21:02 UTC (permalink / raw)
  To: Dan Williams; +Cc: Kees Cook, Linux MM, Andrew Morton, lkml

On Wed, Mar 18, 2020 at 12:40:33PM -0700, Dan Williams wrote:
> Yes, editorializing on unlikely(). Specifically I would normally ask
> for perf numbers to show that the hint is worth it, but I talked
> myself out of asking for that in this case.

Ah, now I understand!  Yes, it's just gilding the lily, but as long as I 
was messing about in the area it seemed worth adding.

You're saying that if the code were a hotter code path, you'd want 
benchmarks, but given its actual infrequent usage, you're satisfied with 
the argument that it's probably right; it's not like we've crippled the 
kernel even if wrong.

> I'm referring to this:
> 
>       if (is_shuffle_order(order))
>                 add_to_free_area_random(page, &zone->free_area[order],
> 
> Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered
> when we might be releasing a 4MB buddy-page.

Ding!  Okay, now it makes sense.  I didn't look that far up the call
stack.  I was auditing each and every call to a get_random function in
the kernel, when I came across this call site, struggled to understand
what it was doing, and rewrote it to be less wrong.

I only vaguely understand where it fits into the larger mm/ system
so I never noticed that condition on its use.

Yes, given that infrequent use, I'm even happier that I focused on code
size rather than performance.

Thank you for taking the time to explain.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 20:39         ` [PATCH v3] " George Spelvin
@ 2020-03-18 21:34           ` Alexander Duyck
  2020-03-18 22:49             ` George Spelvin
  2020-03-19 12:05           ` [PATCH v4] " George Spelvin
  1 sibling, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2020-03-18 21:34 UTC (permalink / raw)
  To: George Spelvin
  Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, Randy Dunlap

On Wed, Mar 18, 2020 at 1:39 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count = 1 and
> both decrement it, resulting in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrarily perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64.  This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines.  Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits.  The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, use left-shift rather than right.  Testing the sign bit
> produces slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to
> a boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
>                 x86_64  i386
> This code        94      95
> Explicit bool   103      99
> Lsbits           99     101
> Both             96     100
>
> In a perfect world, the x86-64 object code would be tiny,
> and dominated by the unavoidable unpredictable branch, but
> this function doesn't warrant arch-specific optimization.
>
> add_to_free_area_random:
>         shlq    rand(%rip)
>         jz      3f
> 1:
>         jnc     2f
>         jmp     add_to_free_area        # tail call
> 2:
>         jmp     add_to_free_area_tail
> 3:
>         pushq   %rdx
>         pushq   %rsi
>         pushq   %rdi
>         call    get_random_u64
>         popq    %rdi
>         popq    %rsi
>         popq    %rdx
>         stc
>         adcq    %rax,%rax       # not lea because we need carry out
>         movq    %rax, rand(%rip)
>         jmp     1b
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org

What kernel is this based on? You might want to rebase on the latest
linux-next as it occurs to me that this function was renamed to
shuffle_pick_tail as I had incorporated a few bits of it into the
logic for placing buddy pages and reported pages on the tail of the
list.

> ---
> v2: Rewrote commit message to explain existing races better.
>     Made local variables unsigned to avoid (technically undefined)
>     signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
>
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..16c5fddc292f 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>                 int migratetype)
>  {
> -       static u64 rand;
> -       static u8 rand_bits;
> +       static unsigned long rand;      /* buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;
>
>         /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> +        * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
> +        * 1 bit, then zero-padding in the lsbits.  This allows us to
> +        * maintain the pre-generated bits and the count of bits in a
> +        * single, atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
>          */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> +       if (unlikely(rshift == 0)) {
> +               r = get_random_long();
> +               rshift = r << 1 | 1;
>         }
> +       WRITE_ONCE(rand, rshift);
>
> -       if (rand & 1)
> +       if ((long)r < 0)
>                 add_to_free_area(page, area, migratetype);
>         else
>                 add_to_free_area_tail(page, area, migratetype);
> -       rand_bits--;
> -       rand >>= 1;
>  }
> --
> 2.26.0.rc2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 21:34           ` Alexander Duyck
@ 2020-03-18 22:49             ` George Spelvin
  2020-03-18 22:57               ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 22:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, Randy Dunlap, lkml

On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote:
> What kernel is this based on? You might want to rebase on the latest
> linux-next as it occurs to me that this function was renamed to
> shuffle_pick_tail as I had incorporated a few bits of it into the
> logic for placing buddy pages and reported pages on the tail of the
> list.

5.5.8.  I didn't realize it made much difference, but I see it does. Due 
to the different return type, the best variant has probably changed and 
I'll have to re-check.

Since there's only one call site, should the function be moved to 
page_alloc.c and inlined?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 22:49             ` George Spelvin
@ 2020-03-18 22:57               ` Dan Williams
  2020-03-18 23:18                 ` George Spelvin
  0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 22:57 UTC (permalink / raw)
  To: George Spelvin
  Cc: Alexander Duyck, Kees Cook, linux-mm, Andrew Morton, Randy Dunlap

On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote:
> > What kernel is this based on? You might want to rebase on the latest
> > linux-next as it occurs to me that this function was renamed to
> > shuffle_pick_tail as I had incorporated a few bits of it into the
> > logic for placing buddy pages and reported pages on the tail of the
> > list.
>
> 5.5.8.  I didn't realize it made much difference, but I see it does. Due
> to the different return type, the best variant has probably changed and
> I'll have to re-check.
>
> Since there's only one call site, should the function be moved to
> page_alloc.c and inlined?

The intent was to make it compile-away on
CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However,
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros.
So, it's not serving much purpose being in mm/shuffle.c. I'd support
moving it.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 22:57               ` Dan Williams
@ 2020-03-18 23:18                 ` George Spelvin
  0 siblings, 0 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 23:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alexander Duyck, Kees Cook, linux-mm, Andrew Morton, Randy Dunlap, lkml

On Wed, Mar 18, 2020 at 03:57:01PM -0700, Dan Williams wrote:
> On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote:
>> Since there's only one call site, should the function be moved to
>> page_alloc.c and inlined?
> 
> The intent was to make it compile-away on
> CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However,
> CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros.
> So, it's not serving much purpose being in mm/shuffle.c. I'd support
> moving it.

The options are to define it as an inline function in shuffle.h,
which keeps the #ifdef localized, or to drop it into page_alloc.c,
which shrinks the shuffle.h interface but requires an #ifdef in
page_alloc.h.

Um... or does it?  If is_shuffle_order always returns false,
shuffle_pick_tail is dead code and will be elided by the compiler.

But will the static variable be elided, too?  Need to check.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-18 20:39         ` [PATCH v3] " George Spelvin
  2020-03-18 21:34           ` Alexander Duyck
@ 2020-03-19 12:05           ` George Spelvin
  2020-03-19 17:49             ` Alexander Duyck
  2020-03-20 17:58             ` Kees Cook
  1 sibling, 2 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-19 12:05 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dan Williams, linux-mm, Andrew Morton, Alexander Duyck,
	Randy Dunlap, lkml

The separate "rand" and "rand_count" variables could get out of
sync with bad results.

In the worst case, two threads would see rand_count=1 and both
decrement it, resulting in rand_count=255 and rand being filled
with zeros for the next 255 calls.

Instead, pack them both into a single, atomically updateable,
variable.  This makes it a lot easier to reason about race
conditions.  They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.

Second, use READ_ONCE and WRITE_ONCE.  Because the random bit
buffer is accessed by multiple threads concurrently without
locking, omitting those puts us deep in the land of nasal demons.
The compiler would be free to spill to the static variable in
arbitrarily perverse ways and create hard-to-find bugs.

(I'm torn between this and just declaring the buffer "volatile".
Linux tends to prefer marking accesses rather than variables,
but in this case, every access to the buffer is volatile.
It makes no difference to the generated code.)

Third, use long rather than u64.  This not only keeps the state
atomically updateable, it also speeds up the fast path on 32-bit
machines.  Saving at least three instructions on the fast path (one
load, one add-with-carry, and one store) is worth a second call
to get_random_u*() per 64 bits.  The fast path of get_random_u*
is less than the 3*64 = 192 instructions saved, and the slow path
happens every 64 bytes so isn't affected by the change.

Fourth, make the function inline.  It's small, and there's only
one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
function call overhead.

Fifth, use the msbits of the buffer first (left shift) rather
than the lsbits (right shift).  Testing the sign bit produces
slightly smaller/faster code than testing the lsbit.

I've tried shifting both ways, and copying the desired bit to a
boolean before shifting rather than keeping separate full-width
r and rshift variables, but both produce larger code:

	x86-64 text size
Msbit   	42236
Lsbit   	42242 (+6)
Lsbit+bool	42258 (+22)
Msbit+bool	42284 (+52)

(Since this is straight-line code, size is a good proxy for number
of instructions and execution time.  Using READ/WRITE_ONCE instead of
volatile makes no difference.)

In a perfect world, on x86-64 the fast path would be:
	shlq	rand(%eip)
	jz	refill
refill_complete:
	jc	add_to_tail

but I don't see how to get gcc to generate that, and this
function isn't worth arch-specific implementation.

Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
v2: Rewrote commit message to explain existing races better.
    Made local variables unsigned to avoid (technically undefined)
    signed overflow.
v3: Typos fixed, Acked-by, expanded commit message.
v4: Rebase against -next; function has changed from 
    add_to_free_area_random() to shuffle_pick_tail.  Move to
    inline function in shuffle.h.
    Not sure if it's okay to keep Acked-by: after such a
    significant change.

 mm/shuffle.c | 23 -----------------------
 mm/shuffle.h | 26 +++++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/mm/shuffle.c b/mm/shuffle.c
index 44406d9977c7..ea281d5e1f23 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -182,26 +182,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
 		shuffle_zone(z);
 }
-
-bool shuffle_pick_tail(void)
-{
-	static u64 rand;
-	static u8 rand_bits;
-	bool ret;
-
-	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
-	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
-	}
-
-	ret = rand & 1;
-
-	rand_bits--;
-	rand >>= 1;
-
-	return ret;
-}
diff --git a/mm/shuffle.h b/mm/shuffle.h
index 4d79f03b6658..fb79e05cd86d 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -22,7 +22,31 @@ enum mm_shuffle_ctl {
 DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
 extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
 extern void __shuffle_free_memory(pg_data_t *pgdat);
-extern bool shuffle_pick_tail(void);
+static inline bool shuffle_pick_tail(void)
+{
+	static unsigned long rand;	/* buffered random bits */
+	unsigned long r = READ_ONCE(rand), rshift = r << 1;
+
+	/*
+	 * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
+	 * 1 bit, then zero-padding in the lsbits.  This allows us to
+	 * maintain the pre-generated bits and the count of bits in a
+	 * single, atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
+	 * update the rand state it just adds to the entropy.  The
+	 * worst that can happen is a random bit is used twice, or
+	 * get_random_long is called redundantly.
+	 */
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
+	}
+	WRITE_ONCE(rand, rshift);
+
+	return (long)r < 0;
+}
+
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
 	if (!static_branch_unlikely(&page_alloc_shuffle_key))

base-commit: 47780d7892b77e922bbe19b5dea99cde06b2f0e5
-- 
2.26.0.rc2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-19 12:05           ` [PATCH v4] " George Spelvin
@ 2020-03-19 17:49             ` Alexander Duyck
  2020-03-20 17:58             ` Kees Cook
  1 sibling, 0 replies; 25+ messages in thread
From: Alexander Duyck @ 2020-03-19 17:49 UTC (permalink / raw)
  To: George Spelvin
  Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, Randy Dunlap

On Thu, Mar 19, 2020 at 5:05 AM George Spelvin <lkml@sdf.org> wrote:
>
> The separate "rand" and "rand_count" variables could get out of
> sync with bad results.
>
> In the worst case, two threads would see rand_count=1 and both
> decrement it, resulting in rand_count=255 and rand being filled
> with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Because the random bit
> buffer is accessed by multiple threads concurrently without
> locking, omitting those puts us deep in the land of nasal demons.
> The compiler would be free to spill to the static variable in
> arbitrarily perverse ways and create hard-to-find bugs.
>
> (I'm torn between this and just declaring the buffer "volatile".
> Linux tends to prefer marking accesses rather than variables,
> but in this case, every access to the buffer is volatile.
> It makes no difference to the generated code.)
>
> Third, use long rather than u64.  This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines.  Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits.  The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, make the function inline.  It's small, and there's only
> one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
> function call overhead.
>
> Fifth, use the msbits of the buffer first (left shift) rather
> than the lsbits (right shift).  Testing the sign bit produces
> slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to a
> boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
>         x86-64 text size
> Msbit           42236
> Lsbit           42242 (+6)
> Lsbit+bool      42258 (+22)
> Msbit+bool      42284 (+52)
>
> (Since this is straight-line code, size is a good proxy for number
> of instructions and execution time.  Using READ/WRITE_ONCE instead of
> volatile makes no difference.)
>
> In a perfect world, on x86-64 the fast path would be:
>         shlq    rand(%eip)
>         jz      refill
> refill_complete:
>         jc      add_to_tail
>
> but I don't see how to get gcc to generate that, and this
> function isn't worth arch-specific implementation.
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org

Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

> ---
> v2: Rewrote commit message to explain existing races better.
>     Made local variables unsigned to avoid (technically undefined)
>     signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
> v4: Rebase against -next; function has changed from
>     add_to_free_area_random() to shuffle_pick_tail.  Move to
>     inline function in shuffle.h.
>     Not sure if it's okay to keep Acked-by: after such a
>     significant change.
>
>  mm/shuffle.c | 23 -----------------------
>  mm/shuffle.h | 26 +++++++++++++++++++++++++-
>  2 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 44406d9977c7..ea281d5e1f23 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -182,26 +182,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>         for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
>                 shuffle_zone(z);
>  }
> -
> -bool shuffle_pick_tail(void)
> -{
> -       static u64 rand;
> -       static u8 rand_bits;
> -       bool ret;
> -
> -       /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> -        */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> -       }
> -
> -       ret = rand & 1;
> -
> -       rand_bits--;
> -       rand >>= 1;
> -
> -       return ret;
> -}
> diff --git a/mm/shuffle.h b/mm/shuffle.h
> index 4d79f03b6658..fb79e05cd86d 100644
> --- a/mm/shuffle.h
> +++ b/mm/shuffle.h
> @@ -22,7 +22,31 @@ enum mm_shuffle_ctl {
>  DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
>  extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
>  extern void __shuffle_free_memory(pg_data_t *pgdat);
> -extern bool shuffle_pick_tail(void);
> +static inline bool shuffle_pick_tail(void)
> +{
> +       static unsigned long rand;      /* buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;
> +
> +       /*
> +        * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
> +        * 1 bit, then zero-padding in the lsbits.  This allows us to
> +        * maintain the pre-generated bits and the count of bits in a
> +        * single, atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
> +        */
> +       if (unlikely(rshift == 0)) {
> +               r = get_random_long();
> +               rshift = r << 1 | 1;
> +       }
> +       WRITE_ONCE(rand, rshift);
> +
> +       return (long)r < 0;
> +}
> +
>  static inline void shuffle_free_memory(pg_data_t *pgdat)
>  {
>         if (!static_branch_unlikely(&page_alloc_shuffle_key))
>
> base-commit: 47780d7892b77e922bbe19b5dea99cde06b2f0e5
> --
> 2.26.0.rc2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4] mm/shuffle.c: Fix races in add_to_free_area_random()
  2020-03-19 12:05           ` [PATCH v4] " George Spelvin
  2020-03-19 17:49             ` Alexander Duyck
@ 2020-03-20 17:58             ` Kees Cook
  1 sibling, 0 replies; 25+ messages in thread
From: Kees Cook @ 2020-03-20 17:58 UTC (permalink / raw)
  To: George Spelvin
  Cc: Dan Williams, linux-mm, Andrew Morton, Alexander Duyck, Randy Dunlap

On Thu, Mar 19, 2020 at 12:05:22PM +0000, George Spelvin wrote:
> The separate "rand" and "rand_count" variables could get out of
> sync with bad results.
> 
> In the worst case, two threads would see rand_count=1 and both
> decrement it, resulting in rand_count=255 and rand being filled
> with zeros for the next 255 calls.
> 
> Instead, pack them both into a single, atomically updateable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
> 
> Second, use READ_ONCE and WRITE_ONCE.  Because the random bit
> buffer is accessed by multiple threads concurrently without
> locking, omitting those puts us deep in the land of nasal demons.
> The compiler would be free to spill to the static variable in
> arbitrarily perverse ways and create hard-to-find bugs.
> 
> (I'm torn between this and just declaring the buffer "volatile".
> Linux tends to prefer marking accesses rather than variables,
> but in this case, every access to the buffer is volatile.
> It makes no difference to the generated code.)
> 
> Third, use long rather than u64.  This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines.  Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits.  The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
> 
> Fourth, make the function inline.  It's small, and there's only
> one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
> function call overhead.
> 
> Fifth, use the msbits of the buffer first (left shift) rather
> than the lsbits (right shift).  Testing the sign bit produces
> slightly smaller/faster code than testing the lsbit.
> 
> I've tried shifting both ways, and copying the desired bit to a
> boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
> 
> 	x86-64 text size
> Msbit   	42236
> Lsbit   	42242 (+6)
> Lsbit+bool	42258 (+22)
> Msbit+bool	42284 (+52)
> 
> (Since this is straight-line code, size is a good proxy for number
> of instructions and execution time.  Using READ/WRITE_ONCE instead of
> volatile makes no difference.)
> 
> In a perfect world, on x86-64 the fast path would be:
> 	shlq	rand(%eip)
> 	jz	refill
> refill_complete:
> 	jc	add_to_tail
> 
> but I don't see how to get gcc to generate that, and this
> function isn't worth arch-specific implementation.
> 
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
> v2: Rewrote commit message to explain existing races better.
>     Made local variables unsigned to avoid (technically undefined)
>     signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
> v4: Rebase against -next; function has changed from 
>     add_to_free_area_random() to shuffle_pick_tail.  Move to
>     inline function in shuffle.h.
>     Not sure if it's okay to keep Acked-by: after such a
>     significant change.

Good by me. Thanks!

-Kees

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-03-20 17:58 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 13:50 [PATCH] mm/shuffle.c: optimize add_to_free_area_random() George Spelvin
2020-03-17 21:44 ` Kees Cook
2020-03-17 23:06   ` George Spelvin
2020-03-17 23:38     ` Kees Cook
2020-03-18  1:44       ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
2020-03-18  1:49         ` Randy Dunlap
2020-03-18  3:53         ` Dan Williams
2020-03-18  8:20           ` George Spelvin
2020-03-18 17:36             ` Dan Williams
2020-03-18 19:29               ` George Spelvin
2020-03-18 19:40                 ` Dan Williams
2020-03-18 21:02                   ` George Spelvin
2020-03-18  3:58         ` Kees Cook
2020-03-18 15:26         ` Alexander Duyck
2020-03-18 18:35           ` George Spelvin
2020-03-18 19:17             ` Alexander Duyck
2020-03-18 20:06               ` George Spelvin
2020-03-18 20:39         ` [PATCH v3] " George Spelvin
2020-03-18 21:34           ` Alexander Duyck
2020-03-18 22:49             ` George Spelvin
2020-03-18 22:57               ` Dan Williams
2020-03-18 23:18                 ` George Spelvin
2020-03-19 12:05           ` [PATCH v4] " George Spelvin
2020-03-19 17:49             ` Alexander Duyck
2020-03-20 17:58             ` Kees Cook

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).