* [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
@ 2020-03-17 13:50 George Spelvin
2020-03-17 21:44 ` Kees Cook
0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-17 13:50 UTC (permalink / raw)
To: Dan Williams, linux-mm; +Cc: Kees Cook, Andrew Morton, lkml
First, use long rather than u64 for the bit buffer type, which
is significantly more efficient on 32-bit processors.
Second, avoid the need for a separate rand_bits counter.
rand_bits is never more than 63, so there's always room in rand
for a bit to mark the end of the available bits. This makes the
shared state atomically updatable, which makes it a lot easier
to reason about race conditions.
Third, use READ_ONCE and WRITE_ONCE. Without them, the compiler
may spill to the shared static in arbitrarily perverse ways,
and combined with the fact that the code eschews locking, that
is a recipe for hard-to-find bugs. Now, a race might cause a bit
to be used twice, or get_random_long() to be called redundantly,
but it can't summon nasal daemons.
I've tried a few variants. Keeping random lsbits with a most-
significant end marker, or using an explicit bool variable rather
than testing r both increase code size slightly.
x86_64 i386
This code 94 95
Explicit bool 103 99
Lsbits 99 101
Both 96 100
Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
mm/shuffle.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index b3fe97fd6654..0e4bf6a8da52 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,25 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
void add_to_free_area_random(struct page *page, struct free_area *area,
int migratetype)
{
- static u64 rand;
- static u8 rand_bits;
+ static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */
+ long r = READ_ONCE(rand), rshift = r << 1;;
/*
- * The lack of locking is deliberate. If 2 threads race to
+ * rand holds some random msbits, with the end marked by a 1 bit.
+ * This allows us to maintain the pre-generated bits and the
+ * count of bits in a single, atomically updatable, variable.
+ *
+ * The lack of locking is deliberate. If two threads race to
* update the rand state it just adds to the entropy.
*/
- if (rand_bits == 0) {
- rand_bits = 64;
- rand = get_random_u64();
+ if (unlikely(rshift == 0)) {
+ r = get_random_long();
+ rshift = r << 1 | 1;
}
+ WRITE_ONCE(rand, rshift);
- if (rand & 1)
+ if (r < 0)
add_to_free_area(page, area, migratetype);
else
add_to_free_area_tail(page, area, migratetype);
- rand_bits--;
- rand >>= 1;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
2020-03-17 13:50 [PATCH] mm/shuffle.c: optimize add_to_free_area_random() George Spelvin
@ 2020-03-17 21:44 ` Kees Cook
2020-03-17 23:06 ` George Spelvin
0 siblings, 1 reply; 25+ messages in thread
From: Kees Cook @ 2020-03-17 21:44 UTC (permalink / raw)
To: George Spelvin; +Cc: Dan Williams, linux-mm, Andrew Morton
On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote:
> First, use long rather than u64 for the bit buffer type, which
> is significantly more efficient on 32-bit processors.
Why? This is twice as many get_random_*() calls for 32-bit: it'll save
whatever bits it doesn't use. (Speaking to word-sized stores, yes, that
makes sense, but see below...)
> Second, avoid the need for a separate rand_bits counter.
> rand_bits is never more than 63, so there's always room in rand
> for a bit to mark the end of the available bits. This makes the
> shared state atomically updatable, which makes it a lot easier
> to reason about race conditions.
What are the race conditions? I had to go look at I see that
add_to_free_area_random() is called under __free_one_page() which is
under &zone->lock. Would it make more sense to store the randomness
per-zone and entirely side-step the issues?
> Third, use READ_ONCE and WRITE_ONCE. Without them, the compiler
> may spill to the shared static in arbitrarily perverse ways,
> and combined with the fact that the code eschews locking, that
> is a recipe for hard-to-find bugs. Now, a race might cause a bit
> to be used twice, or get_random_long() to be called redundantly,
> but it can't summon nasal daemons.
All these things might make the results less predictable, too. :) But,
yes, if there was deterministic sharing of random values, that'd be
bad. But this patch doesn't really solve the "use the same random bits"
problem, which is the very thing that might get us into a predictable
state. If two zones both call READ_ONCE(), use the same value (eek),
and then both call WRITE_ONCE(), one of them wins the write (no big deal).
--
Kees Cook
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
2020-03-17 21:44 ` Kees Cook
@ 2020-03-17 23:06 ` George Spelvin
2020-03-17 23:38 ` Kees Cook
0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-17 23:06 UTC (permalink / raw)
To: Kees Cook; +Cc: Dan Williams, linux-mm, Andrew Morton, lkml
On Tue, Mar 17, 2020 at 02:44:33PM -0700, Kees Cook wrote:
> On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote:
> > First, use long rather than u64 for the bit buffer type, which
> > is significantly more efficient on 32-bit processors.
>
> Why? This is twice as many get_random_*() calls for 32-bit: it'll save
> whatever bits it doesn't use. (Speaking to word-sized stores, yes, that
> makes sense, but see below...)
Yes, but the per-call overhead is fairly low until you run out of the
64-byte buffer; then there's an expensive crypto operation to
refill the buffer.
So two get_random_32 calls is not a lot more than one get_random_64. (The
buffers are per-CPU, so they are pretty cheap to access. (I have other
patches I'm working on to speed up the locking they have, but this one is
independent so I'm sending it separately.)
And if you want faster, prandom_u32() is probably good enough for this
application.
I did it to speed up and shrink the fast path. It saves at least
3 instructions on 32-bit machines (one load, one rotate through carry,
and one store), at the cost of one extra get_random_*() call per 64
bits. If get_random_u32()'s fast path is less than 192 instructions,
it's a win.
It's also relevant to the race condition reasoning mentioned below;
a one-word value can be atomically updated. On a 32-bit machine,
64-bit operations get tearing and it's a PITA to reason about.
>> Second, avoid the need for a separate rand_bits counter.
>> rand_bits is never more than 63, so there's always room in rand
>> for a bit to mark the end of the available bits. This makes the
>> shared state atomically updatable, which makes it a lot easier
>> to reason about race conditions.
>
> What are the race conditions? I had to go look at I see that
> add_to_free_area_random() is called under __free_one_page() which is
> under &zone->lock. Would it make more sense to store the randomness
> per-zone and entirely side-step the issues?
The most serious is that if two threads simultaneously observe
rand_bits == 1, but do their decrements separately, you end up
with rand_bits = 255 and you generate 255 consecutive 0 bits
before refilling the buffer.
Since we're only generating random bits, a screwed-up answer occasionally
doesn't really matter (like the comment says "lack of locking is
deliberate"), but 255 screwed up bits is a bit much.
I avoided changing the underlying locking model because I didn't
feel up to such an invasive change; I managed to fix the problems
I saw without going there. And shrink the code; tht seemed like
enough of a win to justify it to me.
If you want to change the locking, the other alternative to per-zone is
per-CPU, which also avoids locking. We could even add a generic
get_random_bit() function.
>> Third, use READ_ONCE and WRITE_ONCE. Without them, the compiler
>> may spill to the shared static in arbitrarily perverse ways,
>> and combined with the fact that the code eschews locking, that
>> is a recipe for hard-to-find bugs. Now, a race might cause a bit
>> to be used twice, or get_random_long() to be called redundantly,
>> but it can't summon nasal daemons.
>
> All these things might make the results less predictable, too. :) But,
> yes, if there was deterministic sharing of random values, that'd be
> bad. But this patch doesn't really solve the "use the same random bits"
> problem, which is the very thing that might get us into a predictable
> state. If two zones both call READ_ONCE(), use the same value (eek),
> and then both call WRITE_ONCE(), one of them wins the write (no big deal).
Absolutely, but in case of a tight race, only *one* bit is used twice,
and the bit used is uniformly distributed, so it's a really small
correlation. Unlike the 255x zero race the existing code can have.
The main reason I did this is to, as I already wrote, keep the nasal
demons at bay. The existing code does things (having another thread
change a non-volatile variable while this code is executing) that are
explicitly undefined by the ANSI C standard.
The compiler is allowed to (in John Woods' memorable explanation)
produce code that makes demons fly out of your nose. (More plausibly,
it may simply crash.)
It probably won't, but I prefer to stay on the safe side of the
language specification contact. Linus has ranted on more than one
occasion about the perverse ways that GCC has used its permission
to assume that people won't violate the contract.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
2020-03-17 23:06 ` George Spelvin
@ 2020-03-17 23:38 ` Kees Cook
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
0 siblings, 1 reply; 25+ messages in thread
From: Kees Cook @ 2020-03-17 23:38 UTC (permalink / raw)
To: George Spelvin; +Cc: Dan Williams, linux-mm, Andrew Morton
On Tue, Mar 17, 2020 at 11:06:12PM +0000, George Spelvin wrote:
> The most serious is that if two threads simultaneously observe
> rand_bits == 1, but do their decrements separately, you end up
> with rand_bits = 255 and you generate 255 consecutive 0 bits
> before refilling the buffer.
>
> Since we're only generating random bits, a screwed-up answer occasionally
> doesn't really matter (like the comment says "lack of locking is
> deliberate"), but 255 screwed up bits is a bit much.
Okay, I'm on board! :) Thanks for spelling this race out; I hadn't seen
quite how nasty it could get. (Perhaps mention in the commit log for v2?)
> I avoided changing the underlying locking model because I didn't
> feel up to such an invasive change; I managed to fix the problems
> I saw without going there. And shrink the code; tht seemed like
> enough of a win to justify it to me.
Fair enough.
> The compiler is allowed to (in John Woods' memorable explanation)
> produce code that makes demons fly out of your nose. (More plausibly,
> it may simply crash.)
So one thing that I see here that is still in the nasal demon realm is
that the left-shift of a signed value, which is technically undefined
behavior in C. (See the comment on check_shl_overflow().)
Doing a signedness check is very cheap in the resulting machine code;
but I suspect sticking to unsigned and reversing direction for a
bottom-bit test too bad?
i.e.:
static unsigned long rand_bits;
unsigned long r = READ_ONCE(rand_bits), rshift = r >> 1;
if (unlikely(rshift == 0)) {
r = get_random_long();
rshift = (r >> 1) | (0x1UL << (BITS_PER_LONG - 1));
}
WRITE_ONCE(rand_bits, rshift);
if (r & 1)
add_to...
else
add_to...tail
--
Kees Cook
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-17 23:38 ` Kees Cook
@ 2020-03-18 1:44 ` George Spelvin
2020-03-18 1:49 ` Randy Dunlap
` (4 more replies)
0 siblings, 5 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 1:44 UTC (permalink / raw)
To: Kees Cook; +Cc: Dan Williams, linux-mm, Andrew Morton, lkml
The old code had separate "rand" and "rand_count" variables,
which could get out of sync with bad results.
In the worst case, two threads would see rand_count == 1 and
both decrement it, resultint in rand_count = 255 and rand being
filled with zeros for the next 255 calls.
Instead, pack them both into a single, atomically updatable,
variable. This makes it a lot easier to reason about race
conditions. They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.
Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep
in the land of nasal demons. The compiler would be free to spill
temporaries to the static variables in arbitrary perverse ways
and create hard-to-find bugs.
(Alternatively, we could declare the static variable "volatile",
one of the few places in the Linux kernel that would be correct,
but it would probably annoy Linus.)
Third, use long rather than u64. This not only keeps the
state atomically updatable, it also speeds up the fast path
on 32-bit machines. Saving at least three instructions on
the fast path (one load, one add-with-carry, and one store)
is worth exchanging one call to get_random_u64 for two
calls to get_random_u32. The fast path of get_random_* is
less than the 3*64 = 192 instructions saved, and the slow
path happens every 64 bytes so isn't affectrd by the change.
I've tried a few variants. Keeping random lsbits with
a most-significant end marker, and using an explicit bool
flag rather than testing r both increase code size slightly.
x86_64 i386
This code 94 95
Explicit bool 103 99
Lsbits 99 101
Both 96 100
Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
mm/shuffle.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index e0ed247f8d90..4ba3ba84764d 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
void add_to_free_area_random(struct page *page, struct free_area *area,
int migratetype)
{
- static u64 rand;
- static u8 rand_bits;
+ static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */
+ unsigned long r = READ_ONCE(rand), rshift = r << 1;;
/*
- * The lack of locking is deliberate. If 2 threads race to
- * update the rand state it just adds to the entropy.
+ * rand holds some random msbits, with a 1 bit appended, followed
+ * by zero-padding in the lsbits. This allows us to maintain
+ * the pre-generated bits and the count of bits in a single,
+ * atomically updatable, variable.
+ *
+ * The lack of locking is deliberate. If two threads race to
+ * update the rand state it just adds to the entropy. The
+ * worst that can happen is a random bit is used twice, or
+ * get_random_long is called redundantly.
*/
- if (rand_bits == 0) {
- rand_bits = 64;
- rand = get_random_u64();
+ if (unlikely(rshift == 0)) {
+ r = get_random_long();
+ rshift = r << 1 | 1;
}
+ WRITE_ONCE(rand, rshift);
- if (rand & 1)
+ if ((long)r < 0)
add_to_free_area(page, area, migratetype);
else
add_to_free_area_tail(page, area, migratetype);
- rand_bits--;
- rand >>= 1;
}
--
2.26.0.rc2
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
@ 2020-03-18 1:49 ` Randy Dunlap
2020-03-18 3:53 ` Dan Williams
` (3 subsequent siblings)
4 siblings, 0 replies; 25+ messages in thread
From: Randy Dunlap @ 2020-03-18 1:49 UTC (permalink / raw)
To: George Spelvin, Kees Cook; +Cc: Dan Williams, linux-mm, Andrew Morton
Hi,
On 3/17/20 6:44 PM, George Spelvin wrote:
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;;
don't need the double ending ;;
--
~Randy
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
2020-03-18 1:49 ` Randy Dunlap
@ 2020-03-18 3:53 ` Dan Williams
2020-03-18 8:20 ` George Spelvin
2020-03-18 3:58 ` Kees Cook
` (2 subsequent siblings)
4 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 3:53 UTC (permalink / raw)
To: George Spelvin; +Cc: Kees Cook, Linux MM, Andrew Morton
On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable. This makes it a lot easier to reason about race
> conditions. They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep
> in the land of nasal demons. The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64. This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines. Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32. The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants. Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
> x86_64 i386
> This code 94 95
> Explicit bool 103 99
> Lsbits 99 101
> Both 96 100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
> mm/shuffle.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> void add_to_free_area_random(struct page *page, struct free_area *area,
> int migratetype)
> {
> - static u64 rand;
> - static u8 rand_bits;
> + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
> /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> + * rand holds some random msbits, with a 1 bit appended, followed
> + * by zero-padding in the lsbits. This allows us to maintain
> + * the pre-generated bits and the count of bits in a single,
> + * atomically updatable, variable.
> + *
> + * The lack of locking is deliberate. If two threads race to
> + * update the rand state it just adds to the entropy. The
> + * worst that can happen is a random bit is used twice, or
> + * get_random_long is called redundantly.
> */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> + if (unlikely(rshift == 0)) {
I had the impression that unless unlikely is "mostly never" then it
can do more harm than good. Is a branch guaranteed to be taken every
BITS_PER_LONG'th occurrence really a candidate for unlikely()
annotation?
Otherwise this change looks good to me.
> + r = get_random_long();
> + rshift = r << 1 | 1;
> }
> + WRITE_ONCE(rand, rshift);
>
> - if (rand & 1)
> + if ((long)r < 0)
> add_to_free_area(page, area, migratetype);
> else
> add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> }
> --
> 2.26.0.rc2
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
2020-03-18 1:49 ` Randy Dunlap
2020-03-18 3:53 ` Dan Williams
@ 2020-03-18 3:58 ` Kees Cook
2020-03-18 15:26 ` Alexander Duyck
2020-03-18 20:39 ` [PATCH v3] " George Spelvin
4 siblings, 0 replies; 25+ messages in thread
From: Kees Cook @ 2020-03-18 3:58 UTC (permalink / raw)
To: George Spelvin; +Cc: Dan Williams, linux-mm, Andrew Morton
On Wed, Mar 18, 2020 at 01:44:10AM +0000, George Spelvin wrote:
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
typo: resultint -> resulting
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable. This makes it a lot easier to reason about race
> conditions. They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep
> in the land of nasal demons. The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64. This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines. Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32. The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants. Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
> x86_64 i386
> This code 94 95
> Explicit bool 103 99
> Lsbits 99 101
> Both 96 100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
And with Randy's other fix, please consider this:
Acked-by: Kees Cook <keescook@chromium.org>
-Kees
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
> mm/shuffle.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> void add_to_free_area_random(struct page *page, struct free_area *area,
> int migratetype)
> {
> - static u64 rand;
> - static u8 rand_bits;
> + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
> /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> + * rand holds some random msbits, with a 1 bit appended, followed
> + * by zero-padding in the lsbits. This allows us to maintain
> + * the pre-generated bits and the count of bits in a single,
> + * atomically updatable, variable.
> + *
> + * The lack of locking is deliberate. If two threads race to
> + * update the rand state it just adds to the entropy. The
> + * worst that can happen is a random bit is used twice, or
> + * get_random_long is called redundantly.
> */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> + if (unlikely(rshift == 0)) {
> + r = get_random_long();
> + rshift = r << 1 | 1;
> }
> + WRITE_ONCE(rand, rshift);
>
> - if (rand & 1)
> + if ((long)r < 0)
> add_to_free_area(page, area, migratetype);
> else
> add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> }
> --
> 2.26.0.rc2
--
Kees Cook
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 3:53 ` Dan Williams
@ 2020-03-18 8:20 ` George Spelvin
2020-03-18 17:36 ` Dan Williams
0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 8:20 UTC (permalink / raw)
To: Dan Williams; +Cc: Kees Cook, Linux MM, Andrew Morton, lkml
On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>> - if (rand_bits == 0) {
>> - rand_bits = 64;
>> - rand = get_random_u64();
>> + if (unlikely(rshift == 0)) {
>
> I had the impression that unless unlikely is "mostly never" then it
> can do more harm than good. Is a branch guaranteed to be taken every
> BITS_PER_LONG'th occurrence really a candidate for unlikely()
> annotation?
I had to look this up. GCC manual:
For the purposes of branch prediction optimizations, the probability
that a '__builtin_expect' expression is 'true' is controlled by GCC's
'builtin-expect-probability' parameter, which defaults to 90%. You can
also use '__builtin_expect_with_probability' to explicitly assign a
probability value to individual expressions.
So I think that <= 10% is good enough, which is true in this case.
I was tring to encourage the compiler to:
* Place this code path out of line, and
* Not do the stack manipulations (build a frame, spill registers)
needed for a non-leaf function if this path isn't taken.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
` (2 preceding siblings ...)
2020-03-18 3:58 ` Kees Cook
@ 2020-03-18 15:26 ` Alexander Duyck
2020-03-18 18:35 ` George Spelvin
2020-03-18 20:39 ` [PATCH v3] " George Spelvin
4 siblings, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2020-03-18 15:26 UTC (permalink / raw)
To: George Spelvin; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton
On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable. This makes it a lot easier to reason about race
> conditions. They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep
> in the land of nasal demons. The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64. This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines. Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32. The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants. Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
> x86_64 i386
> This code 94 95
> Explicit bool 103 99
> Lsbits 99 101
> Both 96 100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
> mm/shuffle.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> void add_to_free_area_random(struct page *page, struct free_area *area,
> int migratetype)
> {
> - static u64 rand;
> - static u8 rand_bits;
> + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
> /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> + * rand holds some random msbits, with a 1 bit appended, followed
> + * by zero-padding in the lsbits. This allows us to maintain
> + * the pre-generated bits and the count of bits in a single,
> + * atomically updatable, variable.
> + *
> + * The lack of locking is deliberate. If two threads race to
> + * update the rand state it just adds to the entropy. The
> + * worst that can happen is a random bit is used twice, or
> + * get_random_long is called redundantly.
> */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> + if (unlikely(rshift == 0)) {
> + r = get_random_long();
> + rshift = r << 1 | 1;
You might want to wrap the "r << 1" in parenthesis. Also you could
probably use a + 1 instead of an | 1.
> }
> + WRITE_ONCE(rand, rshift);
>
> - if (rand & 1)
> + if ((long)r < 0)
One trick you might be able to get away with here is to actually
compare r to rshift. "If (rshift <= r)" should give you the same
result. This works since what you are essentially doing is just adding
r to itself so if you overflow rshift will be equal to at most r - 1.
However with the addition of the single bit in the rshift == 0 case it
could potentially be equal in the unlikely case of r being all 1's.
> add_to_free_area(page, area, migratetype);
> else
> add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> }
> --
> 2.26.0.rc2
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 8:20 ` George Spelvin
@ 2020-03-18 17:36 ` Dan Williams
2020-03-18 19:29 ` George Spelvin
0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 17:36 UTC (permalink / raw)
To: George Spelvin; +Cc: Kees Cook, Linux MM, Andrew Morton
On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
>
> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
> >> - if (rand_bits == 0) {
> >> - rand_bits = 64;
> >> - rand = get_random_u64();
> >> + if (unlikely(rshift == 0)) {
> >
> > I had the impression that unless unlikely is "mostly never" then it
> > can do more harm than good. Is a branch guaranteed to be taken every
> > BITS_PER_LONG'th occurrence really a candidate for unlikely()
> > annotation?
>
> I had to look this up. GCC manual:
>
> For the purposes of branch prediction optimizations, the probability
> that a '__builtin_expect' expression is 'true' is controlled by GCC's
> 'builtin-expect-probability' parameter, which defaults to 90%. You can
> also use '__builtin_expect_with_probability' to explicitly assign a
> probability value to individual expressions.
>
> So I think that <= 10% is good enough, which is true in this case.
>
> I was tring to encourage the compiler to:
> * Place this code path out of line, and
> * Not do the stack manipulations (build a frame, spill registers)
> needed for a non-leaf function if this path isn't taken.
Understood, I think it's ok in this case because the shuffling only
happens for order-10 page free events by default so it will be
difficult to measure the perf impact either way. But in other kernel
contexts I think unlikely() annotation should come with numbers, 90%
not taken is not sufficient in and of itself.
You can add:
Acked-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 15:26 ` Alexander Duyck
@ 2020-03-18 18:35 ` George Spelvin
2020-03-18 19:17 ` Alexander Duyck
0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 18:35 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, lkml
On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote:
> On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>> + if (unlikely(rshift == 0)) {
>> + r = get_random_long();
>> + rshift = r << 1 | 1;
>
> You might want to wrap the "r << 1" in parenthesis. Also you could
> probably use a + 1 instead of an | 1.
I could, but what would it matter? I have just confirmed that all of:
x << 1 | 1;
(x << 1) + 1;
x + x + 1;
x + x | 1;
2*x + 1;
2*x | 1;
compile to
leal 1(%rdi,%rdi), %eax
on x86, and two instructions on every other processor I can think of.
Since this is concpetually a bit-manipulation operation where carry
propagation is undesirable, the logical operation form seems the most
natural way to write it.
As for the parens, all C programmers are forced to remember that the
boolean operators have weirdly low precedence (below < <= == >= >),
so there's no risk of confusion.
>> }
>> + WRITE_ONCE(rand, rshift);
>>
>> - if (rand & 1)
>> + if ((long)r < 0)
>
> One trick you might be able to get away with here is to actually
> compare r to rshift. "If (rshift <= r)" should give you the same
> result. This works since what you are essentially doing is just adding
> r to itself so if you overflow rshift will be equal to at most r - 1.
> However with the addition of the single bit in the rshift == 0 case it
> could potentially be equal in the unlikely case of r being all 1's.
Er... but why would I want to? On most processors, "branch on sign bit"
is a single instruction, and that's the instruction I'm hoping the
compiler will generate.
That's why I changed the shift direction from the original right (testing
the lsbit) to left (testing the msbit): slight code size reduction.
Anything else produces larger and slower object code, for no benefit.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 18:35 ` George Spelvin
@ 2020-03-18 19:17 ` Alexander Duyck
2020-03-18 20:06 ` George Spelvin
0 siblings, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2020-03-18 19:17 UTC (permalink / raw)
To: George Spelvin; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton
On Wed, Mar 18, 2020 at 11:35 AM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote:
> > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
> >> + if (unlikely(rshift == 0)) {
> >> + r = get_random_long();
> >> + rshift = r << 1 | 1;
> >
> > You might want to wrap the "r << 1" in parenthesis. Also you could
> > probably use a + 1 instead of an | 1.
>
> I could, but what would it matter? I have just confirmed that all of:
> x << 1 | 1;
> (x << 1) + 1;
> x + x + 1;
> x + x | 1;
> 2*x + 1;
> 2*x | 1;
> compile to
> leal 1(%rdi,%rdi), %eax
>
> on x86, and two instructions on every other processor I can think of.
>
> Since this is concpetually a bit-manipulation operation where carry
> propagation is undesirable, the logical operation form seems the most
> natural way to write it.
>
> As for the parens, all C programmers are forced to remember that the
> boolean operators have weirdly low precedence (below < <= == >= >),
> so there's no risk of confusion.
Okay, if this is coming out the same regardless I suppose it doesn't
matter. I am still not a huge fan of skipping the parenthesis as I
feel it is a bit uglier to read but that isn't anything that has to be
fixed.
> >> }
> >> + WRITE_ONCE(rand, rshift);
> >>
> >> - if (rand & 1)
> >> + if ((long)r < 0)
> >
> > One trick you might be able to get away with here is to actually
> > compare r to rshift. "If (rshift <= r)" should give you the same
> > result. This works since what you are essentially doing is just adding
> > r to itself so if you overflow rshift will be equal to at most r - 1.
> > However with the addition of the single bit in the rshift == 0 case it
> > could potentially be equal in the unlikely case of r being all 1's.
>
> Er... but why would I want to? On most processors, "branch on sign bit"
> is a single instruction, and that's the instruction I'm hoping the
> compiler will generate.
>
> That's why I changed the shift direction from the original right (testing
> the lsbit) to left (testing the msbit): slight code size reduction.
>
> Anything else produces larger and slower object code, for no benefit.
I was just putting it out there as a possibility. What I have seen in
the past is that under some circumstances gcc can be smart enough to
interpret that as a "branch on carry". My thought was you are likely
having to test the value against itself and then you might be able to
make use of shift and carry flag to avoid that. In addition you could
get away from having to recast a unsigned value as a signed one in
order to perform the bit test.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 17:36 ` Dan Williams
@ 2020-03-18 19:29 ` George Spelvin
2020-03-18 19:40 ` Dan Williams
0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 19:29 UTC (permalink / raw)
To: Dan Williams; +Cc: Kees Cook, Linux MM, Andrew Morton, lkml
On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote:
> On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
>> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
>>> I had the impression that unless unlikely is "mostly never" then it
>>> can do more harm than good. Is a branch guaranteed to be taken every
>>> BITS_PER_LONG'th occurrence really a candidate for unlikely()
>>> annotation?
>>
>> I had to look this up. GCC manual:
>>
>> For the purposes of branch prediction optimizations, the probability
>> that a '__builtin_expect' expression is 'true' is controlled by GCC's
>> 'builtin-expect-probability' parameter, which defaults to 90%. You can
>> also use '__builtin_expect_with_probability' to explicitly assign a
>> probability value to individual expressions.
>>
>> So I think that <= 10% is good enough, which is true in this case.
>>
>> I was tring to encourage the compiler to:
>> * Place this code path out of line, and
>> * Not do the stack manipulations (build a frame, spill registers)
>> needed for a non-leaf function if this path isn't taken.
>
> Understood, I think it's ok in this case because the shuffling only
> happens for order-10 page free events by default so it will be
> difficult to measure the perf impact either way. But in other kernel
> contexts I think unlikely() annotation should come with numbers, 90%
> not taken is not sufficient in and of itself.
I'm not sure I fully understand your point. I *think* you're
editorializing on unlikely() in general and not this specific code,
but it's a little hard to follow.
Your mention of "order-10 page free events" is confusing. Do you mean
"(order-10 page) free events", i.e. freeing of 1024 consecutive pages?
Or are you using "order" as a synonym for "approximately" and you mean
"approximately 10 (page free event)s"?
We both agree (I hope) that the number here is obvious on brief
inspection: 1/BITS_PER_LONG.
GCC's heuristics are tuned to value cycles on the fast path 9x as much as
cycles on the slow path, so it will spend up to 9 cycles on the slow path
to save a cycle on the fast path.
I've found one comment (https://pastebin.com/S8Y8tqZy) saying that
GCC < 9.x was a lot sloppier on the cost ratio and could pessimize
the code if the branch was more than ~ 1% taken. Perhaps that's what
you're remembering?
Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too
worried.
> You can add:
>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
Thank you!
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 19:29 ` George Spelvin
@ 2020-03-18 19:40 ` Dan Williams
2020-03-18 21:02 ` George Spelvin
0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 19:40 UTC (permalink / raw)
To: George Spelvin; +Cc: Kees Cook, Linux MM, Andrew Morton
On Wed, Mar 18, 2020 at 12:29 PM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote:
> > On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
> >> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> >>> I had the impression that unless unlikely is "mostly never" then it
> >>> can do more harm than good. Is a branch guaranteed to be taken every
> >>> BITS_PER_LONG'th occurrence really a candidate for unlikely()
> >>> annotation?
> >>
> >> I had to look this up. GCC manual:
> >>
> >> For the purposes of branch prediction optimizations, the probability
> >> that a '__builtin_expect' expression is 'true' is controlled by GCC's
> >> 'builtin-expect-probability' parameter, which defaults to 90%. You can
> >> also use '__builtin_expect_with_probability' to explicitly assign a
> >> probability value to individual expressions.
> >>
> >> So I think that <= 10% is good enough, which is true in this case.
> >>
> >> I was tring to encourage the compiler to:
> >> * Place this code path out of line, and
> >> * Not do the stack manipulations (build a frame, spill registers)
> >> needed for a non-leaf function if this path isn't taken.
> >
> > Understood, I think it's ok in this case because the shuffling only
> > happens for order-10 page free events by default so it will be
> > difficult to measure the perf impact either way. But in other kernel
> > contexts I think unlikely() annotation should come with numbers, 90%
> > not taken is not sufficient in and of itself.
>
> I'm not sure I fully understand your point. I *think* you're
> editorializing on unlikely() in general and not this specific code,
> but it's a little hard to follow.
Yes, editorializing on unlikely(). Specifically I would normally ask
for perf numbers to show that the hint is worth it, but I talked
myself out of asking for that in this case.
> Your mention of "order-10 page free events" is confusing. Do you mean
> "(order-10 page) free events", i.e. freeing of 1024 consecutive pages?
> Or are you using "order" as a synonym for "approximately" and you mean
> "approximately 10 (page free event)s"?
I'm referring to this:
if (is_shuffle_order(order))
add_to_free_area_random(page, &zone->free_area[order],
Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered
when we might be releasing a 4MB buddy-page.
> We both agree (I hope) that the number here is obvious on brief
> inspection: 1/BITS_PER_LONG.
>
> GCC's heuristics are tuned to value cycles on the fast path 9x as much as
> cycles on the slow path, so it will spend up to 9 cycles on the slow path
> to save a cycle on the fast path.
>
> I've found one comment (https://pastebin.com/S8Y8tqZy) saying that
> GCC < 9.x was a lot sloppier on the cost ratio and could pessimize
> the code if the branch was more than ~ 1% taken. Perhaps that's what
> you're remembering?
Yes, thanks for that digging!
> Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too
> worried.
Makes sense.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 19:17 ` Alexander Duyck
@ 2020-03-18 20:06 ` George Spelvin
0 siblings, 0 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 20:06 UTC (permalink / raw)
To: Alexander Duyck; +Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, lkml
On Wed, Mar 18, 2020 at 12:17:14PM -0700, Alexander Duyck wrote:
> I was just putting it out there as a possibility. What I have seen in
> the past is that under some circumstances gcc can be smart enough to
> interpret that as a "branch on carry". My thought was you are likely
> having to test the value against itself and then you might be able to
> make use of shift and carry flag to avoid that. In addition you could
> get away from having to recast a unsigned value as a signed one in
> order to perform the bit test.
Ah, yes, it would be nice if gcc could use the carry bit for r
rather than having to devote a whole register to it. But it has
to do two unrelated flag tests (zero and carry), and it's generally
pretty bad at taking advantage of preserved flag bits like that.
My ideal x86-64 object code would be:
shlq rand(%rip)
jz fixup
fixed:
jnc tail
jmp add_to_free_area
tail:
jmp add_to_free_area_tail
fixup:
pushq %rdx
pushq %rsi
pushq %rdi
call get_random_u64
popq %rdi
popq %rsi
popq %rdx
stc
adcq %rax,%rax
movq %rax, rand(%rip)
jmp fixed
... but I don't know how to induce GCC to generate that, and
the function doesn't seem worthy of platform-specific asm.
(Note that I have to use add add on the slow path because lea doesn't
set the carry bit.)
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
` (3 preceding siblings ...)
2020-03-18 15:26 ` Alexander Duyck
@ 2020-03-18 20:39 ` George Spelvin
2020-03-18 21:34 ` Alexander Duyck
2020-03-19 12:05 ` [PATCH v4] " George Spelvin
4 siblings, 2 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 20:39 UTC (permalink / raw)
To: Kees Cook
Cc: Dan Williams, linux-mm, Andrew Morton, Alexander Duyck,
Randy Dunlap, lkml
The old code had separate "rand" and "rand_count" variables,
which could get out of sync with bad results.
In the worst case, two threads would see rand_count = 1 and
both decrement it, resulting in rand_count = 255 and rand being
filled with zeros for the next 255 calls.
Instead, pack them both into a single, atomically updateable,
variable. This makes it a lot easier to reason about race
conditions. They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.
Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep
in the land of nasal demons. The compiler would be free to spill
temporaries to the static variables in arbitrarily perverse ways
and create hard-to-find bugs.
(Alternatively, we could declare the static variable "volatile",
one of the few places in the Linux kernel that would be correct,
but it would probably annoy Linus.)
Third, use long rather than u64. This not only keeps the state
atomically updateable, it also speeds up the fast path on 32-bit
machines. Saving at least three instructions on the fast path (one
load, one add-with-carry, and one store) is worth a second call
to get_random_u*() per 64 bits. The fast path of get_random_u*
is less than the 3*64 = 192 instructions saved, and the slow path
happens every 64 bytes so isn't affected by the change.
Fourth, use left-shift rather than right. Testing the sign bit
produces slightly smaller/faster code than testing the lsbit.
I've tried shifting both ways, and copying the desired bit to
a boolean before shifting rather than keeping separate full-width
r and rshift variables, but both produce larger code:
x86_64 i386
This code 94 95
Explicit bool 103 99
Lsbits 99 101
Both 96 100
In a perfect world, the x86-64 object code would be tiny,
and dominated by the unavoidable unpredictable branch, but
this function doesn't warrant arch-specific optimization.
add_to_free_area_random:
shlq rand(%rip)
jz 3f
1:
jnc 2f
jmp add_to_free_area # tail call
2:
jmp add_to_free_area_tail
3:
pushq %rdx
pushq %rsi
pushq %rdi
call get_random_u64
popq %rdi
popq %rsi
popq %rdx
stc
adcq %rax,%rax # not lea because we need carry out
movq %rax, rand(%rip)
jmp 1b
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
v2: Rewrote commit message to explain existing races better.
Made local variables unsigned to avoid (technically undefined)
signed overflow.
v3: Typos fixed, Acked-by, expanded commit message.
mm/shuffle.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index e0ed247f8d90..16c5fddc292f 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
void add_to_free_area_random(struct page *page, struct free_area *area,
int migratetype)
{
- static u64 rand;
- static u8 rand_bits;
+ static unsigned long rand; /* buffered random bits */
+ unsigned long r = READ_ONCE(rand), rshift = r << 1;
/*
- * The lack of locking is deliberate. If 2 threads race to
- * update the rand state it just adds to the entropy.
+ * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
+ * 1 bit, then zero-padding in the lsbits. This allows us to
+ * maintain the pre-generated bits and the count of bits in a
+ * single, atomically updatable, variable.
+ *
+ * The lack of locking is deliberate. If two threads race to
+ * update the rand state it just adds to the entropy. The
+ * worst that can happen is a random bit is used twice, or
+ * get_random_long is called redundantly.
*/
- if (rand_bits == 0) {
- rand_bits = 64;
- rand = get_random_u64();
+ if (unlikely(rshift == 0)) {
+ r = get_random_long();
+ rshift = r << 1 | 1;
}
+ WRITE_ONCE(rand, rshift);
- if (rand & 1)
+ if ((long)r < 0)
add_to_free_area(page, area, migratetype);
else
add_to_free_area_tail(page, area, migratetype);
- rand_bits--;
- rand >>= 1;
}
--
2.26.0.rc2
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 19:40 ` Dan Williams
@ 2020-03-18 21:02 ` George Spelvin
0 siblings, 0 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 21:02 UTC (permalink / raw)
To: Dan Williams; +Cc: Kees Cook, Linux MM, Andrew Morton, lkml
On Wed, Mar 18, 2020 at 12:40:33PM -0700, Dan Williams wrote:
> Yes, editorializing on unlikely(). Specifically I would normally ask
> for perf numbers to show that the hint is worth it, but I talked
> myself out of asking for that in this case.
Ah, now I understand! Yes, it's just gilding the lily, but as long as I
was messing about in the area it seemed worth adding.
You're saying that if the code were a hotter code path, you'd want
benchmarks, but given its actual infrequent usage, you're satisfied with
the argument that it's probably right; it's not like we've crippled the
kernel even if wrong.
> I'm referring to this:
>
> if (is_shuffle_order(order))
> add_to_free_area_random(page, &zone->free_area[order],
>
> Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered
> when we might be releasing a 4MB buddy-page.
Ding! Okay, now it makes sense. I didn't look that far up the call
stack. I was auditing each and every call to a get_random function in
the kernel, when I came across this call site, struggled to understand
what it was doing, and rewrote it to be less wrong.
I only vaguely understand where it fits into the larger mm/ system
so I never noticed that condition on its use.
Yes, given that infrequent use, I'm even happier that I focused on code
size rather than performance.
Thank you for taking the time to explain.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 20:39 ` [PATCH v3] " George Spelvin
@ 2020-03-18 21:34 ` Alexander Duyck
2020-03-18 22:49 ` George Spelvin
2020-03-19 12:05 ` [PATCH v4] " George Spelvin
1 sibling, 1 reply; 25+ messages in thread
From: Alexander Duyck @ 2020-03-18 21:34 UTC (permalink / raw)
To: George Spelvin
Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, Randy Dunlap
On Wed, Mar 18, 2020 at 1:39 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count = 1 and
> both decrement it, resulting in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable. This makes it a lot easier to reason about race
> conditions. They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep
> in the land of nasal demons. The compiler would be free to spill
> temporaries to the static variables in arbitrarily perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64. This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines. Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits. The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, use left-shift rather than right. Testing the sign bit
> produces slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to
> a boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
> x86_64 i386
> This code 94 95
> Explicit bool 103 99
> Lsbits 99 101
> Both 96 100
>
> In a perfect world, the x86-64 object code would be tiny,
> and dominated by the unavoidable unpredictable branch, but
> this function doesn't warrant arch-specific optimization.
>
> add_to_free_area_random:
> shlq rand(%rip)
> jz 3f
> 1:
> jnc 2f
> jmp add_to_free_area # tail call
> 2:
> jmp add_to_free_area_tail
> 3:
> pushq %rdx
> pushq %rsi
> pushq %rdi
> call get_random_u64
> popq %rdi
> popq %rsi
> popq %rdx
> stc
> adcq %rax,%rax # not lea because we need carry out
> movq %rax, rand(%rip)
> jmp 1b
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
What kernel is this based on? You might want to rebase on the latest
linux-next as it occurs to me that this function was renamed to
shuffle_pick_tail as I had incorporated a few bits of it into the
logic for placing buddy pages and reported pages on the tail of the
list.
> ---
> v2: Rewrote commit message to explain existing races better.
> Made local variables unsigned to avoid (technically undefined)
> signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
>
> mm/shuffle.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..16c5fddc292f 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> void add_to_free_area_random(struct page *page, struct free_area *area,
> int migratetype)
> {
> - static u64 rand;
> - static u8 rand_bits;
> + static unsigned long rand; /* buffered random bits */
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;
>
> /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> + * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
> + * 1 bit, then zero-padding in the lsbits. This allows us to
> + * maintain the pre-generated bits and the count of bits in a
> + * single, atomically updatable, variable.
> + *
> + * The lack of locking is deliberate. If two threads race to
> + * update the rand state it just adds to the entropy. The
> + * worst that can happen is a random bit is used twice, or
> + * get_random_long is called redundantly.
> */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> + if (unlikely(rshift == 0)) {
> + r = get_random_long();
> + rshift = r << 1 | 1;
> }
> + WRITE_ONCE(rand, rshift);
>
> - if (rand & 1)
> + if ((long)r < 0)
> add_to_free_area(page, area, migratetype);
> else
> add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> }
> --
> 2.26.0.rc2
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 21:34 ` Alexander Duyck
@ 2020-03-18 22:49 ` George Spelvin
2020-03-18 22:57 ` Dan Williams
0 siblings, 1 reply; 25+ messages in thread
From: George Spelvin @ 2020-03-18 22:49 UTC (permalink / raw)
To: Alexander Duyck
Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, Randy Dunlap, lkml
On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote:
> What kernel is this based on? You might want to rebase on the latest
> linux-next as it occurs to me that this function was renamed to
> shuffle_pick_tail as I had incorporated a few bits of it into the
> logic for placing buddy pages and reported pages on the tail of the
> list.
5.5.8. I didn't realize it made much difference, but I see it does. Due
to the different return type, the best variant has probably changed and
I'll have to re-check.
Since there's only one call site, should the function be moved to
page_alloc.c and inlined?
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 22:49 ` George Spelvin
@ 2020-03-18 22:57 ` Dan Williams
2020-03-18 23:18 ` George Spelvin
0 siblings, 1 reply; 25+ messages in thread
From: Dan Williams @ 2020-03-18 22:57 UTC (permalink / raw)
To: George Spelvin
Cc: Alexander Duyck, Kees Cook, linux-mm, Andrew Morton, Randy Dunlap
On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote:
> > What kernel is this based on? You might want to rebase on the latest
> > linux-next as it occurs to me that this function was renamed to
> > shuffle_pick_tail as I had incorporated a few bits of it into the
> > logic for placing buddy pages and reported pages on the tail of the
> > list.
>
> 5.5.8. I didn't realize it made much difference, but I see it does. Due
> to the different return type, the best variant has probably changed and
> I'll have to re-check.
>
> Since there's only one call site, should the function be moved to
> page_alloc.c and inlined?
The intent was to make it compile-away on
CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However,
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros.
So, it's not serving much purpose being in mm/shuffle.c. I'd support
moving it.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 22:57 ` Dan Williams
@ 2020-03-18 23:18 ` George Spelvin
0 siblings, 0 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-18 23:18 UTC (permalink / raw)
To: Dan Williams
Cc: Alexander Duyck, Kees Cook, linux-mm, Andrew Morton, Randy Dunlap, lkml
On Wed, Mar 18, 2020 at 03:57:01PM -0700, Dan Williams wrote:
> On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote:
>> Since there's only one call site, should the function be moved to
>> page_alloc.c and inlined?
>
> The intent was to make it compile-away on
> CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However,
> CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros.
> So, it's not serving much purpose being in mm/shuffle.c. I'd support
> moving it.
The options are to define it as an inline function in shuffle.h,
which keeps the #ifdef localized, or to drop it into page_alloc.c,
which shrinks the shuffle.h interface but requires an #ifdef in
page_alloc.h.
Um... or does it? If is_shuffle_order always returns false,
shuffle_pick_tail is dead code and will be elided by the compiler.
But will the static variable be elided, too? Need to check.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-18 20:39 ` [PATCH v3] " George Spelvin
2020-03-18 21:34 ` Alexander Duyck
@ 2020-03-19 12:05 ` George Spelvin
2020-03-19 17:49 ` Alexander Duyck
2020-03-20 17:58 ` Kees Cook
1 sibling, 2 replies; 25+ messages in thread
From: George Spelvin @ 2020-03-19 12:05 UTC (permalink / raw)
To: Kees Cook
Cc: Dan Williams, linux-mm, Andrew Morton, Alexander Duyck,
Randy Dunlap, lkml
The separate "rand" and "rand_count" variables could get out of
sync with bad results.
In the worst case, two threads would see rand_count=1 and both
decrement it, resulting in rand_count=255 and rand being filled
with zeros for the next 255 calls.
Instead, pack them both into a single, atomically updateable,
variable. This makes it a lot easier to reason about race
conditions. They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.
Second, use READ_ONCE and WRITE_ONCE. Because the random bit
buffer is accessed by multiple threads concurrently without
locking, omitting those puts us deep in the land of nasal demons.
The compiler would be free to spill to the static variable in
arbitrarily perverse ways and create hard-to-find bugs.
(I'm torn between this and just declaring the buffer "volatile".
Linux tends to prefer marking accesses rather than variables,
but in this case, every access to the buffer is volatile.
It makes no difference to the generated code.)
Third, use long rather than u64. This not only keeps the state
atomically updateable, it also speeds up the fast path on 32-bit
machines. Saving at least three instructions on the fast path (one
load, one add-with-carry, and one store) is worth a second call
to get_random_u*() per 64 bits. The fast path of get_random_u*
is less than the 3*64 = 192 instructions saved, and the slow path
happens every 64 bytes so isn't affected by the change.
Fourth, make the function inline. It's small, and there's only
one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
function call overhead.
Fifth, use the msbits of the buffer first (left shift) rather
than the lsbits (right shift). Testing the sign bit produces
slightly smaller/faster code than testing the lsbit.
I've tried shifting both ways, and copying the desired bit to a
boolean before shifting rather than keeping separate full-width
r and rshift variables, but both produce larger code:
x86-64 text size
Msbit 42236
Lsbit 42242 (+6)
Lsbit+bool 42258 (+22)
Msbit+bool 42284 (+52)
(Since this is straight-line code, size is a good proxy for number
of instructions and execution time. Using READ/WRITE_ONCE instead of
volatile makes no difference.)
In a perfect world, on x86-64 the fast path would be:
shlq rand(%eip)
jz refill
refill_complete:
jc add_to_tail
but I don't see how to get gcc to generate that, and this
function isn't worth arch-specific implementation.
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
v2: Rewrote commit message to explain existing races better.
Made local variables unsigned to avoid (technically undefined)
signed overflow.
v3: Typos fixed, Acked-by, expanded commit message.
v4: Rebase against -next; function has changed from
add_to_free_area_random() to shuffle_pick_tail. Move to
inline function in shuffle.h.
Not sure if it's okay to keep Acked-by: after such a
significant change.
mm/shuffle.c | 23 -----------------------
mm/shuffle.h | 26 +++++++++++++++++++++++++-
2 files changed, 25 insertions(+), 24 deletions(-)
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 44406d9977c7..ea281d5e1f23 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -182,26 +182,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
shuffle_zone(z);
}
-
-bool shuffle_pick_tail(void)
-{
- static u64 rand;
- static u8 rand_bits;
- bool ret;
-
- /*
- * The lack of locking is deliberate. If 2 threads race to
- * update the rand state it just adds to the entropy.
- */
- if (rand_bits == 0) {
- rand_bits = 64;
- rand = get_random_u64();
- }
-
- ret = rand & 1;
-
- rand_bits--;
- rand >>= 1;
-
- return ret;
-}
diff --git a/mm/shuffle.h b/mm/shuffle.h
index 4d79f03b6658..fb79e05cd86d 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -22,7 +22,31 @@ enum mm_shuffle_ctl {
DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
extern void __shuffle_free_memory(pg_data_t *pgdat);
-extern bool shuffle_pick_tail(void);
+static inline bool shuffle_pick_tail(void)
+{
+ static unsigned long rand; /* buffered random bits */
+ unsigned long r = READ_ONCE(rand), rshift = r << 1;
+
+ /*
+ * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
+ * 1 bit, then zero-padding in the lsbits. This allows us to
+ * maintain the pre-generated bits and the count of bits in a
+ * single, atomically updatable, variable.
+ *
+ * The lack of locking is deliberate. If two threads race to
+ * update the rand state it just adds to the entropy. The
+ * worst that can happen is a random bit is used twice, or
+ * get_random_long is called redundantly.
+ */
+ if (unlikely(rshift == 0)) {
+ r = get_random_long();
+ rshift = r << 1 | 1;
+ }
+ WRITE_ONCE(rand, rshift);
+
+ return (long)r < 0;
+}
+
static inline void shuffle_free_memory(pg_data_t *pgdat)
{
if (!static_branch_unlikely(&page_alloc_shuffle_key))
base-commit: 47780d7892b77e922bbe19b5dea99cde06b2f0e5
--
2.26.0.rc2
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-19 12:05 ` [PATCH v4] " George Spelvin
@ 2020-03-19 17:49 ` Alexander Duyck
2020-03-20 17:58 ` Kees Cook
1 sibling, 0 replies; 25+ messages in thread
From: Alexander Duyck @ 2020-03-19 17:49 UTC (permalink / raw)
To: George Spelvin
Cc: Kees Cook, Dan Williams, linux-mm, Andrew Morton, Randy Dunlap
On Thu, Mar 19, 2020 at 5:05 AM George Spelvin <lkml@sdf.org> wrote:
>
> The separate "rand" and "rand_count" variables could get out of
> sync with bad results.
>
> In the worst case, two threads would see rand_count=1 and both
> decrement it, resulting in rand_count=255 and rand being filled
> with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable. This makes it a lot easier to reason about race
> conditions. They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE. Because the random bit
> buffer is accessed by multiple threads concurrently without
> locking, omitting those puts us deep in the land of nasal demons.
> The compiler would be free to spill to the static variable in
> arbitrarily perverse ways and create hard-to-find bugs.
>
> (I'm torn between this and just declaring the buffer "volatile".
> Linux tends to prefer marking accesses rather than variables,
> but in this case, every access to the buffer is volatile.
> It makes no difference to the generated code.)
>
> Third, use long rather than u64. This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines. Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits. The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, make the function inline. It's small, and there's only
> one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
> function call overhead.
>
> Fifth, use the msbits of the buffer first (left shift) rather
> than the lsbits (right shift). Testing the sign bit produces
> slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to a
> boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
> x86-64 text size
> Msbit 42236
> Lsbit 42242 (+6)
> Lsbit+bool 42258 (+22)
> Msbit+bool 42284 (+52)
>
> (Since this is straight-line code, size is a good proxy for number
> of instructions and execution time. Using READ/WRITE_ONCE instead of
> volatile makes no difference.)
>
> In a perfect world, on x86-64 the fast path would be:
> shlq rand(%eip)
> jz refill
> refill_complete:
> jc add_to_tail
>
> but I don't see how to get gcc to generate that, and this
> function isn't worth arch-specific implementation.
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
> v2: Rewrote commit message to explain existing races better.
> Made local variables unsigned to avoid (technically undefined)
> signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
> v4: Rebase against -next; function has changed from
> add_to_free_area_random() to shuffle_pick_tail. Move to
> inline function in shuffle.h.
> Not sure if it's okay to keep Acked-by: after such a
> significant change.
>
> mm/shuffle.c | 23 -----------------------
> mm/shuffle.h | 26 +++++++++++++++++++++++++-
> 2 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 44406d9977c7..ea281d5e1f23 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -182,26 +182,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> shuffle_zone(z);
> }
> -
> -bool shuffle_pick_tail(void)
> -{
> - static u64 rand;
> - static u8 rand_bits;
> - bool ret;
> -
> - /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> - */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> - }
> -
> - ret = rand & 1;
> -
> - rand_bits--;
> - rand >>= 1;
> -
> - return ret;
> -}
> diff --git a/mm/shuffle.h b/mm/shuffle.h
> index 4d79f03b6658..fb79e05cd86d 100644
> --- a/mm/shuffle.h
> +++ b/mm/shuffle.h
> @@ -22,7 +22,31 @@ enum mm_shuffle_ctl {
> DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
> extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
> extern void __shuffle_free_memory(pg_data_t *pgdat);
> -extern bool shuffle_pick_tail(void);
> +static inline bool shuffle_pick_tail(void)
> +{
> + static unsigned long rand; /* buffered random bits */
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;
> +
> + /*
> + * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
> + * 1 bit, then zero-padding in the lsbits. This allows us to
> + * maintain the pre-generated bits and the count of bits in a
> + * single, atomically updatable, variable.
> + *
> + * The lack of locking is deliberate. If two threads race to
> + * update the rand state it just adds to the entropy. The
> + * worst that can happen is a random bit is used twice, or
> + * get_random_long is called redundantly.
> + */
> + if (unlikely(rshift == 0)) {
> + r = get_random_long();
> + rshift = r << 1 | 1;
> + }
> + WRITE_ONCE(rand, rshift);
> +
> + return (long)r < 0;
> +}
> +
> static inline void shuffle_free_memory(pg_data_t *pgdat)
> {
> if (!static_branch_unlikely(&page_alloc_shuffle_key))
>
> base-commit: 47780d7892b77e922bbe19b5dea99cde06b2f0e5
> --
> 2.26.0.rc2
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4] mm/shuffle.c: Fix races in add_to_free_area_random()
2020-03-19 12:05 ` [PATCH v4] " George Spelvin
2020-03-19 17:49 ` Alexander Duyck
@ 2020-03-20 17:58 ` Kees Cook
1 sibling, 0 replies; 25+ messages in thread
From: Kees Cook @ 2020-03-20 17:58 UTC (permalink / raw)
To: George Spelvin
Cc: Dan Williams, linux-mm, Andrew Morton, Alexander Duyck, Randy Dunlap
On Thu, Mar 19, 2020 at 12:05:22PM +0000, George Spelvin wrote:
> The separate "rand" and "rand_count" variables could get out of
> sync with bad results.
>
> In the worst case, two threads would see rand_count=1 and both
> decrement it, resulting in rand_count=255 and rand being filled
> with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable. This makes it a lot easier to reason about race
> conditions. They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE. Because the random bit
> buffer is accessed by multiple threads concurrently without
> locking, omitting those puts us deep in the land of nasal demons.
> The compiler would be free to spill to the static variable in
> arbitrarily perverse ways and create hard-to-find bugs.
>
> (I'm torn between this and just declaring the buffer "volatile".
> Linux tends to prefer marking accesses rather than variables,
> but in this case, every access to the buffer is volatile.
> It makes no difference to the generated code.)
>
> Third, use long rather than u64. This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines. Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits. The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, make the function inline. It's small, and there's only
> one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
> function call overhead.
>
> Fifth, use the msbits of the buffer first (left shift) rather
> than the lsbits (right shift). Testing the sign bit produces
> slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to a
> boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
> x86-64 text size
> Msbit 42236
> Lsbit 42242 (+6)
> Lsbit+bool 42258 (+22)
> Msbit+bool 42284 (+52)
>
> (Since this is straight-line code, size is a good proxy for number
> of instructions and execution time. Using READ/WRITE_ONCE instead of
> volatile makes no difference.)
>
> In a perfect world, on x86-64 the fast path would be:
> shlq rand(%eip)
> jz refill
> refill_complete:
> jc add_to_tail
>
> but I don't see how to get gcc to generate that, and this
> function isn't worth arch-specific implementation.
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
> v2: Rewrote commit message to explain existing races better.
> Made local variables unsigned to avoid (technically undefined)
> signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
> v4: Rebase against -next; function has changed from
> add_to_free_area_random() to shuffle_pick_tail. Move to
> inline function in shuffle.h.
> Not sure if it's okay to keep Acked-by: after such a
> significant change.
Good by me. Thanks!
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2020-03-20 17:58 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 13:50 [PATCH] mm/shuffle.c: optimize add_to_free_area_random() George Spelvin
2020-03-17 21:44 ` Kees Cook
2020-03-17 23:06 ` George Spelvin
2020-03-17 23:38 ` Kees Cook
2020-03-18 1:44 ` [PATCH v2] mm/shuffle.c: Fix races in add_to_free_area_random() George Spelvin
2020-03-18 1:49 ` Randy Dunlap
2020-03-18 3:53 ` Dan Williams
2020-03-18 8:20 ` George Spelvin
2020-03-18 17:36 ` Dan Williams
2020-03-18 19:29 ` George Spelvin
2020-03-18 19:40 ` Dan Williams
2020-03-18 21:02 ` George Spelvin
2020-03-18 3:58 ` Kees Cook
2020-03-18 15:26 ` Alexander Duyck
2020-03-18 18:35 ` George Spelvin
2020-03-18 19:17 ` Alexander Duyck
2020-03-18 20:06 ` George Spelvin
2020-03-18 20:39 ` [PATCH v3] " George Spelvin
2020-03-18 21:34 ` Alexander Duyck
2020-03-18 22:49 ` George Spelvin
2020-03-18 22:57 ` Dan Williams
2020-03-18 23:18 ` George Spelvin
2020-03-19 12:05 ` [PATCH v4] " George Spelvin
2020-03-19 17:49 ` Alexander Duyck
2020-03-20 17:58 ` Kees Cook
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).