linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get()
@ 2016-07-07 23:09 Dave Hansen
  2016-07-08 10:35 ` Mel Gorman
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2016-07-07 23:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, linux-api, linux-mm, x86, torvalds, akpm, mingo,
	mgorman, dave.hansen


This improves on the code I posted earlier, and you can find
here:

	http://git.kernel.org/cgit/linux/kernel/git/daveh/x86-pkeys.git/log/?h=pkeys-v039

More details on pkey_get/set() can be found here:

	https://www.sr71.net/~dave/intel/manpages/pkey_get.2.html

This is in response to some of Mel's comments regarding how
expensive it is to hold mmap_sem for write.

--

The pkey_set()/pkey_get() system calls are intented to be safer
replacements for the RDPKRU and WRPKRU instructions.  But, in
addition to what those instructions do, the syscalls also ensure
that the pkey being acted upon is allocated.

But, the "allocated" check is fundamentally racy.  By the time
the call returns, another thread could have pkey_free()'d the
key.  A valid return here means only that the key *was*
allocated/valid at some point during the syscall.

We do some pretty hard-core synchronization of
pkey_set()/pkey_get() with down_write(mm->mmap_sem) because it
also protects the allocation map that we need to query.  But,
mmap_sem doesn't really buy us anything other than a consistent
snapshot of the pkey allocation map.

So, get our snapshot another way.  Using WRITE_ONCE()/READ_ONCE()
we can ensure that we get a consistent view of this 2-byte value
for pkey_set()/pkey_get().  It might be stale by the time we act
on it, but that's OK because of the previously mentioned raciness
of pkey_set()/pkey_get() even with mmap_sem is helping out.

Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org
Cc: torvalds@linux-foundation.org
Cc: akpm@linux-foundation.org
Cc: mingo@kernel.org
Cc: mgorman@techsingularity.net
Cc: Dave Hansen (Intel) <dave.hansen@intel.com>

---

 b/arch/x86/include/asm/pkeys.h |   39 ++++++++++++++++++++++++++++++++++-----
 b/mm/mprotect.c                |    4 ----
 2 files changed, 34 insertions(+), 9 deletions(-)

diff -puN mm/mprotect.c~pkeys-119-fast-set-get mm/mprotect.c
--- a/mm/mprotect.c~pkeys-119-fast-set-get	2016-07-07 12:25:49.582075153 -0700
+++ b/mm/mprotect.c	2016-07-07 12:42:50.516384977 -0700
@@ -542,10 +542,8 @@ SYSCALL_DEFINE2(pkey_get, int, pkey, uns
 	if (flags)
 		return -EINVAL;
 
-	down_write(&current->mm->mmap_sem);
 	if (!mm_pkey_is_allocated(current->mm, pkey))
 		ret = -EBADF;
-	up_write(&current->mm->mmap_sem);
 
 	if (ret)
 		return ret;
@@ -563,10 +561,8 @@ SYSCALL_DEFINE3(pkey_set, int, pkey, uns
 	if (flags)
 		return -EINVAL;
 
-	down_write(&current->mm->mmap_sem);
 	if (!mm_pkey_is_allocated(current->mm, pkey))
 		ret = -EBADF;
-	up_write(&current->mm->mmap_sem);
 
 	if (ret)
 		return ret;
diff -puN arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get arch/x86/include/asm/pkeys.h
--- a/arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get	2016-07-07 12:26:19.265421712 -0700
+++ b/arch/x86/include/asm/pkeys.h	2016-07-07 15:18:15.391642423 -0700
@@ -35,18 +35,47 @@ extern int __arch_set_user_pkey_access(s
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
 
+#define PKEY_MAP_SET	1
+#define PKEY_MAP_CLEAR	2
 #define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
-#define mm_set_pkey_allocated(mm, pkey) do {		\
-	mm_pkey_allocation_map(mm) |= (1U << pkey);	\
+static inline
+void mm_modify_pkey_alloc_map(struct mm_struct *mm, int pkey, int setclear)
+{
+	u16 new_map = mm_pkey_allocation_map(mm);
+	if (setclear == PKEY_MAP_SET)
+		new_map |= (1U << pkey);
+	else if (setclear == PKEY_MAP_CLEAR)
+		new_map &= ~(1U << pkey);
+	else
+		BUILD_BUG_ON(1);
+	/*
+	 * Make sure that mm_pkey_is_allocated() callers never
+	 * see intermediate states by using WRITE_ONCE().
+	 * Concurrent calls to this function are excluded by
+	 * down_write(mm->mmap_sem) so we only need to protect
+	 * against readers.
+	 */
+	WRITE_ONCE(mm_pkey_allocation_map(mm), new_map);
+}
+#define mm_set_pkey_allocated(mm, pkey) do {			\
+	mm_modify_pkey_alloc_map(mm, pkey, PKEY_MAP_SET);	\
 } while (0)
-#define mm_set_pkey_free(mm, pkey) do {			\
-	mm_pkey_allocation_map(mm) &= ~(1U << pkey);	\
+#define mm_set_pkey_free(mm, pkey) do {				\
+	mm_modify_pkey_alloc_map(mm, pkey, PKEY_MAP_CLEAR);	\
 } while (0)
 
 static inline
 bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
-	return mm_pkey_allocation_map(mm) & (1U << pkey);
+	/*
+	 * Make sure we get a single, consistent view of the
+	 * allocation map.  This is racy, but that's OK since the
+	 * interfaces that depend on this are either already
+	 * fundamentally racy with respect to the allocation map
+	 * (pkey_get/set()) or called under
+	 * down_write(mm->mmap_sem).
+	 */
+	return READ_ONCE(mm_pkey_allocation_map(mm)) & (1U << pkey);
 }
 
 static inline
_

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get()
  2016-07-07 23:09 [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get() Dave Hansen
@ 2016-07-08 10:35 ` Mel Gorman
  2016-07-08 14:54   ` Dave Hansen
  0 siblings, 1 reply; 4+ messages in thread
From: Mel Gorman @ 2016-07-08 10:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-api, linux-mm, x86, torvalds, akpm, mingo,
	dave.hansen

On Thu, Jul 07, 2016 at 04:09:22PM -0700, Dave Hansen wrote:
>  b/arch/x86/include/asm/pkeys.h |   39 ++++++++++++++++++++++++++++++++++-----
>  b/mm/mprotect.c                |    4 ----
>  2 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff -puN mm/mprotect.c~pkeys-119-fast-set-get mm/mprotect.c
> --- a/mm/mprotect.c~pkeys-119-fast-set-get	2016-07-07 12:25:49.582075153 -0700
> +++ b/mm/mprotect.c	2016-07-07 12:42:50.516384977 -0700
> @@ -542,10 +542,8 @@ SYSCALL_DEFINE2(pkey_get, int, pkey, uns
>  	if (flags)
>  		return -EINVAL;
>  
> -	down_write(&current->mm->mmap_sem);
>  	if (!mm_pkey_is_allocated(current->mm, pkey))
>  		ret = -EBADF;
> -	up_write(&current->mm->mmap_sem);
>  
>  	if (ret)
>  		return ret;

This does allow the possibility of

thread a	thread b
pkey_get enter
		pkey_free
		pkey_alloc
pkey_get leave

The kernel can tell if the key is allocated but not if it is the same
allocation userspace expected or not. That's why I thought this may need
to be a sequence counter. Unfortunately, now I realise that even that is
insufficient because the seqcounter would only detect that something
changed, it would have no idea if the pkey of interest was affected or
not. It gets rapidly messy after that.

Userspace may have no choice other than to serialise itself but the
documentation needs to be clear that the above race is possible.

> diff -puN arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get arch/x86/include/asm/pkeys.h
> --- a/arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get	2016-07-07 12:26:19.265421712 -0700
> +++ b/arch/x86/include/asm/pkeys.h	2016-07-07 15:18:15.391642423 -0700
> @@ -35,18 +35,47 @@ extern int __arch_set_user_pkey_access(s
>  
>  #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
>  
> +#define PKEY_MAP_SET	1
> +#define PKEY_MAP_CLEAR	2
>  #define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
> -#define mm_set_pkey_allocated(mm, pkey) do {		\
> -	mm_pkey_allocation_map(mm) |= (1U << pkey);	\
> +static inline
> +void mm_modify_pkey_alloc_map(struct mm_struct *mm, int pkey, int setclear)
> +{
> +	u16 new_map = mm_pkey_allocation_map(mm);
> +	if (setclear == PKEY_MAP_SET)
> +		new_map |= (1U << pkey);
> +	else if (setclear == PKEY_MAP_CLEAR)
> +		new_map &= ~(1U << pkey);
> +	else
> +		BUILD_BUG_ON(1);
> +	/*
> +	 * Make sure that mm_pkey_is_allocated() callers never
> +	 * see intermediate states by using WRITE_ONCE().
> +	 * Concurrent calls to this function are excluded by
> +	 * down_write(mm->mmap_sem) so we only need to protect
> +	 * against readers.
> +	 */
> +	WRITE_ONCE(mm_pkey_allocation_map(mm), new_map);
> +}

What prevents two pkey_set operations overwriting each others change with
WRITE_ONCE? Does this not need to be a cmpxchg read-modify-write loops?


-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get()
  2016-07-08 10:35 ` Mel Gorman
@ 2016-07-08 14:54   ` Dave Hansen
  2016-07-08 15:15     ` Mel Gorman
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2016-07-08 14:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-api, linux-mm, x86, torvalds, akpm, mingo,
	dave.hansen

On 07/08/2016 03:35 AM, Mel Gorman wrote:
> On Thu, Jul 07, 2016 at 04:09:22PM -0700, Dave Hansen wrote:
>>  b/arch/x86/include/asm/pkeys.h |   39 ++++++++++++++++++++++++++++++++++-----
>>  b/mm/mprotect.c                |    4 ----
>>  2 files changed, 34 insertions(+), 9 deletions(-)
>>
>> diff -puN mm/mprotect.c~pkeys-119-fast-set-get mm/mprotect.c
>> --- a/mm/mprotect.c~pkeys-119-fast-set-get	2016-07-07 12:25:49.582075153 -0700
>> +++ b/mm/mprotect.c	2016-07-07 12:42:50.516384977 -0700
>> @@ -542,10 +542,8 @@ SYSCALL_DEFINE2(pkey_get, int, pkey, uns
>>  	if (flags)
>>  		return -EINVAL;
>>  
>> -	down_write(&current->mm->mmap_sem);
>>  	if (!mm_pkey_is_allocated(current->mm, pkey))
>>  		ret = -EBADF;
>> -	up_write(&current->mm->mmap_sem);
>>  
>>  	if (ret)
>>  		return ret;
> 
> This does allow the possibility of
> 
> thread a	thread b
> pkey_get enter
> 		pkey_free
> 		pkey_alloc
> pkey_get leave
> 
> The kernel can tell if the key is allocated but not if it is the same
> allocation userspace expected or not. That's why I thought this may need
> to be a sequence counter. Unfortunately, now I realise that even that is
> insufficient because the seqcounter would only detect that something
> changed, it would have no idea if the pkey of interest was affected or
> not. It gets rapidly messy after that.
> 
> Userspace may have no choice other than to serialise itself but the
> documentation needs to be clear that the above race is possible.

Yeah, I'll clarify the documentation.  But, I do think this is one of
those races like an stat().  A stat() tells you that a file was once
there with so and so properties, but it does not mean that it is there
any more or that what _is_ there is the same thing you stat()'d.

>> diff -puN arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get arch/x86/include/asm/pkeys.h
>> --- a/arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get	2016-07-07 12:26:19.265421712 -0700
>> +++ b/arch/x86/include/asm/pkeys.h	2016-07-07 15:18:15.391642423 -0700
>> @@ -35,18 +35,47 @@ extern int __arch_set_user_pkey_access(s
>>  
>>  #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
>>  
>> +#define PKEY_MAP_SET	1
>> +#define PKEY_MAP_CLEAR	2
>>  #define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
>> -#define mm_set_pkey_allocated(mm, pkey) do {		\
>> -	mm_pkey_allocation_map(mm) |= (1U << pkey);	\
>> +static inline
>> +void mm_modify_pkey_alloc_map(struct mm_struct *mm, int pkey, int setclear)
>> +{
>> +	u16 new_map = mm_pkey_allocation_map(mm);
>> +	if (setclear == PKEY_MAP_SET)
>> +		new_map |= (1U << pkey);
>> +	else if (setclear == PKEY_MAP_CLEAR)
>> +		new_map &= ~(1U << pkey);
>> +	else
>> +		BUILD_BUG_ON(1);
>> +	/*
>> +	 * Make sure that mm_pkey_is_allocated() callers never
>> +	 * see intermediate states by using WRITE_ONCE().
>> +	 * Concurrent calls to this function are excluded by
>> +	 * down_write(mm->mmap_sem) so we only need to protect
>> +	 * against readers.
>> +	 */
>> +	WRITE_ONCE(mm_pkey_allocation_map(mm), new_map);
>> +}
> 
> What prevents two pkey_set operations overwriting each others change with
> WRITE_ONCE? Does this not need to be a cmpxchg read-modify-write loops?

pkey_set() only reads the allocation map and only writes to PKRU which
is thread-local.

The writers to the allocation map are pkey_alloc()/free() and those are
still mmap_sem-protected.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get()
  2016-07-08 14:54   ` Dave Hansen
@ 2016-07-08 15:15     ` Mel Gorman
  0 siblings, 0 replies; 4+ messages in thread
From: Mel Gorman @ 2016-07-08 15:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-api, linux-mm, x86, torvalds, akpm, mingo,
	dave.hansen

On Fri, Jul 08, 2016 at 07:54:36AM -0700, Dave Hansen wrote:
> > Userspace may have no choice other than to serialise itself but the
> > documentation needs to be clear that the above race is possible.
> 
> Yeah, I'll clarify the documentation.  But, I do think this is one of
> those races like an stat().  A stat() tells you that a file was once
> there with so and so properties, but it does not mean that it is there
> any more or that what _is_ there is the same thing you stat()'d.
> 

Thanks, that would be a perfect example to put in. My initial impression
what that pkeys would give all sorts of guarantees when the reality is
a bit more relaxed and still depends on application correctness. While
that is admittedly due to my lack of familiarity with the specifics,
it's reasonable to assume that application developers may make the same
mistake unless the documentation is explicit.

> >> diff -puN arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get arch/x86/include/asm/pkeys.h
> >> --- a/arch/x86/include/asm/pkeys.h~pkeys-119-fast-set-get	2016-07-07 12:26:19.265421712 -0700
> >> +++ b/arch/x86/include/asm/pkeys.h	2016-07-07 15:18:15.391642423 -0700
> >> @@ -35,18 +35,47 @@ extern int __arch_set_user_pkey_access(s
> >>  
> >>  #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | VM_PKEY_BIT3)
> >>  
> >> +#define PKEY_MAP_SET	1
> >> +#define PKEY_MAP_CLEAR	2
> >>  #define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
> >> -#define mm_set_pkey_allocated(mm, pkey) do {		\
> >> -	mm_pkey_allocation_map(mm) |= (1U << pkey);	\
> >> +static inline
> >> +void mm_modify_pkey_alloc_map(struct mm_struct *mm, int pkey, int setclear)
> >> +{
> >> +	u16 new_map = mm_pkey_allocation_map(mm);
> >> +	if (setclear == PKEY_MAP_SET)
> >> +		new_map |= (1U << pkey);
> >> +	else if (setclear == PKEY_MAP_CLEAR)
> >> +		new_map &= ~(1U << pkey);
> >> +	else
> >> +		BUILD_BUG_ON(1);
> >> +	/*
> >> +	 * Make sure that mm_pkey_is_allocated() callers never
> >> +	 * see intermediate states by using WRITE_ONCE().
> >> +	 * Concurrent calls to this function are excluded by
> >> +	 * down_write(mm->mmap_sem) so we only need to protect
> >> +	 * against readers.
> >> +	 */
> >> +	WRITE_ONCE(mm_pkey_allocation_map(mm), new_map);
> >> +}
> > 
> > What prevents two pkey_set operations overwriting each others change with
> > WRITE_ONCE? Does this not need to be a cmpxchg read-modify-write loops?
> 
> pkey_set() only reads the allocation map and only writes to PKRU which
> is thread-local.
> 

My bad, thanks for the clarification.

It's ok to put my ack on the patches up to but not including the
pkey_[set|get] one, even with the fix on top. It's not a reviewed-by because
reviewed-by's for me are rare and only happen if I've actually tested them.

For pkey_[set|get], I'm still a little cagey until I know more about how
glibc intends to use them and I've still not 100% convinced myself they
are necessary even though I like the additional protection they give,
races or otherwise. I didn't read the documentation patch closely (I just
read your HTML version) and I didn't read the selftests at all so an ack
would be inappropriate. I'll read the documentation patch and probably ack
it if I see more details there about the potential races and why userspace
has to be careful.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-07-08 15:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-07 23:09 [RFC][PATCH] x86, pkeys: scalable pkey_set()/pkey_get() Dave Hansen
2016-07-08 10:35 ` Mel Gorman
2016-07-08 14:54   ` Dave Hansen
2016-07-08 15:15     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).