All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC]: Possible race condition in kernel futex code
@ 2015-10-05 15:11 Jaccon Bastiaansen
  2015-10-09  9:06 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Jaccon Bastiaansen @ 2015-10-05 15:11 UTC (permalink / raw)
  To: Thomas Gleixner, x86, mingo, hpa, Peter Zijlstra; +Cc: linux-kernel, h.zuidam

Hello all,

For a while now, we see our application crashing occasionally (due to
an assert) in the middle of a pthread_mutex_lock() function call.

To be more precise, our application is threaded (50 threads) where all
threads share a custom memory allocator. This memory allocator is
guarded with a recursive, priority inheritance pthread mutex. The
application is running on a 8 core, 64 bit x86 system using the
3.12.42-rt58 kernel and glibc 2.13.

The assert occurs when the pthread_mutex_lock() function checks the
return value of the futex system call (see line 307 of the
pthread_mutex_lock() source code on
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
In case of the assert, the kernel has returned -EDEADLK which results
in a SIGABRT.

The strange thing is that the kernel returns -EDEADLK, because it
finds out that the calling thread is the owner of the mutex. But glibc
has already checked this before executing the futex system call (see
line 194 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
The case where a recursive mutex is locked again is fully handled in
glibc. No futex system call is required.

The kernel futex code finds out that the calling thread is the owner
of the mutex by reading the value of the __data.__lock field of the
pthread_mutex_t. This field is read by the get_futex_value_locked()
function (see kernel/futex.c). The __data.__lock field in a
pthread_mutex_t is written by an atomic 32bit compare exchange when
locking the mutex (see line 288 of
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
Reading the value by the kernel should be an atomic action.

Much to our surprise, the disassembly of the get_futex_value_locked()
function is:

ffffffff81098890 <get_futex_value_locked>:
ffffffff81098890:    55                       push   %rbp
ffffffff81098891:    48 89 e5                 mov    %rsp,%rbp
ffffffff81098894:    48 83 ec 10              sub    $0x10,%rsp
ffffffff81098898:    48 89 5d f0              mov    %rbx,-0x10(%rbp)
ffffffff8109889c:    48 89 fb                 mov    %rdi,%rbx
ffffffff8109889f:    4c 89 65 f8              mov    %r12,-0x8(%rbp)
ffffffff810988a3:    49 89 f4                 mov    %rsi,%r12
ffffffff810988a6:    e8 35 e9 06 00           callq  ffffffff811071e0
<pagefault_disable>
ffffffff810988ab:    48 89 df                 mov    %rbx,%rdi
ffffffff810988ae:    4c 89 e6                 mov    %r12,%rsi
ffffffff810988b1:    ba 04 00 00 00           mov    $0x4,%edx
ffffffff810988b6:    e8 85 f0 2b 00           callq  ffffffff81357940
<copy_user_generic_unrolled>
ffffffff810988bb:    89 c3                    mov    %eax,%ebx
ffffffff810988bd:    e8 fe e8 06 00           callq  ffffffff811071c0
<pagefault_enable>
ffffffff810988c2:    83 fb 01                 cmp    $0x1,%ebx
ffffffff810988c5:    4c 8b 65 f8              mov    -0x8(%rbp),%r12
ffffffff810988c9:    19 c0                    sbb    %eax,%eax
ffffffff810988cb:    48 8b 5d f0              mov    -0x10(%rbp),%rbx
ffffffff810988cf:    c9                       leaveq
ffffffff810988d0:    f7 d0                    not    %eax
ffffffff810988d2:    83 e0 f2                 and    $0xfffffff2,%eax
ffffffff810988d5:    c3                       retq
ffffffff810988d6:    66 2e 0f 1f 84 00 00     nopw   %cs:0x0(%rax,%rax,1)
ffffffff810988dd:    00 00 00

So the copy_user_generic_unrolled() function is called to read the
32bit __data.__lock field. The disassembly of the
copy_user_generic_unrolled()  function:


ffffffff81357940 <copy_user_generic_unrolled>:
ffffffff81357940:    83 fa 08                 cmp    $0x8,%edx
ffffffff81357943:    0f 82 8c 00 00 00        jb     ffffffff813579d5
<copy_user_generic_unrolled+0x95>
ffffffff81357949:    89 f9                    mov    %edi,%ecx
ffffffff8135794b:    83 e1 07                 and    $0x7,%ecx
ffffffff8135794e:    74 15                    je     ffffffff81357965
<copy_user_generic_unrolled+0x25>
ffffffff81357950:    83 e9 08                 sub    $0x8,%ecx
ffffffff81357953:    f7 d9                    neg    %ecx
ffffffff81357955:    29 ca                    sub    %ecx,%edx
ffffffff81357957:    8a 06                    mov    (%rsi),%al
ffffffff81357959:    88 07                    mov    %al,(%rdi)
ffffffff8135795b:    48 ff c6                 inc    %rsi
ffffffff8135795e:    48 ff c7                 inc    %rdi
ffffffff81357961:    ff c9                    dec    %ecx
ffffffff81357963:    75 f2                    jne    ffffffff81357957
<copy_user_generic_unrolled+0x17>
ffffffff81357965:    89 d1                    mov    %edx,%ecx
ffffffff81357967:    83 e2 3f                 and    $0x3f,%edx
ffffffff8135796a:    c1 e9 06                 shr    $0x6,%ecx
ffffffff8135796d:    74 4a                    je     ffffffff813579b9
<copy_user_generic_unrolled+0x79>
ffffffff8135796f:    4c 8b 06                 mov    (%rsi),%r8
ffffffff81357972:    4c 8b 4e 08              mov    0x8(%rsi),%r9
ffffffff81357976:    4c 8b 56 10              mov    0x10(%rsi),%r10
ffffffff8135797a:    4c 8b 5e 18              mov    0x18(%rsi),%r11
ffffffff8135797e:    4c 89 07                 mov    %r8,(%rdi)
ffffffff81357981:    4c 89 4f 08              mov    %r9,0x8(%rdi)
ffffffff81357985:    4c 89 57 10              mov    %r10,0x10(%rdi)
ffffffff81357989:    4c 89 5f 18              mov    %r11,0x18(%rdi)
ffffffff8135798d:    4c 8b 46 20              mov    0x20(%rsi),%r8
ffffffff81357991:    4c 8b 4e 28              mov    0x28(%rsi),%r9
ffffffff81357995:    4c 8b 56 30              mov    0x30(%rsi),%r10
ffffffff81357999:    4c 8b 5e 38              mov    0x38(%rsi),%r11
ffffffff8135799d:    4c 89 47 20              mov    %r8,0x20(%rdi)
ffffffff813579a1:    4c 89 4f 28              mov    %r9,0x28(%rdi)
ffffffff813579a5:    4c 89 57 30              mov    %r10,0x30(%rdi)
ffffffff813579a9:    4c 89 5f 38              mov    %r11,0x38(%rdi)
ffffffff813579ad:    48 8d 76 40              lea    0x40(%rsi),%rsi
ffffffff813579b1:    48 8d 7f 40              lea    0x40(%rdi),%rdi
ffffffff813579b5:    ff c9                    dec    %ecx
ffffffff813579b7:    75 b6                    jne    ffffffff8135796f
<copy_user_generic_unrolled+0x2f>
ffffffff813579b9:    89 d1                    mov    %edx,%ecx
ffffffff813579bb:    83 e2 07                 and    $0x7,%edx
ffffffff813579be:    c1 e9 03                 shr    $0x3,%ecx
ffffffff813579c1:    74 12                    je     ffffffff813579d5
<copy_user_generic_unrolled+0x95>
ffffffff813579c3:    4c 8b 06                 mov    (%rsi),%r8
ffffffff813579c6:    4c 89 07                 mov    %r8,(%rdi)
ffffffff813579c9:    48 8d 76 08              lea    0x8(%rsi),%rsi
ffffffff813579cd:    48 8d 7f 08              lea    0x8(%rdi),%rdi
ffffffff813579d1:    ff c9                    dec    %ecx
ffffffff813579d3:    75 ee                    jne    ffffffff813579c3
<copy_user_generic_unrolled+0x83>
ffffffff813579d5:    21 d2                    and    %edx,%edx
ffffffff813579d7:    74 10                    je     ffffffff813579e9
<copy_user_generic_unrolled+0xa9>
ffffffff813579d9:    89 d1                    mov    %edx,%ecx
ffffffff813579db:    8a 06                    mov    (%rsi),%al
ffffffff813579dd:    88 07                    mov    %al,(%rdi)
ffffffff813579df:    48 ff c6                 inc    %rsi
ffffffff813579e2:    48 ff c7                 inc    %rdi
ffffffff813579e5:    ff c9                    dec    %ecx
ffffffff813579e7:    75 f2                    jne    ffffffff813579db
<copy_user_generic_unrolled+0x9b>
ffffffff813579e9:    31 c0                    xor    %eax,%eax
ffffffff813579eb:    c3                       retq
ffffffff813579ec:    0f 1f 40 00              nopl   0x0(%rax)

shows that that for byte count values less than 8 (the check done at
address ffffffff81357940) , a byte copy loop is used (starting at
address ffffffff813579d5).


This means that we have 2 problems here:
1) The __data.__lock field is written by glibc with an atomic 32 bit
compare exchange, but read by the kernel with 4 separate byte
accesses. This violates the following statement on page 8-4 of the
Intel Software Developer's Manual V3
(http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html):

"Software should access semaphores (shared memory used for signalling
between multiple processors) using identical addresses and operand
lengths. For example, if one processor accesses a semaphore using a
word access, other processors should not access the semaphore using a
byte access."

2) The following race condition is possible:
- Thread x has locked the mutex guarding the memory allocator.
- Another thread y (possibly running on another CPU) attempts to lock
this mutex. Glibc first checks whether thread y has already locked
this mutex, which isn't the case.
- Thread y attempts to lock the mutex by the atomic 32 bit compare
exchange (on line 288 of
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
This compare exchange fails, because thread x still has the mutex
locked.
- Thread y now starts the futex system call. The kernel starts reading
the __data.__lock field of the mutex with 4 separate byte reads.
- After thread y has read the first byte and before it has read the
second byte, thread x releases the mutex with a 32bit atomic compare
exchange (see line 217 of
http://osxr.org/glibc/source/nptl/pthread_mutex_unlock.c?v=glibc-2.13).
- Thread y now continues reading the __data.__lock field and ends up
with an inconsistent __data.__lock field value (the least significant
byte corresponds to the __data.__lock value before thread y released
the mutex, while the other bytes correspond to the __data.__lock value
after thread y has released the mutex).
- This inconsistent value can result in an incorrect decision to
return -EDEADLK and maybe other problems.

This problem is caused by the fact that, for some reason, the compiler
decided to use the copy_user_generic_unrolled() to read a 32bit value
instead of using a 32bit read instruction which is atomic on an x86.
The kernel __copy_from_user_inatomic code is written in such a way the
compiler can figure out it should use a 32bit read, but in this case
it fails to figure this out.

We did some tests with different compilers, kernel versions and kernel
configs, with the following results:

Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
copy_user_generic_unrolled being used, so race condition possible
Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
copy_user_generic_unrolled being used, so race condition possible
Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
race condition
Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
race condition


Our idea to fix this problem is use an explicit 32 bit read in
get_futex_value_locked() instead of using the generic function
copy_from_user_inatomic() and hoping the compiler uses an atomic
access and the right access size.

What is your opinion on this?

Regards,
  Jaccon Bastiaansen

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Possible race condition in kernel futex code
  2015-10-05 15:11 [RFC]: Possible race condition in kernel futex code Jaccon Bastiaansen
@ 2015-10-09  9:06 ` Thomas Gleixner
  2015-10-09  9:49   ` Hans Zuidam
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Thomas Gleixner @ 2015-10-09  9:06 UTC (permalink / raw)
  To: Jaccon Bastiaansen
  Cc: x86, mingo, H. Peter Anvin, Peter Zijlstra, linux-kernel,
	h.zuidam, stable

On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
> We did some tests with different compilers, kernel versions and kernel
> configs, with the following results:
> 
> Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
> copy_user_generic_unrolled being used, so race condition possible
> Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
> copy_user_generic_unrolled being used, so race condition possible
> Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
> race condition
> Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
> race condition
> 
> 
> Our idea to fix this problem is use an explicit 32 bit read in
> get_futex_value_locked() instead of using the generic function
> copy_from_user_inatomic() and hoping the compiler uses an atomic
> access and the right access size.

You cannot use an explicit 32bit read. We need an access which handles
the fault gracefully.

In current mainline this is done proper:

ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32))

        __copy_from_user_nocheck(dst, src, size)

    	       if (!__builtin_constant_p(size))
                     return copy_user_generic(dst, (__force void *)src, size);
	
	       size is constant so we end up in the switch case

	       switch(size) {
	       
	       case 4:
	       	    __get_user_asm(*(u32 *)dst, (u32 __user *)src,
		     		   ret, "l", "k", "=r", 4);
		    return ret;
....

In 3.12 this is different:

   __copy_from_user_inatomic()
	copy_user_generic()
	    copy_user_generic_unrolled()

So this is only an issue for kernel versions < 3.13. It was fixed with

ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit __copy_{from,to}_user_inatomic

but nobody noticed that the race you described can happen, so it was
never backported to the stable kernels.

@stable: Can you please pick up ff47ab4ff3cd plus 

df90ca969035d x86, sparse: Do not force removal of __user when calling copy_to/from_user_nocheck()

for stable kernels <= 3.12?

If that's too much of churn, then I can come up with an explicit fix
for this. Let me know.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Possible race condition in kernel futex code
  2015-10-09  9:06 ` Thomas Gleixner
@ 2015-10-09  9:49   ` Hans Zuidam
  2015-10-09 10:25     ` Thomas Gleixner
  2015-10-18  0:23   ` Greg KH
  2016-05-15 22:34   ` Ben Hutchings
  2 siblings, 1 reply; 7+ messages in thread
From: Hans Zuidam @ 2015-10-09  9:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jaccon Bastiaansen, x86, mingo, H. Peter Anvin, Peter Zijlstra,
	linux-kernel, stable

Hi Thomas,

On 9 okt. 2015, at 11:06, Thomas Gleixner <tglx@linutronix.de> wrote:
On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
>> We did some tests with different compilers, kernel versions and kernel
>> configs, with the following results:

> You cannot use an explicit 32bit read. We need an access which handles the fault gracefully.

The reason for the explicit read suggestion is to avoid the _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual says that there may be situations where it returns 0 even though the argument is a constant.  Although none of the compiler/kernel combinations we have tried showed this happening, we think it is probably better to be safe than sorry.

With kind regards,
Hans Zuidam


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Possible race condition in kernel futex code
  2015-10-09  9:49   ` Hans Zuidam
@ 2015-10-09 10:25     ` Thomas Gleixner
  2015-10-09 11:35       ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2015-10-09 10:25 UTC (permalink / raw)
  To: Hans Zuidam
  Cc: Jaccon Bastiaansen, x86, mingo, H. Peter Anvin, Peter Zijlstra,
	linux-kernel, stable, Linus Torvalds

Hans,

On Fri, 9 Oct 2015, Hans Zuidam wrote:
> On 9 okt. 2015, at 11:06, Thomas Gleixner <tglx@linutronix.de> wrote:
> > You cannot use an explicit 32bit read. We need an access which
> > handles the fault gracefully.
>
> The reason for the explicit read suggestion is to avoid the
> _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual
> says that there may be situations where it returns 0 even though the
> argument is a constant.

That's insane at best.

> Although none of the compiler/kernel combinations we have tried
> showed this happening, we think it is probably better to be safe
> than sorry.

So we would need something like:

   futex_copy_from_user()

which can be mapped to __copy_from_user_inatomic() first. Then go
through all architectures and the asm-generic stuff and provide the
specific variants which are guaranteed to use a 32bit access.

I really prefer that we don't have to do that and the compiler people
get their act together and fix that _builtin_constant_p() thingy.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Possible race condition in kernel futex code
  2015-10-09 10:25     ` Thomas Gleixner
@ 2015-10-09 11:35       ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2015-10-09 11:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Hans Zuidam, Jaccon Bastiaansen, x86, mingo, H. Peter Anvin,
	linux-kernel, stable, Linus Torvalds

On Fri, Oct 09, 2015 at 11:25:09AM +0100, Thomas Gleixner wrote:
> Hans,
> 
> On Fri, 9 Oct 2015, Hans Zuidam wrote:
> > On 9 okt. 2015, at 11:06, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > You cannot use an explicit 32bit read. We need an access which
> > > handles the fault gracefully.
> >
> > The reason for the explicit read suggestion is to avoid the
> > _builtin_constant_p() in __copy_from_user_nocheck().  The GCC manual
> > says that there may be situations where it returns 0 even though the
> > argument is a constant.
> 
> That's insane at best.

Right, but I bet that is for cases where constant propagation completely
fails, and this is a trivial case, I have no problem relying on it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Possible race condition in kernel futex code
  2015-10-09  9:06 ` Thomas Gleixner
  2015-10-09  9:49   ` Hans Zuidam
@ 2015-10-18  0:23   ` Greg KH
  2016-05-15 22:34   ` Ben Hutchings
  2 siblings, 0 replies; 7+ messages in thread
From: Greg KH @ 2015-10-18  0:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jaccon Bastiaansen, x86, mingo, H. Peter Anvin, Peter Zijlstra,
	linux-kernel, h.zuidam, stable

On Fri, Oct 09, 2015 at 10:06:41AM +0100, Thomas Gleixner wrote:
> On Mon, 5 Oct 2015, Jaccon Bastiaansen wrote:
> > We did some tests with different compilers, kernel versions and kernel
> > configs, with the following results:
> > 
> > Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
> > copy_user_generic_unrolled being used, so race condition possible
> > Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
> > copy_user_generic_unrolled being used, so race condition possible
> > Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
> > race condition
> > Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
> > race condition
> > 
> > 
> > Our idea to fix this problem is use an explicit 32 bit read in
> > get_futex_value_locked() instead of using the generic function
> > copy_from_user_inatomic() and hoping the compiler uses an atomic
> > access and the right access size.
> 
> You cannot use an explicit 32bit read. We need an access which handles
> the fault gracefully.
> 
> In current mainline this is done proper:
> 
> ret = __copy_from_user_inatomic(dst, src, size = sizeof(u32))
> 
>         __copy_from_user_nocheck(dst, src, size)
> 
>     	       if (!__builtin_constant_p(size))
>                      return copy_user_generic(dst, (__force void *)src, size);
> 	
> 	       size is constant so we end up in the switch case
> 
> 	       switch(size) {
> 	       
> 	       case 4:
> 	       	    __get_user_asm(*(u32 *)dst, (u32 __user *)src,
> 		     		   ret, "l", "k", "=r", 4);
> 		    return ret;
> ....
> 
> In 3.12 this is different:
> 
>    __copy_from_user_inatomic()
> 	copy_user_generic()
> 	    copy_user_generic_unrolled()
> 
> So this is only an issue for kernel versions < 3.13. It was fixed with
> 
> ff47ab4ff3cd: Add 1/2/4/8 byte optimization to 64bit __copy_{from,to}_user_inatomic
> 
> but nobody noticed that the race you described can happen, so it was
> never backported to the stable kernels.
> 
> @stable: Can you please pick up ff47ab4ff3cd plus 
> 
> df90ca969035d x86, sparse: Do not force removal of __user when calling copy_to/from_user_nocheck()
> 
> for stable kernels <= 3.12?
> 
> If that's too much of churn, then I can come up with an explicit fix
> for this. Let me know.

Now applied to 3.10-stable, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Possible race condition in kernel futex code
  2015-10-09  9:06 ` Thomas Gleixner
  2015-10-09  9:49   ` Hans Zuidam
  2015-10-18  0:23   ` Greg KH
@ 2016-05-15 22:34   ` Ben Hutchings
  2 siblings, 0 replies; 7+ messages in thread
From: Ben Hutchings @ 2016-05-15 22:34 UTC (permalink / raw)
  To: Thomas Gleixner, Jaccon Bastiaansen
  Cc: x86, mingo, H. Peter Anvin, Peter Zijlstra, linux-kernel,
	h.zuidam, stable

[-- Attachment #1: Type: text/plain, Size: 419 bytes --]

On Fri, 2015-10-09 at 10:06 +0100, Thomas Gleixner wrote:
[...]
> @stable: Can you please pick up ff47ab4ff3cd plus 
> 
> df90ca969035d x86, sparse: Do not force removal of __user when calling copy_to/from_user_nocheck()
> 
> for stable kernels <= 3.12?
[...]

I've finally queued these up for 3.2, thanks.

Ben.

-- 
Ben Hutchings
For every action, there is an equal and opposite criticism. - Harrison

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-05-15 22:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-05 15:11 [RFC]: Possible race condition in kernel futex code Jaccon Bastiaansen
2015-10-09  9:06 ` Thomas Gleixner
2015-10-09  9:49   ` Hans Zuidam
2015-10-09 10:25     ` Thomas Gleixner
2015-10-09 11:35       ` Peter Zijlstra
2015-10-18  0:23   ` Greg KH
2016-05-15 22:34   ` Ben Hutchings

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.