From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752293AbbJEPZ7 (ORCPT ); Mon, 5 Oct 2015 11:25:59 -0400 Received: from mail-vk0-f67.google.com ([209.85.213.67]:34353 "EHLO mail-vk0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752190AbbJEPZ4 (ORCPT ); Mon, 5 Oct 2015 11:25:56 -0400 MIME-Version: 1.0 Date: Mon, 5 Oct 2015 17:11:40 +0200 Message-ID: Subject: [RFC]: Possible race condition in kernel futex code From: Jaccon Bastiaansen To: Thomas Gleixner , x86@kernel.org, mingo@redhat.com, hpa@zytor.com, Peter Zijlstra Cc: "linux-kernel@vger.kernel.org" , h.zuidam@computer.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello all, For a while now, we see our application crashing occasionally (due to an assert) in the middle of a pthread_mutex_lock() function call. To be more precise, our application is threaded (50 threads) where all threads share a custom memory allocator. This memory allocator is guarded with a recursive, priority inheritance pthread mutex. The application is running on a 8 core, 64 bit x86 system using the 3.12.42-rt58 kernel and glibc 2.13. The assert occurs when the pthread_mutex_lock() function checks the return value of the futex system call (see line 307 of the pthread_mutex_lock() source code on http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). In case of the assert, the kernel has returned -EDEADLK which results in a SIGABRT. The strange thing is that the kernel returns -EDEADLK, because it finds out that the calling thread is the owner of the mutex. But glibc has already checked this before executing the futex system call (see line 194 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). The case where a recursive mutex is locked again is fully handled in glibc. No futex system call is required. The kernel futex code finds out that the calling thread is the owner of the mutex by reading the value of the __data.__lock field of the pthread_mutex_t. This field is read by the get_futex_value_locked() function (see kernel/futex.c). The __data.__lock field in a pthread_mutex_t is written by an atomic 32bit compare exchange when locking the mutex (see line 288 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). Reading the value by the kernel should be an atomic action. Much to our surprise, the disassembly of the get_futex_value_locked() function is: ffffffff81098890 : ffffffff81098890: 55 push %rbp ffffffff81098891: 48 89 e5 mov %rsp,%rbp ffffffff81098894: 48 83 ec 10 sub $0x10,%rsp ffffffff81098898: 48 89 5d f0 mov %rbx,-0x10(%rbp) ffffffff8109889c: 48 89 fb mov %rdi,%rbx ffffffff8109889f: 4c 89 65 f8 mov %r12,-0x8(%rbp) ffffffff810988a3: 49 89 f4 mov %rsi,%r12 ffffffff810988a6: e8 35 e9 06 00 callq ffffffff811071e0 ffffffff810988ab: 48 89 df mov %rbx,%rdi ffffffff810988ae: 4c 89 e6 mov %r12,%rsi ffffffff810988b1: ba 04 00 00 00 mov $0x4,%edx ffffffff810988b6: e8 85 f0 2b 00 callq ffffffff81357940 ffffffff810988bb: 89 c3 mov %eax,%ebx ffffffff810988bd: e8 fe e8 06 00 callq ffffffff811071c0 ffffffff810988c2: 83 fb 01 cmp $0x1,%ebx ffffffff810988c5: 4c 8b 65 f8 mov -0x8(%rbp),%r12 ffffffff810988c9: 19 c0 sbb %eax,%eax ffffffff810988cb: 48 8b 5d f0 mov -0x10(%rbp),%rbx ffffffff810988cf: c9 leaveq ffffffff810988d0: f7 d0 not %eax ffffffff810988d2: 83 e0 f2 and $0xfffffff2,%eax ffffffff810988d5: c3 retq ffffffff810988d6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) ffffffff810988dd: 00 00 00 So the copy_user_generic_unrolled() function is called to read the 32bit __data.__lock field. The disassembly of the copy_user_generic_unrolled() function: ffffffff81357940 : ffffffff81357940: 83 fa 08 cmp $0x8,%edx ffffffff81357943: 0f 82 8c 00 00 00 jb ffffffff813579d5 ffffffff81357949: 89 f9 mov %edi,%ecx ffffffff8135794b: 83 e1 07 and $0x7,%ecx ffffffff8135794e: 74 15 je ffffffff81357965 ffffffff81357950: 83 e9 08 sub $0x8,%ecx ffffffff81357953: f7 d9 neg %ecx ffffffff81357955: 29 ca sub %ecx,%edx ffffffff81357957: 8a 06 mov (%rsi),%al ffffffff81357959: 88 07 mov %al,(%rdi) ffffffff8135795b: 48 ff c6 inc %rsi ffffffff8135795e: 48 ff c7 inc %rdi ffffffff81357961: ff c9 dec %ecx ffffffff81357963: 75 f2 jne ffffffff81357957 ffffffff81357965: 89 d1 mov %edx,%ecx ffffffff81357967: 83 e2 3f and $0x3f,%edx ffffffff8135796a: c1 e9 06 shr $0x6,%ecx ffffffff8135796d: 74 4a je ffffffff813579b9 ffffffff8135796f: 4c 8b 06 mov (%rsi),%r8 ffffffff81357972: 4c 8b 4e 08 mov 0x8(%rsi),%r9 ffffffff81357976: 4c 8b 56 10 mov 0x10(%rsi),%r10 ffffffff8135797a: 4c 8b 5e 18 mov 0x18(%rsi),%r11 ffffffff8135797e: 4c 89 07 mov %r8,(%rdi) ffffffff81357981: 4c 89 4f 08 mov %r9,0x8(%rdi) ffffffff81357985: 4c 89 57 10 mov %r10,0x10(%rdi) ffffffff81357989: 4c 89 5f 18 mov %r11,0x18(%rdi) ffffffff8135798d: 4c 8b 46 20 mov 0x20(%rsi),%r8 ffffffff81357991: 4c 8b 4e 28 mov 0x28(%rsi),%r9 ffffffff81357995: 4c 8b 56 30 mov 0x30(%rsi),%r10 ffffffff81357999: 4c 8b 5e 38 mov 0x38(%rsi),%r11 ffffffff8135799d: 4c 89 47 20 mov %r8,0x20(%rdi) ffffffff813579a1: 4c 89 4f 28 mov %r9,0x28(%rdi) ffffffff813579a5: 4c 89 57 30 mov %r10,0x30(%rdi) ffffffff813579a9: 4c 89 5f 38 mov %r11,0x38(%rdi) ffffffff813579ad: 48 8d 76 40 lea 0x40(%rsi),%rsi ffffffff813579b1: 48 8d 7f 40 lea 0x40(%rdi),%rdi ffffffff813579b5: ff c9 dec %ecx ffffffff813579b7: 75 b6 jne ffffffff8135796f ffffffff813579b9: 89 d1 mov %edx,%ecx ffffffff813579bb: 83 e2 07 and $0x7,%edx ffffffff813579be: c1 e9 03 shr $0x3,%ecx ffffffff813579c1: 74 12 je ffffffff813579d5 ffffffff813579c3: 4c 8b 06 mov (%rsi),%r8 ffffffff813579c6: 4c 89 07 mov %r8,(%rdi) ffffffff813579c9: 48 8d 76 08 lea 0x8(%rsi),%rsi ffffffff813579cd: 48 8d 7f 08 lea 0x8(%rdi),%rdi ffffffff813579d1: ff c9 dec %ecx ffffffff813579d3: 75 ee jne ffffffff813579c3 ffffffff813579d5: 21 d2 and %edx,%edx ffffffff813579d7: 74 10 je ffffffff813579e9 ffffffff813579d9: 89 d1 mov %edx,%ecx ffffffff813579db: 8a 06 mov (%rsi),%al ffffffff813579dd: 88 07 mov %al,(%rdi) ffffffff813579df: 48 ff c6 inc %rsi ffffffff813579e2: 48 ff c7 inc %rdi ffffffff813579e5: ff c9 dec %ecx ffffffff813579e7: 75 f2 jne ffffffff813579db ffffffff813579e9: 31 c0 xor %eax,%eax ffffffff813579eb: c3 retq ffffffff813579ec: 0f 1f 40 00 nopl 0x0(%rax) shows that that for byte count values less than 8 (the check done at address ffffffff81357940) , a byte copy loop is used (starting at address ffffffff813579d5). This means that we have 2 problems here: 1) The __data.__lock field is written by glibc with an atomic 32 bit compare exchange, but read by the kernel with 4 separate byte accesses. This violates the following statement on page 8-4 of the Intel Software Developer's Manual V3 (http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html): "Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access." 2) The following race condition is possible: - Thread x has locked the mutex guarding the memory allocator. - Another thread y (possibly running on another CPU) attempts to lock this mutex. Glibc first checks whether thread y has already locked this mutex, which isn't the case. - Thread y attempts to lock the mutex by the atomic 32 bit compare exchange (on line 288 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13). This compare exchange fails, because thread x still has the mutex locked. - Thread y now starts the futex system call. The kernel starts reading the __data.__lock field of the mutex with 4 separate byte reads. - After thread y has read the first byte and before it has read the second byte, thread x releases the mutex with a 32bit atomic compare exchange (see line 217 of http://osxr.org/glibc/source/nptl/pthread_mutex_unlock.c?v=glibc-2.13). - Thread y now continues reading the __data.__lock field and ends up with an inconsistent __data.__lock field value (the least significant byte corresponds to the __data.__lock value before thread y released the mutex, while the other bytes correspond to the __data.__lock value after thread y has released the mutex). - This inconsistent value can result in an incorrect decision to return -EDEADLK and maybe other problems. This problem is caused by the fact that, for some reason, the compiler decided to use the copy_user_generic_unrolled() to read a 32bit value instead of using a 32bit read instruction which is atomic on an x86. The kernel __copy_from_user_inatomic code is written in such a way the compiler can figure out it should use a 32bit read, but in this case it fails to figure this out. We did some tests with different compilers, kernel versions and kernel configs, with the following results: Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 : copy_user_generic_unrolled being used, so race condition possible Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 : copy_user_generic_unrolled being used, so race condition possible Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no race condition Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no race condition Our idea to fix this problem is use an explicit 32 bit read in get_futex_value_locked() instead of using the generic function copy_from_user_inatomic() and hoping the compiler uses an atomic access and the right access size. What is your opinion on this? Regards, Jaccon Bastiaansen