From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932325AbbDJJBC (ORCPT ); Fri, 10 Apr 2015 05:01:02 -0400 Received: from mail-wi0-f174.google.com ([209.85.212.174]:36538 "EHLO mail-wi0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932229AbbDJJA4 (ORCPT ); Fri, 10 Apr 2015 05:00:56 -0400 Date: Fri, 10 Apr 2015 11:00:51 +0200 From: Ingo Molnar To: "Paul E. McKenney" Cc: Linus Torvalds , Jason Low , Peter Zijlstra , Davidlohr Bueso , Tim Chen , Aswin Chandramouleeswaran , LKML Subject: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Message-ID: <20150410090051.GA28549@gmail.com> References: <1428521960-5268-1-git-send-email-jason.low2@hp.com> <1428521960-5268-3-git-send-email-jason.low2@hp.com> <20150409053725.GB13871@gmail.com> <1428561611.3506.78.camel@j-VirtualBox> <20150409075311.GA4645@gmail.com> <20150409175652.GI6464@linux.vnet.ibm.com> <20150409183926.GM6464@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150409183926.GM6464@linux.vnet.ibm.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Paul E. McKenney wrote: > > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about > > performance *at*all*. We will have worse performance problems than > > doing some RCU read-locking inside the loop. > > > > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care > > about locking, since at worst we just access stale memory for one > > iteration. > > But if we are running on a hypervisor, mightn't our VCPU be > preempted just before accessing ->on_cpu, the task exit and its > structures be freed and unmapped? Or is the task structure in > memory that is never unmapped? (If the latter, clearly not a > problem.) kmalloc()able kernel memory is never unmapped in that fashion [*]. Even hotplug memory is based on limiting what gets allocated in that area and never putting critical kernel data structures there. Personally I'd be more comfortable with having a special primitive for this that is DEBUG_PAGEALLOC aware (Linus's first suggestion), so that we don't use different RCU primitives in the rare case someone tests CONFIG_DEBUG_PAGEALLOC=y ... We even have such a primitive: __copy_from_user_inatomic(). It compiles to a single instruction for integer types on x86. I've attached a patch that implements it for the regular mutexes (xadd can be done too), and it all compiles to a rather sweet, compact routine: 0000000000000030 : 30: 48 3b 37 cmp (%rdi),%rsi 33: 48 8d 4e 28 lea 0x28(%rsi),%rcx 37: 75 4e jne 87 39: 55 push %rbp 3a: 45 31 c0 xor %r8d,%r8d 3d: 65 4c 8b 0c 25 00 00 mov %gs:0x0,%r9 44: 00 00 46: 48 89 e5 mov %rsp,%rbp 49: 48 83 ec 10 sub $0x10,%rsp 4d: eb 08 jmp 57 4f: 90 nop 50: f3 90 pause 52: 48 3b 37 cmp (%rdi),%rsi 55: 75 29 jne 80 57: 44 89 c0 mov %r8d,%eax 5a: 90 nop 5b: 90 nop 5c: 90 nop 5d: 8b 11 mov (%rcx),%edx 5f: 90 nop 60: 90 nop 61: 90 nop 62: 85 d2 test %edx,%edx 64: 89 55 fc mov %edx,-0x4(%rbp) 67: 74 0b je 74 69: 49 8b 81 10 c0 ff ff mov -0x3ff0(%r9),%rax 70: a8 08 test $0x8,%al 72: 74 dc je 50 74: 31 c0 xor %eax,%eax 76: c9 leaveq 77: c3 retq 78: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 7f: 00 80: b8 01 00 00 00 mov $0x1,%eax 85: c9 leaveq 86: c3 retq 87: b8 01 00 00 00 mov $0x1,%eax 8c: c3 retq 8d: 0f 1f 00 nopl (%rax) No RCU overhead, and this is the access to owner->on_cpu: 69: 49 8b 81 10 c0 ff ff mov -0x3ff0(%r9),%rax Totally untested and all that, I only built the mutex.o. What do you think? Am I missing anything? Thanks, Ingo [*] with the exception of CONFIG_DEBUG_PAGEALLOC and other debug mechanisms like CONFIG_KMEMCHECK (which is on the way out) that are based on provoking page faults and fixing up page tables to catch unexpected memory accesses. =================================> >>From ef3e5e763747d47a43a32f846ee94706089222cf Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 10 Apr 2015 10:49:11 +0200 Subject: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Linus suggested to get rid of the held RCU read-lock in mutex_spin_on_owner(). The only real complication is that the 'owner' task might be freed from under us and we might dereference into possibly freed kernel memory. As long as the kernel pointer itself is valid this approach is fine in this case (see the analysis below) - with the exception of CONFIG_DEBUG_PAGEALLOC=y and similarly instrumented kernels which might fault on referencing freed kernel memory. Use the non-faulting copy-from-user primitive to get the owner->on_cpu value that we use in NMI handlers and which works even on CONFIG_DEBUG_PAGEALLOC=y instrumented kernels. This compiles to a single instruction on most platforms. This approach might briefly copy in junk data from an already freed (previous) owner task, which might trigger two scenarios: 1) The junk data causes us to loop once more. This is not a problem as we'll check the owner on the next loop and break out of the loop. 2) If the junk value causes us to break out of the loop that's fine too: it's what we'd have done anyway on the next iteration, as the lock owner changed. The inatomic context copy primitives are compiler barriers too - this matters to make sure the above owner check is emitted to before the copy attempt. We also ignore failed copies, as the next iteration will clean up after us. This saves an extra branch in the common case. Suggested-by: Linus Torvalds Not-Yet-Signed-off-by: Ingo Molnar --- kernel/locking/mutex.c | 40 +++++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 4cccea6b8934..fcc7db45d62e 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -224,28 +224,42 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock, static noinline bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) { - bool ret = true; + int on_cpu; + int ret; - rcu_read_lock(); while (lock->owner == owner) { /* - * Ensure we emit the owner->on_cpu, dereference _after_ - * checking lock->owner still matches owner. If that fails, - * owner might point to freed memory. If it still matches, - * the rcu_read_lock() ensures the memory stays valid. + * Use the non-faulting copy-user primitive to get the owner->on_cpu + * value that works even on CONFIG_DEBUG_PAGEALLOC=y instrumented + * kernels. This compiles to a single instruction on most platforms. + * + * This might briefly copy in junk data from an already freed + * (previous) owner task, which might trigger two scenarios: + * + * 1) The junk data causes us to loop once more. This is not + * a problem as we'll check the owner on the next loop and + * break out of the loop. + * + * 2) If the junk value causes us to break out of the loop + * that's fine too: it's what we'd have done anyway on + * the next iteration, as the lock owner changed. + * + * NOTE: the inatomic context copy primitives are compiler barriers + * too - this matters to make sure the above owner check is + * emitted to before the copy attempt. + * + * NOTE2: We ignore failed copies, as the next iteration will clean + * up after us. This saves an extra branch in the common case. */ - barrier(); + ret = __copy_from_user_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu)); - if (!owner->on_cpu || need_resched()) { - ret = false; - break; - } + if (!on_cpu || need_resched()) + return false; cpu_relax_lowlatency(); } - rcu_read_unlock(); - return ret; + return true; } /*