linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Jason Low <jason.low2@hp.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aswin Chandramouleeswaran <aswin@hp.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
Date: Fri, 10 Apr 2015 07:20:24 -0700	[thread overview]
Message-ID: <20150410142024.GY6464@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150410090051.GA28549@gmail.com>

On Fri, Apr 10, 2015 at 11:00:51AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> > > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about 
> > > performance *at*all*. We will have worse performance problems than 
> > > doing some RCU read-locking inside the loop.
> > > 
> > > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care 
> > > about locking, since at worst we just access stale memory for one 
> > > iteration.
> > 
> > But if we are running on a hypervisor, mightn't our VCPU be 
> > preempted just before accessing ->on_cpu, the task exit and its 
> > structures be freed and unmapped?  Or is the task structure in 
> > memory that is never unmapped?  (If the latter, clearly not a 
> > problem.)
> 
> kmalloc()able kernel memory is never unmapped in that fashion [*]. 
> Even hotplug memory is based on limiting what gets allocated in that 
> area and never putting critical kernel data structures there.
> 
> Personally I'd be more comfortable with having a special primitive for 
> this that is DEBUG_PAGEALLOC aware (Linus's first suggestion), so that 
> we don't use different RCU primitives in the rare case someone tests 
> CONFIG_DEBUG_PAGEALLOC=y ...
> 
> We even have such a primitive: __copy_from_user_inatomic(). It 
> compiles to a single instruction for integer types on x86. I've 
> attached a patch that implements it for the regular mutexes (xadd can 
> be done too), and it all compiles to a rather sweet, compact routine:
> 
> 0000000000000030 <mutex_spin_on_owner.isra.4>:
>   30:	48 3b 37             	cmp    (%rdi),%rsi
>   33:	48 8d 4e 28          	lea    0x28(%rsi),%rcx
>   37:	75 4e                	jne    87 <mutex_spin_on_owner.isra.4+0x57>
>   39:	55                   	push   %rbp
>   3a:	45 31 c0             	xor    %r8d,%r8d
>   3d:	65 4c 8b 0c 25 00 00 	mov    %gs:0x0,%r9
>   44:	00 00 
>   46:	48 89 e5             	mov    %rsp,%rbp
>   49:	48 83 ec 10          	sub    $0x10,%rsp
>   4d:	eb 08                	jmp    57 <mutex_spin_on_owner.isra.4+0x27>
>   4f:	90                   	nop
>   50:	f3 90                	pause  
>   52:	48 3b 37             	cmp    (%rdi),%rsi
>   55:	75 29                	jne    80 <mutex_spin_on_owner.isra.4+0x50>
>   57:	44 89 c0             	mov    %r8d,%eax
>   5a:	90                   	nop
>   5b:	90                   	nop
>   5c:	90                   	nop
>   5d:	8b 11                	mov    (%rcx),%edx
>   5f:	90                   	nop
>   60:	90                   	nop
>   61:	90                   	nop
>   62:	85 d2                	test   %edx,%edx
>   64:	89 55 fc             	mov    %edx,-0x4(%rbp)
>   67:	74 0b                	je     74 <mutex_spin_on_owner.isra.4+0x44>
>   69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
>   70:	a8 08                	test   $0x8,%al
>   72:	74 dc                	je     50 <mutex_spin_on_owner.isra.4+0x20>
>   74:	31 c0                	xor    %eax,%eax
>   76:	c9                   	leaveq 
>   77:	c3                   	retq   
>   78:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
>   7f:	00 
>   80:	b8 01 00 00 00       	mov    $0x1,%eax
>   85:	c9                   	leaveq 
>   86:	c3                   	retq   
>   87:	b8 01 00 00 00       	mov    $0x1,%eax
>   8c:	c3                   	retq   
>   8d:	0f 1f 00             	nopl   (%rax)
> 
> No RCU overhead, and this is the access to owner->on_cpu:
> 
>   69:	49 8b 81 10 c0 ff ff 	mov    -0x3ff0(%r9),%rax
> 
> Totally untested and all that, I only built the mutex.o.
> 
> What do you think? Am I missing anything?

I suspect it is good, but let's take a look at Linus' summary of the code:

        rcu_read_lock();
        while (sem->owner == owner) {
                if (!owner->on_cpu || need_resched())
                        break;
                cpu_relax_lowlatency();
        }
        rcu_read_unlock();

The cpu_relax_lowlatency() looks to have barrier() semantics, so  the
sem->owner should get reloaded every time through the loop.  This is
needed, because otherwise the task structure could get freed and
reallocated as something else that happened to have the field at the
->on_cpu offset always zero, resulting in an infinite loop.

Of course, the implicit barrier() could well be forcing unnecessary
register spilling.  If so, it might be worth trying a variant of
cpu_relax_lowlatency() that doesn't have barrier() semantics and
placing READ_ONCE() around the sem->owner in the "while" condition.

So, in short, given the big fat comment you called for, I don't see
any problem with this approach.

Which means that any bugs will be a surprise, which will at least have
the advantage of avoiding boredom.  ;-)

							Thanx, Paul

> Thanks,
> 
> 	Ingo
> 
> [*] with the exception of CONFIG_DEBUG_PAGEALLOC and other debug 
>     mechanisms like CONFIG_KMEMCHECK (which is on the way out) that 
>     are based on provoking page faults and fixing up page tables to 
>     catch unexpected memory accesses.
> 
> =================================>
> >From ef3e5e763747d47a43a32f846ee94706089222cf Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@kernel.org>
> Date: Fri, 10 Apr 2015 10:49:11 +0200
> Subject: [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock
> 
> Linus suggested to get rid of the held RCU read-lock in 
> mutex_spin_on_owner(). The only real complication is that the 'owner' 
> task might be freed from under us and we might dereference into 
> possibly freed kernel memory.
> 
> As long as the kernel pointer itself is valid this approach is fine in 
> this case (see the analysis below) - with the exception of 
> CONFIG_DEBUG_PAGEALLOC=y and similarly instrumented kernels which 
> might fault on referencing freed kernel memory.
> 
> Use the non-faulting copy-from-user primitive to get the owner->on_cpu 
> value that we use in NMI handlers and which works even on 
> CONFIG_DEBUG_PAGEALLOC=y instrumented kernels. This compiles to a 
> single instruction on most platforms.
> 
> This approach might briefly copy in junk data from an already freed 
> (previous) owner task, which might trigger two scenarios:
> 
> 1) The junk data causes us to loop once more. This is not
>    a problem as we'll check the owner on the next loop and
>    break out of the loop.
> 
> 2) If the junk value causes us to break out of the loop
>    that's fine too: it's what we'd have done anyway on
>    the next iteration, as the lock owner changed.
> 
> The inatomic context copy primitives are compiler barriers
> too - this matters to make sure the above owner check is
> emitted to before the copy attempt.
> 
> We also ignore failed copies, as the next iteration will clean
> up after us. This saves an extra branch in the common case.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  kernel/locking/mutex.c | 40 +++++++++++++++++++++++++++-------------
>  1 file changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 4cccea6b8934..fcc7db45d62e 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -224,28 +224,42 @@ ww_mutex_set_context_slowpath(struct ww_mutex *lock,
>  static noinline
>  bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
>  {
> -	bool ret = true;
> +	int on_cpu;
> +	int ret;
> 
> -	rcu_read_lock();
>  	while (lock->owner == owner) {
>  		/*
> -		 * Ensure we emit the owner->on_cpu, dereference _after_
> -		 * checking lock->owner still matches owner. If that fails,
> -		 * owner might point to freed memory. If it still matches,
> -		 * the rcu_read_lock() ensures the memory stays valid.
> +		 * Use the non-faulting copy-user primitive to get the owner->on_cpu
> +		 * value that works even on CONFIG_DEBUG_PAGEALLOC=y instrumented
> +		 * kernels. This compiles to a single instruction on most platforms.
> +		 *
> +		 * This might briefly copy in junk data from an already freed
> +		 * (previous) owner task, which might trigger two scenarios:
> +		 *
> +		 * 1) The junk data causes us to loop once more. This is not
> +		 *    a problem as we'll check the owner on the next loop and
> +		 *    break out of the loop.
> +		 *
> +		 * 2) If the junk value causes us to break out of the loop
> +		 *    that's fine too: it's what we'd have done anyway on
> +		 *    the next iteration, as the lock owner changed.
> +		 *
> +		 * NOTE: the inatomic context copy primitives are compiler barriers
> +		 *       too - this matters to make sure the above owner check is
> +		 *       emitted to before the copy attempt.
> +		 *
> +		 * NOTE2: We ignore failed copies, as the next iteration will clean
> +		 *        up after us. This saves an extra branch in the common case.
>  		 */
> -		barrier();
> +		ret = __copy_from_user_inatomic(&on_cpu, &owner->on_cpu, sizeof(on_cpu));
> 
> -		if (!owner->on_cpu || need_resched()) {
> -			ret = false;
> -			break;
> -		}
> +		if (!on_cpu || need_resched())
> +			return false;
> 
>  		cpu_relax_lowlatency();
>  	}
> -	rcu_read_unlock();
> 
> -	return ret;
> +	return true;
>  }
> 
>  /*
> 


  parent reply	other threads:[~2015-04-10 14:20 UTC|newest]

Thread overview: 108+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-08 19:39 [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Jason Low
2015-04-08 19:39 ` [PATCH 1/2] locking/mutex: Further refactor mutex_spin_on_owner() Jason Low
2015-04-09  9:00   ` [tip:locking/core] locking/mutex: Further simplify mutex_spin_on_owner() tip-bot for Jason Low
2015-04-08 19:39 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09  5:37   ` Ingo Molnar
2015-04-09  6:40     ` Jason Low
2015-04-09  7:53       ` Ingo Molnar
2015-04-09 16:47         ` Linus Torvalds
2015-04-09 17:56           ` Paul E. McKenney
2015-04-09 18:08             ` Linus Torvalds
2015-04-09 18:16               ` Linus Torvalds
2015-04-09 18:39                 ` Paul E. McKenney
2015-04-10  9:00                   ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10  9:12                     ` Ingo Molnar
2015-04-10  9:21                       ` [PATCH] uaccess: Add __copy_from_kernel_inatomic() primitive Ingo Molnar
2015-04-10 11:14                         ` [PATCH] x86/uaccess: Implement get_kernel() Ingo Molnar
2015-04-10 11:27                           ` [PATCH] mutex: Improve mutex_spin_on_owner() code generation Ingo Molnar
2015-04-10 12:08                             ` [PATCH] x86: Align jump targets to 1 byte boundaries Ingo Molnar
2015-04-10 12:18                               ` [PATCH] x86: Pack function addresses tightly as well Ingo Molnar
2015-04-10 12:30                                 ` [PATCH] x86: Pack loops " Ingo Molnar
2015-04-10 13:46                                   ` Borislav Petkov
2015-05-15  9:40                                   ` [tip:x86/asm] " tip-bot for Ingo Molnar
2015-05-17  6:03                                   ` [tip:x86/apic] " tip-bot for Ingo Molnar
2015-05-15  9:39                                 ` [tip:x86/asm] x86: Pack function addresses " tip-bot for Ingo Molnar
2015-05-15 18:36                                   ` Linus Torvalds
2015-05-15 20:52                                     ` Denys Vlasenko
2015-05-17  5:58                                     ` Ingo Molnar
2015-05-17  7:09                                       ` Ingo Molnar
2015-05-17  7:30                                         ` Ingo Molnar
2015-05-18  9:28                                       ` Denys Vlasenko
2015-05-19 21:38                                       ` [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Ingo Molnar
2015-05-20  0:47                                         ` Linus Torvalds
2015-05-20 12:21                                           ` Denys Vlasenko
2015-05-21 11:36                                             ` Ingo Molnar
2015-05-21 11:38                                             ` Denys Vlasenko
2016-04-16 21:08                                               ` Denys Vlasenko
2015-05-20 13:09                                           ` Ingo Molnar
2015-05-20 11:29                                         ` Denys Vlasenko
2015-05-21 13:28                                           ` Ingo Molnar
2015-05-21 14:03                                           ` Ingo Molnar
2015-04-10 12:50                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-04-10 13:18                                 ` H. Peter Anvin
2015-04-10 17:54                                   ` Ingo Molnar
2015-04-10 18:32                                     ` H. Peter Anvin
2015-04-11 14:41                                   ` Markus Trippelsdorf
2015-04-12 10:14                                     ` Ingo Molnar
2015-04-13 16:23                                       ` Markus Trippelsdorf
2015-04-13 17:26                                         ` Markus Trippelsdorf
2015-04-13 18:31                                           ` Linus Torvalds
2015-04-13 19:09                                             ` Markus Trippelsdorf
2015-04-14  5:38                                               ` Ingo Molnar
2015-04-14  8:23                                                 ` Markus Trippelsdorf
2015-04-14  9:16                                                   ` Ingo Molnar
2015-04-14 11:17                                                     ` Markus Trippelsdorf
2015-04-14 12:09                                                       ` Ingo Molnar
2015-04-10 18:48                                 ` Linus Torvalds
2015-04-12 23:44                                   ` Maciej W. Rozycki
2015-04-10 19:23                                 ` Daniel Borkmann
2015-04-11 13:48                                 ` Markus Trippelsdorf
2015-04-10 13:19                               ` Borislav Petkov
2015-04-10 13:54                                 ` Denys Vlasenko
2015-04-10 14:01                                   ` Borislav Petkov
2015-04-10 14:53                                     ` Denys Vlasenko
2015-04-10 15:25                                       ` Borislav Petkov
2015-04-10 15:48                                         ` Denys Vlasenko
2015-04-10 15:54                                           ` Borislav Petkov
2015-04-10 21:44                                             ` Borislav Petkov
2015-04-10 18:54                                       ` Linus Torvalds
2015-04-10 14:10                               ` Paul E. McKenney
2015-04-11 14:28                                 ` Josh Triplett
2015-04-11  9:20                               ` [PATCH] x86: Turn off GCC branch probability heuristics Ingo Molnar
2015-04-11 17:41                                 ` Linus Torvalds
2015-04-11 18:57                                   ` Thomas Gleixner
2015-04-11 19:35                                     ` Linus Torvalds
2015-04-12  5:47                                       ` Ingo Molnar
2015-04-12  6:20                                         ` Markus Trippelsdorf
2015-04-12 10:15                                           ` Ingo Molnar
2015-04-12  7:56                                         ` Mike Galbraith
2015-04-12  7:41                                       ` Ingo Molnar
2015-04-12  8:07                                     ` Ingo Molnar
2015-04-12 21:11                                     ` Jan Hubicka
2015-05-14 11:59                               ` [PATCH] x86: Align jump targets to 1 byte boundaries Denys Vlasenko
2015-05-14 18:17                                 ` Ingo Molnar
2015-05-14 19:04                                   ` Denys Vlasenko
2015-05-14 19:44                                     ` Ingo Molnar
2015-05-15 15:45                                   ` Josh Triplett
2015-05-17  5:34                                     ` Ingo Molnar
2015-05-17 19:18                                       ` Josh Triplett
2015-05-18  6:48                                         ` Ingo Molnar
2015-05-15  9:39                               ` [tip:x86/asm] x86: Align jump targets to 1-byte boundaries tip-bot for Ingo Molnar
2015-04-10 11:34                           ` [PATCH] x86/uaccess: Implement get_kernel() Peter Zijlstra
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 17:49                           ` Linus Torvalds
2015-04-10 18:04                             ` Ingo Molnar
2015-04-10 18:09                               ` Linus Torvalds
2015-04-10 14:20                     ` Paul E. McKenney [this message]
2015-04-10 17:44                       ` [PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock Ingo Molnar
2015-04-10 18:05                         ` Paul E. McKenney
2015-04-09 19:43                 ` [PATCH 2/2] locking/rwsem: Use a return variable in rwsem_spin_on_owner() Jason Low
2015-04-09 19:58                   ` Paul E. McKenney
2015-04-09 20:58                     ` Jason Low
2015-04-09 21:07                       ` Paul E. McKenney
2015-04-09 19:59                   ` Davidlohr Bueso
2015-04-09 20:36                 ` Jason Low
2015-04-10  2:43                   ` Andev
2015-04-10  9:04                   ` Ingo Molnar
2015-04-08 19:49 ` [PATCH 0/2] locking: Simplify mutex and rwsem spinning code Davidlohr Bueso
2015-04-08 20:10   ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150410142024.GY6464@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=aswin@hp.com \
    --cc=dave@stgolabs.net \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).