linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
@ 2017-02-17 20:43 Waiman Long
  2017-02-20  4:20 ` Andrea Parri
  0 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2017-02-17 20:43 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Pan Xinhui, Boqun Feng, Waiman Long

All the locking related cmpxchg's in the following functions are
replaced with the _acquire variants:
 - pv_queued_spin_steal_lock()
 - trylock_clear_pending()

This change should help performance on architectures that use LL/SC.

On a 2-core 16-thread Power8 system with pvqspinlock explicitly
enabled, the performance of a locking microbenchmark with and without
this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
were as follows:

  # of thread     w/o patch    with patch      % Change
  -----------     ---------    ----------      --------
       4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
       8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
      12         2576.4 Mop/s  2674.6 Mop/s     +3.8%

Signed-off-by: Waiman Long <longman@redhat.com>
---

 v2->v3:
  - Reduce scope by relaxing cmpxchg's in fast path only.

 v1->v2:
  - Add comments in changelog and code for the rationale of the change.

 kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e6b2f7a..a59dc05 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
 	struct __qspinlock *l = (void *)lock;
 
 	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
-	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
+	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
 		qstat_inc(qstat_pv_lock_stealing, true);
 		return true;
 	}
@@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
 
 /*
  * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
- * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
- * just to be sure that it will get it.
+ * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
+ * lock just to be sure that it will get it.
  */
 static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
 	struct __qspinlock *l = (void *)lock;
 
 	return !READ_ONCE(l->locked) &&
-	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
-			== _Q_PENDING_VAL);
+	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
+				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
 }
 #else /* _Q_PENDING_BITS == 8 */
 static __always_inline void set_pending(struct qspinlock *lock)
@@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 		 */
 		old = val;
 		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-		val = atomic_cmpxchg(&lock->val, old, new);
+		val = atomic_cmpxchg_acquire(&lock->val, old, new);
 
 		if (val == old)
 			return 1;
@@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * observe its next->locked value and advance itself.
 	 *
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+	 *
+	 * We can't used relaxed form of cmpxchg here as the loading of
+	 * pn->state can happen before setting next->locked in some archs.
 	 */
 	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
 		return;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
  2017-02-17 20:43 [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Waiman Long
@ 2017-02-20  4:20 ` Andrea Parri
  2017-02-20  4:53   ` Boqun Feng
  2017-02-20 11:00   ` Peter Zijlstra
  0 siblings, 2 replies; 7+ messages in thread
From: Andrea Parri @ 2017-02-20  4:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Pan Xinhui, Boqun Feng, linux-kernel

On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> All the locking related cmpxchg's in the following functions are
> replaced with the _acquire variants:
>  - pv_queued_spin_steal_lock()
>  - trylock_clear_pending()
> 
> This change should help performance on architectures that use LL/SC.
> 
> On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> enabled, the performance of a locking microbenchmark with and without
> this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> were as follows:
> 
>   # of thread     w/o patch    with patch      % Change
>   -----------     ---------    ----------      --------
>        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
>        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
>       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> 
>  v2->v3:
>   - Reduce scope by relaxing cmpxchg's in fast path only.
> 
>  v1->v2:
>   - Add comments in changelog and code for the rationale of the change.
> 
>  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index e6b2f7a..a59dc05 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
>  	struct __qspinlock *l = (void *)lock;
>  
>  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
>  		qstat_inc(qstat_pv_lock_stealing, true);
>  		return true;
>  	}
> @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
>  
>  /*
>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> - * just to be sure that it will get it.
> + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> + * lock just to be sure that it will get it.
>   */
>  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>  {
>  	struct __qspinlock *l = (void *)lock;
>  
>  	return !READ_ONCE(l->locked) &&
> -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> -			== _Q_PENDING_VAL);
> +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
>  }
>  #else /* _Q_PENDING_BITS == 8 */
>  static __always_inline void set_pending(struct qspinlock *lock)
> @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>  		 */
>  		old = val;
>  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> -		val = atomic_cmpxchg(&lock->val, old, new);
> +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
>  
>  		if (val == old)
>  			return 1;
> @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>  	 * observe its next->locked value and advance itself.
>  	 *
>  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> +	 *
> +	 * We can't used relaxed form of cmpxchg here as the loading of
> +	 * pn->state can happen before setting next->locked in some archs.
>  	 */
>  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)

Hi Waiman.

cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
to something like:

    _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:

  Andrea


>  		return;
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
  2017-02-20  4:20 ` Andrea Parri
@ 2017-02-20  4:53   ` Boqun Feng
  2017-02-20  4:58     ` Boqun Feng
  2017-02-20 15:58     ` Waiman Long
  2017-02-20 11:00   ` Peter Zijlstra
  1 sibling, 2 replies; 7+ messages in thread
From: Boqun Feng @ 2017-02-20  4:53 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Pan Xinhui, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5093 bytes --]

On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > All the locking related cmpxchg's in the following functions are
> > replaced with the _acquire variants:
> >  - pv_queued_spin_steal_lock()
> >  - trylock_clear_pending()
> > 
> > This change should help performance on architectures that use LL/SC.
> > 
> > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > enabled, the performance of a locking microbenchmark with and without
> > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > were as follows:
> > 
> >   # of thread     w/o patch    with patch      % Change
> >   -----------     ---------    ----------      --------
> >        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
> >        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
> >       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> > 
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > ---
> > 
> >  v2->v3:
> >   - Reduce scope by relaxing cmpxchg's in fast path only.
> > 
> >  v1->v2:
> >   - Add comments in changelog and code for the rationale of the change.
> > 
> >  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > index e6b2f7a..a59dc05 100644
> > --- a/kernel/locking/qspinlock_paravirt.h
> > +++ b/kernel/locking/qspinlock_paravirt.h
> > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> >  	struct __qspinlock *l = (void *)lock;
> >  
> >  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> >  		qstat_inc(qstat_pv_lock_stealing, true);
> >  		return true;
> >  	}
> > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
> >  
> >  /*
> >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> > - * just to be sure that it will get it.
> > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> > + * lock just to be sure that it will get it.
> >   */
> >  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> >  {
> >  	struct __qspinlock *l = (void *)lock;
> >  
> >  	return !READ_ONCE(l->locked) &&
> > -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > -			== _Q_PENDING_VAL);
> > +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
> >  }
> >  #else /* _Q_PENDING_BITS == 8 */
> >  static __always_inline void set_pending(struct qspinlock *lock)
> > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> >  		 */
> >  		old = val;
> >  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > -		val = atomic_cmpxchg(&lock->val, old, new);
> > +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
> >  
> >  		if (val == old)
> >  			return 1;
> > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> >  	 * observe its next->locked value and advance itself.
> >  	 *
> >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > +	 *
> > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > +	 * pn->state can happen before setting next->locked in some archs.
> >  	 */
> >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> 
> Hi Waiman.
> 
> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> to something like:
> 
>     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
> 

Yes, sorry for be late for this one.

So Waiman, the fact is that in this case, we want the following code
sequence:

	CPU 0					CPU 1
	=================			====================
	{pn->state = vcpu_running, node->locked = 0}

	smp_store_smb(&pn->state, vcpu_halted):
	  WRITE_ONCE(pn->state, vcpu_halted);
	  smp_mb();
	r1 = READ_ONCE(node->locked);
						arch_mcs_spin_unlock_contented();
						  WRITE_ONCE(node->locked, 1)

						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);

never ends up in:

	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
	value vcpu_running).

We can have such a guarantee if cmpxchg has a smp_mb() before its load
part, which is true for PPC. But semantically, cmpxchg() doesn't provide
any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
in Cc for his insight ;-)).

So a possible "fix"(in case ARM64 will use qspinlock some day), would be
replace cmpxchg() with smp_mb() + cmpxchg_relaxed().

Regards,
Boqun

>   Andrea
> 
> 
> >  		return;
> > -- 
> > 1.8.3.1
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
  2017-02-20  4:53   ` Boqun Feng
@ 2017-02-20  4:58     ` Boqun Feng
  2017-02-21 13:04       ` Will Deacon
  2017-02-20 15:58     ` Waiman Long
  1 sibling, 1 reply; 7+ messages in thread
From: Boqun Feng @ 2017-02-20  4:58 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, Pan Xinhui,
	linux-kernel, Will Deacon

[-- Attachment #1: Type: text/plain, Size: 5463 bytes --]

(Really add Will this time ...)

On Mon, Feb 20, 2017 at 12:53:58PM +0800, Boqun Feng wrote:
> On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
> > > All the locking related cmpxchg's in the following functions are
> > > replaced with the _acquire variants:
> > >  - pv_queued_spin_steal_lock()
> > >  - trylock_clear_pending()
> > > 
> > > This change should help performance on architectures that use LL/SC.
> > > 
> > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly
> > > enabled, the performance of a locking microbenchmark with and without
> > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
> > > were as follows:
> > > 
> > >   # of thread     w/o patch    with patch      % Change
> > >   -----------     ---------    ----------      --------
> > >        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
> > >        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
> > >       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > ---
> > > 
> > >  v2->v3:
> > >   - Reduce scope by relaxing cmpxchg's in fast path only.
> > > 
> > >  v1->v2:
> > >   - Add comments in changelog and code for the rationale of the change.
> > > 
> > >  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
> > >  1 file changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> > > index e6b2f7a..a59dc05 100644
> > > --- a/kernel/locking/qspinlock_paravirt.h
> > > +++ b/kernel/locking/qspinlock_paravirt.h
> > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
> > >  	struct __qspinlock *l = (void *)lock;
> > >  
> > >  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
> > > -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > > +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
> > >  		qstat_inc(qstat_pv_lock_stealing, true);
> > >  		return true;
> > >  	}
> > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
> > >  
> > >  /*
> > >   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
> > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
> > > - * just to be sure that it will get it.
> > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
> > > + * lock just to be sure that it will get it.
> > >   */
> > >  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> > >  {
> > >  	struct __qspinlock *l = (void *)lock;
> > >  
> > >  	return !READ_ONCE(l->locked) &&
> > > -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
> > > -			== _Q_PENDING_VAL);
> > > +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
> > > +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
> > >  }
> > >  #else /* _Q_PENDING_BITS == 8 */
> > >  static __always_inline void set_pending(struct qspinlock *lock)
> > > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
> > >  		 */
> > >  		old = val;
> > >  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
> > > -		val = atomic_cmpxchg(&lock->val, old, new);
> > > +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
> > >  
> > >  		if (val == old)
> > >  			return 1;
> > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> > >  	 * observe its next->locked value and advance itself.
> > >  	 *
> > >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > > +	 *
> > > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > > +	 * pn->state can happen before setting next->locked in some archs.
> > >  	 */
> > >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> > 
> > Hi Waiman.
> > 
> > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> > to something like:
> > 
> >     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
> > 
> 
> Yes, sorry for be late for this one.
> 
> So Waiman, the fact is that in this case, we want the following code
> sequence:
> 
> 	CPU 0					CPU 1
> 	=================			====================
> 	{pn->state = vcpu_running, node->locked = 0}
> 
> 	smp_store_smb(&pn->state, vcpu_halted):
> 	  WRITE_ONCE(pn->state, vcpu_halted);
> 	  smp_mb();
> 	r1 = READ_ONCE(node->locked);
> 						arch_mcs_spin_unlock_contented();
> 						  WRITE_ONCE(node->locked, 1)
> 
> 						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);
> 
> never ends up in:
> 
> 	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
> 	value vcpu_running).
> 
> We can have such a guarantee if cmpxchg has a smp_mb() before its load
> part, which is true for PPC. But semantically, cmpxchg() doesn't provide
> any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
> in Cc for his insight ;-)).
> 
> So a possible "fix"(in case ARM64 will use qspinlock some day), would be
> replace cmpxchg() with smp_mb() + cmpxchg_relaxed().
> 
> Regards,
> Boqun
> 
> >   Andrea
> > 
> > 
> > >  		return;
> > > -- 
> > > 1.8.3.1
> > > 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
  2017-02-20  4:20 ` Andrea Parri
  2017-02-20  4:53   ` Boqun Feng
@ 2017-02-20 11:00   ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2017-02-20 11:00 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Waiman Long, Ingo Molnar, Pan Xinhui, Boqun Feng, linux-kernel,
	Will Deacon

On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:

> > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> >  	 * observe its next->locked value and advance itself.
> >  	 *
> >  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> > +	 *
> > +	 * We can't used relaxed form of cmpxchg here as the loading of
> > +	 * pn->state can happen before setting next->locked in some archs.
> >  	 */
> >  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> 
> Hi Waiman.
> 
> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
> to something like:
> 
>     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:


So cmpxchg() should have an effective smp_mb() before and after the
operation, _except_ in the case of failure, in which case it need not
imply any barrier.

And since a successful cmpxchg does the store, which is a store-release
in that arm64 sequence, that provides the ordering against all prior
state. The dmb-ish provides the required barrier after the operation.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
  2017-02-20  4:53   ` Boqun Feng
  2017-02-20  4:58     ` Boqun Feng
@ 2017-02-20 15:58     ` Waiman Long
  1 sibling, 0 replies; 7+ messages in thread
From: Waiman Long @ 2017-02-20 15:58 UTC (permalink / raw)
  To: Boqun Feng, Andrea Parri
  Cc: Peter Zijlstra, Ingo Molnar, Pan Xinhui, linux-kernel

On 02/19/2017 11:53 PM, Boqun Feng wrote:
> On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
>> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
>>> All the locking related cmpxchg's in the following functions are
>>> replaced with the _acquire variants:
>>>  - pv_queued_spin_steal_lock()
>>>  - trylock_clear_pending()
>>>
>>> This change should help performance on architectures that use LL/SC.
>>>
>>> On a 2-core 16-thread Power8 system with pvqspinlock explicitly
>>> enabled, the performance of a locking microbenchmark with and without
>>> this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
>>> were as follows:
>>>
>>>   # of thread     w/o patch    with patch      % Change
>>>   -----------     ---------    ----------      --------
>>>        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
>>>        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
>>>       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>
>>>  v2->v3:
>>>   - Reduce scope by relaxing cmpxchg's in fast path only.
>>>
>>>  v1->v2:
>>>   - Add comments in changelog and code for the rationale of the change.
>>>
>>>  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
>>>  1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
>>> index e6b2f7a..a59dc05 100644
>>> --- a/kernel/locking/qspinlock_paravirt.h
>>> +++ b/kernel/locking/qspinlock_paravirt.h
>>> @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
>>>  	struct __qspinlock *l = (void *)lock;
>>>  
>>>  	if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
>>> -	    (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
>>> +	    (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
>>>  		qstat_inc(qstat_pv_lock_stealing, true);
>>>  		return true;
>>>  	}
>>> @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct qspinlock *lock)
>>>  
>>>  /*
>>>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
>>> - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
>>> - * just to be sure that it will get it.
>>> + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
>>> + * lock just to be sure that it will get it.
>>>   */
>>>  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>>>  {
>>>  	struct __qspinlock *l = (void *)lock;
>>>  
>>>  	return !READ_ONCE(l->locked) &&
>>> -	       (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
>>> -			== _Q_PENDING_VAL);
>>> +	       (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
>>> +				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
>>>  }
>>>  #else /* _Q_PENDING_BITS == 8 */
>>>  static __always_inline void set_pending(struct qspinlock *lock)
>>> @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>>>  		 */
>>>  		old = val;
>>>  		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
>>> -		val = atomic_cmpxchg(&lock->val, old, new);
>>> +		val = atomic_cmpxchg_acquire(&lock->val, old, new);
>>>  
>>>  		if (val == old)
>>>  			return 1;
>>> @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>>>  	 * observe its next->locked value and advance itself.
>>>  	 *
>>>  	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>> +	 *
>>> +	 * We can't used relaxed form of cmpxchg here as the loading of
>>> +	 * pn->state can happen before setting next->locked in some archs.
>>>  	 */
>>>  	if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>> Hi Waiman.
>>
>> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
>> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
>> to something like:
>>
>>     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
>>
> Yes, sorry for be late for this one.
>
> So Waiman, the fact is that in this case, we want the following code
> sequence:
>
> 	CPU 0					CPU 1
> 	=================			====================
> 	{pn->state = vcpu_running, node->locked = 0}
>
> 	smp_store_smb(&pn->state, vcpu_halted):
> 	  WRITE_ONCE(pn->state, vcpu_halted);
> 	  smp_mb();
> 	r1 = READ_ONCE(node->locked);
> 						arch_mcs_spin_unlock_contented();
> 						  WRITE_ONCE(node->locked, 1)
>
> 						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);
>
> never ends up in:
>
> 	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
> 	value vcpu_running).
>
> We can have such a guarantee if cmpxchg has a smp_mb() before its load
> part, which is true for PPC. But semantically, cmpxchg() doesn't provide
> any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
> in Cc for his insight ;-)).
>
> So a possible "fix"(in case ARM64 will use qspinlock some day), would be
> replace cmpxchg() with smp_mb() + cmpxchg_relaxed().
>

It is the pvqspinlock code, the native qspinlock will not have this
problem. So we will need to make sure that the write to node->locked
will always precede the read of pn->state whether the cmpxchg is
successful or not. We are not going to replace cmpxchg() with smp_mb() +
cmpxchg_relaxed() as that will impact x86 performance. Perhaps, we could
provide another cmpxchg variant that can provide that memory barrier
guarantee for all archs.

In the mean time, I am going to update the patch to document this
limitation so that we could do something about it when archs like ARM64
want pvqspinlock support.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs
  2017-02-20  4:58     ` Boqun Feng
@ 2017-02-21 13:04       ` Will Deacon
  0 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2017-02-21 13:04 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Andrea Parri, Waiman Long, Peter Zijlstra, Ingo Molnar,
	Pan Xinhui, linux-kernel

On Mon, Feb 20, 2017 at 12:58:39PM +0800, Boqun Feng wrote:
> > So Waiman, the fact is that in this case, we want the following code
> > sequence:
> > 
> > 	CPU 0					CPU 1
> > 	=================			====================
> > 	{pn->state = vcpu_running, node->locked = 0}
> > 
> > 	smp_store_smb(&pn->state, vcpu_halted):
> > 	  WRITE_ONCE(pn->state, vcpu_halted);
> > 	  smp_mb();
> > 	r1 = READ_ONCE(node->locked);
> > 						arch_mcs_spin_unlock_contented();
> > 						  WRITE_ONCE(node->locked, 1)
> > 
> > 						cmpxchg(&pn->state, vcpu_halted, vcpu_hashed);
> > 
> > never ends up in:
> > 
> > 	r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
> > 	value vcpu_running).
> > 
> > We can have such a guarantee if cmpxchg has a smp_mb() before its load
> > part, which is true for PPC. But semantically, cmpxchg() doesn't provide
> > any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
> > in Cc for his insight ;-)).

I think you're right. The write to node->locked on CPU1 is not required
to be ordered before the load part of the failing cmpxchg.

> > So a possible "fix"(in case ARM64 will use qspinlock some day), would be
> > replace cmpxchg() with smp_mb() + cmpxchg_relaxed().

Peversely, we could actually get away with cmpxchg_acquire on arm64 because
arch_mcs_spin_unlock_contended is smp_store_release and we order release ->
acquire in the architecture. But that just brings up the age old unlock/lock
discussion again...

Will

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-02-21 13:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-17 20:43 [PATCH v3] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs Waiman Long
2017-02-20  4:20 ` Andrea Parri
2017-02-20  4:53   ` Boqun Feng
2017-02-20  4:58     ` Boqun Feng
2017-02-21 13:04       ` Will Deacon
2017-02-20 15:58     ` Waiman Long
2017-02-20 11:00   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).