linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] locking/qspinlock: Add stat tracking for pending vs slowpath
@ 2018-04-09 18:08 Waiman Long
  2018-04-09 18:08 ` [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop Waiman Long
  0 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2018-04-09 18:08 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Will Deacon, boqun.feng, catalin.marinas,
	Paul E. McKenney, Waiman Long

Currently, the qspinlock_stat code tracks only statistical counts in the
PV qspinlock code. However, it may also be useful to track the number
of locking operations done via the pending code vs. the MCS lock queue
slowpath for the non-PV case.

The qspinlock stat code is modified to do that. The stat counter
pv_lock_slowpath is renamed to lock_slowpath so that it can be used by
both the PV and non-PV cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c          | 14 +++++++++++---
 kernel/locking/qspinlock_paravirt.h |  7 +------
 kernel/locking/qspinlock_stat.h     |  9 ++++++---
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d880296..634a49b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -12,11 +12,11 @@
  * GNU General Public License for more details.
  *
  * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
- * (C) Copyright 2013-2014 Red Hat, Inc.
+ * (C) Copyright 2013-2014,2018 Red Hat, Inc.
  * (C) Copyright 2015 Intel Corp.
  * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
  *
- * Authors: Waiman Long <waiman.long@hpe.com>
+ * Authors: Waiman Long <longman@redhat.com>
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
@@ -33,6 +33,11 @@
 #include <asm/qspinlock.h>
 
 /*
+ * Include queued spinlock statistics code
+ */
+#include "qspinlock_stat.h"
+
+/*
  * The basic principle of a queue-based spinlock can best be understood
  * by studying a classic queue-based spinlock implementation called the
  * MCS lock. The paper below provides a good description for this kind
@@ -300,7 +305,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
 	if (pv_enabled())
-		goto queue;
+		goto pv_queue;
 
 	if (virt_spin_lock(lock))
 		return;
@@ -342,6 +347,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 
 		val = old;
 	}
+	qstat_inc(qstat_lock_pending, true);
 
 	/*
 	 * we won the trylock
@@ -374,6 +380,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * queuing.
 	 */
 queue:
+	qstat_inc(qstat_lock_slowpath, true);
+pv_queue:
 	node = this_cpu_ptr(&mcs_nodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 6ee4777..39ca670 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -56,11 +56,6 @@ struct pv_node {
 };
 
 /*
- * Include queued spinlock statistics code
- */
-#include "qspinlock_stat.h"
-
-/*
  * Hybrid PV queued/unfair lock
  *
  * By replacing the regular queued_spin_trylock() with the function below,
@@ -443,7 +438,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	/*
 	 * Tracking # of slowpath locking operations
 	 */
-	qstat_inc(qstat_pv_lock_slowpath, true);
+	qstat_inc(qstat_lock_slowpath, true);
 
 	for (;; waitcnt++) {
 		/*
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 4a30ef6..6bd78c0 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -22,13 +22,14 @@
  *   pv_kick_wake	- # of vCPU kicks used for computing pv_latency_wake
  *   pv_latency_kick	- average latency (ns) of vCPU kick operation
  *   pv_latency_wake	- average latency (ns) from vCPU kick to wakeup
- *   pv_lock_slowpath	- # of locking operations via the slowpath
  *   pv_lock_stealing	- # of lock stealing operations
  *   pv_spurious_wakeup	- # of spurious wakeups in non-head vCPUs
  *   pv_wait_again	- # of wait's after a queue head vCPU kick
  *   pv_wait_early	- # of early vCPU wait's
  *   pv_wait_head	- # of vCPU wait's at the queue head
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
+ *   lock_pending	- # of locking operations via pending code
+ *   lock_slowpath	- # of locking operations via MCS lock queue
  *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
@@ -46,13 +47,14 @@ enum qlock_stats {
 	qstat_pv_kick_wake,
 	qstat_pv_latency_kick,
 	qstat_pv_latency_wake,
-	qstat_pv_lock_slowpath,
 	qstat_pv_lock_stealing,
 	qstat_pv_spurious_wakeup,
 	qstat_pv_wait_again,
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+	qstat_lock_pending,
+	qstat_lock_slowpath,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -73,12 +75,13 @@ enum qlock_stats {
 	[qstat_pv_spurious_wakeup] = "pv_spurious_wakeup",
 	[qstat_pv_latency_kick]	   = "pv_latency_kick",
 	[qstat_pv_latency_wake]    = "pv_latency_wake",
-	[qstat_pv_lock_slowpath]   = "pv_lock_slowpath",
 	[qstat_pv_lock_stealing]   = "pv_lock_stealing",
 	[qstat_pv_wait_again]      = "pv_wait_again",
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+	[qstat_lock_pending]       = "lock_pending",
+	[qstat_lock_slowpath]      = "lock_slowpath",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop
  2018-04-09 18:08 [PATCH 1/2] locking/qspinlock: Add stat tracking for pending vs slowpath Waiman Long
@ 2018-04-09 18:08 ` Waiman Long
  2018-04-10 18:26   ` Will Deacon
  2018-04-11 15:22   ` Catalin Marinas
  0 siblings, 2 replies; 6+ messages in thread
From: Waiman Long @ 2018-04-09 18:08 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Will Deacon, boqun.feng, catalin.marinas,
	Paul E. McKenney, Waiman Long

A locker in the pending code path is doing an infinite number of spins
when waiting for the _Q_PENDING_VAL to _Q_LOCK_VAL transition. There
is a concern that lock starvation can happen concurrent lockers are
able to take the lock in the cmpxchg loop without queuing and pass it
around amongst themselves.

To ensure forward progress while still taking advantage of using
the pending code path without queuing, the code is now modified
to do a limited number of spins before aborting the effort and
going to queue itself.

Ideally, the spinning times should be at least a few times the typical
cacheline load time from memory which I think can be down to 100ns or
so for each cacheline load with the newest systems or up to several
hundreds ns for older systems.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 634a49b..35367cc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -82,6 +82,15 @@
 #endif
 
 /*
+ * The pending bit spinning loop count.
+ * This parameter can be overridden by another architecture specific
+ * constant. Default is 512.
+ */
+#ifndef _Q_PENDING_LOOP
+#define _Q_PENDING_LOOP	(1 << 9)
+#endif
+
+/*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
@@ -311,13 +320,19 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		return;
 
 	/*
-	 * wait for in-progress pending->locked hand-overs
+	 * wait for in-progress pending->locked hand-overs with a
+	 * limited number of spins.
 	 *
 	 * 0,1,0 -> 0,0,1
 	 */
 	if (val == _Q_PENDING_VAL) {
-		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
+		int cnt = _Q_PENDING_LOOP;
+
+		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) {
+			if (!--cnt)
+				goto queue;
 			cpu_relax();
+		}
 	}
 
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop
  2018-04-09 18:08 ` [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop Waiman Long
@ 2018-04-10 18:26   ` Will Deacon
  2018-04-10 18:53     ` Waiman Long
  2018-04-11 15:22   ` Catalin Marinas
  1 sibling, 1 reply; 6+ messages in thread
From: Will Deacon @ 2018-04-10 18:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, boqun.feng,
	catalin.marinas, Paul E. McKenney

Hi Waiman,

On Mon, Apr 09, 2018 at 02:08:52PM -0400, Waiman Long wrote:
> A locker in the pending code path is doing an infinite number of spins
> when waiting for the _Q_PENDING_VAL to _Q_LOCK_VAL transition. There
> is a concern that lock starvation can happen concurrent lockers are
> able to take the lock in the cmpxchg loop without queuing and pass it
> around amongst themselves.
> 
> To ensure forward progress while still taking advantage of using
> the pending code path without queuing, the code is now modified
> to do a limited number of spins before aborting the effort and
> going to queue itself.
> 
> Ideally, the spinning times should be at least a few times the typical
> cacheline load time from memory which I think can be down to 100ns or
> so for each cacheline load with the newest systems or up to several
> hundreds ns for older systems.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 634a49b..35367cc 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -82,6 +82,15 @@
>  #endif
>  
>  /*
> + * The pending bit spinning loop count.
> + * This parameter can be overridden by another architecture specific
> + * constant. Default is 512.
> + */
> +#ifndef _Q_PENDING_LOOP
> +#define _Q_PENDING_LOOP	(1 << 9)
> +#endif

I really dislike heuristics like this because there's never a good number
to choose and it almost certainly varies between systems and workloads
rather than just by architecture. However, I've also not managed to come
up with something better.

If I rewrite your code slightly to look like:

	if (val == _Q_PENDING_VAL) {
		int cnt = _Q_PENDING_LOOP;
		val = atomic_cond_read_relaxed(&lock->val, (VAL != _Q_PENDING_VAL) || !cnt--);
	}

then architectures that implement atomic_cond_read_relaxed as something
more interesting than a spinning loop will probably be happy with
_Q_PENDING_LOOP == 1;

I'll post a v2 tomorrow with that change, and I'll add your stat patch to
the series too so that everything is kept together.

Will

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop
  2018-04-10 18:26   ` Will Deacon
@ 2018-04-10 18:53     ` Waiman Long
  0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-04-10 18:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, boqun.feng,
	catalin.marinas, Paul E. McKenney

On 04/10/2018 02:26 PM, Will Deacon wrote:
> Hi Waiman,
>
> On Mon, Apr 09, 2018 at 02:08:52PM -0400, Waiman Long wrote:
>> A locker in the pending code path is doing an infinite number of spins
>> when waiting for the _Q_PENDING_VAL to _Q_LOCK_VAL transition. There
>> is a concern that lock starvation can happen concurrent lockers are
>> able to take the lock in the cmpxchg loop without queuing and pass it
>> around amongst themselves.
>>
>> To ensure forward progress while still taking advantage of using
>> the pending code path without queuing, the code is now modified
>> to do a limited number of spins before aborting the effort and
>> going to queue itself.
>>
>> Ideally, the spinning times should be at least a few times the typical
>> cacheline load time from memory which I think can be down to 100ns or
>> so for each cacheline load with the newest systems or up to several
>> hundreds ns for older systems.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/locking/qspinlock.c | 19 +++++++++++++++++--
>>  1 file changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 634a49b..35367cc 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -82,6 +82,15 @@
>>  #endif
>>  
>>  /*
>> + * The pending bit spinning loop count.
>> + * This parameter can be overridden by another architecture specific
>> + * constant. Default is 512.
>> + */
>> +#ifndef _Q_PENDING_LOOP
>> +#define _Q_PENDING_LOOP	(1 << 9)
>> +#endif
> I really dislike heuristics like this because there's never a good number
> to choose and it almost certainly varies between systems and workloads
> rather than just by architecture. However, I've also not managed to come
> up with something better.

I share your concern about heuristic like this, but I can't think of
another easy way out.

> If I rewrite your code slightly to look like:
>
> 	if (val == _Q_PENDING_VAL) {
> 		int cnt = _Q_PENDING_LOOP;
> 		val = atomic_cond_read_relaxed(&lock->val, (VAL != _Q_PENDING_VAL) || !cnt--);
> 	}
>
> then architectures that implement atomic_cond_read_relaxed as something
> more interesting than a spinning loop will probably be happy with
> _Q_PENDING_LOOP == 1;

Right. That is why I state that _Q_PENDING_LOOP is an architecture
specific constant.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop
  2018-04-09 18:08 ` [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop Waiman Long
  2018-04-10 18:26   ` Will Deacon
@ 2018-04-11 15:22   ` Catalin Marinas
  2018-04-11 18:06     ` Waiman Long
  1 sibling, 1 reply; 6+ messages in thread
From: Catalin Marinas @ 2018-04-11 15:22 UTC (permalink / raw)
  To: Waiman Long
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Will Deacon,
	boqun.feng, Paul E. McKenney

Hi Waiman,

On Mon, Apr 09, 2018 at 02:08:52PM -0400, Waiman Long wrote:
> @@ -311,13 +320,19 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		return;
>  
>  	/*
> -	 * wait for in-progress pending->locked hand-overs
> +	 * wait for in-progress pending->locked hand-overs with a
> +	 * limited number of spins.
>  	 *
>  	 * 0,1,0 -> 0,0,1
>  	 */
>  	if (val == _Q_PENDING_VAL) {
> -		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
> +		int cnt = _Q_PENDING_LOOP;
> +
> +		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) {
> +			if (!--cnt)
> +				goto queue;
>  			cpu_relax();
> +		}
>  	}

In my model, the pathological case is not this loop but the following
one (trylock || pending):

P0:					P1:
queued_spin_lock() fails		queued_spin_lock() succeeds
queued_spin_lock_slowpath()
	val == _Q_LOCKED_VAL
	new = _Q_LOCKED_VAL |
		_Q_PENDING_VAL
					queued_spin_unlock()
						lock->val == 0
	cmpxchg(&lock->val, val, new)
		fails
	val = old (0)
	repeat for (;;) loop:
	new = _Q_LOCKED_VAL
					queued_spin_lock() succeeds
						lock->val == _Q_LOCKED_VAL
	cmpxchg(&lock->val, val, new)
		fails
	val = old (_Q_LOCKED_VAL)
	repeat for (;;) loop:

	... and we are back to the P0 state above when it first entered
	the loop, hence no progress. P1 never enters slowpath.

I think the pending bounded loop in your patch is needed for a three CPU
scenario where two of them can hand over _Q_PENDING_VAL while the third
doesn't make progress. I tried modeling 3 CPUs to see but the tool still
hits the for (;;) loop case rather than pending wait loop. Maybe a
combination of Will's changes to the (trylock || pending) loop with your
bounded pending hand-over?

-- 
Catalin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop
  2018-04-11 15:22   ` Catalin Marinas
@ 2018-04-11 18:06     ` Waiman Long
  0 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2018-04-11 18:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ingo Molnar, Peter Zijlstra, linux-kernel, Will Deacon,
	boqun.feng, Paul E. McKenney

On 04/11/2018 11:22 AM, Catalin Marinas wrote:
> Hi Waiman,
>
> On Mon, Apr 09, 2018 at 02:08:52PM -0400, Waiman Long wrote:
>> @@ -311,13 +320,19 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  		return;
>>  
>>  	/*
>> -	 * wait for in-progress pending->locked hand-overs
>> +	 * wait for in-progress pending->locked hand-overs with a
>> +	 * limited number of spins.
>>  	 *
>>  	 * 0,1,0 -> 0,0,1
>>  	 */
>>  	if (val == _Q_PENDING_VAL) {
>> -		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL)
>> +		int cnt = _Q_PENDING_LOOP;
>> +
>> +		while ((val = atomic_read(&lock->val)) == _Q_PENDING_VAL) {
>> +			if (!--cnt)
>> +				goto queue;
>>  			cpu_relax();
>> +		}
>>  	}
> In my model, the pathological case is not this loop but the following
> one (trylock || pending):
>
> P0:					P1:
> queued_spin_lock() fails		queued_spin_lock() succeeds
> queued_spin_lock_slowpath()
> 	val == _Q_LOCKED_VAL
> 	new = _Q_LOCKED_VAL |
> 		_Q_PENDING_VAL
> 					queued_spin_unlock()
> 						lock->val == 0
> 	cmpxchg(&lock->val, val, new)
> 		fails
> 	val = old (0)
> 	repeat for (;;) loop:
> 	new = _Q_LOCKED_VAL
> 					queued_spin_lock() succeeds
> 						lock->val == _Q_LOCKED_VAL
> 	cmpxchg(&lock->val, val, new)
> 		fails
> 	val = old (_Q_LOCKED_VAL)
> 	repeat for (;;) loop:
>
> 	... and we are back to the P0 state above when it first entered
> 	the loop, hence no progress. P1 never enters slowpath.

I don't see any problem in removing this second loop. Thanks for running
tool to check for problem.

-Longman

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-04-11 18:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-09 18:08 [PATCH 1/2] locking/qspinlock: Add stat tracking for pending vs slowpath Waiman Long
2018-04-09 18:08 ` [PATCH 2/2] locking/qspinlock: Limit # of spins in _Q_PENDING_VAL wait loop Waiman Long
2018-04-10 18:26   ` Will Deacon
2018-04-10 18:53     ` Waiman Long
2018-04-11 15:22   ` Catalin Marinas
2018-04-11 18:06     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).