linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] locking/qspinlock: Handle > 4 nesting levels
@ 2019-01-23  3:49 Waiman Long
  2019-01-23  3:49 ` [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath " Waiman Long
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Waiman Long @ 2019-01-23  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

 v2:
  - Use the simple trylock loop as suggested by PeterZ.

The current allows up to 4 levels of nested slowpath spinlock calls.
That should be enough for the process, soft irq, hard irq, and nmi.
With the unfortunate event of nested NMIs happening with slowpath
spinlock call in each of the previous level, we are going to run out
of useable MCS node for queuing.

In this case, we fall back to a simple TAS lock and spin on the lock
cacheline until the lock is free. This is not most elegant solution
but is simple enough.

Patch 1 implements the TAS loop when all the existing MCS nodes are
occupied.

Patches 2-4 enhances the locking statistics code to track the new code
as well as enabling it on other architectures such as ARM64.

By setting MAX_NODES to 1, we can have some usage of the new code path
during the booting process as demonstrated by the stat counter values
shown below on an 1-socket 22-core 44-thread x86-64 system after booting
up the new kernel.

  lock_no_node=20
  lock_pending=29660
  lock_slowpath=172714

Waiman Long (4):
  locking/qspinlock: Handle > 4 slowpath nesting levels
  locking/qspinlock_stat: Track the no MCS node available case
  locking/qspinlock_stat: Separate out the PV specific stat counts
  locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs

 arch/Kconfig                    |   7 ++
 arch/x86/Kconfig                |   8 ---
 kernel/locking/qspinlock.c      |  18 ++++-
 kernel/locking/qspinlock_stat.h | 150 +++++++++++++++++++++++++---------------
 4 files changed, 120 insertions(+), 63 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath nesting levels
  2019-01-23  3:49 [PATCH v2 0/4] locking/qspinlock: Handle > 4 nesting levels Waiman Long
@ 2019-01-23  3:49 ` Waiman Long
  2019-01-23  9:34   ` Will Deacon
  2019-01-23  3:49 ` [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Waiman Long @ 2019-01-23  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Four queue nodes per cpu are allocated to enable up to 4 nesting levels
using the per-cpu nodes. Nested NMIs are possible in some architectures.
Still it is very unlikely that we will ever hit more than 4 nested
levels with contention in the slowpath.

When that rare condition happens, however, it is likely that the system
will hang or crash shortly after that. It is not good and we need to
handle this exception case.

This is done by spinning directly on the lock using repeated trylock.
This alternative code path should only be used when there is nested
NMIs. Assuming that the locks used by those NMI handlers will not be
heavily contended, a simple TAS locking should work out.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8a8c3c2..0875053 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -412,6 +412,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
+	/*
+	 * 4 nodes are allocated based on the assumption that there will
+	 * not be nested NMIs taking spinlocks. That may not be true in
+	 * some architectures even though the chance of needing more than
+	 * 4 nodes will still be extremely unlikely. When that happens,
+	 * we fall back to spinning on the lock directly without using
+	 * any MCS node. This is not the most elegant solution, but is
+	 * simple enough.
+	 */
+	if (unlikely(idx >= MAX_NODES)) {
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
+		goto release;
+	}
+
 	node = grab_mcs_node(node, idx);
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case
  2019-01-23  3:49 [PATCH v2 0/4] locking/qspinlock: Handle > 4 nesting levels Waiman Long
  2019-01-23  3:49 ` [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath " Waiman Long
@ 2019-01-23  3:49 ` Waiman Long
  2019-01-23  9:23   ` Will Deacon
  2019-01-23  3:49 ` [PATCH v2 3/4] locking/qspinlock_stat: Separate out the PV specific stat counts Waiman Long
  2019-01-23  3:49 ` [PATCH v2 4/4] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs Waiman Long
  3 siblings, 1 reply; 11+ messages in thread
From: Waiman Long @ 2019-01-23  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Track the number of slowpath locking operations that are being done
without any MCS node available as well renaming lock_index[123] to make
them more descriptive.

Using these stat counters is one way to find out if a code path is
being exercised.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c      |  3 ++-
 kernel/locking/qspinlock_stat.h | 21 +++++++++++++++------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0875053..21ee51b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -422,6 +422,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 * simple enough.
 	 */
 	if (unlikely(idx >= MAX_NODES)) {
+		qstat_inc(qstat_lock_no_node, true);
 		while (!queued_spin_trylock(lock))
 			cpu_relax();
 		goto release;
@@ -432,7 +433,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * Keep counts of non-zero index values:
 	 */
-	qstat_inc(qstat_lock_idx1 + idx - 1, idx);
+	qstat_inc(qstat_lock_use_node2 + idx - 1, idx);
 
 	/*
 	 * Ensure that we increment the head node->count before initialising
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 42d3d8d..31728f6 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -30,6 +30,13 @@
  *   pv_wait_node	- # of vCPU wait's at a non-head queue node
  *   lock_pending	- # of locking operations via pending code
  *   lock_slowpath	- # of locking operations via MCS lock queue
+ *   lock_use_node2	- # of locking operations that use 2nd percpu node
+ *   lock_use_node3	- # of locking operations that use 3rd percpu node
+ *   lock_use_node4	- # of locking operations that use 4th percpu node
+ *   lock_no_node	- # of locking operations without using percpu node
+ *
+ * Subtraccting lock_use_node[234] from lock_slowpath will give you
+ * lock_use_node1.
  *
  * Writing to the "reset_counters" file will reset all the above counter
  * values.
@@ -55,9 +62,10 @@ enum qlock_stats {
 	qstat_pv_wait_node,
 	qstat_lock_pending,
 	qstat_lock_slowpath,
-	qstat_lock_idx1,
-	qstat_lock_idx2,
-	qstat_lock_idx3,
+	qstat_lock_use_node2,
+	qstat_lock_use_node3,
+	qstat_lock_use_node4,
+	qstat_lock_no_node,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -85,9 +93,10 @@ enum qlock_stats {
 	[qstat_pv_wait_node]       = "pv_wait_node",
 	[qstat_lock_pending]       = "lock_pending",
 	[qstat_lock_slowpath]      = "lock_slowpath",
-	[qstat_lock_idx1]	   = "lock_index1",
-	[qstat_lock_idx2]	   = "lock_index2",
-	[qstat_lock_idx3]	   = "lock_index3",
+	[qstat_lock_use_node2]	   = "lock_use_node2",
+	[qstat_lock_use_node3]	   = "lock_use_node3",
+	[qstat_lock_use_node4]	   = "lock_use_node4",
+	[qstat_lock_no_node]	   = "lock_no_node",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/4] locking/qspinlock_stat: Separate out the PV specific stat counts
  2019-01-23  3:49 [PATCH v2 0/4] locking/qspinlock: Handle > 4 nesting levels Waiman Long
  2019-01-23  3:49 ` [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath " Waiman Long
  2019-01-23  3:49 ` [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
@ 2019-01-23  3:49 ` Waiman Long
  2019-01-23  3:49 ` [PATCH v2 4/4] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs Waiman Long
  3 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2019-01-23  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

Some of the statistics counts are for PV qspinlocks only and are not
applicable if PARAVIRT_SPINLOCKS aren't configured. So make those counts
dependent on the PARAVIRT_SPINLOCKS config option now.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock_stat.h | 129 +++++++++++++++++++++++++---------------
 1 file changed, 81 insertions(+), 48 deletions(-)

diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 31728f6..7a0a848 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -49,6 +49,7 @@
  * There may be slight difference between pv_kick_wake and pv_kick_unlock.
  */
 enum qlock_stats {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 	qstat_pv_hash_hops,
 	qstat_pv_kick_unlock,
 	qstat_pv_kick_wake,
@@ -60,6 +61,7 @@ enum qlock_stats {
 	qstat_pv_wait_early,
 	qstat_pv_wait_head,
 	qstat_pv_wait_node,
+#endif
 	qstat_lock_pending,
 	qstat_lock_slowpath,
 	qstat_lock_use_node2,
@@ -80,6 +82,7 @@ enum qlock_stats {
 #include <linux/fs.h>
 
 static const char * const qstat_names[qstat_num + 1] = {
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
 	[qstat_pv_hash_hops]	   = "pv_hash_hops",
 	[qstat_pv_kick_unlock]     = "pv_kick_unlock",
 	[qstat_pv_kick_wake]       = "pv_kick_wake",
@@ -91,6 +94,7 @@ enum qlock_stats {
 	[qstat_pv_wait_early]      = "pv_wait_early",
 	[qstat_pv_wait_head]       = "pv_wait_head",
 	[qstat_pv_wait_node]       = "pv_wait_node",
+#endif
 	[qstat_lock_pending]       = "lock_pending",
 	[qstat_lock_slowpath]      = "lock_slowpath",
 	[qstat_lock_use_node2]	   = "lock_use_node2",
@@ -104,6 +108,20 @@ enum qlock_stats {
  * Per-cpu counters
  */
 static DEFINE_PER_CPU(unsigned long, qstats[qstat_num]);
+
+/*
+ * Increment the PV qspinlock statistical counters
+ */
+static inline void qstat_inc(enum qlock_stats stat, bool cond)
+{
+	if (cond)
+		this_cpu_inc(qstats[stat]);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * PV specific per-cpu counters
+ */
 static DEFINE_PER_CPU(u64, pv_kick_time);
 
 /*
@@ -178,6 +196,69 @@ static ssize_t qstat_read(struct file *file, char __user *user_buf,
 }
 
 /*
+ * PV hash hop count
+ */
+static inline void qstat_hop(int hopcnt)
+{
+	this_cpu_add(qstats[qstat_pv_hash_hops], hopcnt);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+	u64 start = sched_clock();
+
+	per_cpu(pv_kick_time, cpu) = start;
+	pv_kick(cpu);
+	this_cpu_add(qstats[qstat_pv_latency_kick], sched_clock() - start);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+	*pkick_time = 0;
+	pv_wait(ptr, val);
+	if (*pkick_time) {
+		this_cpu_add(qstats[qstat_pv_latency_wake],
+			     sched_clock() - *pkick_time);
+		qstat_inc(qstat_pv_kick_wake, true);
+	}
+}
+
+#define pv_kick(c)	__pv_kick(c)
+#define pv_wait(p, v)	__pv_wait(p, v)
+
+#else /* CONFIG_PARAVIRT_SPINLOCKS */
+static ssize_t qstat_read(struct file *file, char __user *user_buf,
+			  size_t count, loff_t *ppos)
+{
+	char buf[64];
+	int cpu, counter, len;
+	u64 stat = 0;
+
+	/*
+	 * Get the counter ID stored in file->f_inode->i_private
+	 */
+	counter = (long)file_inode(file)->i_private;
+
+	if (counter >= qstat_num)
+		return -EBADF;
+
+	for_each_possible_cpu(cpu)
+		stat += per_cpu(qstats[counter], cpu);
+	len = snprintf(buf, sizeof(buf) - 1, "%llu\n", stat);
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+/*
  * Function to handle write request
  *
  * When counter = reset_cnts, reset all the counter values.
@@ -250,54 +331,6 @@ static int __init init_qspinlock_stat(void)
 }
 fs_initcall(init_qspinlock_stat);
 
-/*
- * Increment the PV qspinlock statistical counters
- */
-static inline void qstat_inc(enum qlock_stats stat, bool cond)
-{
-	if (cond)
-		this_cpu_inc(qstats[stat]);
-}
-
-/*
- * PV hash hop count
- */
-static inline void qstat_hop(int hopcnt)
-{
-	this_cpu_add(qstats[qstat_pv_hash_hops], hopcnt);
-}
-
-/*
- * Replacement function for pv_kick()
- */
-static inline void __pv_kick(int cpu)
-{
-	u64 start = sched_clock();
-
-	per_cpu(pv_kick_time, cpu) = start;
-	pv_kick(cpu);
-	this_cpu_add(qstats[qstat_pv_latency_kick], sched_clock() - start);
-}
-
-/*
- * Replacement function for pv_wait()
- */
-static inline void __pv_wait(u8 *ptr, u8 val)
-{
-	u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
-
-	*pkick_time = 0;
-	pv_wait(ptr, val);
-	if (*pkick_time) {
-		this_cpu_add(qstats[qstat_pv_latency_wake],
-			     sched_clock() - *pkick_time);
-		qstat_inc(qstat_pv_kick_wake, true);
-	}
-}
-
-#define pv_kick(c)	__pv_kick(c)
-#define pv_wait(p, v)	__pv_wait(p, v)
-
 #else /* CONFIG_QUEUED_LOCK_STAT */
 
 static inline void qstat_inc(enum qlock_stats stat, bool cond)	{ }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/4] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs
  2019-01-23  3:49 [PATCH v2 0/4] locking/qspinlock: Handle > 4 nesting levels Waiman Long
                   ` (2 preceding siblings ...)
  2019-01-23  3:49 ` [PATCH v2 3/4] locking/qspinlock_stat: Separate out the PV specific stat counts Waiman Long
@ 2019-01-23  3:49 ` Waiman Long
  3 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2019-01-23  3:49 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Thomas Gleixner,
	Borislav Petkov, H. Peter Anvin
  Cc: linux-kernel, linux-arch, x86, Zhenzhong Duan, James Morse,
	SRINIVAS, Waiman Long

The QUEUED_LOCK_STAT option to report queued spinlocks statistics was
previously allowed only on x86 architecture. Now queued spinlocks are
used in multiple architectures, we now allow QUEUED_LOCK_STAT to be
enabled for all those architectures that use queued spinlocks. This
option is listed as part of the general architecture-dependent options.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/Kconfig     | 7 +++++++
 arch/x86/Kconfig | 8 --------
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de..c82e32f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -885,6 +885,13 @@ config HAVE_ARCH_PREL32_RELOCATIONS
 	  architectures, and don't require runtime relocation on relocatable
 	  kernels.
 
+config QUEUED_LOCK_STAT
+	bool "Queued spinlock statistics"
+	depends on QUEUED_SPINLOCKS && DEBUG_FS
+	---help---
+	  Enable the collection of statistical data on the slowpath
+	  behavior of queued spinlocks and report them on debugfs.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4b4a7f3..872e681 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -784,14 +784,6 @@ config PARAVIRT_SPINLOCKS
 
 	  If you are unsure how to answer this question, answer Y.
 
-config QUEUED_LOCK_STAT
-	bool "Paravirt queued spinlock statistics"
-	depends on PARAVIRT_SPINLOCKS && DEBUG_FS
-	---help---
-	  Enable the collection of statistical data on the slowpath
-	  behavior of paravirtualized queued spinlocks and report
-	  them on debugfs.
-
 source "arch/x86/xen/Kconfig"
 
 config KVM_GUEST
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case
  2019-01-23  3:49 ` [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
@ 2019-01-23  9:23   ` Will Deacon
  2019-01-23 20:04     ` Waiman Long
  0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2019-01-23  9:23 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On Tue, Jan 22, 2019 at 10:49:09PM -0500, Waiman Long wrote:
> Track the number of slowpath locking operations that are being done
> without any MCS node available as well renaming lock_index[123] to make
> them more descriptive.
> 
> Using these stat counters is one way to find out if a code path is
> being exercised.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock.c      |  3 ++-
>  kernel/locking/qspinlock_stat.h | 21 +++++++++++++++------
>  2 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 0875053..21ee51b 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -422,6 +422,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	 * simple enough.
>  	 */
>  	if (unlikely(idx >= MAX_NODES)) {
> +		qstat_inc(qstat_lock_no_node, true);
>  		while (!queued_spin_trylock(lock))
>  			cpu_relax();
>  		goto release;
> @@ -432,7 +433,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	/*
>  	 * Keep counts of non-zero index values:
>  	 */
> -	qstat_inc(qstat_lock_idx1 + idx - 1, idx);
> +	qstat_inc(qstat_lock_use_node2 + idx - 1, idx);
>  
>  	/*
>  	 * Ensure that we increment the head node->count before initialising
> diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
> index 42d3d8d..31728f6 100644
> --- a/kernel/locking/qspinlock_stat.h
> +++ b/kernel/locking/qspinlock_stat.h
> @@ -30,6 +30,13 @@
>   *   pv_wait_node	- # of vCPU wait's at a non-head queue node
>   *   lock_pending	- # of locking operations via pending code
>   *   lock_slowpath	- # of locking operations via MCS lock queue
> + *   lock_use_node2	- # of locking operations that use 2nd percpu node
> + *   lock_use_node3	- # of locking operations that use 3rd percpu node
> + *   lock_use_node4	- # of locking operations that use 4th percpu node
> + *   lock_no_node	- # of locking operations without using percpu node
> + *
> + * Subtraccting lock_use_node[234] from lock_slowpath will give you
> + * lock_use_node1.

Typo: "subtraccting"

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath nesting levels
  2019-01-23  3:49 ` [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath " Waiman Long
@ 2019-01-23  9:34   ` Will Deacon
  2019-01-23 20:11     ` Waiman Long
  0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2019-01-23  9:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On Tue, Jan 22, 2019 at 10:49:08PM -0500, Waiman Long wrote:
> Four queue nodes per cpu are allocated to enable up to 4 nesting levels
> using the per-cpu nodes. Nested NMIs are possible in some architectures.
> Still it is very unlikely that we will ever hit more than 4 nested
> levels with contention in the slowpath.
> 
> When that rare condition happens, however, it is likely that the system
> will hang or crash shortly after that. It is not good and we need to
> handle this exception case.
> 
> This is done by spinning directly on the lock using repeated trylock.
> This alternative code path should only be used when there is nested
> NMIs. Assuming that the locks used by those NMI handlers will not be
> heavily contended, a simple TAS locking should work out.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  kernel/locking/qspinlock.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 8a8c3c2..0875053 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -412,6 +412,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	idx = node->count++;
>  	tail = encode_tail(smp_processor_id(), idx);

Does the compiler generate better code if we move the tail assignment
further down, closer to the xchg_tail() call?

> +	/*
> +	 * 4 nodes are allocated based on the assumption that there will
> +	 * not be nested NMIs taking spinlocks. That may not be true in
> +	 * some architectures even though the chance of needing more than
> +	 * 4 nodes will still be extremely unlikely. When that happens,
> +	 * we fall back to spinning on the lock directly without using
> +	 * any MCS node. This is not the most elegant solution, but is
> +	 * simple enough.
> +	 */
> +	if (unlikely(idx >= MAX_NODES)) {
> +		while (!queued_spin_trylock(lock))
> +			cpu_relax();
> +		goto release;
> +	}

Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case
  2019-01-23  9:23   ` Will Deacon
@ 2019-01-23 20:04     ` Waiman Long
  0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2019-01-23 20:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On 01/23/2019 04:23 AM, Will Deacon wrote:
> On Tue, Jan 22, 2019 at 10:49:09PM -0500, Waiman Long wrote:
>> Track the number of slowpath locking operations that are being done
>> without any MCS node available as well renaming lock_index[123] to make
>> them more descriptive.
>>
>> Using these stat counters is one way to find out if a code path is
>> being exercised.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/locking/qspinlock.c      |  3 ++-
>>  kernel/locking/qspinlock_stat.h | 21 +++++++++++++++------
>>  2 files changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 0875053..21ee51b 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -422,6 +422,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  	 * simple enough.
>>  	 */
>>  	if (unlikely(idx >= MAX_NODES)) {
>> +		qstat_inc(qstat_lock_no_node, true);
>>  		while (!queued_spin_trylock(lock))
>>  			cpu_relax();
>>  		goto release;
>> @@ -432,7 +433,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  	/*
>>  	 * Keep counts of non-zero index values:
>>  	 */
>> -	qstat_inc(qstat_lock_idx1 + idx - 1, idx);
>> +	qstat_inc(qstat_lock_use_node2 + idx - 1, idx);
>>  
>>  	/*
>>  	 * Ensure that we increment the head node->count before initialising
>> diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
>> index 42d3d8d..31728f6 100644
>> --- a/kernel/locking/qspinlock_stat.h
>> +++ b/kernel/locking/qspinlock_stat.h
>> @@ -30,6 +30,13 @@
>>   *   pv_wait_node	- # of vCPU wait's at a non-head queue node
>>   *   lock_pending	- # of locking operations via pending code
>>   *   lock_slowpath	- # of locking operations via MCS lock queue
>> + *   lock_use_node2	- # of locking operations that use 2nd percpu node
>> + *   lock_use_node3	- # of locking operations that use 3rd percpu node
>> + *   lock_use_node4	- # of locking operations that use 4th percpu node
>> + *   lock_no_node	- # of locking operations without using percpu node
>> + *
>> + * Subtraccting lock_use_node[234] from lock_slowpath will give you
>> + * lock_use_node1.
> Typo: "subtraccting"
>
> Will

Thanks for catching that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath nesting levels
  2019-01-23  9:34   ` Will Deacon
@ 2019-01-23 20:11     ` Waiman Long
  2019-01-23 20:40       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Waiman Long @ 2019-01-23 20:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On 01/23/2019 04:34 AM, Will Deacon wrote:
> On Tue, Jan 22, 2019 at 10:49:08PM -0500, Waiman Long wrote:
>> Four queue nodes per cpu are allocated to enable up to 4 nesting levels
>> using the per-cpu nodes. Nested NMIs are possible in some architectures.
>> Still it is very unlikely that we will ever hit more than 4 nested
>> levels with contention in the slowpath.
>>
>> When that rare condition happens, however, it is likely that the system
>> will hang or crash shortly after that. It is not good and we need to
>> handle this exception case.
>>
>> This is done by spinning directly on the lock using repeated trylock.
>> This alternative code path should only be used when there is nested
>> NMIs. Assuming that the locks used by those NMI handlers will not be
>> heavily contended, a simple TAS locking should work out.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  kernel/locking/qspinlock.c | 15 +++++++++++++++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 8a8c3c2..0875053 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -412,6 +412,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  	idx = node->count++;
>>  	tail = encode_tail(smp_processor_id(), idx);
> Does the compiler generate better code if we move the tail assignment
> further down, closer to the xchg_tail() call?
>
>> +	/*
>> +	 * 4 nodes are allocated based on the assumption that there will
>> +	 * not be nested NMIs taking spinlocks. That may not be true in
>> +	 * some architectures even though the chance of needing more than
>> +	 * 4 nodes will still be extremely unlikely. When that happens,
>> +	 * we fall back to spinning on the lock directly without using
>> +	 * any MCS node. This is not the most elegant solution, but is
>> +	 * simple enough.
>> +	 */
>> +	if (unlikely(idx >= MAX_NODES)) {
>> +		while (!queued_spin_trylock(lock))
>> +			cpu_relax();
>> +		goto release;
>> +	}
> Acked-by: Will Deacon <will.deacon@arm.com>
>
> Will

Looking at the generated x86 code:

424        if (unlikely(idx >= MAX_NODES)) {
   0x00000000000003ce <+206>:    test   %ecx,%ecx
   0x00000000000003d0 <+208>:    jg     0x4c6
<native_queued_spin_lock_slowpath+454>

425            qstat_inc(qstat_lock_no_node, true);
426            while (!queued_spin_trylock(lock))

   0x00000000000004c2 <+450>:    jne    0x482
<native_queued_spin_lock_slowpath+386>
   0x00000000000004c4 <+452>:    jmp    0x491
<native_queued_spin_lock_slowpath+401>
   0x00000000000004c6 <+454>:    incq   %gs:0x0(%rip)        # 0x4ce
<native_queued_spin_lock_slowpath+462>
   0x00000000000004ce <+462>:    mov    $0x1,%edx
   0x00000000000004d3 <+467>:    jmp    0x4d7
<native_queued_spin_lock_slowpath+471>
   0x00000000000004d5 <+469>:    pause 
   0x00000000000004d7 <+471>:    mov    (%rdi),%eax
   0x00000000000004d9 <+473>:    test   %eax,%eax
   0x00000000000004db <+475>:    jne    0x4d5
<native_queued_spin_lock_slowpath+469>
   0x00000000000004dd <+477>:    lock cmpxchg %edx,(%rdi)
   0x00000000000004e1 <+481>:    jne    0x4d5
<native_queued_spin_lock_slowpath+469>
   0x00000000000004e3 <+483>:    jmp    0x491
<native_queued_spin_lock_slowpath+4

MAX_NODES was modified to 1 in the test kernel.

So the additional code checks the idx value and branch to the end of the
function when the condition is true. There isn't too much overhead here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath nesting levels
  2019-01-23 20:11     ` Waiman Long
@ 2019-01-23 20:40       ` Peter Zijlstra
  2019-01-23 22:36         ` Waiman Long
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-01-23 20:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Will Deacon, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On Wed, Jan 23, 2019 at 03:11:19PM -0500, Waiman Long wrote:
> On 01/23/2019 04:34 AM, Will Deacon wrote:
> > On Tue, Jan 22, 2019 at 10:49:08PM -0500, Waiman Long wrote:

> >> @@ -412,6 +412,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >>  	idx = node->count++;
> >>  	tail = encode_tail(smp_processor_id(), idx);

> >> +	if (unlikely(idx >= MAX_NODES)) {
> >> +		while (!queued_spin_trylock(lock))
> >> +			cpu_relax();
> >> +		goto release;
> >> +	}

> So the additional code checks the idx value and branch to the end of the
> function when the condition is true. There isn't too much overhead here.

So something horrible we could do (and I'm not at all advocating we do
this), is invert node->count. That is, start at 3 and decrement and
detect sign flips.

That avoids the additional compare. It would require we change the
structure layout though, otherwise we keep hitting that second line by
default, which would suck.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath nesting levels
  2019-01-23 20:40       ` Peter Zijlstra
@ 2019-01-23 22:36         ` Waiman Long
  0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2019-01-23 22:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	H. Peter Anvin, linux-kernel, linux-arch, x86, Zhenzhong Duan,
	James Morse, SRINIVAS

On 01/23/2019 03:40 PM, Peter Zijlstra wrote:
> On Wed, Jan 23, 2019 at 03:11:19PM -0500, Waiman Long wrote:
>> On 01/23/2019 04:34 AM, Will Deacon wrote:
>>> On Tue, Jan 22, 2019 at 10:49:08PM -0500, Waiman Long wrote:
>>>> @@ -412,6 +412,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>>>  	idx = node->count++;
>>>>  	tail = encode_tail(smp_processor_id(), idx);
>>>> +	if (unlikely(idx >= MAX_NODES)) {
>>>> +		while (!queued_spin_trylock(lock))
>>>> +			cpu_relax();
>>>> +		goto release;
>>>> +	}
>> So the additional code checks the idx value and branch to the end of the
>> function when the condition is true. There isn't too much overhead here.
> So something horrible we could do (and I'm not at all advocating we do
> this), is invert node->count. That is, start at 3 and decrement and
> detect sign flips.
>
> That avoids the additional compare. It would require we change the
> structure layout though, otherwise we keep hitting that second line by
> default, which would suck.

The cost of the additional compare will not be noticeable if the branch
prediction logic is working properly. Inverting the loop logic, however,
will be a much bigger change and it may not guarantee it will be faster
anyway. So I don't think we should down go this route :-)

Cheers,
Longman


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-01-23 22:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23  3:49 [PATCH v2 0/4] locking/qspinlock: Handle > 4 nesting levels Waiman Long
2019-01-23  3:49 ` [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath " Waiman Long
2019-01-23  9:34   ` Will Deacon
2019-01-23 20:11     ` Waiman Long
2019-01-23 20:40       ` Peter Zijlstra
2019-01-23 22:36         ` Waiman Long
2019-01-23  3:49 ` [PATCH v2 2/4] locking/qspinlock_stat: Track the no MCS node available case Waiman Long
2019-01-23  9:23   ` Will Deacon
2019-01-23 20:04     ` Waiman Long
2019-01-23  3:49 ` [PATCH v2 3/4] locking/qspinlock_stat: Separate out the PV specific stat counts Waiman Long
2019-01-23  3:49 ` [PATCH v2 4/4] locking/qspinlock_stat: Allow QUEUED_LOCK_STAT for all archs Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).