[PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths
@ 2018-10-16 13:45 Waiman Long
  2018-10-16 13:45 ` [PATCH 2/2] locking/pvqspinlock: Extend node size when pvqspinlock is configured Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Waiman Long @ 2018-10-16 13:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon; +Cc: linux-kernel, Waiman Long

Queued spinlock supports up to 4 levels of lock slowpath nesting -
user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
often the nesting happens. So 3 more per-cpu stat counters are added
to track the number of instances where nesting index goes to 1, 2 and
3 respectively.

On a dual-socket 64-core 128-thread Zen server, the following were stat
counter values under different circumstances.

         State                      slowpath   index1   index2   index3
         -----                      --------   ------   ------   -------
  After bootup                      1,012,150    82       0        0
  After parallel build + perf-top 125,195,009    82       0        0

So the chance of needing more than 2 levels of nesting is extremely rare.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c      | 5 +++++
 kernel/locking/qspinlock_stat.h | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index bfaeb05..7099bbe 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -378,6 +378,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	node += idx;
 
 	/*
+	 * Keep counts of non-zero index values
+	 */
+	qstat_inc(qstat_lock_idx1 + idx - 1, idx);
+
+	/*
 	 * Ensure that we increment the head node->count before initialising
 	 * the actual node. If the compiler is kind enough to reorder these
 	 * stores, then an IRQ could overwrite our assignments.
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 6bd78c0..42d3d8d 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -55,6 +55,9 @@ enum qlock_stats {
 	qstat_pv_wait_node,
 	qstat_lock_pending,
 	qstat_lock_slowpath,
+	qstat_lock_idx1,
+	qstat_lock_idx2,
+	qstat_lock_idx3,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -82,6 +85,9 @@ enum qlock_stats {
 	[qstat_pv_wait_node]       = "pv_wait_node",
 	[qstat_lock_pending]       = "lock_pending",
 	[qstat_lock_slowpath]      = "lock_slowpath",
+	[qstat_lock_idx1]	   = "lock_index1",
+	[qstat_lock_idx2]	   = "lock_index2",
+	[qstat_lock_idx3]	   = "lock_index3",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] locking/pvqspinlock: Extend node size when pvqspinlock is configured
  2018-10-16 13:45 [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Waiman Long
@ 2018-10-16 13:45 ` Waiman Long
  2018-10-17  9:12   ` [tip:locking/core] " tip-bot for Waiman Long
  2018-10-17  7:38 ` [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Peter Zijlstra
  2018-10-17  9:11 ` [tip:locking/core] " tip-bot for Waiman Long
  2 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2018-10-16 13:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon; +Cc: linux-kernel, Waiman Long

The qspinlock code supports up to 4 level of slowpath nesting using
four per-cpu mcs_spinlock structures. For 64-bit architectures, they
fit nicely in one 64-byte cacheline.

For para-virtualized (PV) qspinlock, it needs to store more information
in the per-cpu node structure than there is space for. It uses a trick
to use a second cacheline to hold the extra information that it needs.
So PV qspinlock needs to access two extra cachelines for its information
whereas the native qspinlock code only needs one extra cacheline.

Counter profiling of the qspinlock code, however, revealed that it
was very rare to use more than two levels of slowpath nesting. So it
doesn't make sense to penalize PV qspinlock code in order to have four
mcs_spinlock structures in the same cacheline to optimize for a case
in the native qspinlock code that rarely happens.

The per-cpu node structure is now extended to have two more long
words when PV qspinlock lock is configured to hold the extra data
that it needs.  As a result, the PV qspinlock code will enjoy the same
benefit of using just one extra cacheline like the native counterpart
for most cases.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/qspinlock.c          | 34 ++++++++++++++++++++++++++--------
 kernel/locking/qspinlock_paravirt.h |  4 +---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7099bbe..389bdb7 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -74,12 +74,24 @@
  */
 
 #include "mcs_spinlock.h"
+#define MAX_NODES	4
 
+/*
+ * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
+ * size and four of them will fit nicely in one 64-byte cacheline. For
+ * pvqspinlock, however, we need more space for extra data. To accommodate
+ * that, we insert two more long words to pad it up to 32 bytes. IOW, only
+ * two of them can fit in a cacheline in this case. That is OK as it is rare
+ * to have more than 2 levels of slowpath nesting in actual use. We don't
+ * want to penalize pvqspinlock to optimize for a rare case in native
+ * qspinlock.
+ */
+struct qnode {
+	struct mcs_spinlock mcs;
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define MAX_NODES	8
-#else
-#define MAX_NODES	4
+	long reserved[2];
 #endif
+};
 
 /*
  * The pending bit spinning loop count.
@@ -101,7 +113,7 @@
  *
  * PV doubles the storage and uses the second cacheline for PV state.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
+static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -126,7 +138,13 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
 	int idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
 
-	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+	return per_cpu_ptr(&qnodes[idx].mcs, cpu);
+}
+
+static inline __pure
+struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
+{
+	return &((struct qnode *)base + idx)->mcs;
 }
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
@@ -371,11 +389,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 queue:
 	qstat_inc(qstat_lock_slowpath, true);
 pv_queue:
-	node = this_cpu_ptr(&mcs_nodes[0]);
+	node = this_cpu_ptr(&qnodes[0].mcs);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
-	node += idx;
+	node = grab_mcs_node(node, idx);
 
 	/*
 	 * Keep counts of non-zero index values
@@ -506,7 +524,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * release the node
 	 */
-	__this_cpu_dec(mcs_nodes[0].count);
+	__this_cpu_dec(qnodes[0].mcs.count);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 5a0cf5f..0130e48 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -49,8 +49,6 @@ enum vcpu_state {
 
 struct pv_node {
 	struct mcs_spinlock	mcs;
-	struct mcs_spinlock	__res[3];
-
 	int			cpu;
 	u8			state;
 };
@@ -281,7 +279,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 
-	BUILD_BUG_ON(sizeof(struct pv_node) > 5*sizeof(struct mcs_spinlock));
+	BUILD_BUG_ON(sizeof(struct pv_node) > sizeof(struct qnode));
 
 	pn->cpu = smp_processor_id();
 	pn->state = vcpu_running;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths
  2018-10-16 13:45 [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Waiman Long
  2018-10-16 13:45 ` [PATCH 2/2] locking/pvqspinlock: Extend node size when pvqspinlock is configured Waiman Long
@ 2018-10-17  7:38 ` Peter Zijlstra
  2018-10-17 21:06   ` Waiman Long
  2018-10-17  9:11 ` [tip:locking/core] " tip-bot for Waiman Long
  2 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2018-10-17  7:38 UTC (permalink / raw)
  To: Waiman Long; +Cc: Ingo Molnar, Will Deacon, linux-kernel

On Tue, Oct 16, 2018 at 09:45:06AM -0400, Waiman Long wrote:
> Queued spinlock supports up to 4 levels of lock slowpath nesting -
> user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
> often the nesting happens. So 3 more per-cpu stat counters are added
> to track the number of instances where nesting index goes to 1, 2 and
> 3 respectively.
> 
> On a dual-socket 64-core 128-thread Zen server, the following were stat
> counter values under different circumstances.
> 
>          State                      slowpath   index1   index2   index3
>          -----                      --------   ------   ------   -------
>   After bootup                      1,012,150    82       0        0
>   After parallel build + perf-top 125,195,009    82       0        0

Would probably be good to check a network workload.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [tip:locking/core] locking/qspinlock_stat: Count instances of nested lock slowpaths
  2018-10-16 13:45 [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Waiman Long
  2018-10-16 13:45 ` [PATCH 2/2] locking/pvqspinlock: Extend node size when pvqspinlock is configured Waiman Long
  2018-10-17  7:38 ` [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Peter Zijlstra
@ 2018-10-17  9:11 ` tip-bot for Waiman Long
  2 siblings, 0 replies; 7+ messages in thread
From: tip-bot for Waiman Long @ 2018-10-17  9:11 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: akpm, longman, paulmck, tglx, will.deacon, linux-kernel,
	torvalds, hpa, mingo, peterz

Commit-ID:  1222109a53637f96c581224198b86856d503f892
Gitweb:     https://git.kernel.org/tip/1222109a53637f96c581224198b86856d503f892
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Tue, 16 Oct 2018 09:45:06 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 17 Oct 2018 08:37:31 +0200

locking/qspinlock_stat: Count instances of nested lock slowpaths

Queued spinlock supports up to 4 levels of lock slowpath nesting -
user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
often the nesting happens.

So add 3 more per-CPU stat counters to track the number of instances where
nesting index goes to 1, 2 and 3 respectively.

On a dual-socket 64-core 128-thread Zen server, the following were the
new stat counter values under different circumstances:

         State                         slowpath   index1   index2   index3
         -----                         --------   ------   ------   -------
  After bootup                         1,012,150    82       0        0
  After parallel build + perf-top    125,195,009    82       0        0

So the chance of having more than 2 levels of nesting is extremely low.

[ mingo: Minor changelog edits. ]

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c      | 5 +++++
 kernel/locking/qspinlock_stat.h | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 341ca666bc60..ce6af1ee2cac 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -396,6 +396,11 @@ pv_queue:
 
 	node += idx;
 
+	/*
+	 * Keep counts of non-zero index values:
+	 */
+	qstat_inc(qstat_lock_idx1 + idx - 1, idx);
+
 	/*
 	 * Ensure that we increment the head node->count before initialising
 	 * the actual node. If the compiler is kind enough to reorder these
diff --git a/kernel/locking/qspinlock_stat.h b/kernel/locking/qspinlock_stat.h
index 6bd78c0740fc..42d3d8dc8f49 100644
--- a/kernel/locking/qspinlock_stat.h
+++ b/kernel/locking/qspinlock_stat.h
@@ -55,6 +55,9 @@ enum qlock_stats {
 	qstat_pv_wait_node,
 	qstat_lock_pending,
 	qstat_lock_slowpath,
+	qstat_lock_idx1,
+	qstat_lock_idx2,
+	qstat_lock_idx3,
 	qstat_num,	/* Total number of statistical counters */
 	qstat_reset_cnts = qstat_num,
 };
@@ -82,6 +85,9 @@ static const char * const qstat_names[qstat_num + 1] = {
 	[qstat_pv_wait_node]       = "pv_wait_node",
 	[qstat_lock_pending]       = "lock_pending",
 	[qstat_lock_slowpath]      = "lock_slowpath",
+	[qstat_lock_idx1]	   = "lock_index1",
+	[qstat_lock_idx2]	   = "lock_index2",
+	[qstat_lock_idx3]	   = "lock_index3",
 	[qstat_reset_cnts]         = "reset_counters",
 };
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [tip:locking/core] locking/pvqspinlock: Extend node size when pvqspinlock is configured
  2018-10-16 13:45 ` [PATCH 2/2] locking/pvqspinlock: Extend node size when pvqspinlock is configured Waiman Long
@ 2018-10-17  9:12   ` tip-bot for Waiman Long
  0 siblings, 0 replies; 7+ messages in thread
From: tip-bot for Waiman Long @ 2018-10-17  9:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, longman, mingo, akpm, torvalds, linux-kernel, will.deacon,
	peterz, paulmck, hpa

Commit-ID:  0fa809ca7f81c47bea6706bc689e941eb25d7e89
Gitweb:     https://git.kernel.org/tip/0fa809ca7f81c47bea6706bc689e941eb25d7e89
Author:     Waiman Long <longman@redhat.com>
AuthorDate: Tue, 16 Oct 2018 09:45:07 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 17 Oct 2018 08:37:32 +0200

locking/pvqspinlock: Extend node size when pvqspinlock is configured

The qspinlock code supports up to 4 levels of slowpath nesting using
four per-CPU mcs_spinlock structures. For 64-bit architectures, they
fit nicely in one 64-byte cacheline.

For para-virtualized (PV) qspinlocks it needs to store more information
in the per-CPU node structure than there is space for. It uses a trick
to use a second cacheline to hold the extra information that it needs.
So PV qspinlock needs to access two extra cachelines for its information
whereas the native qspinlock code only needs one extra cacheline.

Freshly added counter profiling of the qspinlock code, however, revealed
that it was very rare to use more than two levels of slowpath nesting.
So it doesn't make sense to penalize PV qspinlock code in order to have
four mcs_spinlock structures in the same cacheline to optimize for a case
in the native qspinlock code that rarely happens.

Extend the per-CPU node structure to have two more long words when PV
qspinlock locks are configured to hold the extra data that it needs.

As a result, the PV qspinlock code will enjoy the same benefit of using
just one extra cacheline like the native counterpart, for most cases.

[ mingo: Minor changelog edits. ]

Signed-off-by: Waiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/locking/qspinlock.c          | 34 ++++++++++++++++++++++++++--------
 kernel/locking/qspinlock_paravirt.h |  4 +---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ce6af1ee2cac..8a8c3c208c5e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -74,12 +74,24 @@
  */
 
 #include "mcs_spinlock.h"
+#define MAX_NODES	4
 
+/*
+ * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
+ * size and four of them will fit nicely in one 64-byte cacheline. For
+ * pvqspinlock, however, we need more space for extra data. To accommodate
+ * that, we insert two more long words to pad it up to 32 bytes. IOW, only
+ * two of them can fit in a cacheline in this case. That is OK as it is rare
+ * to have more than 2 levels of slowpath nesting in actual use. We don't
+ * want to penalize pvqspinlocks to optimize for a rare case in native
+ * qspinlocks.
+ */
+struct qnode {
+	struct mcs_spinlock mcs;
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define MAX_NODES	8
-#else
-#define MAX_NODES	4
+	long reserved[2];
 #endif
+};
 
 /*
  * The pending bit spinning loop count.
@@ -101,7 +113,7 @@
  *
  * PV doubles the storage and uses the second cacheline for PV state.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
+static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -126,7 +138,13 @@ static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
 	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
 	int idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
 
-	return per_cpu_ptr(&mcs_nodes[idx], cpu);
+	return per_cpu_ptr(&qnodes[idx].mcs, cpu);
+}
+
+static inline __pure
+struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
+{
+	return &((struct qnode *)base + idx)->mcs;
 }
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
@@ -390,11 +408,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 queue:
 	qstat_inc(qstat_lock_slowpath, true);
 pv_queue:
-	node = this_cpu_ptr(&mcs_nodes[0]);
+	node = this_cpu_ptr(&qnodes[0].mcs);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
-	node += idx;
+	node = grab_mcs_node(node, idx);
 
 	/*
 	 * Keep counts of non-zero index values:
@@ -534,7 +552,7 @@ release:
 	/*
 	 * release the node
 	 */
-	__this_cpu_dec(mcs_nodes[0].count);
+	__this_cpu_dec(qnodes[0].mcs.count);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 5a0cf5f9008c..0130e488ebfe 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -49,8 +49,6 @@ enum vcpu_state {
 
 struct pv_node {
 	struct mcs_spinlock	mcs;
-	struct mcs_spinlock	__res[3];
-
 	int			cpu;
 	u8			state;
 };
@@ -281,7 +279,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 {
 	struct pv_node *pn = (struct pv_node *)node;
 
-	BUILD_BUG_ON(sizeof(struct pv_node) > 5*sizeof(struct mcs_spinlock));
+	BUILD_BUG_ON(sizeof(struct pv_node) > sizeof(struct qnode));
 
 	pn->cpu = smp_processor_id();
 	pn->state = vcpu_running;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths
  2018-10-17  7:38 ` [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Peter Zijlstra
@ 2018-10-17 21:06   ` Waiman Long
  2018-10-18  9:05     ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Waiman Long @ 2018-10-17 21:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Will Deacon, linux-kernel

On 10/17/2018 03:38 AM, Peter Zijlstra wrote:
> On Tue, Oct 16, 2018 at 09:45:06AM -0400, Waiman Long wrote:
>> Queued spinlock supports up to 4 levels of lock slowpath nesting -
>> user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
>> often the nesting happens. So 3 more per-cpu stat counters are added
>> to track the number of instances where nesting index goes to 1, 2 and
>> 3 respectively.
>>
>> On a dual-socket 64-core 128-thread Zen server, the following were stat
>> counter values under different circumstances.
>>
>>          State                      slowpath   index1   index2   index3
>>          -----                      --------   ------   ------   -------
>>   After bootup                      1,012,150    82       0        0
>>   After parallel build + perf-top 125,195,009    82       0        0
> Would probably be good to check a network workload.

Do you have any suggestion of what network workload? I usually don't
test on network related workload.

The only way to have a nesting level of 3 is when there is lock
contention in user context, soft IRQ and hard IRQ simultaneously which
is very rare, I think. Native qspinlock has a pending path and so even a
slightly contented lock will not cause entry into the slowpath. PV
qspinlock doesn't have a pending path, so there may be a slightly higher
chance of hitting that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths
  2018-10-17 21:06   ` Waiman Long
@ 2018-10-18  9:05     ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2018-10-18  9:05 UTC (permalink / raw)
  To: Waiman Long; +Cc: Ingo Molnar, Will Deacon, linux-kernel

On Wed, Oct 17, 2018 at 05:06:46PM -0400, Waiman Long wrote:
> On 10/17/2018 03:38 AM, Peter Zijlstra wrote:
> > On Tue, Oct 16, 2018 at 09:45:06AM -0400, Waiman Long wrote:
> >> Queued spinlock supports up to 4 levels of lock slowpath nesting -
> >> user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
> >> often the nesting happens. So 3 more per-cpu stat counters are added
> >> to track the number of instances where nesting index goes to 1, 2 and
> >> 3 respectively.
> >>
> >> On a dual-socket 64-core 128-thread Zen server, the following were stat
> >> counter values under different circumstances.
> >>
> >>          State                      slowpath   index1   index2   index3
> >>          -----                      --------   ------   ------   -------
> >>   After bootup                      1,012,150    82       0        0
> >>   After parallel build + perf-top 125,195,009    82       0        0
> > Would probably be good to check a network workload.
> 
> Do you have any suggestion of what network workload? I usually don't
> test on network related workload.

Some netperf runs? Be sure to ask the network folks for 2 systems with
at least 10GBe or something.

> The only way to have a nesting level of 3 is when there is lock
> contention in user context, soft IRQ and hard IRQ simultaneously which
> is very rare,

network should be able to trigger a lot of softirq and hardirqs, which
is why I wondered if it could hit significant 3 levels.

But yeah, I played around a bit, and couldn't get my index* numbers to
move at all. I suppose we've gotten really good at avoiding lock
contention for all the benchmarks we care about :-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-10-18  9:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-16 13:45 [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Waiman Long
2018-10-16 13:45 ` [PATCH 2/2] locking/pvqspinlock: Extend node size when pvqspinlock is configured Waiman Long
2018-10-17  9:12   ` [tip:locking/core] " tip-bot for Waiman Long
2018-10-17  7:38 ` [PATCH 1/2] locking/qspinlock_stat: Count instances of nested lock slowpaths Peter Zijlstra
2018-10-17 21:06   ` Waiman Long
2018-10-18  9:05     ` Peter Zijlstra
2018-10-17  9:11 ` [tip:locking/core] " tip-bot for Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).