All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] kernel/smp.c: add more CSD lock debugging
@ 2021-03-02  6:28 Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 1/4] kernel/smp: add boot parameter for controlling " Juergen Gross
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Juergen Gross @ 2021-03-02  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-doc
  Cc: paulmck, mhocko, peterz, Juergen Gross, Jonathan Corbet

This patch series was created to help catching a rather long standing
problem with smp_call_function_any() and friends.

Very rarely a remote cpu seems not to execute a queued function and
the cpu queueing that function request will wait forever for the
CSD lock to be released by the remote cpu.

This problem has been observed primarily when running as a guest on
top of KVM or Xen, but there are reports of the same pattern for the
bare metal case, too. It seems to exist since about 2 years now, and
there is not much data available.

What is known up to now is that resending an IPI to the remote cpu is
helping.

The patches are adding more debug data being printed in a hang
situation using a kernel with CONFIG_CSD_LOCK_WAIT_DEBUG configured.
Additionally the debug coding can be controlled via a new parameter
in order to make it easier to use such a kernel in a production
environment without too much negative performance impact. Per default
the debugging additions will be switched off and they can be activated
via the new boot parameter:

csdlock_debug=1 will switch on the basic debugging and IPI resend
csdlock_debug=ext will add additional data printed out in a hang
  situation, but this option will have a larger impact on performance.

I hope that the "ext" setting will help to find the root cause of the
problem.

Juergen Gross (4):
  kernel/smp: add boot parameter for controlling CSD lock debugging
  kernel/smp: prepare more CSD lock debugging
  kernel/smp: add more data to CSD lock debugging
  kernel/smp: fix flush_smp_call_function_queue() cpu offline detection

 .../admin-guide/kernel-parameters.txt         |  10 +
 kernel/smp.c                                  | 280 +++++++++++++++++-
 2 files changed, 277 insertions(+), 13 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/4] kernel/smp: add boot parameter for controlling CSD lock debugging
  2021-03-02  6:28 [PATCH v3 0/4] kernel/smp.c: add more CSD lock debugging Juergen Gross
@ 2021-03-02  6:28 ` Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 2/4] kernel/smp: prepare more " Juergen Gross
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2021-03-02  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-doc
  Cc: paulmck, mhocko, peterz, Juergen Gross, Jonathan Corbet

Currently CSD lock debugging can be switched on and off via a kernel
config option only. Unfortunately there is at least one problem with
CSD lock handling pending for about 2 years now, which has been seen
in different environments (mostly when running virtualized under KVM
or Xen, at least once on bare metal). Multiple attempts to catch this
issue have finally led to introduction of CSD lock debug code, but
this code is not in use in most distros as it has some impact on
performance.

In order to be able to ship kernels with CONFIG_CSD_LOCK_WAIT_DEBUG
enabled even for production use, add a boot parameter for switching
the debug functionality on. This will reduce any performance impact
of the debug coding to a bare minimum when not being used.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 .../admin-guide/kernel-parameters.txt         |  6 +++
 kernel/smp.c                                  | 38 +++++++++++++++++--
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..31dbf7b2f0e8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -784,6 +784,12 @@
 	cs89x0_media=	[HW,NET]
 			Format: { rj45 | aui | bnc }
 
+	csdlock_debug=	[KNL] Enable debug add-ons of cross-cpu function call
+			handling. When switched on additional debug data is
+			printed to the console in case a hanging cpu is
+			detected and that cpu is pinged again in order to try
+			to resolve the hang situation.
+
 	dasd=		[HW,NET]
 			See header of drivers/s390/block/dasd_devmap.c.
 
diff --git a/kernel/smp.c b/kernel/smp.c
index aeb0adfa0606..d5f0b21ab55e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -24,6 +24,7 @@
 #include <linux/sched/clock.h>
 #include <linux/nmi.h>
 #include <linux/sched/debug.h>
+#include <linux/jump_label.h>
 
 #include "smpboot.h"
 #include "sched/smp.h"
@@ -102,6 +103,20 @@ void __init call_function_init(void)
 
 #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
 
+static DEFINE_STATIC_KEY_FALSE(csdlock_debug_enabled);
+
+static int __init csdlock_debug(char *str)
+{
+	unsigned int val = 0;
+
+	get_option(&str, &val);
+	if (val)
+		static_branch_enable(&csdlock_debug_enabled);
+
+	return 0;
+}
+early_param("csdlock_debug", csdlock_debug);
+
 static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
 static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
 static DEFINE_PER_CPU(void *, cur_csd_info);
@@ -110,7 +125,7 @@ static DEFINE_PER_CPU(void *, cur_csd_info);
 static atomic_t csd_bug_count = ATOMIC_INIT(0);
 
 /* Record current CSD work for current CPU, NULL to erase. */
-static void csd_lock_record(call_single_data_t *csd)
+static void __csd_lock_record(call_single_data_t *csd)
 {
 	if (!csd) {
 		smp_mb(); /* NULL cur_csd after unlock. */
@@ -125,7 +140,13 @@ static void csd_lock_record(call_single_data_t *csd)
 		  /* Or before unlock, as the case may be. */
 }
 
-static __always_inline int csd_lock_wait_getcpu(call_single_data_t *csd)
+static __always_inline void csd_lock_record(call_single_data_t *csd)
+{
+	if (static_branch_unlikely(&csdlock_debug_enabled))
+		__csd_lock_record(csd);
+}
+
+static int csd_lock_wait_getcpu(call_single_data_t *csd)
 {
 	unsigned int csd_type;
 
@@ -140,7 +161,7 @@ static __always_inline int csd_lock_wait_getcpu(call_single_data_t *csd)
  * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
  * so waiting on other types gets much less information.
  */
-static __always_inline bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id)
+static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id)
 {
 	int cpu = -1;
 	int cpux;
@@ -204,7 +225,7 @@ static __always_inline bool csd_lock_wait_toolong(call_single_data_t *csd, u64 t
  * previous function call. For multi-cpu calls its even more interesting
  * as we'll have to ensure no other cpu is observing our csd.
  */
-static __always_inline void csd_lock_wait(call_single_data_t *csd)
+static void __csd_lock_wait(call_single_data_t *csd)
 {
 	int bug_id = 0;
 	u64 ts0, ts1;
@@ -218,6 +239,15 @@ static __always_inline void csd_lock_wait(call_single_data_t *csd)
 	smp_acquire__after_ctrl_dep();
 }
 
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
+{
+	if (static_branch_unlikely(&csdlock_debug_enabled)) {
+		__csd_lock_wait(csd);
+		return;
+	}
+
+	smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK));
+}
 #else
 static void csd_lock_record(call_single_data_t *csd)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/4] kernel/smp: prepare more CSD lock debugging
  2021-03-02  6:28 [PATCH v3 0/4] kernel/smp.c: add more CSD lock debugging Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 1/4] kernel/smp: add boot parameter for controlling " Juergen Gross
@ 2021-03-02  6:28 ` Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 3/4] kernel/smp: add more data to " Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 4/4] kernel/smp: fix flush_smp_call_function_queue() cpu offline detection Juergen Gross
  3 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2021-03-02  6:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: paulmck, mhocko, peterz, Juergen Gross

In order to be able to easily add more CSD lock debugging data to
struct call_function_data->csd move the call_single_data_t element
into a sub-structure.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 kernel/smp.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index d5f0b21ab55e..6d7e6dbe33dc 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -31,8 +31,12 @@
 
 #define CSD_TYPE(_csd)	((_csd)->node.u_flags & CSD_FLAG_TYPE_MASK)
 
+struct cfd_percpu {
+	call_single_data_t	csd;
+};
+
 struct call_function_data {
-	call_single_data_t	__percpu *csd;
+	struct cfd_percpu	__percpu *pcpu;
 	cpumask_var_t		cpumask;
 	cpumask_var_t		cpumask_ipi;
 };
@@ -55,8 +59,8 @@ int smpcfd_prepare_cpu(unsigned int cpu)
 		free_cpumask_var(cfd->cpumask);
 		return -ENOMEM;
 	}
-	cfd->csd = alloc_percpu(call_single_data_t);
-	if (!cfd->csd) {
+	cfd->pcpu = alloc_percpu(struct cfd_percpu);
+	if (!cfd->pcpu) {
 		free_cpumask_var(cfd->cpumask);
 		free_cpumask_var(cfd->cpumask_ipi);
 		return -ENOMEM;
@@ -71,7 +75,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
 
 	free_cpumask_var(cfd->cpumask);
 	free_cpumask_var(cfd->cpumask_ipi);
-	free_percpu(cfd->csd);
+	free_percpu(cfd->pcpu);
 	return 0;
 }
 
@@ -694,7 +698,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	cpumask_clear(cfd->cpumask_ipi);
 	for_each_cpu(cpu, cfd->cpumask) {
-		call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
+		call_single_data_t *csd = &per_cpu_ptr(cfd->pcpu, cpu)->csd;
 
 		if (cond_func && !cond_func(cpu, info))
 			continue;
@@ -719,7 +723,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		for_each_cpu(cpu, cfd->cpumask) {
 			call_single_data_t *csd;
 
-			csd = per_cpu_ptr(cfd->csd, cpu);
+			csd = &per_cpu_ptr(cfd->pcpu, cpu)->csd;
 			csd_lock_wait(csd);
 		}
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-03-02  6:28 [PATCH v3 0/4] kernel/smp.c: add more CSD lock debugging Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 1/4] kernel/smp: add boot parameter for controlling " Juergen Gross
  2021-03-02  6:28 ` [PATCH v3 2/4] kernel/smp: prepare more " Juergen Gross
@ 2021-03-02  6:28 ` Juergen Gross
  2021-03-24 10:18   ` Jürgen Groß
  2021-03-02  6:28 ` [PATCH v3 4/4] kernel/smp: fix flush_smp_call_function_queue() cpu offline detection Juergen Gross
  3 siblings, 1 reply; 13+ messages in thread
From: Juergen Gross @ 2021-03-02  6:28 UTC (permalink / raw)
  To: linux-kernel, linux-doc
  Cc: paulmck, mhocko, peterz, Juergen Gross, Jonathan Corbet

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
        csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
        csd: cnt(00008cd): ffff->0006 idle
        csd: cnt(0003668): 0001->0006 queue
        csd: cnt(0003669): 0001->0006 ipi
        csd: cnt(0003e0f): 0007->000a queue
        csd: cnt(0003e10): 0001->ffff ping
        csd: cnt(0003e71): 0003->0000 ping
        csd: cnt(0003e72): ffff->0006 gotipi
        csd: cnt(0003e73): ffff->0006 handle
        csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
        csd: cnt(0003e7f): 0004->0006 ping
        csd: cnt(0003e80): 0001->ffff pinged
        csd: cnt(0003eb2): 0005->0001 noipi
        csd: cnt(0003eb3): 0001->0006 queue
        csd: cnt(0003eb4): 0001->0006 noipi
        csd: cnt now: 0003f00

This example (being an artificial one, produced with a previous version
of this patch without the "hdlend" event), shows that cpu#6 started to
handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
(sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
from cpu#4 in the queue already).

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
---
V2:
- add automatic data deciphering and sorting of entries
- add new trace point for leaving flush_smp_call_function_queue()
- add information when finding an empty call_single_queue
V3:
- move new code to generic_exec_single() (Peter Zijlstra)
---
 .../admin-guide/kernel-parameters.txt         |   4 +
 kernel/smp.c                                  | 228 +++++++++++++++++-
 2 files changed, 228 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 31dbf7b2f0e8..80c72f8e780d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -789,6 +789,10 @@
 			printed to the console in case a hanging cpu is
 			detected and that cpu is pinged again in order to try
 			to resolve the hang situation.
+			0: disable csdlock debugging (default)
+			1: enable basic csdlock debugging (minor impact)
+			ext: enable extended csdlock debugging (more impact,
+			     but more data)
 
 	dasd=		[HW,NET]
 			See header of drivers/s390/block/dasd_devmap.c.
diff --git a/kernel/smp.c b/kernel/smp.c
index 6d7e6dbe33dc..1a96691dbf7f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -31,8 +31,59 @@
 
 #define CSD_TYPE(_csd)	((_csd)->node.u_flags & CSD_FLAG_TYPE_MASK)
 
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+union cfd_seq_cnt {
+	u64		val;
+	struct {
+		u64	src:16;
+		u64	dst:16;
+#define CFD_SEQ_NOCPU	0xffff
+		u64	type:4;
+#define CFD_SEQ_QUEUE	0
+#define CFD_SEQ_IPI	1
+#define CFD_SEQ_NOIPI	2
+#define CFD_SEQ_PING	3
+#define CFD_SEQ_PINGED	4
+#define CFD_SEQ_HANDLE	5
+#define CFD_SEQ_DEQUEUE	6
+#define CFD_SEQ_IDLE	7
+#define CFD_SEQ_GOTIPI	8
+#define CFD_SEQ_HDLEND	9
+		u64	cnt:28;
+	}		u;
+};
+
+static char *seq_type[] = {
+	[CFD_SEQ_QUEUE]		= "queue",
+	[CFD_SEQ_IPI]		= "ipi",
+	[CFD_SEQ_NOIPI]		= "noipi",
+	[CFD_SEQ_PING]		= "ping",
+	[CFD_SEQ_PINGED]	= "pinged",
+	[CFD_SEQ_HANDLE]	= "handle",
+	[CFD_SEQ_DEQUEUE]	= "dequeue (src cpu 0 == empty)",
+	[CFD_SEQ_IDLE]		= "idle",
+	[CFD_SEQ_GOTIPI]	= "gotipi",
+	[CFD_SEQ_HDLEND]	= "hdlend (src cpu 0 == early)",
+};
+
+struct cfd_seq_local {
+	u64	ping;
+	u64	pinged;
+	u64	handle;
+	u64	dequeue;
+	u64	idle;
+	u64	gotipi;
+	u64	hdlend;
+};
+#endif
+
 struct cfd_percpu {
 	call_single_data_t	csd;
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+	u64	seq_queue;
+	u64	seq_ipi;
+	u64	seq_noipi;
+#endif
 };
 
 struct call_function_data {
@@ -108,12 +159,18 @@ void __init call_function_init(void)
 #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
 
 static DEFINE_STATIC_KEY_FALSE(csdlock_debug_enabled);
+static DEFINE_STATIC_KEY_FALSE(csdlock_debug_extended);
 
 static int __init csdlock_debug(char *str)
 {
 	unsigned int val = 0;
 
-	get_option(&str, &val);
+	if (str && !strcmp(str, "ext")) {
+		val = 1;
+		static_branch_enable(&csdlock_debug_extended);
+	} else
+		get_option(&str, &val);
+
 	if (val)
 		static_branch_enable(&csdlock_debug_enabled);
 
@@ -124,9 +181,34 @@ early_param("csdlock_debug", csdlock_debug);
 static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
 static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
 static DEFINE_PER_CPU(void *, cur_csd_info);
+static DEFINE_PER_CPU(struct cfd_seq_local, cfd_seq_local);
 
 #define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC)
 static atomic_t csd_bug_count = ATOMIC_INIT(0);
+static u64 cfd_seq;
+
+#define CFD_SEQ(s, d, t, c)	\
+	(union cfd_seq_cnt){ .u.src = s, .u.dst = d, .u.type = t, .u.cnt = c }
+
+static u64 cfd_seq_inc(unsigned int src, unsigned int dst, unsigned int type)
+{
+	union cfd_seq_cnt new, old;
+
+	new = CFD_SEQ(src, dst, type, 0);
+
+	do {
+		old.val = READ_ONCE(cfd_seq);
+		new.u.cnt = old.u.cnt + 1;
+	} while (cmpxchg(&cfd_seq, old.val, new.val) != old.val);
+
+	return old.val;
+}
+
+#define cfd_seq_store(var, src, dst, type)				\
+	do {								\
+		if (static_branch_unlikely(&csdlock_debug_extended))	\
+			var = cfd_seq_inc(src, dst, type);		\
+	} while (0)
 
 /* Record current CSD work for current CPU, NULL to erase. */
 static void __csd_lock_record(call_single_data_t *csd)
@@ -160,6 +242,88 @@ static int csd_lock_wait_getcpu(call_single_data_t *csd)
 	return -1;
 }
 
+static void cfd_seq_data_add(u64 val, unsigned int src, unsigned int dst,
+			     unsigned int type, union cfd_seq_cnt *data,
+			     unsigned int *n_data, unsigned int now)
+{
+	union cfd_seq_cnt new[2];
+	unsigned int i, j, k;
+
+	new[0].val = val;
+	new[1] = CFD_SEQ(src, dst, type, new[0].u.cnt + 1);
+
+	for (i = 0; i < 2; i++) {
+		if (new[i].u.cnt <= now)
+			new[i].u.cnt |= 0x80000000U;
+		for (j = 0; j < *n_data; j++) {
+			if (new[i].u.cnt == data[j].u.cnt) {
+				/* Direct read value trumps generated one. */
+				if (i == 0)
+					data[j].val = new[i].val;
+				break;
+			}
+			if (new[i].u.cnt < data[j].u.cnt) {
+				for (k = *n_data; k > j; k--)
+					data[k].val = data[k - 1].val;
+				data[j].val = new[i].val;
+				(*n_data)++;
+				break;
+			}
+		}
+		if (j == *n_data) {
+			data[j].val = new[i].val;
+			(*n_data)++;
+		}
+	}
+}
+
+static const char *csd_lock_get_type(unsigned int type)
+{
+	return (type >= ARRAY_SIZE(seq_type)) ? "?" : seq_type[type];
+}
+
+static void csd_lock_print_extended(call_single_data_t *csd, int cpu)
+{
+	struct cfd_seq_local *seq = &per_cpu(cfd_seq_local, cpu);
+	unsigned int srccpu = csd->node.src;
+	struct call_function_data *cfd = per_cpu_ptr(&cfd_data, srccpu);
+	struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
+	unsigned int now;
+	union cfd_seq_cnt data[2 * ARRAY_SIZE(seq_type)];
+	unsigned int n_data = 0, i;
+
+	data[0].val = READ_ONCE(cfd_seq);
+	now = data[0].u.cnt;
+
+	cfd_seq_data_add(pcpu->seq_queue, srccpu, cpu,
+			 CFD_SEQ_QUEUE, data, &n_data, now);
+	cfd_seq_data_add(pcpu->seq_ipi, srccpu, cpu,
+			 CFD_SEQ_IPI, data, &n_data, now);
+	cfd_seq_data_add(pcpu->seq_noipi, srccpu, cpu,
+			 CFD_SEQ_NOIPI, data, &n_data, now);
+	cfd_seq_data_add(per_cpu(cfd_seq_local.ping, srccpu), srccpu,
+			 CFD_SEQ_NOCPU, CFD_SEQ_PING, data, &n_data, now);
+	cfd_seq_data_add(per_cpu(cfd_seq_local.pinged, srccpu), srccpu,
+			 CFD_SEQ_NOCPU, CFD_SEQ_PINGED, data, &n_data, now);
+	cfd_seq_data_add(seq->idle, CFD_SEQ_NOCPU, cpu,
+			 CFD_SEQ_IDLE, data, &n_data, now);
+	cfd_seq_data_add(seq->gotipi, CFD_SEQ_NOCPU, cpu,
+			 CFD_SEQ_GOTIPI, data, &n_data, now);
+	cfd_seq_data_add(seq->handle, CFD_SEQ_NOCPU, cpu,
+			 CFD_SEQ_HANDLE, data, &n_data, now);
+	cfd_seq_data_add(seq->dequeue, CFD_SEQ_NOCPU, cpu,
+			 CFD_SEQ_DEQUEUE, data, &n_data, now);
+	cfd_seq_data_add(seq->hdlend, CFD_SEQ_NOCPU, cpu,
+			 CFD_SEQ_HDLEND, data, &n_data, now);
+
+	for (i = 0; i < n_data; i++) {
+		pr_alert("\tcsd: cnt(%07x): %04x->%04x %s\n",
+			 data[i].u.cnt & ~0x80000000U, data[i].u.src,
+			 data[i].u.dst, csd_lock_get_type(data[i].u.type));
+	}
+	pr_alert("\tcsd: cnt now: %07x\n", now);
+}
+
 /*
  * Complain if too much time spent waiting.  Note that only
  * the CSD_TYPE_SYNC/ASYNC types provide the destination CPU,
@@ -209,6 +373,8 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
 			 *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request");
 	}
 	if (cpu >= 0) {
+		if (static_branch_unlikely(&csdlock_debug_extended))
+			csd_lock_print_extended(csd, cpu);
 		if (!trigger_single_cpu_backtrace(cpu))
 			dump_cpu_task(cpu);
 		if (!cpu_cur_csd) {
@@ -252,7 +418,27 @@ static __always_inline void csd_lock_wait(call_single_data_t *csd)
 
 	smp_cond_load_acquire(&csd->node.u_flags, !(VAL & CSD_FLAG_LOCK));
 }
+
+static void __smp_call_single_queue_debug(int cpu, struct llist_node *node)
+{
+	unsigned int this_cpu = smp_processor_id();
+	struct cfd_seq_local *seq = this_cpu_ptr(&cfd_seq_local);
+	struct call_function_data *cfd = this_cpu_ptr(&cfd_data);
+	struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
+
+	cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE);
+	if (llist_add(node, &per_cpu(call_single_queue, cpu))) {
+		cfd_seq_store(pcpu->seq_ipi, this_cpu, cpu, CFD_SEQ_IPI);
+		cfd_seq_store(seq->ping, this_cpu, cpu, CFD_SEQ_PING);
+		send_call_function_single_ipi(cpu);
+		cfd_seq_store(seq->pinged, this_cpu, cpu, CFD_SEQ_PINGED);
+	} else {
+		cfd_seq_store(pcpu->seq_noipi, this_cpu, cpu, CFD_SEQ_NOIPI);
+	}
+}
 #else
+#define cfd_seq_store(var, src, dst, type)
+
 static void csd_lock_record(call_single_data_t *csd)
 {
 }
@@ -335,6 +521,13 @@ static int generic_exec_single(int cpu, call_single_data_t *csd)
 		return -ENXIO;
 	}
 
+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+	if (static_branch_unlikely(&csdlock_debug_extended)) {
+		__smp_call_single_queue_debug(cpu, &csd->node.llist);
+		return 0;
+	}
+#endif
+
 	__smp_call_single_queue(cpu, &csd->node.llist);
 
 	return 0;
@@ -348,6 +541,8 @@ static int generic_exec_single(int cpu, call_single_data_t *csd)
  */
 void generic_smp_call_function_single_interrupt(void)
 {
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->gotipi, CFD_SEQ_NOCPU,
+		      smp_processor_id(), CFD_SEQ_GOTIPI);
 	flush_smp_call_function_queue(true);
 }
 
@@ -375,7 +570,13 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	lockdep_assert_irqs_disabled();
 
 	head = this_cpu_ptr(&call_single_queue);
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->handle, CFD_SEQ_NOCPU,
+		      smp_processor_id(), CFD_SEQ_HANDLE);
 	entry = llist_del_all(head);
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->dequeue,
+		      /* Special meaning of source cpu: 0 == queue empty */
+		      entry ? CFD_SEQ_NOCPU : 0,
+		      smp_processor_id(), CFD_SEQ_DEQUEUE);
 	entry = llist_reverse_order(entry);
 
 	/* There shouldn't be any pending callbacks on an offline CPU. */
@@ -434,8 +635,12 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 		}
 	}
 
-	if (!entry)
+	if (!entry) {
+		cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->hdlend,
+			      0, smp_processor_id(),
+			      CFD_SEQ_HDLEND);
 		return;
+	}
 
 	/*
 	 * Second; run all !SYNC callbacks.
@@ -473,6 +678,9 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 	 */
 	if (entry)
 		sched_ttwu_pending(entry);
+
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->hdlend, CFD_SEQ_NOCPU,
+		      smp_processor_id(), CFD_SEQ_HDLEND);
 }
 
 void flush_smp_call_function_from_idle(void)
@@ -482,6 +690,8 @@ void flush_smp_call_function_from_idle(void)
 	if (llist_empty(this_cpu_ptr(&call_single_queue)))
 		return;
 
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU,
+		      smp_processor_id(), CFD_SEQ_IDLE);
 	local_irq_save(flags);
 	flush_smp_call_function_queue(true);
 	if (local_softirq_pending())
@@ -698,7 +908,8 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	cpumask_clear(cfd->cpumask_ipi);
 	for_each_cpu(cpu, cfd->cpumask) {
-		call_single_data_t *csd = &per_cpu_ptr(cfd->pcpu, cpu)->csd;
+		struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
+		call_single_data_t *csd = &pcpu->csd;
 
 		if (cond_func && !cond_func(cpu, info))
 			continue;
@@ -712,12 +923,21 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		csd->node.src = smp_processor_id();
 		csd->node.dst = cpu;
 #endif
-		if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu)))
+		cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE);
+		if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) {
 			__cpumask_set_cpu(cpu, cfd->cpumask_ipi);
+			cfd_seq_store(pcpu->seq_ipi, this_cpu, cpu, CFD_SEQ_IPI);
+		} else {
+			cfd_seq_store(pcpu->seq_noipi, this_cpu, cpu, CFD_SEQ_NOIPI);
+		}
 	}
 
 	/* Send a message to all CPUs in the map */
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->ping, this_cpu,
+		      CFD_SEQ_NOCPU, CFD_SEQ_PING);
 	arch_send_call_function_ipi_mask(cfd->cpumask_ipi);
+	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->pinged, this_cpu,
+		      CFD_SEQ_NOCPU, CFD_SEQ_PINGED);
 
 	if (wait) {
 		for_each_cpu(cpu, cfd->cpumask) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/4] kernel/smp: fix flush_smp_call_function_queue() cpu offline detection
  2021-03-02  6:28 [PATCH v3 0/4] kernel/smp.c: add more CSD lock debugging Juergen Gross
                   ` (2 preceding siblings ...)
  2021-03-02  6:28 ` [PATCH v3 3/4] kernel/smp: add more data to " Juergen Gross
@ 2021-03-02  6:28 ` Juergen Gross
  3 siblings, 0 replies; 13+ messages in thread
From: Juergen Gross @ 2021-03-02  6:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: paulmck, mhocko, peterz, Juergen Gross

The warnings for flushing a logically offline cpu's call_single_queue
are gated by a wrong if statement. It should trigger when there have
been new requests before dequeueing them, not afterwards.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V3:
- new patch
---
 kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 1a96691dbf7f..b3077c327b0a 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -581,7 +581,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
 
 	/* There shouldn't be any pending callbacks on an offline CPU. */
 	if (unlikely(warn_cpu_offline && !cpu_online(smp_processor_id()) &&
-		     !warned && !llist_empty(head))) {
+		     !warned && entry)) {
 		warned = true;
 		WARN(1, "IPI on offline CPU %d\n", smp_processor_id());
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-03-02  6:28 ` [PATCH v3 3/4] kernel/smp: add more data to " Juergen Gross
@ 2021-03-24 10:18   ` Jürgen Groß
  2021-03-30 17:33     ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Jürgen Groß @ 2021-03-24 10:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: paulmck, mhocko, peterz


[-- Attachment #1.1.1: Type: text/plain, Size: 6556 bytes --]

On 02.03.21 07:28, Juergen Gross wrote:
> In order to help identifying problems with IPI handling and remote
> function execution add some more data to IPI debugging code.
> 
> There have been multiple reports of cpus looping long times (many
> seconds) in smp_call_function_many() waiting for another cpu executing
> a function like tlb flushing. Most of these reports have been for
> cases where the kernel was running as a guest on top of KVM or Xen
> (there are rumours of that happening under VMWare, too, and even on
> bare metal).
> 
> Finding the root cause hasn't been successful yet, even after more than
> 2 years of chasing this bug by different developers.
> 
> Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> diagnostics") tried to address this by adding some debug code and by
> issuing another IPI when a hang was detected. This helped mitigating
> the problem (the repeated IPI unlocks the hang), but the root cause is
> still unknown.
> 
> Current available data suggests that either an IPI wasn't sent when it
> should have been, or that the IPI didn't result in the target cpu
> executing the queued function (due to the IPI not reaching the cpu,
> the IPI handler not being called, or the handler not seeing the queued
> request).
> 
> Try to add more diagnostic data by introducing a global atomic counter
> which is being incremented when doing critical operations (before and
> after queueing a new request, when sending an IPI, and when dequeueing
> a request). The counter value is stored in percpu variables which can
> be printed out when a hang is detected.
> 
> The data of the last event (consisting of sequence counter, source
> cpu, target cpu, and event type) is stored in a global variable. When
> a new event is to be traced, the data of the last event is stored in
> the event related percpu location and the global data is updated with
> the new event's data. This allows to track two events in one data
> location: one by the value of the event data (the event before the
> current one), and one by the location itself (the current event).
> 
> A typical printout with a detected hang will look like this:
> 
> csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
> 	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
>          csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
>          csd: cnt(00008cd): ffff->0006 idle
>          csd: cnt(0003668): 0001->0006 queue
>          csd: cnt(0003669): 0001->0006 ipi
>          csd: cnt(0003e0f): 0007->000a queue
>          csd: cnt(0003e10): 0001->ffff ping
>          csd: cnt(0003e71): 0003->0000 ping
>          csd: cnt(0003e72): ffff->0006 gotipi
>          csd: cnt(0003e73): ffff->0006 handle
>          csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
>          csd: cnt(0003e7f): 0004->0006 ping
>          csd: cnt(0003e80): 0001->ffff pinged
>          csd: cnt(0003eb2): 0005->0001 noipi
>          csd: cnt(0003eb3): 0001->0006 queue
>          csd: cnt(0003eb4): 0001->0006 noipi
>          csd: cnt now: 0003f00
> 
> This example (being an artificial one, produced with a previous version
> of this patch without the "hdlend" event), shows that cpu#6 started to
> handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> from cpu#4 in the queue already).
> 
> The idea is to print only relevant entries. Those are all events which
> are associated with the hang (so sender side events for the source cpu
> of the hanging request, and receiver side events for the target cpu),
> and the related events just before those (for adding data needed to
> identify a possible race). Printing all available data would be
> possible, but this would add large amounts of data printed on larger
> configurations.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Tested-by: Paul E. McKenney <paulmck@kernel.org>

Just an update regarding current status with debugging the underlying
issue:

On a customer's machine with a backport of this patch applied we've
seen another case of the hang. In the logs we've found:

smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting 
5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
smp: 	csd: CSD lock (#1) unresponsive.
smp: 	csd: cnt(0000000): 0000->0000 queue
smp: 	csd: cnt(0000001): ffff->0006 idle
smp: 	csd: cnt(0025dba): 0012->0006 queue
smp: 	csd: cnt(0025dbb): 0012->0006 noipi
smp: 	csd: cnt(01d1333): 001a->0006 pinged
smp: 	csd: cnt(01d1334): ffff->0006 gotipi
smp: 	csd: cnt(01d1335): ffff->0006 handle
smp: 	csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
smp: 	csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
smp: 	csd: cnt(01d16cb): 0012->0005 ipi
smp: 	csd: cnt(01d16cc): 0012->0006 queue
smp: 	csd: cnt(01d16cd): 0012->0006 ipi
smp: 	csd: cnt(01d16f3): 0012->001a ipi
smp: 	csd: cnt(01d16f4): 0012->ffff ping
smp: 	csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
smp: 	csd: cnt(01d1751): 0012->ffff pinged
smp: 	csd: cnt now: 01d1769

So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of 
the data).

The next 4 lines of the csd actions are not really interesting, as they 
are rather old.

Then we see that cpu 0006 did handle a request rather recently (cnt 
01d1333 - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, 
entered the handler, dequeued a request, and handled it.

Nearly all of the rest shows the critical request: cpu 0012 did a loop 
over probably all other cpus and queued the requests and marked them to 
be IPI-ed (including cpu 0006, cnt 01d16cd). Then the cpus marked to 
receive an IPI were pinged (cnt 01d16f4 and cnt 01d1751).

The entry cnt 01d1750 is not of interest here.

This data confirms that on sending side everything seems to be okay at 
the level above the actual IPI sending. On receiver side there seems no 
IPI to be seen, but there is no visible reason for a race either.

It seems as if we need more debugging in the deeper layers: is the IPI
really sent out, and is something being received on the destination cpu?
I'll have another try with even more debugging code, probably in private
on the customer machine first.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-03-24 10:18   ` Jürgen Groß
@ 2021-03-30 17:33     ` Paul E. McKenney
  2021-04-02 15:46       ` Juergen Gross
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2021-03-30 17:33 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: linux-kernel, mhocko, peterz

On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> On 02.03.21 07:28, Juergen Gross wrote:
> > In order to help identifying problems with IPI handling and remote
> > function execution add some more data to IPI debugging code.
> > 
> > There have been multiple reports of cpus looping long times (many
> > seconds) in smp_call_function_many() waiting for another cpu executing
> > a function like tlb flushing. Most of these reports have been for
> > cases where the kernel was running as a guest on top of KVM or Xen
> > (there are rumours of that happening under VMWare, too, and even on
> > bare metal).
> > 
> > Finding the root cause hasn't been successful yet, even after more than
> > 2 years of chasing this bug by different developers.
> > 
> > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > diagnostics") tried to address this by adding some debug code and by
> > issuing another IPI when a hang was detected. This helped mitigating
> > the problem (the repeated IPI unlocks the hang), but the root cause is
> > still unknown.
> > 
> > Current available data suggests that either an IPI wasn't sent when it
> > should have been, or that the IPI didn't result in the target cpu
> > executing the queued function (due to the IPI not reaching the cpu,
> > the IPI handler not being called, or the handler not seeing the queued
> > request).
> > 
> > Try to add more diagnostic data by introducing a global atomic counter
> > which is being incremented when doing critical operations (before and
> > after queueing a new request, when sending an IPI, and when dequeueing
> > a request). The counter value is stored in percpu variables which can
> > be printed out when a hang is detected.
> > 
> > The data of the last event (consisting of sequence counter, source
> > cpu, target cpu, and event type) is stored in a global variable. When
> > a new event is to be traced, the data of the last event is stored in
> > the event related percpu location and the global data is updated with
> > the new event's data. This allows to track two events in one data
> > location: one by the value of the event data (the event before the
> > current one), and one by the location itself (the current event).
> > 
> > A typical printout with a detected hang will look like this:
> > 
> > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
> > 	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
> >          csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
> >          csd: cnt(00008cd): ffff->0006 idle
> >          csd: cnt(0003668): 0001->0006 queue
> >          csd: cnt(0003669): 0001->0006 ipi
> >          csd: cnt(0003e0f): 0007->000a queue
> >          csd: cnt(0003e10): 0001->ffff ping
> >          csd: cnt(0003e71): 0003->0000 ping
> >          csd: cnt(0003e72): ffff->0006 gotipi
> >          csd: cnt(0003e73): ffff->0006 handle
> >          csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
> >          csd: cnt(0003e7f): 0004->0006 ping
> >          csd: cnt(0003e80): 0001->ffff pinged
> >          csd: cnt(0003eb2): 0005->0001 noipi
> >          csd: cnt(0003eb3): 0001->0006 queue
> >          csd: cnt(0003eb4): 0001->0006 noipi
> >          csd: cnt now: 0003f00
> > 
> > This example (being an artificial one, produced with a previous version
> > of this patch without the "hdlend" event), shows that cpu#6 started to
> > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > from cpu#4 in the queue already).
> > 
> > The idea is to print only relevant entries. Those are all events which
> > are associated with the hang (so sender side events for the source cpu
> > of the hanging request, and receiver side events for the target cpu),
> > and the related events just before those (for adding data needed to
> > identify a possible race). Printing all available data would be
> > possible, but this would add large amounts of data printed on larger
> > configurations.
> > 
> > Signed-off-by: Juergen Gross <jgross@suse.com>
> > Tested-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Just an update regarding current status with debugging the underlying
> issue:
> 
> On a customer's machine with a backport of this patch applied we've
> seen another case of the hang. In the logs we've found:
> 
> smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
> 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
> smp: 	csd: CSD lock (#1) unresponsive.
> smp: 	csd: cnt(0000000): 0000->0000 queue
> smp: 	csd: cnt(0000001): ffff->0006 idle
> smp: 	csd: cnt(0025dba): 0012->0006 queue
> smp: 	csd: cnt(0025dbb): 0012->0006 noipi
> smp: 	csd: cnt(01d1333): 001a->0006 pinged
> smp: 	csd: cnt(01d1334): ffff->0006 gotipi
> smp: 	csd: cnt(01d1335): ffff->0006 handle
> smp: 	csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
> smp: 	csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
> smp: 	csd: cnt(01d16cb): 0012->0005 ipi
> smp: 	csd: cnt(01d16cc): 0012->0006 queue
> smp: 	csd: cnt(01d16cd): 0012->0006 ipi
> smp: 	csd: cnt(01d16f3): 0012->001a ipi
> smp: 	csd: cnt(01d16f4): 0012->ffff ping
> smp: 	csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
> smp: 	csd: cnt(01d1751): 0012->ffff pinged
> smp: 	csd: cnt now: 01d1769
> 
> So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of the
> data).
> 
> The next 4 lines of the csd actions are not really interesting, as they are
> rather old.
> 
> Then we see that cpu 0006 did handle a request rather recently (cnt 01d1333
> - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered the
> handler, dequeued a request, and handled it.
> 
> Nearly all of the rest shows the critical request: cpu 0012 did a loop over
> probably all other cpus and queued the requests and marked them to be IPI-ed
> (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an IPI
> were pinged (cnt 01d16f4 and cnt 01d1751).
> 
> The entry cnt 01d1750 is not of interest here.
> 
> This data confirms that on sending side everything seems to be okay at the
> level above the actual IPI sending. On receiver side there seems no IPI to
> be seen, but there is no visible reason for a race either.
> 
> It seems as if we need more debugging in the deeper layers: is the IPI
> really sent out, and is something being received on the destination cpu?
> I'll have another try with even more debugging code, probably in private
> on the customer machine first.

Apologies for the late reply, was out last week.

Excellent news, and thank you!

For my part, I have put together a rough prototype script that allows
me to run scftorture on larger groups of systems and started running it,
though I am hoping that 1,000 is far more than will be required.

Your diagnosis of a lost IPI matches what we have been able to glean
from the occasional occurrences in the wild on our systems, for whatever
that might be worth.  The hope is to get something that reproduces more
quickly, which would allow deeper debugging at this end as well.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-03-30 17:33     ` Paul E. McKenney
@ 2021-04-02 15:46       ` Juergen Gross
  2021-04-02 16:11         ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Juergen Gross @ 2021-04-02 15:46 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, mhocko, peterz


[-- Attachment #1.1.1: Type: text/plain, Size: 8104 bytes --]

On 30.03.21 19:33, Paul E. McKenney wrote:
> On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
>> On 02.03.21 07:28, Juergen Gross wrote:
>>> In order to help identifying problems with IPI handling and remote
>>> function execution add some more data to IPI debugging code.
>>>
>>> There have been multiple reports of cpus looping long times (many
>>> seconds) in smp_call_function_many() waiting for another cpu executing
>>> a function like tlb flushing. Most of these reports have been for
>>> cases where the kernel was running as a guest on top of KVM or Xen
>>> (there are rumours of that happening under VMWare, too, and even on
>>> bare metal).
>>>
>>> Finding the root cause hasn't been successful yet, even after more than
>>> 2 years of chasing this bug by different developers.
>>>
>>> Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
>>> diagnostics") tried to address this by adding some debug code and by
>>> issuing another IPI when a hang was detected. This helped mitigating
>>> the problem (the repeated IPI unlocks the hang), but the root cause is
>>> still unknown.
>>>
>>> Current available data suggests that either an IPI wasn't sent when it
>>> should have been, or that the IPI didn't result in the target cpu
>>> executing the queued function (due to the IPI not reaching the cpu,
>>> the IPI handler not being called, or the handler not seeing the queued
>>> request).
>>>
>>> Try to add more diagnostic data by introducing a global atomic counter
>>> which is being incremented when doing critical operations (before and
>>> after queueing a new request, when sending an IPI, and when dequeueing
>>> a request). The counter value is stored in percpu variables which can
>>> be printed out when a hang is detected.
>>>
>>> The data of the last event (consisting of sequence counter, source
>>> cpu, target cpu, and event type) is stored in a global variable. When
>>> a new event is to be traced, the data of the last event is stored in
>>> the event related percpu location and the global data is updated with
>>> the new event's data. This allows to track two events in one data
>>> location: one by the value of the event data (the event before the
>>> current one), and one by the location itself (the current event).
>>>
>>> A typical printout with a detected hang will look like this:
>>>
>>> csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
>>> 	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
>>>           csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
>>>           csd: cnt(00008cd): ffff->0006 idle
>>>           csd: cnt(0003668): 0001->0006 queue
>>>           csd: cnt(0003669): 0001->0006 ipi
>>>           csd: cnt(0003e0f): 0007->000a queue
>>>           csd: cnt(0003e10): 0001->ffff ping
>>>           csd: cnt(0003e71): 0003->0000 ping
>>>           csd: cnt(0003e72): ffff->0006 gotipi
>>>           csd: cnt(0003e73): ffff->0006 handle
>>>           csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
>>>           csd: cnt(0003e7f): 0004->0006 ping
>>>           csd: cnt(0003e80): 0001->ffff pinged
>>>           csd: cnt(0003eb2): 0005->0001 noipi
>>>           csd: cnt(0003eb3): 0001->0006 queue
>>>           csd: cnt(0003eb4): 0001->0006 noipi
>>>           csd: cnt now: 0003f00
>>>
>>> This example (being an artificial one, produced with a previous version
>>> of this patch without the "hdlend" event), shows that cpu#6 started to
>>> handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
>>> (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
>>> queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
>>> from cpu#4 in the queue already).
>>>
>>> The idea is to print only relevant entries. Those are all events which
>>> are associated with the hang (so sender side events for the source cpu
>>> of the hanging request, and receiver side events for the target cpu),
>>> and the related events just before those (for adding data needed to
>>> identify a possible race). Printing all available data would be
>>> possible, but this would add large amounts of data printed on larger
>>> configurations.
>>>
>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>> Tested-by: Paul E. McKenney <paulmck@kernel.org>
>>
>> Just an update regarding current status with debugging the underlying
>> issue:
>>
>> On a customer's machine with a backport of this patch applied we've
>> seen another case of the hang. In the logs we've found:
>>
>> smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
>> 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
>> smp: 	csd: CSD lock (#1) unresponsive.
>> smp: 	csd: cnt(0000000): 0000->0000 queue
>> smp: 	csd: cnt(0000001): ffff->0006 idle
>> smp: 	csd: cnt(0025dba): 0012->0006 queue
>> smp: 	csd: cnt(0025dbb): 0012->0006 noipi
>> smp: 	csd: cnt(01d1333): 001a->0006 pinged
>> smp: 	csd: cnt(01d1334): ffff->0006 gotipi
>> smp: 	csd: cnt(01d1335): ffff->0006 handle
>> smp: 	csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
>> smp: 	csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
>> smp: 	csd: cnt(01d16cb): 0012->0005 ipi
>> smp: 	csd: cnt(01d16cc): 0012->0006 queue
>> smp: 	csd: cnt(01d16cd): 0012->0006 ipi
>> smp: 	csd: cnt(01d16f3): 0012->001a ipi
>> smp: 	csd: cnt(01d16f4): 0012->ffff ping
>> smp: 	csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
>> smp: 	csd: cnt(01d1751): 0012->ffff pinged
>> smp: 	csd: cnt now: 01d1769
>>
>> So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of the
>> data).
>>
>> The next 4 lines of the csd actions are not really interesting, as they are
>> rather old.
>>
>> Then we see that cpu 0006 did handle a request rather recently (cnt 01d1333
>> - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered the
>> handler, dequeued a request, and handled it.
>>
>> Nearly all of the rest shows the critical request: cpu 0012 did a loop over
>> probably all other cpus and queued the requests and marked them to be IPI-ed
>> (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an IPI
>> were pinged (cnt 01d16f4 and cnt 01d1751).
>>
>> The entry cnt 01d1750 is not of interest here.
>>
>> This data confirms that on sending side everything seems to be okay at the
>> level above the actual IPI sending. On receiver side there seems no IPI to
>> be seen, but there is no visible reason for a race either.
>>
>> It seems as if we need more debugging in the deeper layers: is the IPI
>> really sent out, and is something being received on the destination cpu?
>> I'll have another try with even more debugging code, probably in private
>> on the customer machine first.
> 
> Apologies for the late reply, was out last week.
> 
> Excellent news, and thank you!
> 
> For my part, I have put together a rough prototype script that allows
> me to run scftorture on larger groups of systems and started running it,
> though I am hoping that 1,000 is far more than will be required.
> 
> Your diagnosis of a lost IPI matches what we have been able to glean
> from the occasional occurrences in the wild on our systems, for whatever
> that might be worth.  The hope is to get something that reproduces more
> quickly, which would allow deeper debugging at this end as well.

Sometimes one is lucky.

I've found a reproducer while hunting another bug. The test on that
machine will trigger the csd_lock timeout about once a day.

I've used my new debug kernel and found that the IPI is really sent
out (more precise: the hypervisor has been requested to do so, and
it didn't report an error). On the target cpu there was no interrupt
received after that, so the IPI has not been swallowed on the target
cpu by the Linux kernel.

Will now try to instrument the hypervisor to get more data.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-04-02 15:46       ` Juergen Gross
@ 2021-04-02 16:11         ` Paul E. McKenney
  2021-04-05  7:37           ` Juergen Gross
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2021-04-02 16:11 UTC (permalink / raw)
  To: Juergen Gross; +Cc: linux-kernel, mhocko, peterz

On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
> On 30.03.21 19:33, Paul E. McKenney wrote:
> > On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> > > On 02.03.21 07:28, Juergen Gross wrote:
> > > > In order to help identifying problems with IPI handling and remote
> > > > function execution add some more data to IPI debugging code.
> > > > 
> > > > There have been multiple reports of cpus looping long times (many
> > > > seconds) in smp_call_function_many() waiting for another cpu executing
> > > > a function like tlb flushing. Most of these reports have been for
> > > > cases where the kernel was running as a guest on top of KVM or Xen
> > > > (there are rumours of that happening under VMWare, too, and even on
> > > > bare metal).
> > > > 
> > > > Finding the root cause hasn't been successful yet, even after more than
> > > > 2 years of chasing this bug by different developers.
> > > > 
> > > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > > > diagnostics") tried to address this by adding some debug code and by
> > > > issuing another IPI when a hang was detected. This helped mitigating
> > > > the problem (the repeated IPI unlocks the hang), but the root cause is
> > > > still unknown.
> > > > 
> > > > Current available data suggests that either an IPI wasn't sent when it
> > > > should have been, or that the IPI didn't result in the target cpu
> > > > executing the queued function (due to the IPI not reaching the cpu,
> > > > the IPI handler not being called, or the handler not seeing the queued
> > > > request).
> > > > 
> > > > Try to add more diagnostic data by introducing a global atomic counter
> > > > which is being incremented when doing critical operations (before and
> > > > after queueing a new request, when sending an IPI, and when dequeueing
> > > > a request). The counter value is stored in percpu variables which can
> > > > be printed out when a hang is detected.
> > > > 
> > > > The data of the last event (consisting of sequence counter, source
> > > > cpu, target cpu, and event type) is stored in a global variable. When
> > > > a new event is to be traced, the data of the last event is stored in
> > > > the event related percpu location and the global data is updated with
> > > > the new event's data. This allows to track two events in one data
> > > > location: one by the value of the event data (the event before the
> > > > current one), and one by the location itself (the current event).
> > > > 
> > > > A typical printout with a detected hang will look like this:
> > > > 
> > > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
> > > > 	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
> > > >           csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
> > > >           csd: cnt(00008cd): ffff->0006 idle
> > > >           csd: cnt(0003668): 0001->0006 queue
> > > >           csd: cnt(0003669): 0001->0006 ipi
> > > >           csd: cnt(0003e0f): 0007->000a queue
> > > >           csd: cnt(0003e10): 0001->ffff ping
> > > >           csd: cnt(0003e71): 0003->0000 ping
> > > >           csd: cnt(0003e72): ffff->0006 gotipi
> > > >           csd: cnt(0003e73): ffff->0006 handle
> > > >           csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
> > > >           csd: cnt(0003e7f): 0004->0006 ping
> > > >           csd: cnt(0003e80): 0001->ffff pinged
> > > >           csd: cnt(0003eb2): 0005->0001 noipi
> > > >           csd: cnt(0003eb3): 0001->0006 queue
> > > >           csd: cnt(0003eb4): 0001->0006 noipi
> > > >           csd: cnt now: 0003f00
> > > > 
> > > > This example (being an artificial one, produced with a previous version
> > > > of this patch without the "hdlend" event), shows that cpu#6 started to
> > > > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> > > > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > > > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > > > from cpu#4 in the queue already).
> > > > 
> > > > The idea is to print only relevant entries. Those are all events which
> > > > are associated with the hang (so sender side events for the source cpu
> > > > of the hanging request, and receiver side events for the target cpu),
> > > > and the related events just before those (for adding data needed to
> > > > identify a possible race). Printing all available data would be
> > > > possible, but this would add large amounts of data printed on larger
> > > > configurations.
> > > > 
> > > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > > > Tested-by: Paul E. McKenney <paulmck@kernel.org>
> > > 
> > > Just an update regarding current status with debugging the underlying
> > > issue:
> > > 
> > > On a customer's machine with a backport of this patch applied we've
> > > seen another case of the hang. In the logs we've found:
> > > 
> > > smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
> > > 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
> > > smp: 	csd: CSD lock (#1) unresponsive.
> > > smp: 	csd: cnt(0000000): 0000->0000 queue
> > > smp: 	csd: cnt(0000001): ffff->0006 idle
> > > smp: 	csd: cnt(0025dba): 0012->0006 queue
> > > smp: 	csd: cnt(0025dbb): 0012->0006 noipi
> > > smp: 	csd: cnt(01d1333): 001a->0006 pinged
> > > smp: 	csd: cnt(01d1334): ffff->0006 gotipi
> > > smp: 	csd: cnt(01d1335): ffff->0006 handle
> > > smp: 	csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
> > > smp: 	csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
> > > smp: 	csd: cnt(01d16cb): 0012->0005 ipi
> > > smp: 	csd: cnt(01d16cc): 0012->0006 queue
> > > smp: 	csd: cnt(01d16cd): 0012->0006 ipi
> > > smp: 	csd: cnt(01d16f3): 0012->001a ipi
> > > smp: 	csd: cnt(01d16f4): 0012->ffff ping
> > > smp: 	csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
> > > smp: 	csd: cnt(01d1751): 0012->ffff pinged
> > > smp: 	csd: cnt now: 01d1769
> > > 
> > > So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of the
> > > data).
> > > 
> > > The next 4 lines of the csd actions are not really interesting, as they are
> > > rather old.
> > > 
> > > Then we see that cpu 0006 did handle a request rather recently (cnt 01d1333
> > > - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered the
> > > handler, dequeued a request, and handled it.
> > > 
> > > Nearly all of the rest shows the critical request: cpu 0012 did a loop over
> > > probably all other cpus and queued the requests and marked them to be IPI-ed
> > > (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an IPI
> > > were pinged (cnt 01d16f4 and cnt 01d1751).
> > > 
> > > The entry cnt 01d1750 is not of interest here.
> > > 
> > > This data confirms that on sending side everything seems to be okay at the
> > > level above the actual IPI sending. On receiver side there seems no IPI to
> > > be seen, but there is no visible reason for a race either.
> > > 
> > > It seems as if we need more debugging in the deeper layers: is the IPI
> > > really sent out, and is something being received on the destination cpu?
> > > I'll have another try with even more debugging code, probably in private
> > > on the customer machine first.
> > 
> > Apologies for the late reply, was out last week.
> > 
> > Excellent news, and thank you!
> > 
> > For my part, I have put together a rough prototype script that allows
> > me to run scftorture on larger groups of systems and started running it,
> > though I am hoping that 1,000 is far more than will be required.
> > 
> > Your diagnosis of a lost IPI matches what we have been able to glean
> > from the occasional occurrences in the wild on our systems, for whatever
> > that might be worth.  The hope is to get something that reproduces more
> > quickly, which would allow deeper debugging at this end as well.
> 
> Sometimes one is lucky.
> 
> I've found a reproducer while hunting another bug. The test on that
> machine will trigger the csd_lock timeout about once a day.

Nice!!!  You are way ahead of me!

> I've used my new debug kernel and found that the IPI is really sent
> out (more precise: the hypervisor has been requested to do so, and
> it didn't report an error). On the target cpu there was no interrupt
> received after that, so the IPI has not been swallowed on the target
> cpu by the Linux kernel.
> 
> Will now try to instrument the hypervisor to get more data.

I am increasing the number and types of systems and the test duration.
Ijust started running three different systems with IPI workloads in both
guests and within host over the weekend.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-04-02 16:11         ` Paul E. McKenney
@ 2021-04-05  7:37           ` Juergen Gross
  2021-04-05 15:29             ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Juergen Gross @ 2021-04-05  7:37 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel, mhocko, peterz


[-- Attachment #1.1.1: Type: text/plain, Size: 9393 bytes --]

On 02.04.21 18:11, Paul E. McKenney wrote:
> On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
>> On 30.03.21 19:33, Paul E. McKenney wrote:
>>> On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
>>>> On 02.03.21 07:28, Juergen Gross wrote:
>>>>> In order to help identifying problems with IPI handling and remote
>>>>> function execution add some more data to IPI debugging code.
>>>>>
>>>>> There have been multiple reports of cpus looping long times (many
>>>>> seconds) in smp_call_function_many() waiting for another cpu executing
>>>>> a function like tlb flushing. Most of these reports have been for
>>>>> cases where the kernel was running as a guest on top of KVM or Xen
>>>>> (there are rumours of that happening under VMWare, too, and even on
>>>>> bare metal).
>>>>>
>>>>> Finding the root cause hasn't been successful yet, even after more than
>>>>> 2 years of chasing this bug by different developers.
>>>>>
>>>>> Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
>>>>> diagnostics") tried to address this by adding some debug code and by
>>>>> issuing another IPI when a hang was detected. This helped mitigating
>>>>> the problem (the repeated IPI unlocks the hang), but the root cause is
>>>>> still unknown.
>>>>>
>>>>> Current available data suggests that either an IPI wasn't sent when it
>>>>> should have been, or that the IPI didn't result in the target cpu
>>>>> executing the queued function (due to the IPI not reaching the cpu,
>>>>> the IPI handler not being called, or the handler not seeing the queued
>>>>> request).
>>>>>
>>>>> Try to add more diagnostic data by introducing a global atomic counter
>>>>> which is being incremented when doing critical operations (before and
>>>>> after queueing a new request, when sending an IPI, and when dequeueing
>>>>> a request). The counter value is stored in percpu variables which can
>>>>> be printed out when a hang is detected.
>>>>>
>>>>> The data of the last event (consisting of sequence counter, source
>>>>> cpu, target cpu, and event type) is stored in a global variable. When
>>>>> a new event is to be traced, the data of the last event is stored in
>>>>> the event related percpu location and the global data is updated with
>>>>> the new event's data. This allows to track two events in one data
>>>>> location: one by the value of the event data (the event before the
>>>>> current one), and one by the location itself (the current event).
>>>>>
>>>>> A typical printout with a detected hang will look like this:
>>>>>
>>>>> csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
>>>>> 	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
>>>>>            csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
>>>>>            csd: cnt(00008cd): ffff->0006 idle
>>>>>            csd: cnt(0003668): 0001->0006 queue
>>>>>            csd: cnt(0003669): 0001->0006 ipi
>>>>>            csd: cnt(0003e0f): 0007->000a queue
>>>>>            csd: cnt(0003e10): 0001->ffff ping
>>>>>            csd: cnt(0003e71): 0003->0000 ping
>>>>>            csd: cnt(0003e72): ffff->0006 gotipi
>>>>>            csd: cnt(0003e73): ffff->0006 handle
>>>>>            csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
>>>>>            csd: cnt(0003e7f): 0004->0006 ping
>>>>>            csd: cnt(0003e80): 0001->ffff pinged
>>>>>            csd: cnt(0003eb2): 0005->0001 noipi
>>>>>            csd: cnt(0003eb3): 0001->0006 queue
>>>>>            csd: cnt(0003eb4): 0001->0006 noipi
>>>>>            csd: cnt now: 0003f00
>>>>>
>>>>> This example (being an artificial one, produced with a previous version
>>>>> of this patch without the "hdlend" event), shows that cpu#6 started to
>>>>> handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
>>>>> (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
>>>>> queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
>>>>> from cpu#4 in the queue already).
>>>>>
>>>>> The idea is to print only relevant entries. Those are all events which
>>>>> are associated with the hang (so sender side events for the source cpu
>>>>> of the hanging request, and receiver side events for the target cpu),
>>>>> and the related events just before those (for adding data needed to
>>>>> identify a possible race). Printing all available data would be
>>>>> possible, but this would add large amounts of data printed on larger
>>>>> configurations.
>>>>>
>>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>>> Tested-by: Paul E. McKenney <paulmck@kernel.org>
>>>>
>>>> Just an update regarding current status with debugging the underlying
>>>> issue:
>>>>
>>>> On a customer's machine with a backport of this patch applied we've
>>>> seen another case of the hang. In the logs we've found:
>>>>
>>>> smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
>>>> 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
>>>> smp: 	csd: CSD lock (#1) unresponsive.
>>>> smp: 	csd: cnt(0000000): 0000->0000 queue
>>>> smp: 	csd: cnt(0000001): ffff->0006 idle
>>>> smp: 	csd: cnt(0025dba): 0012->0006 queue
>>>> smp: 	csd: cnt(0025dbb): 0012->0006 noipi
>>>> smp: 	csd: cnt(01d1333): 001a->0006 pinged
>>>> smp: 	csd: cnt(01d1334): ffff->0006 gotipi
>>>> smp: 	csd: cnt(01d1335): ffff->0006 handle
>>>> smp: 	csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
>>>> smp: 	csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
>>>> smp: 	csd: cnt(01d16cb): 0012->0005 ipi
>>>> smp: 	csd: cnt(01d16cc): 0012->0006 queue
>>>> smp: 	csd: cnt(01d16cd): 0012->0006 ipi
>>>> smp: 	csd: cnt(01d16f3): 0012->001a ipi
>>>> smp: 	csd: cnt(01d16f4): 0012->ffff ping
>>>> smp: 	csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
>>>> smp: 	csd: cnt(01d1751): 0012->ffff pinged
>>>> smp: 	csd: cnt now: 01d1769
>>>>
>>>> So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of the
>>>> data).
>>>>
>>>> The next 4 lines of the csd actions are not really interesting, as they are
>>>> rather old.
>>>>
>>>> Then we see that cpu 0006 did handle a request rather recently (cnt 01d1333
>>>> - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered the
>>>> handler, dequeued a request, and handled it.
>>>>
>>>> Nearly all of the rest shows the critical request: cpu 0012 did a loop over
>>>> probably all other cpus and queued the requests and marked them to be IPI-ed
>>>> (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an IPI
>>>> were pinged (cnt 01d16f4 and cnt 01d1751).
>>>>
>>>> The entry cnt 01d1750 is not of interest here.
>>>>
>>>> This data confirms that on sending side everything seems to be okay at the
>>>> level above the actual IPI sending. On receiver side there seems no IPI to
>>>> be seen, but there is no visible reason for a race either.
>>>>
>>>> It seems as if we need more debugging in the deeper layers: is the IPI
>>>> really sent out, and is something being received on the destination cpu?
>>>> I'll have another try with even more debugging code, probably in private
>>>> on the customer machine first.
>>>
>>> Apologies for the late reply, was out last week.
>>>
>>> Excellent news, and thank you!
>>>
>>> For my part, I have put together a rough prototype script that allows
>>> me to run scftorture on larger groups of systems and started running it,
>>> though I am hoping that 1,000 is far more than will be required.
>>>
>>> Your diagnosis of a lost IPI matches what we have been able to glean
>>> from the occasional occurrences in the wild on our systems, for whatever
>>> that might be worth.  The hope is to get something that reproduces more
>>> quickly, which would allow deeper debugging at this end as well.
>>
>> Sometimes one is lucky.
>>
>> I've found a reproducer while hunting another bug. The test on that
>> machine will trigger the csd_lock timeout about once a day.
> 
> Nice!!!  You are way ahead of me!
> 
>> I've used my new debug kernel and found that the IPI is really sent
>> out (more precise: the hypervisor has been requested to do so, and
>> it didn't report an error). On the target cpu there was no interrupt
>> received after that, so the IPI has not been swallowed on the target
>> cpu by the Linux kernel.
>>
>> Will now try to instrument the hypervisor to get more data.
> 
> I am increasing the number and types of systems and the test duration.
> Ijust started running three different systems with IPI workloads in both
> guests and within host over the weekend.

Maybe you can try my kind of workload:

I have a guest with 16 vcpus and 8 GB of memory running 8 instances of

sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 
--max-time=300 --max-requests=0 run

on disjunct nfs mounts. Those have been created with:

mount -t nfs -o 
rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport 
server:/share[1-8] /mount[1-8]

with the server running on the host system of the guest and the shares
located in a ramdisk.

The host has 72 cpus and 48 GB of RAM.

A csd lock timeout happens about once per day on the host.


Juergen

[-- Attachment #1.1.2: OpenPGP_0xB0DE9DD628BF132F.asc --]
[-- Type: application/pgp-keys, Size: 3135 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2021-04-05  7:37           ` Juergen Gross
@ 2021-04-05 15:29             ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2021-04-05 15:29 UTC (permalink / raw)
  To: Juergen Gross; +Cc: linux-kernel, mhocko, peterz

On Mon, Apr 05, 2021 at 09:37:40AM +0200, Juergen Gross wrote:
> On 02.04.21 18:11, Paul E. McKenney wrote:
> > On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
> > > On 30.03.21 19:33, Paul E. McKenney wrote:
> > > > On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> > > > > On 02.03.21 07:28, Juergen Gross wrote:
> > > > > > In order to help identifying problems with IPI handling and remote
> > > > > > function execution add some more data to IPI debugging code.
> > > > > > 
> > > > > > There have been multiple reports of cpus looping long times (many
> > > > > > seconds) in smp_call_function_many() waiting for another cpu executing
> > > > > > a function like tlb flushing. Most of these reports have been for
> > > > > > cases where the kernel was running as a guest on top of KVM or Xen
> > > > > > (there are rumours of that happening under VMWare, too, and even on
> > > > > > bare metal).
> > > > > > 
> > > > > > Finding the root cause hasn't been successful yet, even after more than
> > > > > > 2 years of chasing this bug by different developers.
> > > > > > 
> > > > > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > > > > > diagnostics") tried to address this by adding some debug code and by
> > > > > > issuing another IPI when a hang was detected. This helped mitigating
> > > > > > the problem (the repeated IPI unlocks the hang), but the root cause is
> > > > > > still unknown.
> > > > > > 
> > > > > > Current available data suggests that either an IPI wasn't sent when it
> > > > > > should have been, or that the IPI didn't result in the target cpu
> > > > > > executing the queued function (due to the IPI not reaching the cpu,
> > > > > > the IPI handler not being called, or the handler not seeing the queued
> > > > > > request).
> > > > > > 
> > > > > > Try to add more diagnostic data by introducing a global atomic counter
> > > > > > which is being incremented when doing critical operations (before and
> > > > > > after queueing a new request, when sending an IPI, and when dequeueing
> > > > > > a request). The counter value is stored in percpu variables which can
> > > > > > be printed out when a hang is detected.
> > > > > > 
> > > > > > The data of the last event (consisting of sequence counter, source
> > > > > > cpu, target cpu, and event type) is stored in a global variable. When
> > > > > > a new event is to be traced, the data of the last event is stored in
> > > > > > the event related percpu location and the global data is updated with
> > > > > > the new event's data. This allows to track two events in one data
> > > > > > location: one by the value of the event data (the event before the
> > > > > > current one), and one by the location itself (the current event).
> > > > > > 
> > > > > > A typical printout with a detected hang will look like this:
> > > > > > 
> > > > > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
> > > > > > 	csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
> > > > > >            csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
> > > > > >            csd: cnt(00008cd): ffff->0006 idle
> > > > > >            csd: cnt(0003668): 0001->0006 queue
> > > > > >            csd: cnt(0003669): 0001->0006 ipi
> > > > > >            csd: cnt(0003e0f): 0007->000a queue
> > > > > >            csd: cnt(0003e10): 0001->ffff ping
> > > > > >            csd: cnt(0003e71): 0003->0000 ping
> > > > > >            csd: cnt(0003e72): ffff->0006 gotipi
> > > > > >            csd: cnt(0003e73): ffff->0006 handle
> > > > > >            csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
> > > > > >            csd: cnt(0003e7f): 0004->0006 ping
> > > > > >            csd: cnt(0003e80): 0001->ffff pinged
> > > > > >            csd: cnt(0003eb2): 0005->0001 noipi
> > > > > >            csd: cnt(0003eb3): 0001->0006 queue
> > > > > >            csd: cnt(0003eb4): 0001->0006 noipi
> > > > > >            csd: cnt now: 0003f00
> > > > > > 
> > > > > > This example (being an artificial one, produced with a previous version
> > > > > > of this patch without the "hdlend" event), shows that cpu#6 started to
> > > > > > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> > > > > > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > > > > > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > > > > > from cpu#4 in the queue already).
> > > > > > 
> > > > > > The idea is to print only relevant entries. Those are all events which
> > > > > > are associated with the hang (so sender side events for the source cpu
> > > > > > of the hanging request, and receiver side events for the target cpu),
> > > > > > and the related events just before those (for adding data needed to
> > > > > > identify a possible race). Printing all available data would be
> > > > > > possible, but this would add large amounts of data printed on larger
> > > > > > configurations.
> > > > > > 
> > > > > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > > > > > Tested-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > 
> > > > > Just an update regarding current status with debugging the underlying
> > > > > issue:
> > > > > 
> > > > > On a customer's machine with a backport of this patch applied we've
> > > > > seen another case of the hang. In the logs we've found:
> > > > > 
> > > > > smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
> > > > > 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
> > > > > smp: 	csd: CSD lock (#1) unresponsive.
> > > > > smp: 	csd: cnt(0000000): 0000->0000 queue
> > > > > smp: 	csd: cnt(0000001): ffff->0006 idle
> > > > > smp: 	csd: cnt(0025dba): 0012->0006 queue
> > > > > smp: 	csd: cnt(0025dbb): 0012->0006 noipi
> > > > > smp: 	csd: cnt(01d1333): 001a->0006 pinged
> > > > > smp: 	csd: cnt(01d1334): ffff->0006 gotipi
> > > > > smp: 	csd: cnt(01d1335): ffff->0006 handle
> > > > > smp: 	csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
> > > > > smp: 	csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
> > > > > smp: 	csd: cnt(01d16cb): 0012->0005 ipi
> > > > > smp: 	csd: cnt(01d16cc): 0012->0006 queue
> > > > > smp: 	csd: cnt(01d16cd): 0012->0006 ipi
> > > > > smp: 	csd: cnt(01d16f3): 0012->001a ipi
> > > > > smp: 	csd: cnt(01d16f4): 0012->ffff ping
> > > > > smp: 	csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
> > > > > smp: 	csd: cnt(01d1751): 0012->ffff pinged
> > > > > smp: 	csd: cnt now: 01d1769
> > > > > 
> > > > > So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of the
> > > > > data).
> > > > > 
> > > > > The next 4 lines of the csd actions are not really interesting, as they are
> > > > > rather old.
> > > > > 
> > > > > Then we see that cpu 0006 did handle a request rather recently (cnt 01d1333
> > > > > - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered the
> > > > > handler, dequeued a request, and handled it.
> > > > > 
> > > > > Nearly all of the rest shows the critical request: cpu 0012 did a loop over
> > > > > probably all other cpus and queued the requests and marked them to be IPI-ed
> > > > > (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an IPI
> > > > > were pinged (cnt 01d16f4 and cnt 01d1751).
> > > > > 
> > > > > The entry cnt 01d1750 is not of interest here.
> > > > > 
> > > > > This data confirms that on sending side everything seems to be okay at the
> > > > > level above the actual IPI sending. On receiver side there seems no IPI to
> > > > > be seen, but there is no visible reason for a race either.
> > > > > 
> > > > > It seems as if we need more debugging in the deeper layers: is the IPI
> > > > > really sent out, and is something being received on the destination cpu?
> > > > > I'll have another try with even more debugging code, probably in private
> > > > > on the customer machine first.
> > > > 
> > > > Apologies for the late reply, was out last week.
> > > > 
> > > > Excellent news, and thank you!
> > > > 
> > > > For my part, I have put together a rough prototype script that allows
> > > > me to run scftorture on larger groups of systems and started running it,
> > > > though I am hoping that 1,000 is far more than will be required.
> > > > 
> > > > Your diagnosis of a lost IPI matches what we have been able to glean
> > > > from the occasional occurrences in the wild on our systems, for whatever
> > > > that might be worth.  The hope is to get something that reproduces more
> > > > quickly, which would allow deeper debugging at this end as well.
> > > 
> > > Sometimes one is lucky.
> > > 
> > > I've found a reproducer while hunting another bug. The test on that
> > > machine will trigger the csd_lock timeout about once a day.
> > 
> > Nice!!!  You are way ahead of me!
> > 
> > > I've used my new debug kernel and found that the IPI is really sent
> > > out (more precise: the hypervisor has been requested to do so, and
> > > it didn't report an error). On the target cpu there was no interrupt
> > > received after that, so the IPI has not been swallowed on the target
> > > cpu by the Linux kernel.
> > > 
> > > Will now try to instrument the hypervisor to get more data.
> > 
> > I am increasing the number and types of systems and the test duration.
> > Ijust started running three different systems with IPI workloads in both
> > guests and within host over the weekend.
> 
> Maybe you can try my kind of workload:
> 
> I have a guest with 16 vcpus and 8 GB of memory running 8 instances of
> 
> sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300
> --max-requests=0 run
> 
> on disjunct nfs mounts. Those have been created with:
> 
> mount -t nfs -o
> rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport
> server:/share[1-8] /mount[1-8]
> 
> with the server running on the host system of the guest and the shares
> located in a ramdisk.
> 
> The host has 72 cpus and 48 GB of RAM.
> 
> A csd lock timeout happens about once per day on the host.

Thank you!  I will give it a try.  It has been quite some time since I
have done anything with NFS, so it should be entertaining.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
  2022-08-16 10:53 [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging Chang-Ho Cho
@ 2022-08-16 15:58 ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2022-08-16 15:58 UTC (permalink / raw)
  To: Chang-Ho Cho; +Cc: jgross, linux-kernel, mhocko, peterz

On Tue, Aug 16, 2022 at 07:53:49PM +0900, Chang-Ho Cho wrote:
> Hello, 
> 
> Google search brought me into this thread.  May I have update on the progress of this issue? Has anyone find root cause, solution or workaround?
> How can I get more about the problem?
> Windows 2008 R2 VM experiencing hang on QEMU-kvm with IPI handling issue. 

Here is a blog entry describing the bug that was causing me trouble:

https://paulmck.livejournal.com/62071.html

But although this appeared to be an IPI-loss problem, it was in fact
an interrupt storm.

Juergen, who you CCed, located a race-condition issue with a non-KVM
hypervisor, if I recall correctly.

Of course, much depends on the exact versions of your various OSes
and firmware.  One useful trick is to look for fixes since whichever
version of Linux you are using, and to backport those.  Also, for issues
with Windows, you are likely better served asking elsewhere.  Since you
are asking here, I am assuming that you have somehow proven that the
interrupt is being lost in KVM or QEMU rather than in the Windows guest.

Please note that observing the hang on QEMU/KVM but not on bare metal
is insufficient, given that the difference in timing can dramatically
change the probability of bugs occuring.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging
@ 2022-08-16 10:53 Chang-Ho Cho
  2022-08-16 15:58 ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Chang-Ho Cho @ 2022-08-16 10:53 UTC (permalink / raw)
  To: paulmck; +Cc: jgross, linux-kernel, mhocko, peterz

Hello, 

Google search brought me into this thread.  May I have update on the progress of this issue? Has anyone find root cause, solution or workaround?
How can I get more about the problem?
Windows 2008 R2 VM experiencing hang on QEMU-kvm with IPI handling issue. 

Regards,
Changho

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-16 16:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-02  6:28 [PATCH v3 0/4] kernel/smp.c: add more CSD lock debugging Juergen Gross
2021-03-02  6:28 ` [PATCH v3 1/4] kernel/smp: add boot parameter for controlling " Juergen Gross
2021-03-02  6:28 ` [PATCH v3 2/4] kernel/smp: prepare more " Juergen Gross
2021-03-02  6:28 ` [PATCH v3 3/4] kernel/smp: add more data to " Juergen Gross
2021-03-24 10:18   ` Jürgen Groß
2021-03-30 17:33     ` Paul E. McKenney
2021-04-02 15:46       ` Juergen Gross
2021-04-02 16:11         ` Paul E. McKenney
2021-04-05  7:37           ` Juergen Gross
2021-04-05 15:29             ` Paul E. McKenney
2021-03-02  6:28 ` [PATCH v3 4/4] kernel/smp: fix flush_smp_call_function_queue() cpu offline detection Juergen Gross
2022-08-16 10:53 [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging Chang-Ho Cho
2022-08-16 15:58 ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.