[PATCH v6 0/2] rcu: Add RCU stall diagnosis information

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
@ 2022-11-09  9:37 Zhen Lei
  2022-11-09  9:37 ` [PATCH v6 1/2] " Zhen Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Zhen Lei @ 2022-11-09  9:37 UTC (permalink / raw)
  To: Paul E . McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, rcu, linux-kernel
  Cc: Zhen Lei, Robert Elliott

v5 --> v6:
1. When there are more than two continuous RCU stallings, correctly handle the
   value of the second and subsequent sampling periods. Update comments and
   document.
   Thanks to Elliott, Robert for the test.
2. Change "rcu stall" to "RCU stall".

v4 --> v5:
1. Resolve a git am conflict. No code change.

v3 --> v4:
1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.

v2 --> v3:
1. Fix the return type of kstat_cpu_irqs_sum()
2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
   rcupdate.rcu_cpu_stall_deep_debug.
3. Add comments and normalize local variable name

v1 --> v2:
1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
   kcpustat_this_cpu cannot be used.
@@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
        if (r->gp_seq != rdp->gp_seq)
                return;

-       cpustat = kcpustat_this_cpu->cpustat;
+       cpustat = kcpustat_cpu(cpu).cpustat;
2. Move the start point of statistics from rcu_stall_kick_kthreads() to
   rcu_implicit_dynticks_qs(), removing the dependency on irq_work.

v1:
In some extreme cases, such as the I/O pressure test, the CPU usage may
be 100%, causing RCU stall. In this case, the printed information about
current is not useful. Displays the number and usage of hard interrupts,
soft interrupts, and context switches that are generated within half of
the CPU stall timeout, can help us make a general judgment. In other
cases, we can preliminarily determine whether an infinite loop occurs
when local_irq, local_bh or preempt is disabled.

Zhen Lei (2):
  rcu: Add RCU stall diagnosis information
  doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information

 Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
 .../admin-guide/kernel-parameters.txt         |  6 ++
 kernel/rcu/Kconfig.debug                      | 11 +++
 kernel/rcu/rcu.h                              |  1 +
 kernel/rcu/tree.c                             | 17 ++++
 kernel/rcu/tree.h                             | 19 ++++
 kernel/rcu/tree_stall.h                       | 29 ++++++
 kernel/rcu/update.c                           |  2 +
 8 files changed, 173 insertions(+)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v6 1/2] rcu: Add RCU stall diagnosis information
  2022-11-09  9:37 [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Zhen Lei
@ 2022-11-09  9:37 ` Zhen Lei
  2022-11-09 15:20   ` Frederic Weisbecker
  2022-11-09 16:55   ` Elliott, Robert (Servers)
  2022-11-09  9:37 ` [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information Zhen Lei
  2022-11-09 15:26 ` [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Frederic Weisbecker
  2 siblings, 2 replies; 19+ messages in thread
From: Zhen Lei @ 2022-11-09  9:37 UTC (permalink / raw)
  To: Paul E . McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, rcu, linux-kernel
  Cc: Zhen Lei, Robert Elliott

Because RCU CPU stall warnings are driven from the scheduling-clock
interrupt handler, a workload consisting of a very large number of
short-duration hardware interrupts can result in misleading stall-warning
messages.  On systems supporting only a single level of interrupts,
that is, where interrupts handlers cannot be interrupted, this can
produce misleading diagnostics.  The stack traces will show the
innocent-bystander interrupted task, not the interrupts that are
at the very least exacerbating the stall.

This situation can be improved by displaying the number of interrupts
and the CPU time that they have consumed.  Diagnosing other types
of stalls can be eased by also providing the count of softirqs and
the CPU time that they consumed as well as the number of context
switches and the task-level CPU time consumed.

Consider the following output given this change:

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:     0-....: (1250 ticks this GP) <omitted>
rcu:          hardirqs   softirqs   csw/system
rcu:  number:      624         45            0
rcu: cputime:       69          1         2425   ==> 2500(ms)

This output shows that the number of hard and soft interrupts is small,
there are no context switches, and the system takes up a lot of time. This
indicates that the current task is looping with preemption disabled.

The impact on system performance is negligible because snapshot is
recorded only once for all continuous RCU stalls.

This added debugging information is suppressed by default and can be
enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or
by booting with rcupdate.rcu_cpu_stall_cputime=1.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 .../admin-guide/kernel-parameters.txt         |  6 ++++
 kernel/rcu/Kconfig.debug                      | 11 +++++++
 kernel/rcu/rcu.h                              |  1 +
 kernel/rcu/tree.c                             | 17 +++++++++++
 kernel/rcu/tree.h                             | 19 ++++++++++++
 kernel/rcu/tree_stall.h                       | 29 +++++++++++++++++++
 kernel/rcu/update.c                           |  2 ++
 7 files changed, 85 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a465d5242774af8..2729f3ad11d108b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5082,6 +5082,12 @@
 			rcupdate.rcu_cpu_stall_timeout to be used (after
 			conversion from seconds to milliseconds).
 
+	rcupdate.rcu_cpu_stall_cputime= [KNL]
+			Provide statistics on the cputime and count of
+			interrupts and tasks during the sampling period. For
+			multiple continuous RCU stalls, all sampling periods
+			begin at half of the first RCU stall timeout.
+
 	rcupdate.rcu_expedited= [KNL]
 			Use expedited grace-period primitives, for
 			example, synchronize_rcu_expedited() instead
diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 1b0c41d490f0588..025566a9ba44667 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -95,6 +95,17 @@ config RCU_EXP_CPU_STALL_TIMEOUT
 	  says to use the RCU_CPU_STALL_TIMEOUT value converted from
 	  seconds to milliseconds.
 
+config RCU_CPU_STALL_CPUTIME
+	bool "Provide additional RCU stall debug information"
+	depends on RCU_STALL_COMMON
+	default n
+	help
+	  Collect statistics during the sampling period, such as the number of
+	  (hard interrupts, soft interrupts, task switches) and the cputime of
+	  (hard interrupts, soft interrupts, kernel tasks) are added to the
+	  RCU stall report. For multiple continuous RCU stalls, all sampling
+	  periods begin at half of the first RCU stall timeout.
+
 config RCU_TRACE
 	bool "Enable tracing for RCU"
 	depends on DEBUG_KERNEL
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 96122f203187f39..4844dec36bddb48 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -231,6 +231,7 @@ extern int rcu_cpu_stall_ftrace_dump;
 extern int rcu_cpu_stall_suppress;
 extern int rcu_cpu_stall_timeout;
 extern int rcu_exp_cpu_stall_timeout;
+extern int rcu_cpu_stall_cputime;
 int rcu_jiffies_till_stall_check(void);
 int rcu_exp_jiffies_till_stall_check(void);
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ed93ddb8203d42c..e1ff23b2a14d71d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -866,6 +866,23 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 			rdp->rcu_iw_gp_seq = rnp->gp_seq;
 			irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
 		}
+
+		if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) {
+			u64 *cpustat;
+			struct rcu_snap_record *rsrp;
+
+			cpustat = kcpustat_cpu(rdp->cpu).cpustat;
+
+			rsrp = &rdp->snap_record;
+			rsrp->cputime_irq     = cpustat[CPUTIME_IRQ];
+			rsrp->cputime_softirq = cpustat[CPUTIME_SOFTIRQ];
+			rsrp->cputime_system  = cpustat[CPUTIME_SYSTEM];
+			rsrp->nr_hardirqs = kstat_cpu_irqs_sum(rdp->cpu);
+			rsrp->nr_softirqs = kstat_cpu_softirqs_sum(rdp->cpu);
+			rsrp->nr_csw = nr_context_switches_cpu(rdp->cpu);
+			rsrp->jiffies = jiffies;
+			rsrp->gp_seq = rdp->gp_seq;
+		}
 	}
 
 	return 0;
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index fcb5d696eb1700d..192536916f9a607 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -158,6 +158,23 @@ union rcu_noqs {
 	u16 s; /* Set of bits, aggregate OR here. */
 };
 
+/*
+ * Record the snapshot of the core stats at half of the first RCU stall timeout.
+ * The member gp_seq is used to ensure that all members are updated only once
+ * during the sampling period. The snapshot is taken only if this gp_seq is not
+ * equal to rdp->gp_seq.
+ */
+struct rcu_snap_record {
+	unsigned long	gp_seq;		/* Track rdp->gp_seq counter */
+	u64		cputime_irq;	/* Accumulated cputime of hard irqs */
+	u64		cputime_softirq;/* Accumulated cputime of soft irqs */
+	u64		cputime_system; /* Accumulated cputime of kernel tasks */
+	unsigned long	nr_hardirqs;	/* Accumulated number of hard irqs */
+	unsigned int	nr_softirqs;	/* Accumulated number of soft irqs */
+	unsigned long long nr_csw;	/* Accumulated number of task switches */
+	unsigned long   jiffies;	/* Track jiffies value */
+};
+
 /* Per-CPU data for read-copy update. */
 struct rcu_data {
 	/* 1) quiescent-state and grace-period handling : */
@@ -262,6 +279,8 @@ struct rcu_data {
 	short rcu_onl_gp_flags;		/* ->gp_flags at last online. */
 	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
 	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
+	struct rcu_snap_record snap_record; /* Snapshot of core stats at half of */
+					    /* the first RCU stall timeout */
 
 	long lazy_len;			/* Length of buffered lazy callbacks. */
 	int cpu;
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 5653560573e22d6..7b6afb9c7b96dbe 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -428,6 +428,33 @@ static bool rcu_is_rcuc_kthread_starving(struct rcu_data *rdp, unsigned long *jp
 	return j > 2 * HZ;
 }
 
+static void print_cpu_stat_info(int cpu)
+{
+	u64 *cpustat;
+	struct rcu_snap_record *rsrp;
+	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+	if (!rcu_cpu_stall_cputime)
+		return;
+
+	rsrp = &rdp->snap_record;
+	if (rsrp->gp_seq != rdp->gp_seq)
+		return;
+
+	cpustat = kcpustat_cpu(cpu).cpustat;
+
+	pr_err("         hardirqs   softirqs   csw/system\n");
+	pr_err(" number: %8ld %10d %12lld\n",
+		kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
+		kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
+		nr_context_switches_cpu(cpu) - rsrp->nr_csw);
+	pr_err("cputime: %8lld %10lld %12lld   ==> %lld(ms)\n",
+		div_u64(cpustat[CPUTIME_IRQ] - rsrp->cputime_irq, NSEC_PER_MSEC),
+		div_u64(cpustat[CPUTIME_SOFTIRQ] - rsrp->cputime_softirq, NSEC_PER_MSEC),
+		div_u64(cpustat[CPUTIME_SYSTEM] - rsrp->cputime_system, NSEC_PER_MSEC),
+		jiffies64_to_msecs(jiffies - rsrp->jiffies));
+}
+
 /*
  * Print out diagnostic information for the specified stalled CPU.
  *
@@ -484,6 +511,8 @@ static void print_cpu_stall_info(int cpu)
 	       data_race(rcu_state.n_force_qs) - rcu_state.n_force_qs_gpstart,
 	       rcuc_starved ? buf : "",
 	       falsepositive ? " (false positive?)" : "");
+
+	print_cpu_stat_info(cpu);
 }
 
 /* Complain about starvation of grace-period kthread.  */
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 738842c4886b235..aec76ccbe1e343b 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -508,6 +508,8 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
 module_param(rcu_cpu_stall_timeout, int, 0644);
 int rcu_exp_cpu_stall_timeout __read_mostly = CONFIG_RCU_EXP_CPU_STALL_TIMEOUT;
 module_param(rcu_exp_cpu_stall_timeout, int, 0644);
+int rcu_cpu_stall_cputime __read_mostly = IS_ENABLED(CONFIG_RCU_CPU_STALL_CPUTIME);
+module_param(rcu_cpu_stall_cputime, int, 0644);
 #endif /* #ifdef CONFIG_RCU_STALL_COMMON */
 
 // Suppress boot-time RCU CPU stall warnings and rcutorture writer stall
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
  2022-11-09  9:37 [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Zhen Lei
  2022-11-09  9:37 ` [PATCH v6 1/2] " Zhen Lei
@ 2022-11-09  9:37 ` Zhen Lei
  2022-11-09 15:08   ` Frederic Weisbecker
  2022-11-09 15:26 ` [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Frederic Weisbecker
  2 siblings, 1 reply; 19+ messages in thread
From: Zhen Lei @ 2022-11-09  9:37 UTC (permalink / raw)
  To: Paul E . McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, rcu, linux-kernel
  Cc: Zhen Lei, Robert Elliott

This commit doucments how to quickly determine the bug causing a given
RCU CPU stall fault warning based on the output information provided
by CONFIG_RCU_CPU_STALL_CPUTIME=y.

[ paulmck: Apply wordsmithing. ]

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/RCU/stallwarn.rst | 88 +++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst
index dfa4db8c0931eaf..5e24e849290a286 100644
--- a/Documentation/RCU/stallwarn.rst
+++ b/Documentation/RCU/stallwarn.rst
@@ -390,3 +390,91 @@ for example, "P3421".
 
 It is entirely possible to see stall warnings from normal and from
 expedited grace periods at about the same time during the same run.
+
+RCU_CPU_STALL_CPUTIME
+=====================
+
+In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
+rcupdate.rcu_cpu_stall_cputime=1, the following additional information
+is supplied with each RCU CPU stall warning::
+
+rcu:          hardirqs   softirqs   csw/system
+rcu:  number:      624         45            0
+rcu: cputime:       69          1         2425   ==> 2500(ms)
+
+These statistics are collected during the sampling period. The values
+in row "number:" are the number of hard interrupts, number of soft
+interrupts, and number of context switches on the stalled CPU. The
+first three values in row "cputime:" indicate the CPU time in
+milliseconds consumed by hard interrupts, soft interrupts, and tasks
+on the stalled CPU.  The last number is the measurement interval, again
+in milliseconds.  Because user-mode tasks normally do not cause RCU CPU
+stalls, these tasks are typically kernel tasks, which is why only the
+system CPU time are considered.
+
+The sampling period is shown as follows:
+|<------------first timeout---------->|<-----second timeout----->|
+|<--half timeout-->|<--half timeout-->|                          |
+|                  |<--first period-->|                          |
+|                  |<-----------second sampling period---------->|
+|                  |                  |                          |
+|          sampling time point    1st-stall                  2nd-stall
+
+
+The following describes four typical scenarios:
+
+1. A CPU looping with interrupts disabled.::
+
+   rcu:          hardirqs   softirqs   csw/system
+   rcu:  number:        0          0            0
+   rcu: cputime:        0          0            0   ==> 2500(ms)
+
+   Because interrupts have been disabled throughout the measurement
+   interval, there are no interrupts and no context switches.
+   Furthermore, because CPU time consumption was measured using interrupt
+   handlers, the system CPU consumption is misleadingly measured as zero.
+   This scenario will normally also have "(0 ticks this GP)" printed on
+   this CPU's summary line.
+
+2. A CPU looping with bottom halves disabled.
+
+   This is similar to the previous example, but with non-zero number of
+   and CPU time consumed by hard interrupts, along with non-zero CPU
+   time consumed by in-kernel execution.::
+
+   rcu:          hardirqs   softirqs   csw/system
+   rcu:  number:      624          0            0
+   rcu: cputime:       49          0         2446   ==> 2500(ms)
+
+   The fact that there are zero softirqs gives a hint that these were
+   disabled, perhaps via local_bh_disable().  It is of course possible
+   that there were no softirqs, perhaps because all events that would
+   result in softirq execution are confined to other CPUs.  In this case,
+   the diagnosis should continue as shown in the next example.
+
+3. A CPU looping with preemption disabled.
+
+   Here, only the number of context switches is zero.::
+
+   rcu:          hardirqs   softirqs   csw/system
+   rcu:  number:      624         45            0
+   rcu: cputime:       69          1         2425   ==> 2500(ms)
+
+   This situation hints that the stalled CPU was looping with preemption
+   disabled.
+
+4. No looping, but massive hard and soft interrupts.::
+
+   rcu:          hardirqs   softirqs   csw/system
+   rcu:  number:       xx         xx            0
+   rcu: cputime:       xx         xx            0   ==> 2500(ms)
+
+   Here, the number and CPU time of hard interrupts are all non-zero,
+   but the number of context switches and the in-kernel CPU time consumed
+   are zero. The number and cputime of soft interrupts will usually be
+   non-zero, but could be zero, for example, if the CPU was spinning
+   within a single hard interrupt handler.
+
+   If this type of RCU CPU stall warning can be reproduced, you can
+   narrow it down by looking at /proc/interrupts or by writing code to
+   trace each interrupt, for example, by referring to show_interrupts().
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
  2022-11-09  9:37 ` [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information Zhen Lei
@ 2022-11-09 15:08   ` Frederic Weisbecker
  2022-11-10  2:54     ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2022-11-09 15:08 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Paul E . McKenney, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott

On Wed, Nov 09, 2022 at 05:37:38PM +0800, Zhen Lei wrote:
> This commit doucments how to quickly determine the bug causing a given
> RCU CPU stall fault warning based on the output information provided
> by CONFIG_RCU_CPU_STALL_CPUTIME=y.
> 
> [ paulmck: Apply wordsmithing. ]
> 
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  Documentation/RCU/stallwarn.rst | 88 +++++++++++++++++++++++++++++++++
>  1 file changed, 88 insertions(+)
> 
> diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst
> index dfa4db8c0931eaf..5e24e849290a286 100644
> --- a/Documentation/RCU/stallwarn.rst
> +++ b/Documentation/RCU/stallwarn.rst
> @@ -390,3 +390,91 @@ for example, "P3421".
>  
>  It is entirely possible to see stall warnings from normal and from
>  expedited grace periods at about the same time during the same run.
> +
> +RCU_CPU_STALL_CPUTIME
> +=====================
> +
> +In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
> +rcupdate.rcu_cpu_stall_cputime=1, the following additional information
> +is supplied with each RCU CPU stall warning::
> +
> +rcu:          hardirqs   softirqs   csw/system
> +rcu:  number:      624         45            0
> +rcu: cputime:       69          1         2425   ==> 2500(ms)
> +
> +These statistics are collected during the sampling period. The values
> +in row "number:" are the number of hard interrupts, number of soft
> +interrupts, and number of context switches on the stalled CPU. The
> +first three values in row "cputime:" indicate the CPU time in
> +milliseconds consumed by hard interrupts, soft interrupts, and tasks
> +on the stalled CPU.

Is that since the boot or since the last snapshot?

> The last number is the measurement interval, again
> +in milliseconds.  Because user-mode tasks normally do not cause RCU CPU
> +stalls, these tasks are typically kernel tasks, which is why only the
> +system CPU time are considered.
> +
> +The sampling period is shown as follows:
> +|<------------first timeout---------->|<-----second timeout----->|
> +|<--half timeout-->|<--half timeout-->|                          |
> +|                  |<--first period-->|                          |
> +|                  |<-----------second sampling period---------->|
> +|                  |                  |                          |
> +|          sampling time point    1st-stall                  2nd-stall
> +
> +
> +The following describes four typical scenarios:
> +
> +1. A CPU looping with interrupts disabled.::
> +
> +   rcu:          hardirqs   softirqs   csw/system
> +   rcu:  number:        0          0            0
> +   rcu: cputime:        0          0            0   ==> 2500(ms)
> +
> +   Because interrupts have been disabled throughout the measurement
> +   interval, there are no interrupts and no context switches.
> +   Furthermore, because CPU time consumption was measured using interrupt
> +   handlers, the system CPU consumption is misleadingly measured as zero.
> +   This scenario will normally also have "(0 ticks this GP)" printed on
> +   this CPU's summary line.

Right, unless you're running with CONFIG_NO_HZ_FULL=y and the target CPU
is nohz_full=, in that case you should see a delta in stime because the
cputime is measured with the CPU clock.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/2] rcu: Add RCU stall diagnosis information
  2022-11-09  9:37 ` [PATCH v6 1/2] " Zhen Lei
@ 2022-11-09 15:20   ` Frederic Weisbecker
  2022-11-10  6:55     ` Leizhen (ThunderTown)
  2022-11-09 16:55   ` Elliott, Robert (Servers)
  1 sibling, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2022-11-09 15:20 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Paul E . McKenney, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott

On Wed, Nov 09, 2022 at 05:37:37PM +0800, Zhen Lei wrote:
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ed93ddb8203d42c..e1ff23b2a14d71d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -866,6 +866,23 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
>  			rdp->rcu_iw_gp_seq = rnp->gp_seq;
>  			irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
>  		}
> +
> +		if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) {
> +			u64 *cpustat;
> +			struct rcu_snap_record *rsrp;
> +
> +			cpustat = kcpustat_cpu(rdp->cpu).cpustat;
> +
> +			rsrp = &rdp->snap_record;
> +			rsrp->cputime_irq     = cpustat[CPUTIME_IRQ];
> +			rsrp->cputime_softirq = cpustat[CPUTIME_SOFTIRQ];
> +			rsrp->cputime_system  = cpustat[CPUTIME_SYSTEM];

You need to use kcpustat_field(), otherwise you'll get stalled values on nohz_full CPUs.

> +			rsrp->nr_hardirqs = kstat_cpu_irqs_sum(rdp->cpu);
> +			rsrp->nr_softirqs = kstat_cpu_softirqs_sum(rdp->cpu);
> +			rsrp->nr_csw = nr_context_switches_cpu(rdp->cpu);
> +			rsrp->jiffies = jiffies;
> +			rsrp->gp_seq = rdp->gp_seq;
> +		}
>  	}
>  
>  	return 0;
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index 5653560573e22d6..7b6afb9c7b96dbe 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -428,6 +428,33 @@ static bool rcu_is_rcuc_kthread_starving(struct rcu_data *rdp, unsigned long *jp
>  	return j > 2 * HZ;
>  }
>  
> +static void print_cpu_stat_info(int cpu)
> +{
> +	u64 *cpustat;
> +	struct rcu_snap_record *rsrp;
> +	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
> +
> +	if (!rcu_cpu_stall_cputime)
> +		return;
> +
> +	rsrp = &rdp->snap_record;
> +	if (rsrp->gp_seq != rdp->gp_seq)
> +		return;
> +
> +	cpustat = kcpustat_cpu(cpu).cpustat;
> +
> +	pr_err("         hardirqs   softirqs   csw/system\n");
> +	pr_err(" number: %8ld %10d %12lld\n",
> +		kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
> +		kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
> +		nr_context_switches_cpu(cpu) - rsrp->nr_csw);
> +	pr_err("cputime: %8lld %10lld %12lld   ==> %lld(ms)\n",
> +		div_u64(cpustat[CPUTIME_IRQ] - rsrp->cputime_irq, NSEC_PER_MSEC),
> +		div_u64(cpustat[CPUTIME_SOFTIRQ] - rsrp->cputime_softirq, NSEC_PER_MSEC),
> +		div_u64(cpustat[CPUTIME_SYSTEM] - rsrp->cputime_system,
> NSEC_PER_MSEC),

Same here.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-09  9:37 [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Zhen Lei
  2022-11-09  9:37 ` [PATCH v6 1/2] " Zhen Lei
  2022-11-09  9:37 ` [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information Zhen Lei
@ 2022-11-09 15:26 ` Frederic Weisbecker
  2022-11-09 15:59   ` Paul E. McKenney
  2 siblings, 1 reply; 19+ messages in thread
From: Frederic Weisbecker @ 2022-11-09 15:26 UTC (permalink / raw)
  To: Zhen Lei
  Cc: Paul E . McKenney, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott

Hi Zhen Lei,

On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
> v5 --> v6:
> 1. When there are more than two continuous RCU stallings, correctly handle the
>    value of the second and subsequent sampling periods. Update comments and
>    document.
>    Thanks to Elliott, Robert for the test.
> 2. Change "rcu stall" to "RCU stall".
> 
> v4 --> v5:
> 1. Resolve a git am conflict. No code change.
> 
> v3 --> v4:
> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> 
> v2 --> v3:
> 1. Fix the return type of kstat_cpu_irqs_sum()
> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>    rcupdate.rcu_cpu_stall_deep_debug.
> 3. Add comments and normalize local variable name
> 
> 
> v1 --> v2:
> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>    kcpustat_this_cpu cannot be used.
> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>         if (r->gp_seq != rdp->gp_seq)
>                 return;
> 
> -       cpustat = kcpustat_this_cpu->cpustat;
> +       cpustat = kcpustat_cpu(cpu).cpustat;
> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> 
> v1:
> In some extreme cases, such as the I/O pressure test, the CPU usage may
> be 100%, causing RCU stall. In this case, the printed information about
> current is not useful. Displays the number and usage of hard interrupts,
> soft interrupts, and context switches that are generated within half of
> the CPU stall timeout, can help us make a general judgment. In other
> cases, we can preliminarily determine whether an infinite loop occurs
> when local_irq, local_bh or preempt is disabled.

That looks useful but I have to ask: what does it bring that the softlockup
and hardlockup watchdog can not already solve?

Thanks.

> 
> Zhen Lei (2):
>   rcu: Add RCU stall diagnosis information
>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> 
>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>  .../admin-guide/kernel-parameters.txt         |  6 ++
>  kernel/rcu/Kconfig.debug                      | 11 +++
>  kernel/rcu/rcu.h                              |  1 +
>  kernel/rcu/tree.c                             | 17 ++++
>  kernel/rcu/tree.h                             | 19 ++++
>  kernel/rcu/tree_stall.h                       | 29 ++++++
>  kernel/rcu/update.c                           |  2 +
>  8 files changed, 173 insertions(+)
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-09 15:26 ` [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Frederic Weisbecker
@ 2022-11-09 15:59   ` Paul E. McKenney
  2022-11-09 17:03     ` Frederic Weisbecker
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2022-11-09 15:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Zhen Lei, Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel, Robert Elliott

On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
> Hi Zhen Lei,
> 
> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
> > v5 --> v6:
> > 1. When there are more than two continuous RCU stallings, correctly handle the
> >    value of the second and subsequent sampling periods. Update comments and
> >    document.
> >    Thanks to Elliott, Robert for the test.
> > 2. Change "rcu stall" to "RCU stall".
> > 
> > v4 --> v5:
> > 1. Resolve a git am conflict. No code change.
> > 
> > v3 --> v4:
> > 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> > 
> > v2 --> v3:
> > 1. Fix the return type of kstat_cpu_irqs_sum()
> > 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> >    rcupdate.rcu_cpu_stall_deep_debug.
> > 3. Add comments and normalize local variable name
> > 
> > 
> > v1 --> v2:
> > 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> >    kcpustat_this_cpu cannot be used.
> > @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> >         if (r->gp_seq != rdp->gp_seq)
> >                 return;
> > 
> > -       cpustat = kcpustat_this_cpu->cpustat;
> > +       cpustat = kcpustat_cpu(cpu).cpustat;
> > 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> >    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> > 
> > v1:
> > In some extreme cases, such as the I/O pressure test, the CPU usage may
> > be 100%, causing RCU stall. In this case, the printed information about
> > current is not useful. Displays the number and usage of hard interrupts,
> > soft interrupts, and context switches that are generated within half of
> > the CPU stall timeout, can help us make a general judgment. In other
> > cases, we can preliminarily determine whether an infinite loop occurs
> > when local_irq, local_bh or preempt is disabled.
> 
> That looks useful but I have to ask: what does it bring that the softlockup
> and hardlockup watchdog can not already solve?

This is a good point.  One possible benefit is putting the needed information
in one spot, for example, in cases where the soft/hard lockup timeouts are
significantly different than the RCU CPU stall warning timeout.

Thoughts?

							Thanx, Paul

> Thanks.
> 
> > 
> > Zhen Lei (2):
> >   rcu: Add RCU stall diagnosis information
> >   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> > 
> >  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
> >  .../admin-guide/kernel-parameters.txt         |  6 ++
> >  kernel/rcu/Kconfig.debug                      | 11 +++
> >  kernel/rcu/rcu.h                              |  1 +
> >  kernel/rcu/tree.c                             | 17 ++++
> >  kernel/rcu/tree.h                             | 19 ++++
> >  kernel/rcu/tree_stall.h                       | 29 ++++++
> >  kernel/rcu/update.c                           |  2 +
> >  8 files changed, 173 insertions(+)
> > 
> > -- 
> > 2.25.1
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v6 1/2] rcu: Add RCU stall diagnosis information
  2022-11-09  9:37 ` [PATCH v6 1/2] " Zhen Lei
  2022-11-09 15:20   ` Frederic Weisbecker
@ 2022-11-09 16:55   ` Elliott, Robert (Servers)
  2022-11-09 17:03     ` Elliott, Robert (Servers)
  2022-11-10  8:27     ` Leizhen (ThunderTown)
  1 sibling, 2 replies; 19+ messages in thread
From: Elliott, Robert (Servers) @ 2022-11-09 16:55 UTC (permalink / raw)
  To: Zhen Lei, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel



> b/Documentation/admin-guide/kernel-parameters.txt
> index a465d5242774af8..2729f3ad11d108b 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5082,6 +5082,12 @@
>  			rcupdate.rcu_cpu_stall_timeout to be used (after
>  			conversion from seconds to milliseconds).
> 
> +	rcupdate.rcu_cpu_stall_cputime= [KNL]
> +			Provide statistics on the cputime and count of
> +			interrupts and tasks during the sampling period. For
> +			multiple continuous RCU stalls, all sampling periods
> +			begin at half of the first RCU stall timeout.

This description should start with:
    "In kernels built with CONFIG_RCU_CPU_STALL_TIME=y, "

Also, that parameter name seems like it contains a time value, but
it's really just treated as zero vs. anything else. Consider renaming
it to rcu_cpu_stall_cputime_en or describing the values in the
description ("0 disables, all other values enable").

> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> +struct rcu_snap_record {
> +	unsigned long	gp_seq;		/* Track rdp->gp_seq counter */
> +	u64		cputime_irq;	/* Accumulated cputime of hard irqs */
> +	u64		cputime_softirq;/* Accumulated cputime of soft irqs */
> +	u64		cputime_system; /* Accumulated cputime of kernel tasks
> */
> +	unsigned long	nr_hardirqs;	/* Accumulated number of hard irqs */
> +	unsigned int	nr_softirqs;	/* Accumulated number of soft irqs */

That should be "unsigned long" to match the other patch


> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> +static void print_cpu_stat_info(int cpu)
> +{
...
> +	pr_err("         hardirqs   softirqs   csw/system\n");
> +	pr_err(" number: %8ld %10d %12lld\n",
> +		kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
> +		kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
> +		nr_context_switches_cpu(cpu) - rsrp->nr_csw);
> +	pr_err("cputime: %8lld %10lld %12lld   ==> %lld(ms)\n",

Those should all start with "\t" to match other related prints.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-09 15:59   ` Paul E. McKenney
@ 2022-11-09 17:03     ` Frederic Weisbecker
  2022-11-09 17:22       ` Paul E. McKenney
  2022-11-10  7:29       ` Leizhen (ThunderTown)
  0 siblings, 2 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2022-11-09 17:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Zhen Lei, Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel, Robert Elliott

On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
> > Hi Zhen Lei,
> > 
> > On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
> > > v5 --> v6:
> > > 1. When there are more than two continuous RCU stallings, correctly handle the
> > >    value of the second and subsequent sampling periods. Update comments and
> > >    document.
> > >    Thanks to Elliott, Robert for the test.
> > > 2. Change "rcu stall" to "RCU stall".
> > > 
> > > v4 --> v5:
> > > 1. Resolve a git am conflict. No code change.
> > > 
> > > v3 --> v4:
> > > 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> > > 
> > > v2 --> v3:
> > > 1. Fix the return type of kstat_cpu_irqs_sum()
> > > 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> > >    rcupdate.rcu_cpu_stall_deep_debug.
> > > 3. Add comments and normalize local variable name
> > > 
> > > 
> > > v1 --> v2:
> > > 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> > >    kcpustat_this_cpu cannot be used.
> > > @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> > >         if (r->gp_seq != rdp->gp_seq)
> > >                 return;
> > > 
> > > -       cpustat = kcpustat_this_cpu->cpustat;
> > > +       cpustat = kcpustat_cpu(cpu).cpustat;
> > > 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> > >    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> > > 
> > > v1:
> > > In some extreme cases, such as the I/O pressure test, the CPU usage may
> > > be 100%, causing RCU stall. In this case, the printed information about
> > > current is not useful. Displays the number and usage of hard interrupts,
> > > soft interrupts, and context switches that are generated within half of
> > > the CPU stall timeout, can help us make a general judgment. In other
> > > cases, we can preliminarily determine whether an infinite loop occurs
> > > when local_irq, local_bh or preempt is disabled.
> > 
> > That looks useful but I have to ask: what does it bring that the softlockup
> > and hardlockup watchdog can not already solve?
> 
> This is a good point.  One possible benefit is putting the needed information
> in one spot, for example, in cases where the soft/hard lockup timeouts are
> significantly different than the RCU CPU stall warning timeout.

Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
unless all CPUs are caught into a hardlockup, in which case only the hardlockup
detector has a chance.

Anyway I would say that in this case just lower the delay for the lockup
detectors to consider the situation is a lockup?

Thanks.


> 
> Thoughts?
> 
> 							Thanx, Paul
> 
> > Thanks.
> > 
> > > 
> > > Zhen Lei (2):
> > >   rcu: Add RCU stall diagnosis information
> > >   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> > > 
> > >  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
> > >  .../admin-guide/kernel-parameters.txt         |  6 ++
> > >  kernel/rcu/Kconfig.debug                      | 11 +++
> > >  kernel/rcu/rcu.h                              |  1 +
> > >  kernel/rcu/tree.c                             | 17 ++++
> > >  kernel/rcu/tree.h                             | 19 ++++
> > >  kernel/rcu/tree_stall.h                       | 29 ++++++
> > >  kernel/rcu/update.c                           |  2 +
> > >  8 files changed, 173 insertions(+)
> > > 
> > > -- 
> > > 2.25.1
> > > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH v6 1/2] rcu: Add RCU stall diagnosis information
  2022-11-09 16:55   ` Elliott, Robert (Servers)
@ 2022-11-09 17:03     ` Elliott, Robert (Servers)
  2022-11-10  8:27     ` Leizhen (ThunderTown)
  1 sibling, 0 replies; 19+ messages in thread
From: Elliott, Robert (Servers) @ 2022-11-09 17:03 UTC (permalink / raw)
  To: Zhen Lei, Paul E . McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel



> -----Original Message-----
> From: Elliott, Robert (Servers)
> Sent: Wednesday, November 9, 2022 10:56 AM
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index a465d5242774af8..2729f3ad11d108b 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -5082,6 +5082,12 @@
> >  			rcupdate.rcu_cpu_stall_timeout to be used (after
> >  			conversion from seconds to milliseconds).
> >
> > +	rcupdate.rcu_cpu_stall_cputime= [KNL]
> > +			Provide statistics on the cputime and count of
> > +			interrupts and tasks during the sampling period. For
> > +			multiple continuous RCU stalls, all sampling periods
> > +			begin at half of the first RCU stall timeout.
> 
> This description should start with:
>     "In kernels built with CONFIG_RCU_CPU_STALL_TIME=y, "

Please ignore that comment - the module parameter is always
present (only subject to CONFIG_RCU_STALL_COMMON like the others).
The config option is just selecting the default value for
that module parameter.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-09 17:03     ` Frederic Weisbecker
@ 2022-11-09 17:22       ` Paul E. McKenney
  2022-11-10  2:27         ` Leizhen (ThunderTown)
  2022-11-10  7:29       ` Leizhen (ThunderTown)
  1 sibling, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2022-11-09 17:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Zhen Lei, Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel, Robert Elliott

On Wed, Nov 09, 2022 at 06:03:17PM +0100, Frederic Weisbecker wrote:
> On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
> > > Hi Zhen Lei,
> > > 
> > > On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
> > > > v5 --> v6:
> > > > 1. When there are more than two continuous RCU stallings, correctly handle the
> > > >    value of the second and subsequent sampling periods. Update comments and
> > > >    document.
> > > >    Thanks to Elliott, Robert for the test.
> > > > 2. Change "rcu stall" to "RCU stall".
> > > > 
> > > > v4 --> v5:
> > > > 1. Resolve a git am conflict. No code change.
> > > > 
> > > > v3 --> v4:
> > > > 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> > > > 
> > > > v2 --> v3:
> > > > 1. Fix the return type of kstat_cpu_irqs_sum()
> > > > 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> > > >    rcupdate.rcu_cpu_stall_deep_debug.
> > > > 3. Add comments and normalize local variable name
> > > > 
> > > > 
> > > > v1 --> v2:
> > > > 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> > > >    kcpustat_this_cpu cannot be used.
> > > > @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> > > >         if (r->gp_seq != rdp->gp_seq)
> > > >                 return;
> > > > 
> > > > -       cpustat = kcpustat_this_cpu->cpustat;
> > > > +       cpustat = kcpustat_cpu(cpu).cpustat;
> > > > 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> > > >    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> > > > 
> > > > v1:
> > > > In some extreme cases, such as the I/O pressure test, the CPU usage may
> > > > be 100%, causing RCU stall. In this case, the printed information about
> > > > current is not useful. Displays the number and usage of hard interrupts,
> > > > soft interrupts, and context switches that are generated within half of
> > > > the CPU stall timeout, can help us make a general judgment. In other
> > > > cases, we can preliminarily determine whether an infinite loop occurs
> > > > when local_irq, local_bh or preempt is disabled.
> > > 
> > > That looks useful but I have to ask: what does it bring that the softlockup
> > > and hardlockup watchdog can not already solve?
> > 
> > This is a good point.  One possible benefit is putting the needed information
> > in one spot, for example, in cases where the soft/hard lockup timeouts are
> > significantly different than the RCU CPU stall warning timeout.
> 
> Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
> unless all CPUs are caught into a hardlockup, in which case only the hardlockup
> detector has a chance.
> 
> Anyway I would say that in this case just lower the delay for the lockup
> detectors to consider the situation is a lockup?

Try it both ways and see how it works?  The rcutorture module parameters
stall_cpu and stall_cpu_irqsoff are easy ways to generate these sorts
of scenarios.

Actually, that does remind me of something.  Back when I was chasing
that interrupt storm, would this patch have helped me?  In that case, the
half-way point would have been reached while all online CPUs were spinning
with interrupts disabled and the incoming CPU was getting hammered with
continual scheduling-clock interrupts.  So I suspect that the answer is
"no" because the incoming CPU was not blocking the grace period.

Instead of being snapshot halfway to the RCU CPU stall warning, should
the values be snapshot when the CPU notices the beginning or end of an
RCU grace period and when a CPU goes offline?

But that would not suffice, because detailed information would not have
been dumped for the incoming CPU.

However, the lack of context switches and interrupts on the rest of the
CPUs would likely have been a big cluebat, so there is that.  It might
be better to rework the warning at the beginning of rcu_sched_clock_irq()
to complain if more than (say) 10 scheduling-clock interrupts occur on
a given CPU during a single jiffy.

Independent of Zhen Lei patch.

Thoughts?

							Thanx, Paul

> Thanks.
> 
> 
> > 
> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > > Thanks.
> > > 
> > > > 
> > > > Zhen Lei (2):
> > > >   rcu: Add RCU stall diagnosis information
> > > >   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> > > > 
> > > >  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
> > > >  .../admin-guide/kernel-parameters.txt         |  6 ++
> > > >  kernel/rcu/Kconfig.debug                      | 11 +++
> > > >  kernel/rcu/rcu.h                              |  1 +
> > > >  kernel/rcu/tree.c                             | 17 ++++
> > > >  kernel/rcu/tree.h                             | 19 ++++
> > > >  kernel/rcu/tree_stall.h                       | 29 ++++++
> > > >  kernel/rcu/update.c                           |  2 +
> > > >  8 files changed, 173 insertions(+)
> > > > 
> > > > -- 
> > > > 2.25.1
> > > > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-09 17:22       ` Paul E. McKenney
@ 2022-11-10  2:27         ` Leizhen (ThunderTown)
  2022-11-12 18:59           ` Paul E. McKenney
  0 siblings, 1 reply; 19+ messages in thread
From: Leizhen (ThunderTown) @ 2022-11-10  2:27 UTC (permalink / raw)
  To: paulmck, Frederic Weisbecker
  Cc: Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel, Robert Elliott



On 2022/11/10 1:22, Paul E. McKenney wrote:
> On Wed, Nov 09, 2022 at 06:03:17PM +0100, Frederic Weisbecker wrote:
>> On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
>>> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
>>>> Hi Zhen Lei,
>>>>
>>>> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
>>>>> v5 --> v6:
>>>>> 1. When there are more than two continuous RCU stallings, correctly handle the
>>>>>    value of the second and subsequent sampling periods. Update comments and
>>>>>    document.
>>>>>    Thanks to Elliott, Robert for the test.
>>>>> 2. Change "rcu stall" to "RCU stall".
>>>>>
>>>>> v4 --> v5:
>>>>> 1. Resolve a git am conflict. No code change.
>>>>>
>>>>> v3 --> v4:
>>>>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
>>>>>
>>>>> v2 --> v3:
>>>>> 1. Fix the return type of kstat_cpu_irqs_sum()
>>>>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>>>>>    rcupdate.rcu_cpu_stall_deep_debug.
>>>>> 3. Add comments and normalize local variable name
>>>>>
>>>>>
>>>>> v1 --> v2:
>>>>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>>>>>    kcpustat_this_cpu cannot be used.
>>>>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>>>>>         if (r->gp_seq != rdp->gp_seq)
>>>>>                 return;
>>>>>
>>>>> -       cpustat = kcpustat_this_cpu->cpustat;
>>>>> +       cpustat = kcpustat_cpu(cpu).cpustat;
>>>>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>>>>>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
>>>>>
>>>>> v1:
>>>>> In some extreme cases, such as the I/O pressure test, the CPU usage may
>>>>> be 100%, causing RCU stall. In this case, the printed information about
>>>>> current is not useful. Displays the number and usage of hard interrupts,
>>>>> soft interrupts, and context switches that are generated within half of
>>>>> the CPU stall timeout, can help us make a general judgment. In other
>>>>> cases, we can preliminarily determine whether an infinite loop occurs
>>>>> when local_irq, local_bh or preempt is disabled.
>>>>
>>>> That looks useful but I have to ask: what does it bring that the softlockup
>>>> and hardlockup watchdog can not already solve?
>>>
>>> This is a good point.  One possible benefit is putting the needed information
>>> in one spot, for example, in cases where the soft/hard lockup timeouts are
>>> significantly different than the RCU CPU stall warning timeout.
>>
>> Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
>> unless all CPUs are caught into a hardlockup, in which case only the hardlockup
>> detector has a chance.
>>
>> Anyway I would say that in this case just lower the delay for the lockup
>> detectors to consider the situation is a lockup?
> 
> Try it both ways and see how it works?  The rcutorture module parameters
> stall_cpu and stall_cpu_irqsoff are easy ways to generate these sorts
> of scenarios.
> 
> Actually, that does remind me of something.  Back when I was chasing
> that interrupt storm, would this patch have helped me?  In that case, the

Yes, this patch series originally addressed an RCU stall issue caused by an
interruption storm. The serial port driver written by another project team
failed to write the register in a specific condition. As a result, interrupts
were repeatedly reported.

> half-way point would have been reached while all online CPUs were spinning
> with interrupts disabled and the incoming CPU was getting hammered with
> continual scheduling-clock interrupts.  So I suspect that the answer is
> "no" because the incoming CPU was not blocking the grace period.
> 
> Instead of being snapshot halfway to the RCU CPU stall warning, should
> the values be snapshot when the CPU notices the beginning or end of an
> RCU grace period and when a CPU goes offline?

This won't work. Those normal counts that occurred before the failure
have an impact on our analysis. For example, some software interrupts
may have been generated before local_bh_disable() is called.

> 
> But that would not suffice, because detailed information would not have
> been dumped for the incoming CPU.
> 
> However, the lack of context switches and interrupts on the rest of the
> CPUs would likely have been a big cluebat, so there is that.  It might
> be better to rework the warning at the beginning of rcu_sched_clock_irq()
> to complain if more than (say) 10 scheduling-clock interrupts occur on
> a given CPU during a single jiffy.
> 
> Independent of Zhen Lei patch.
> 
> Thoughts?
> 
> 							Thanx, Paul
> 
>> Thanks.
>>
>>
>>>
>>> Thoughts?
>>>
>>> 							Thanx, Paul
>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> Zhen Lei (2):
>>>>>   rcu: Add RCU stall diagnosis information
>>>>>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
>>>>>
>>>>>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>>>>>  .../admin-guide/kernel-parameters.txt         |  6 ++
>>>>>  kernel/rcu/Kconfig.debug                      | 11 +++
>>>>>  kernel/rcu/rcu.h                              |  1 +
>>>>>  kernel/rcu/tree.c                             | 17 ++++
>>>>>  kernel/rcu/tree.h                             | 19 ++++
>>>>>  kernel/rcu/tree_stall.h                       | 29 ++++++
>>>>>  kernel/rcu/update.c                           |  2 +
>>>>>  8 files changed, 173 insertions(+)
>>>>>
>>>>> -- 
>>>>> 2.25.1
>>>>>
> .
> 

-- 
Regards,
  Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
  2022-11-09 15:08   ` Frederic Weisbecker
@ 2022-11-10  2:54     ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 19+ messages in thread
From: Leizhen (ThunderTown) @ 2022-11-10  2:54 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E . McKenney, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott



On 2022/11/9 23:08, Frederic Weisbecker wrote:
> On Wed, Nov 09, 2022 at 05:37:38PM +0800, Zhen Lei wrote:
>> This commit doucments how to quickly determine the bug causing a given
>> RCU CPU stall fault warning based on the output information provided
>> by CONFIG_RCU_CPU_STALL_CPUTIME=y.
>>
>> [ paulmck: Apply wordsmithing. ]
>>
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>> ---
>>  Documentation/RCU/stallwarn.rst | 88 +++++++++++++++++++++++++++++++++
>>  1 file changed, 88 insertions(+)
>>
>> diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst
>> index dfa4db8c0931eaf..5e24e849290a286 100644
>> --- a/Documentation/RCU/stallwarn.rst
>> +++ b/Documentation/RCU/stallwarn.rst
>> @@ -390,3 +390,91 @@ for example, "P3421".
>>  
>>  It is entirely possible to see stall warnings from normal and from
>>  expedited grace periods at about the same time during the same run.
>> +
>> +RCU_CPU_STALL_CPUTIME
>> +=====================
>> +
>> +In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
>> +rcupdate.rcu_cpu_stall_cputime=1, the following additional information
>> +is supplied with each RCU CPU stall warning::
>> +
>> +rcu:          hardirqs   softirqs   csw/system
>> +rcu:  number:      624         45            0
>> +rcu: cputime:       69          1         2425   ==> 2500(ms)
>> +
>> +These statistics are collected during the sampling period. The values
>> +in row "number:" are the number of hard interrupts, number of soft
>> +interrupts, and number of context switches on the stalled CPU. The
>> +first three values in row "cputime:" indicate the CPU time in
>> +milliseconds consumed by hard interrupts, soft interrupts, and tasks
>> +on the stalled CPU.
> 
> Is that since the boot or since the last snapshot?

Since the last snapshot. See the diagram below:

+The sampling period is shown as follows:
+|<------------first timeout---------->|<-----second timeout----->|
+|<--half timeout-->|<--half timeout-->|                          |
+|                  |<--first period-->|                          |
+|                  |<-----------second sampling period---------->|
+|                  |                  |                          |
+|          sampling time point    1st-stall                  2nd-stall
                    |
                    |
                    Take the snapshot at this time

> 
>> The last number is the measurement interval, again
>> +in milliseconds.  Because user-mode tasks normally do not cause RCU CPU
>> +stalls, these tasks are typically kernel tasks, which is why only the
>> +system CPU time are considered.
>> +
>> +The sampling period is shown as follows:
>> +|<------------first timeout---------->|<-----second timeout----->|
>> +|<--half timeout-->|<--half timeout-->|                          |
>> +|                  |<--first period-->|                          |
>> +|                  |<-----------second sampling period---------->|
>> +|                  |                  |                          |
>> +|          sampling time point    1st-stall                  2nd-stall
>> +
>> +
>> +The following describes four typical scenarios:
>> +
>> +1. A CPU looping with interrupts disabled.::
>> +
>> +   rcu:          hardirqs   softirqs   csw/system
>> +   rcu:  number:        0          0            0
>> +   rcu: cputime:        0          0            0   ==> 2500(ms)
>> +
>> +   Because interrupts have been disabled throughout the measurement
>> +   interval, there are no interrupts and no context switches.
>> +   Furthermore, because CPU time consumption was measured using interrupt
>> +   handlers, the system CPU consumption is misleadingly measured as zero.
>> +   This scenario will normally also have "(0 ticks this GP)" printed on
>> +   this CPU's summary line.
> 
> Right, unless you're running with CONFIG_NO_HZ_FULL=y and the target CPU
> is nohz_full=, in that case you should see a delta in stime because the
> cputime is measured with the CPU clock.
> 
> Thanks.
> .
> 

-- 
Regards,
  Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/2] rcu: Add RCU stall diagnosis information
  2022-11-09 15:20   ` Frederic Weisbecker
@ 2022-11-10  6:55     ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 19+ messages in thread
From: Leizhen (ThunderTown) @ 2022-11-10  6:55 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E . McKenney, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott



On 2022/11/9 23:20, Frederic Weisbecker wrote:
> On Wed, Nov 09, 2022 at 05:37:37PM +0800, Zhen Lei wrote:
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index ed93ddb8203d42c..e1ff23b2a14d71d 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -866,6 +866,23 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
>>  			rdp->rcu_iw_gp_seq = rnp->gp_seq;
>>  			irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
>>  		}
>> +
>> +		if (rcu_cpu_stall_cputime && rdp->snap_record.gp_seq != rdp->gp_seq) {
>> +			u64 *cpustat;
>> +			struct rcu_snap_record *rsrp;
>> +
>> +			cpustat = kcpustat_cpu(rdp->cpu).cpustat;
>> +
>> +			rsrp = &rdp->snap_record;
>> +			rsrp->cputime_irq     = cpustat[CPUTIME_IRQ];
>> +			rsrp->cputime_softirq = cpustat[CPUTIME_SOFTIRQ];
>> +			rsrp->cputime_system  = cpustat[CPUTIME_SYSTEM];
> 
> You need to use kcpustat_field(), otherwise you'll get stalled values on nohz_full CPUs.

OK, I'll update it. Thanks.

> 
>> +			rsrp->nr_hardirqs = kstat_cpu_irqs_sum(rdp->cpu);
>> +			rsrp->nr_softirqs = kstat_cpu_softirqs_sum(rdp->cpu);
>> +			rsrp->nr_csw = nr_context_switches_cpu(rdp->cpu);
>> +			rsrp->jiffies = jiffies;
>> +			rsrp->gp_seq = rdp->gp_seq;
>> +		}
>>  	}
>>  
>>  	return 0;
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> index 5653560573e22d6..7b6afb9c7b96dbe 100644
>> --- a/kernel/rcu/tree_stall.h
>> +++ b/kernel/rcu/tree_stall.h
>> @@ -428,6 +428,33 @@ static bool rcu_is_rcuc_kthread_starving(struct rcu_data *rdp, unsigned long *jp
>>  	return j > 2 * HZ;
>>  }
>>  
>> +static void print_cpu_stat_info(int cpu)
>> +{
>> +	u64 *cpustat;
>> +	struct rcu_snap_record *rsrp;
>> +	struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
>> +
>> +	if (!rcu_cpu_stall_cputime)
>> +		return;
>> +
>> +	rsrp = &rdp->snap_record;
>> +	if (rsrp->gp_seq != rdp->gp_seq)
>> +		return;
>> +
>> +	cpustat = kcpustat_cpu(cpu).cpustat;
>> +
>> +	pr_err("         hardirqs   softirqs   csw/system\n");
>> +	pr_err(" number: %8ld %10d %12lld\n",
>> +		kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
>> +		kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
>> +		nr_context_switches_cpu(cpu) - rsrp->nr_csw);
>> +	pr_err("cputime: %8lld %10lld %12lld   ==> %lld(ms)\n",
>> +		div_u64(cpustat[CPUTIME_IRQ] - rsrp->cputime_irq, NSEC_PER_MSEC),
>> +		div_u64(cpustat[CPUTIME_SOFTIRQ] - rsrp->cputime_softirq, NSEC_PER_MSEC),
>> +		div_u64(cpustat[CPUTIME_SYSTEM] - rsrp->cputime_system,
>> NSEC_PER_MSEC),
> 
> Same here.

OK

> 
> Thanks.
> .
> 

-- 
Regards,
  Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-09 17:03     ` Frederic Weisbecker
  2022-11-09 17:22       ` Paul E. McKenney
@ 2022-11-10  7:29       ` Leizhen (ThunderTown)
  2022-11-10 11:35         ` Frederic Weisbecker
  1 sibling, 1 reply; 19+ messages in thread
From: Leizhen (ThunderTown) @ 2022-11-10  7:29 UTC (permalink / raw)
  To: Frederic Weisbecker, Paul E. McKenney
  Cc: Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel, Robert Elliott



On 2022/11/10 1:03, Frederic Weisbecker wrote:
> On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
>> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
>>> Hi Zhen Lei,
>>>
>>> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
>>>> v5 --> v6:
>>>> 1. When there are more than two continuous RCU stallings, correctly handle the
>>>>    value of the second and subsequent sampling periods. Update comments and
>>>>    document.
>>>>    Thanks to Elliott, Robert for the test.
>>>> 2. Change "rcu stall" to "RCU stall".
>>>>
>>>> v4 --> v5:
>>>> 1. Resolve a git am conflict. No code change.
>>>>
>>>> v3 --> v4:
>>>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
>>>>
>>>> v2 --> v3:
>>>> 1. Fix the return type of kstat_cpu_irqs_sum()
>>>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>>>>    rcupdate.rcu_cpu_stall_deep_debug.
>>>> 3. Add comments and normalize local variable name
>>>>
>>>>
>>>> v1 --> v2:
>>>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>>>>    kcpustat_this_cpu cannot be used.
>>>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>>>>         if (r->gp_seq != rdp->gp_seq)
>>>>                 return;
>>>>
>>>> -       cpustat = kcpustat_this_cpu->cpustat;
>>>> +       cpustat = kcpustat_cpu(cpu).cpustat;
>>>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>>>>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
>>>>
>>>> v1:
>>>> In some extreme cases, such as the I/O pressure test, the CPU usage may
>>>> be 100%, causing RCU stall. In this case, the printed information about
>>>> current is not useful. Displays the number and usage of hard interrupts,
>>>> soft interrupts, and context switches that are generated within half of
>>>> the CPU stall timeout, can help us make a general judgment. In other
>>>> cases, we can preliminarily determine whether an infinite loop occurs
>>>> when local_irq, local_bh or preempt is disabled.
>>>
>>> That looks useful but I have to ask: what does it bring that the softlockup
>>> and hardlockup watchdog can not already solve?
>>
>> This is a good point.  One possible benefit is putting the needed information
>> in one spot, for example, in cases where the soft/hard lockup timeouts are
>> significantly different than the RCU CPU stall warning timeout.
> 
> Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
> unless all CPUs are caught into a hardlockup, in which case only the hardlockup
> detector has a chance.

But not all ARCHs support hardlockup, such as s390. Maybe arm64.

config HARDLOCKUP_DETECTOR
        bool "Detect Hard Lockups"
        depends on DEBUG_KERNEL && !S390
        depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH

> 
> Anyway I would say that in this case just lower the delay for the lockup
> detectors to consider the situation is a lockup?

In most architectures, CONFIG_SOFTLOCKUP_DETECTOR is not set by default.
Otherwise 20 is less than 21.

Softlockups are bugs that cause the kernel to loop in kernel
mode for more than 20 seconds, without giving other tasks a
chance to run.

config RCU_CPU_STALL_TIMEOUT
	default 21


In short, hardlockup and softlockup are completely uncontrollable to RCU stall.

> 
> Thanks.
> 
> 
>>
>> Thoughts?
>>
>> 							Thanx, Paul
>>
>>> Thanks.
>>>
>>>>
>>>> Zhen Lei (2):
>>>>   rcu: Add RCU stall diagnosis information
>>>>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
>>>>
>>>>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>>>>  .../admin-guide/kernel-parameters.txt         |  6 ++
>>>>  kernel/rcu/Kconfig.debug                      | 11 +++
>>>>  kernel/rcu/rcu.h                              |  1 +
>>>>  kernel/rcu/tree.c                             | 17 ++++
>>>>  kernel/rcu/tree.h                             | 19 ++++
>>>>  kernel/rcu/tree_stall.h                       | 29 ++++++
>>>>  kernel/rcu/update.c                           |  2 +
>>>>  8 files changed, 173 insertions(+)
>>>>
>>>> -- 
>>>> 2.25.1
>>>>
> .
> 

-- 
Regards,
  Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 1/2] rcu: Add RCU stall diagnosis information
  2022-11-09 16:55   ` Elliott, Robert (Servers)
  2022-11-09 17:03     ` Elliott, Robert (Servers)
@ 2022-11-10  8:27     ` Leizhen (ThunderTown)
  1 sibling, 0 replies; 19+ messages in thread
From: Leizhen (ThunderTown) @ 2022-11-10  8:27 UTC (permalink / raw)
  To: Elliott, Robert (Servers),
	Paul E . McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, rcu, linux-kernel



On 2022/11/10 0:55, Elliott, Robert (Servers) wrote:
> 
> 
>> b/Documentation/admin-guide/kernel-parameters.txt
>> index a465d5242774af8..2729f3ad11d108b 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -5082,6 +5082,12 @@
>>  			rcupdate.rcu_cpu_stall_timeout to be used (after
>>  			conversion from seconds to milliseconds).
>>
>> +	rcupdate.rcu_cpu_stall_cputime= [KNL]
>> +			Provide statistics on the cputime and count of
>> +			interrupts and tasks during the sampling period. For
>> +			multiple continuous RCU stalls, all sampling periods
>> +			begin at half of the first RCU stall timeout.
> 
> This description should start with:
>     "In kernels built with CONFIG_RCU_CPU_STALL_TIME=y, "
> 
> Also, that parameter name seems like it contains a time value, but
> it's really just treated as zero vs. anything else. Consider renaming
> it to rcu_cpu_stall_cputime_en or describing the values in the
> description ("0 disables, all other values enable").
> 
>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>> +struct rcu_snap_record {
>> +	unsigned long	gp_seq;		/* Track rdp->gp_seq counter */
>> +	u64		cputime_irq;	/* Accumulated cputime of hard irqs */
>> +	u64		cputime_softirq;/* Accumulated cputime of soft irqs */
>> +	u64		cputime_system; /* Accumulated cputime of kernel tasks
>> */
>> +	unsigned long	nr_hardirqs;	/* Accumulated number of hard irqs */
>> +	unsigned int	nr_softirqs;	/* Accumulated number of soft irqs */
> 
> That should be "unsigned long" to match the other patch

We have discussed this before. And you mentioned:

irqs_sumstruct kernel_stat {
        unsigned long irqs_sum;
        unsigned int softirqs[NR_SOFTIRQS];
};

The softirqs field is an unsigned int, so the new function doesn't have
this inconsistency.

> 
> 
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> +static void print_cpu_stat_info(int cpu)
>> +{
> ...
>> +	pr_err("         hardirqs   softirqs   csw/system\n");
>> +	pr_err(" number: %8ld %10d %12lld\n",
>> +		kstat_cpu_irqs_sum(cpu) - rsrp->nr_hardirqs,
>> +		kstat_cpu_softirqs_sum(cpu) - rsrp->nr_softirqs,
>> +		nr_context_switches_cpu(cpu) - rsrp->nr_csw);
>> +	pr_err("cputime: %8lld %10lld %12lld   ==> %lld(ms)\n",
> 
> Those should all start with "\t" to match other related prints.

Right, thanks.

> 
> 
> .
> 

-- 
Regards,
  Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-10  7:29       ` Leizhen (ThunderTown)
@ 2022-11-10 11:35         ` Frederic Weisbecker
  0 siblings, 0 replies; 19+ messages in thread
From: Frederic Weisbecker @ 2022-11-10 11:35 UTC (permalink / raw)
  To: Leizhen (ThunderTown)
  Cc: Paul E. McKenney, Neeraj Upadhyay, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	linux-kernel, Robert Elliott

On Thu, Nov 10, 2022 at 03:29:04PM +0800, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/11/10 1:03, Frederic Weisbecker wrote:
> > On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
> >> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
> >>> Hi Zhen Lei,
> >>>
> >>> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
> >>>> v5 --> v6:
> >>>> 1. When there are more than two continuous RCU stallings, correctly handle the
> >>>>    value of the second and subsequent sampling periods. Update comments and
> >>>>    document.
> >>>>    Thanks to Elliott, Robert for the test.
> >>>> 2. Change "rcu stall" to "RCU stall".
> >>>>
> >>>> v4 --> v5:
> >>>> 1. Resolve a git am conflict. No code change.
> >>>>
> >>>> v3 --> v4:
> >>>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> >>>>
> >>>> v2 --> v3:
> >>>> 1. Fix the return type of kstat_cpu_irqs_sum()
> >>>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> >>>>    rcupdate.rcu_cpu_stall_deep_debug.
> >>>> 3. Add comments and normalize local variable name
> >>>>
> >>>>
> >>>> v1 --> v2:
> >>>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> >>>>    kcpustat_this_cpu cannot be used.
> >>>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> >>>>         if (r->gp_seq != rdp->gp_seq)
> >>>>                 return;
> >>>>
> >>>> -       cpustat = kcpustat_this_cpu->cpustat;
> >>>> +       cpustat = kcpustat_cpu(cpu).cpustat;
> >>>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> >>>>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> >>>>
> >>>> v1:
> >>>> In some extreme cases, such as the I/O pressure test, the CPU usage may
> >>>> be 100%, causing RCU stall. In this case, the printed information about
> >>>> current is not useful. Displays the number and usage of hard interrupts,
> >>>> soft interrupts, and context switches that are generated within half of
> >>>> the CPU stall timeout, can help us make a general judgment. In other
> >>>> cases, we can preliminarily determine whether an infinite loop occurs
> >>>> when local_irq, local_bh or preempt is disabled.
> >>>
> >>> That looks useful but I have to ask: what does it bring that the softlockup
> >>> and hardlockup watchdog can not already solve?
> >>
> >> This is a good point.  One possible benefit is putting the needed information
> >> in one spot, for example, in cases where the soft/hard lockup timeouts are
> >> significantly different than the RCU CPU stall warning timeout.
> > 
> > Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
> > unless all CPUs are caught into a hardlockup, in which case only the hardlockup
> > detector has a chance.
> 
> But not all ARCHs support hardlockup, such as s390. Maybe arm64.
> 
> config HARDLOCKUP_DETECTOR
>         bool "Detect Hard Lockups"
>         depends on DEBUG_KERNEL && !S390
>         depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH

Ah fair point indeed.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-10  2:27         ` Leizhen (ThunderTown)
@ 2022-11-12 18:59           ` Paul E. McKenney
  2022-11-15  9:11             ` Leizhen (ThunderTown)
  0 siblings, 1 reply; 19+ messages in thread
From: Paul E. McKenney @ 2022-11-12 18:59 UTC (permalink / raw)
  To: Leizhen (ThunderTown)
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott

On Thu, Nov 10, 2022 at 10:27:44AM +0800, Leizhen (ThunderTown) wrote:
> On 2022/11/10 1:22, Paul E. McKenney wrote:
> > On Wed, Nov 09, 2022 at 06:03:17PM +0100, Frederic Weisbecker wrote:
> >> On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
> >>> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
> >>>> Hi Zhen Lei,
> >>>>
> >>>> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
> >>>>> v5 --> v6:
> >>>>> 1. When there are more than two continuous RCU stallings, correctly handle the
> >>>>>    value of the second and subsequent sampling periods. Update comments and
> >>>>>    document.
> >>>>>    Thanks to Elliott, Robert for the test.
> >>>>> 2. Change "rcu stall" to "RCU stall".
> >>>>>
> >>>>> v4 --> v5:
> >>>>> 1. Resolve a git am conflict. No code change.
> >>>>>
> >>>>> v3 --> v4:
> >>>>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
> >>>>>
> >>>>> v2 --> v3:
> >>>>> 1. Fix the return type of kstat_cpu_irqs_sum()
> >>>>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
> >>>>>    rcupdate.rcu_cpu_stall_deep_debug.
> >>>>> 3. Add comments and normalize local variable name
> >>>>>
> >>>>>
> >>>>> v1 --> v2:
> >>>>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
> >>>>>    kcpustat_this_cpu cannot be used.
> >>>>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
> >>>>>         if (r->gp_seq != rdp->gp_seq)
> >>>>>                 return;
> >>>>>
> >>>>> -       cpustat = kcpustat_this_cpu->cpustat;
> >>>>> +       cpustat = kcpustat_cpu(cpu).cpustat;
> >>>>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
> >>>>>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
> >>>>>
> >>>>> v1:
> >>>>> In some extreme cases, such as the I/O pressure test, the CPU usage may
> >>>>> be 100%, causing RCU stall. In this case, the printed information about
> >>>>> current is not useful. Displays the number and usage of hard interrupts,
> >>>>> soft interrupts, and context switches that are generated within half of
> >>>>> the CPU stall timeout, can help us make a general judgment. In other
> >>>>> cases, we can preliminarily determine whether an infinite loop occurs
> >>>>> when local_irq, local_bh or preempt is disabled.
> >>>>
> >>>> That looks useful but I have to ask: what does it bring that the softlockup
> >>>> and hardlockup watchdog can not already solve?
> >>>
> >>> This is a good point.  One possible benefit is putting the needed information
> >>> in one spot, for example, in cases where the soft/hard lockup timeouts are
> >>> significantly different than the RCU CPU stall warning timeout.
> >>
> >> Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
> >> unless all CPUs are caught into a hardlockup, in which case only the hardlockup
> >> detector has a chance.
> >>
> >> Anyway I would say that in this case just lower the delay for the lockup
> >> detectors to consider the situation is a lockup?
> > 
> > Try it both ways and see how it works?  The rcutorture module parameters
> > stall_cpu and stall_cpu_irqsoff are easy ways to generate these sorts
> > of scenarios.
> > 
> > Actually, that does remind me of something.  Back when I was chasing
> > that interrupt storm, would this patch have helped me?  In that case, the
> 
> Yes, this patch series originally addressed an RCU stall issue caused by an
> interruption storm. The serial port driver written by another project team
> failed to write the register in a specific condition. As a result, interrupts
> were repeatedly reported.

Very good!

> > half-way point would have been reached while all online CPUs were spinning
> > with interrupts disabled and the incoming CPU was getting hammered with
> > continual scheduling-clock interrupts.  So I suspect that the answer is
> > "no" because the incoming CPU was not blocking the grace period.
> > 
> > Instead of being snapshot halfway to the RCU CPU stall warning, should
> > the values be snapshot when the CPU notices the beginning or end of an
> > RCU grace period and when a CPU goes offline?
> 
> This won't work. Those normal counts that occurred before the failure
> have an impact on our analysis. For example, some software interrupts
> may have been generated before local_bh_disable() is called.

Fair enough, and thank you for considering this option.  But please be
prepared to adjust (somehow or another) as needed to accommodate other
failure scenarios as they arise.

							Thanx, Paul

> > But that would not suffice, because detailed information would not have
> > been dumped for the incoming CPU.
> > 
> > However, the lack of context switches and interrupts on the rest of the
> > CPUs would likely have been a big cluebat, so there is that.  It might
> > be better to rework the warning at the beginning of rcu_sched_clock_irq()
> > to complain if more than (say) 10 scheduling-clock interrupts occur on
> > a given CPU during a single jiffy.
> > 
> > Independent of Zhen Lei patch.
> > 
> > Thoughts?
> > 
> > 							Thanx, Paul
> > 
> >> Thanks.
> >>
> >>
> >>>
> >>> Thoughts?
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>> Zhen Lei (2):
> >>>>>   rcu: Add RCU stall diagnosis information
> >>>>>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
> >>>>>
> >>>>>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
> >>>>>  .../admin-guide/kernel-parameters.txt         |  6 ++
> >>>>>  kernel/rcu/Kconfig.debug                      | 11 +++
> >>>>>  kernel/rcu/rcu.h                              |  1 +
> >>>>>  kernel/rcu/tree.c                             | 17 ++++
> >>>>>  kernel/rcu/tree.h                             | 19 ++++
> >>>>>  kernel/rcu/tree_stall.h                       | 29 ++++++
> >>>>>  kernel/rcu/update.c                           |  2 +
> >>>>>  8 files changed, 173 insertions(+)
> >>>>>
> >>>>> -- 
> >>>>> 2.25.1
> >>>>>
> > .
> > 
> 
> -- 
> Regards,
>   Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v6 0/2] rcu: Add RCU stall diagnosis information
  2022-11-12 18:59           ` Paul E. McKenney
@ 2022-11-15  9:11             ` Leizhen (ThunderTown)
  0 siblings, 0 replies; 19+ messages in thread
From: Leizhen (ThunderTown) @ 2022-11-15  9:11 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, Neeraj Upadhyay, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu, linux-kernel, Robert Elliott



On 2022/11/13 2:59, Paul E. McKenney wrote:
> On Thu, Nov 10, 2022 at 10:27:44AM +0800, Leizhen (ThunderTown) wrote:
>> On 2022/11/10 1:22, Paul E. McKenney wrote:
>>> On Wed, Nov 09, 2022 at 06:03:17PM +0100, Frederic Weisbecker wrote:
>>>> On Wed, Nov 09, 2022 at 07:59:01AM -0800, Paul E. McKenney wrote:
>>>>> On Wed, Nov 09, 2022 at 04:26:21PM +0100, Frederic Weisbecker wrote:
>>>>>> Hi Zhen Lei,
>>>>>>
>>>>>> On Wed, Nov 09, 2022 at 05:37:36PM +0800, Zhen Lei wrote:
>>>>>>> v5 --> v6:
>>>>>>> 1. When there are more than two continuous RCU stallings, correctly handle the
>>>>>>>    value of the second and subsequent sampling periods. Update comments and
>>>>>>>    document.
>>>>>>>    Thanks to Elliott, Robert for the test.
>>>>>>> 2. Change "rcu stall" to "RCU stall".
>>>>>>>
>>>>>>> v4 --> v5:
>>>>>>> 1. Resolve a git am conflict. No code change.
>>>>>>>
>>>>>>> v3 --> v4:
>>>>>>> 1. Rename rcu_cpu_stall_deep_debug to rcu_cpu_stall_cputime.
>>>>>>>
>>>>>>> v2 --> v3:
>>>>>>> 1. Fix the return type of kstat_cpu_irqs_sum()
>>>>>>> 2. Add Kconfig option CONFIG_RCU_CPU_STALL_DEEP_DEBUG and boot parameter
>>>>>>>    rcupdate.rcu_cpu_stall_deep_debug.
>>>>>>> 3. Add comments and normalize local variable name
>>>>>>>
>>>>>>>
>>>>>>> v1 --> v2:
>>>>>>> 1. Fixed a bug in the code. If the rcu stall is detected by another CPU,
>>>>>>>    kcpustat_this_cpu cannot be used.
>>>>>>> @@ -451,7 +451,7 @@ static void print_cpu_stat_info(int cpu)
>>>>>>>         if (r->gp_seq != rdp->gp_seq)
>>>>>>>                 return;
>>>>>>>
>>>>>>> -       cpustat = kcpustat_this_cpu->cpustat;
>>>>>>> +       cpustat = kcpustat_cpu(cpu).cpustat;
>>>>>>> 2. Move the start point of statistics from rcu_stall_kick_kthreads() to
>>>>>>>    rcu_implicit_dynticks_qs(), removing the dependency on irq_work.
>>>>>>>
>>>>>>> v1:
>>>>>>> In some extreme cases, such as the I/O pressure test, the CPU usage may
>>>>>>> be 100%, causing RCU stall. In this case, the printed information about
>>>>>>> current is not useful. Displays the number and usage of hard interrupts,
>>>>>>> soft interrupts, and context switches that are generated within half of
>>>>>>> the CPU stall timeout, can help us make a general judgment. In other
>>>>>>> cases, we can preliminarily determine whether an infinite loop occurs
>>>>>>> when local_irq, local_bh or preempt is disabled.
>>>>>>
>>>>>> That looks useful but I have to ask: what does it bring that the softlockup
>>>>>> and hardlockup watchdog can not already solve?
>>>>>
>>>>> This is a good point.  One possible benefit is putting the needed information
>>>>> in one spot, for example, in cases where the soft/hard lockup timeouts are
>>>>> significantly different than the RCU CPU stall warning timeout.
>>>>
>>>> Arguably, the hardlockup/softlockup detectors usually trigger after RCU stall,
>>>> unless all CPUs are caught into a hardlockup, in which case only the hardlockup
>>>> detector has a chance.
>>>>
>>>> Anyway I would say that in this case just lower the delay for the lockup
>>>> detectors to consider the situation is a lockup?
>>>
>>> Try it both ways and see how it works?  The rcutorture module parameters
>>> stall_cpu and stall_cpu_irqsoff are easy ways to generate these sorts
>>> of scenarios.
>>>
>>> Actually, that does remind me of something.  Back when I was chasing
>>> that interrupt storm, would this patch have helped me?  In that case, the
>>
>> Yes, this patch series originally addressed an RCU stall issue caused by an
>> interruption storm. The serial port driver written by another project team
>> failed to write the register in a specific condition. As a result, interrupts
>> were repeatedly reported.
> 
> Very good!
> 
>>> half-way point would have been reached while all online CPUs were spinning
>>> with interrupts disabled and the incoming CPU was getting hammered with
>>> continual scheduling-clock interrupts.  So I suspect that the answer is
>>> "no" because the incoming CPU was not blocking the grace period.
>>>
>>> Instead of being snapshot halfway to the RCU CPU stall warning, should
>>> the values be snapshot when the CPU notices the beginning or end of an
>>> RCU grace period and when a CPU goes offline?
>>
>> This won't work. Those normal counts that occurred before the failure
>> have an impact on our analysis. For example, some software interrupts
>> may have been generated before local_bh_disable() is called.
> 
> Fair enough, and thank you for considering this option.  But please be
> prepared to adjust (somehow or another) as needed to accommodate other
> failure scenarios as they arise.

Except the document has an warning of "make htmldocs". I can't think of
anything to improve on the v7 at the moment.

Change the type of softirqs[NR_SOFTIRQS] from "unsigned int" to
"unsigned long", I will post a separate patch in future. Because
more people may join the discussion.

How about I post v8 tomorrow?

> 
> 							Thanx, Paul
> 
>>> But that would not suffice, because detailed information would not have
>>> been dumped for the incoming CPU.
>>>
>>> However, the lack of context switches and interrupts on the rest of the
>>> CPUs would likely have been a big cluebat, so there is that.  It might
>>> be better to rework the warning at the beginning of rcu_sched_clock_irq()
>>> to complain if more than (say) 10 scheduling-clock interrupts occur on
>>> a given CPU during a single jiffy.
>>>
>>> Independent of Zhen Lei patch.
>>>
>>> Thoughts?
>>>
>>> 							Thanx, Paul
>>>
>>>> Thanks.
>>>>
>>>>
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> 							Thanx, Paul
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>>
>>>>>>> Zhen Lei (2):
>>>>>>>   rcu: Add RCU stall diagnosis information
>>>>>>>   doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information
>>>>>>>
>>>>>>>  Documentation/RCU/stallwarn.rst               | 88 +++++++++++++++++++
>>>>>>>  .../admin-guide/kernel-parameters.txt         |  6 ++
>>>>>>>  kernel/rcu/Kconfig.debug                      | 11 +++
>>>>>>>  kernel/rcu/rcu.h                              |  1 +
>>>>>>>  kernel/rcu/tree.c                             | 17 ++++
>>>>>>>  kernel/rcu/tree.h                             | 19 ++++
>>>>>>>  kernel/rcu/tree_stall.h                       | 29 ++++++
>>>>>>>  kernel/rcu/update.c                           |  2 +
>>>>>>>  8 files changed, 173 insertions(+)
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.25.1
>>>>>>>
>>> .
>>>
>>
>> -- 
>> Regards,
>>   Zhen Lei
> .
> 

-- 
Regards,
  Zhen Lei

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-11-15  9:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-09  9:37 [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Zhen Lei
2022-11-09  9:37 ` [PATCH v6 1/2] " Zhen Lei
2022-11-09 15:20   ` Frederic Weisbecker
2022-11-10  6:55     ` Leizhen (ThunderTown)
2022-11-09 16:55   ` Elliott, Robert (Servers)
2022-11-09 17:03     ` Elliott, Robert (Servers)
2022-11-10  8:27     ` Leizhen (ThunderTown)
2022-11-09  9:37 ` [PATCH v6 2/2] doc: Document CONFIG_RCU_CPU_STALL_CPUTIME=y stall information Zhen Lei
2022-11-09 15:08   ` Frederic Weisbecker
2022-11-10  2:54     ` Leizhen (ThunderTown)
2022-11-09 15:26 ` [PATCH v6 0/2] rcu: Add RCU stall diagnosis information Frederic Weisbecker
2022-11-09 15:59   ` Paul E. McKenney
2022-11-09 17:03     ` Frederic Weisbecker
2022-11-09 17:22       ` Paul E. McKenney
2022-11-10  2:27         ` Leizhen (ThunderTown)
2022-11-12 18:59           ` Paul E. McKenney
2022-11-15  9:11             ` Leizhen (ThunderTown)
2022-11-10  7:29       ` Leizhen (ThunderTown)
2022-11-10 11:35         ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.