linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] printk: reduce deadlocks during panic
@ 2022-01-26 23:02 Stephen Brennan
  2022-01-26 23:02 ` [PATCH v2 1/4] panic: Add panic_in_progress helper Stephen Brennan
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Stephen Brennan @ 2022-01-26 23:02 UTC (permalink / raw)
  To: Andrew Morton, Sergey Senozhatsky, Petr Mladek
  Cc: Arnd Bergmann, Steven Rostedt, Stephen Brennan,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

When a caller writes heavily to the kernel log (e.g. writing to
/dev/kmsg in a loop) while another panics, there's currently a high
likelihood of a deadlock (see patch 2 for the full description of this
deadlock).

The principle fix is to disable the optimistic spin once panic_cpu is
set, so the panic CPU doesn't spin waiting for a halted CPU to hand over
the console_sem.

However, this exposed us to a livelock situation, where the panic CPU
holds the console_sem, and another CPU could fill up the log buffer
faster than the consoles could drain it, preventing the panic from
progressing and halting the other CPUs. To avoid this, patch 3 adds a
mechanism to suppress printk (from non-panic-CPU) during panic, if we
reach a threshold of dropped messages.

A major goal with all of these patches is to try to decrease the
likelihood that another CPU is holding the console_sem when we halt it
in panic(). This reduces the odds of needing to break locks and
potentially encountering further deadlocks with the console drivers.

To test, I use the following script, kmsg_panic.sh:

    #!/bin/bash
    date
    # 991 chars (based on log buffer size):
    chars="$(printf 'a%.0s' {1..991})"
    while :; do
        echo $chars > /dev/kmsg
    done &
    echo c > /proc/sysrq-trigger &
    date
    exit

I defined a hang as any time the system did not reboot to a login prompt
on the serial console within 60 seconds. Here are the statistics on
hangs using this script, before and after the patch.

before:  776 hangs / 1484 trials - 52.3%
after :    0 hangs /  408 trials -  0.0%

v2:
    Each patch has minor updates from code reviews. I've re-done testing and
    updated the above statistics. Exact changes are in each patch.

Stephen Brennan (4):
  panic: Add panic_in_progress helper
  printk: disable optimistic spin during panic
  printk: Avoid livelock with heavy printk during panic
  printk: Drop console_sem during panic

 include/linux/panic.h  |  6 +++++
 kernel/printk/printk.c | 51 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 56 insertions(+), 1 deletion(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/4] panic: Add panic_in_progress helper
  2022-01-26 23:02 [PATCH v2 0/4] printk: reduce deadlocks during panic Stephen Brennan
@ 2022-01-26 23:02 ` Stephen Brennan
  2022-01-27 14:34   ` Petr Mladek
  2022-01-26 23:02 ` [PATCH v2 2/4] printk: disable optimistic spin during panic Stephen Brennan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Stephen Brennan @ 2022-01-26 23:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arnd Bergmann, Steven Rostedt, Stephen Brennan,
	Sebastian Reichel, Sergey Senozhatsky, John Ogness,
	Andy Shevchenko, Luis Chamberlain, linux-kernel, Petr Mladek

This helper will be used in printk code to avoid deadlocks during
panic().

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v2: Switch to static inline function

 include/linux/panic.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/panic.h b/include/linux/panic.h
index f5844908a089..1022ec930d34 100644
--- a/include/linux/panic.h
+++ b/include/linux/panic.h
@@ -4,6 +4,7 @@
 
 #include <linux/compiler_attributes.h>
 #include <linux/types.h>
+#include <linux/atomic.h>
 
 struct pt_regs;
 
@@ -45,6 +46,11 @@ extern bool crash_kexec_post_notifiers;
 extern atomic_t panic_cpu;
 #define PANIC_CPU_INVALID	-1
 
+static inline bool panic_in_progress(void)
+{
+	return unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID);
+}
+
 /*
  * Only to be used by arch init code. If the user over-wrote the default
  * CONFIG_PANIC_TIMEOUT, honor it.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] printk: disable optimistic spin during panic
  2022-01-26 23:02 [PATCH v2 0/4] printk: reduce deadlocks during panic Stephen Brennan
  2022-01-26 23:02 ` [PATCH v2 1/4] panic: Add panic_in_progress helper Stephen Brennan
@ 2022-01-26 23:02 ` Stephen Brennan
  2022-01-26 23:02 ` [PATCH v2 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
  2022-01-26 23:02 ` [PATCH v2 4/4] printk: Drop console_sem " Stephen Brennan
  3 siblings, 0 replies; 14+ messages in thread
From: Stephen Brennan @ 2022-01-26 23:02 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: Arnd Bergmann, Steven Rostedt, Andrew Morton, Stephen Brennan,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

A CPU executing with console lock spinning enabled might be halted
during a panic. Before the panicking CPU calls console_flush_on_panic(),
it may call console_trylock(), which attempts to optimistically spin,
deadlocking the panic CPU:

CPU 0 (panic CPU)             CPU 1
-----------------             ------
                              printk() {
                                vprintk_func() {
                                  vprintk_default() {
                                    vprintk_emit() {
                                      console_unlock() {
                                        console_lock_spinning_enable();
                                        ... printing to console ...
panic() {
  crash_smp_send_stop() {
    NMI  -------------------> HALT
  }
  atomic_notifier_call_chain() {
    printk() {
      ...
      console_trylock_spinnning() {
        // optimistic spin infinitely

This hang during panic can be induced when a kdump kernel is loaded, and
crash_kexec_post_notifiers=1 is present on the kernel command line. The
following script which concurrently writes to /dev/kmsg, and triggers a
panic, can result in this hang:

    #!/bin/bash
    date
    # 991 chars (based on log buffer size):
    chars="$(printf 'a%.0s' {1..991})"
    while :; do
        echo $chars > /dev/kmsg
    done &
    echo c > /proc/sysrq-trigger &
    date
    exit

To avoid this deadlock, ensure that console_trylock_spinning() does not
allow spinning once a panic has begun.

Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
---
 kernel/printk/printk.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 57b132b658e1..20b4b71a1a07 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1843,6 +1843,16 @@ static int console_trylock_spinning(void)
 	if (console_trylock())
 		return 1;
 
+	/*
+	 * It's unsafe to spin once a panic has begun. If we are the
+	 * panic CPU, we may have already halted the owner of the
+	 * console_sem. If we are not the panic CPU, then we should
+	 * avoid taking console_sem, so the panic CPU has a better
+	 * chance of cleanly acquiring it later.
+	 */
+	if (panic_in_progress())
+		return 0;
+
 	printk_safe_enter_irqsave(flags);
 
 	raw_spin_lock(&console_owner_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] printk: Avoid livelock with heavy printk during panic
  2022-01-26 23:02 [PATCH v2 0/4] printk: reduce deadlocks during panic Stephen Brennan
  2022-01-26 23:02 ` [PATCH v2 1/4] panic: Add panic_in_progress helper Stephen Brennan
  2022-01-26 23:02 ` [PATCH v2 2/4] printk: disable optimistic spin during panic Stephen Brennan
@ 2022-01-26 23:02 ` Stephen Brennan
  2022-01-27 14:50   ` Petr Mladek
  2022-01-26 23:02 ` [PATCH v2 4/4] printk: Drop console_sem " Stephen Brennan
  3 siblings, 1 reply; 14+ messages in thread
From: Stephen Brennan @ 2022-01-26 23:02 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: Arnd Bergmann, Steven Rostedt, Andrew Morton, Stephen Brennan,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

During panic(), if another CPU is writing heavily the kernel log (e.g.
via /dev/kmsg), then the panic CPU may livelock writing out its messages
to the console. Note when too many messages are dropped during panic and
suppress further printk, except from the panic CPU. This could result in
some important messages being dropped. However, messages are already
being dropped, so this approach at least prevents a livelock.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v2: Add pr_warn when we suppress printk on non-panic CPU

 kernel/printk/printk.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 20b4b71a1a07..18107db118d4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -93,6 +93,12 @@ EXPORT_SYMBOL_GPL(console_drivers);
  */
 int __read_mostly suppress_printk;
 
+/*
+ * During panic, heavy printk by other CPUs can delay the
+ * panic and risk deadlock on console resources.
+ */
+int __read_mostly suppress_panic_printk;
+
 #ifdef CONFIG_LOCKDEP
 static struct lockdep_map console_lock_dep_map = {
 	.name = "console_lock"
@@ -2228,6 +2234,10 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (unlikely(suppress_printk))
 		return 0;
 
+	if (unlikely(suppress_panic_printk) &&
+	    atomic_read(&panic_cpu) != raw_smp_processor_id())
+		return 0;
+
 	if (level == LOGLEVEL_SCHED) {
 		level = LOGLEVEL_DEFAULT;
 		in_sched = true;
@@ -2613,6 +2623,7 @@ void console_unlock(void)
 {
 	static char ext_text[CONSOLE_EXT_LOG_MAX];
 	static char text[CONSOLE_LOG_MAX];
+	static int panic_console_dropped;
 	unsigned long flags;
 	bool do_cond_resched, retry;
 	struct printk_info info;
@@ -2667,6 +2678,11 @@ void console_unlock(void)
 		if (console_seq != r.info->seq) {
 			console_dropped += r.info->seq - console_seq;
 			console_seq = r.info->seq;
+			if (panic_in_progress() && panic_console_dropped++ > 10) {
+				suppress_panic_printk = 1;
+				pr_warn("Too many dropped messages. "
+				        "Supress messages on non-panic CPUs to prevent livelock.\n");
+			}
 		}
 
 		if (suppress_message_printing(r.info->level)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] printk: Drop console_sem during panic
  2022-01-26 23:02 [PATCH v2 0/4] printk: reduce deadlocks during panic Stephen Brennan
                   ` (2 preceding siblings ...)
  2022-01-26 23:02 ` [PATCH v2 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
@ 2022-01-26 23:02 ` Stephen Brennan
  2022-01-27  9:22   ` John Ogness
  2022-01-27 15:12   ` Petr Mladek
  3 siblings, 2 replies; 14+ messages in thread
From: Stephen Brennan @ 2022-01-26 23:02 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: Arnd Bergmann, Steven Rostedt, Andrew Morton, Stephen Brennan,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

If another CPU is in panic, we are about to be halted. Try to gracefully
abandon the console_sem, leaving it free for the panic CPU to grab.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v2: Factor check out to a helper, and check at the end of
    console_unlock() to prevent retry as well.

 kernel/printk/printk.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 18107db118d4..572363ff716f 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2593,6 +2593,25 @@ static int have_callable_console(void)
 	return 0;
 }
 
+/*
+ * Return true when this CPU should unlock console_sem without pushing all
+ * messages to the console. This reduces the chance that the console is
+ * locked when the panic CPU tries to use it.
+ */
+static bool abandon_console_lock_in_panic(void)
+{
+	if (!panic_in_progress())
+		return false;
+
+	/*
+	 * We can use raw_smp_processor_id() here because it is impossible for
+	 * the task to be migrated to the panic_cpu, or away from it. If
+	 * panic_cpu has already been set, and we're not currently executing on
+	 * that CPU, then we never will be.
+	 */
+	return atomic_read(&panic_cpu) != raw_smp_processor_id();
+}
+
 /*
  * Can we actually use the console at this time on this cpu?
  *
@@ -2742,6 +2761,10 @@ void console_unlock(void)
 		if (handover)
 			return;
 
+		/* Allow panic_cpu to take over the consoles safely */
+		if (abandon_console_lock_in_panic())
+			break;
+
 		if (do_cond_resched)
 			cond_resched();
 	}
@@ -2759,7 +2782,7 @@ void console_unlock(void)
 	 * flush, no worries.
 	 */
 	retry = prb_read_valid(prb, next_seq, NULL);
-	if (retry && console_trylock())
+	if (retry && !abandon_console_lock_in_panic() && console_trylock())
 		goto again;
 }
 EXPORT_SYMBOL(console_unlock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] printk: Drop console_sem during panic
  2022-01-26 23:02 ` [PATCH v2 4/4] printk: Drop console_sem " Stephen Brennan
@ 2022-01-27  9:22   ` John Ogness
  2022-01-27 15:03     ` Petr Mladek
  2022-01-27 15:12   ` Petr Mladek
  1 sibling, 1 reply; 14+ messages in thread
From: John Ogness @ 2022-01-27  9:22 UTC (permalink / raw)
  To: Stephen Brennan, Sergey Senozhatsky, Petr Mladek
  Cc: Arnd Bergmann, Steven Rostedt, Andrew Morton, Stephen Brennan,
	Sebastian Reichel, Andy Shevchenko, Luis Chamberlain,
	linux-kernel

On 2022-01-26, Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2759,7 +2782,7 @@ void console_unlock(void)
>  	 * flush, no worries.
>  	 */
>  	retry = prb_read_valid(prb, next_seq, NULL);
> -	if (retry && console_trylock())
> +	if (retry && !abandon_console_lock_in_panic() && console_trylock())

As Sergey suggested [0], I would like to see the call to
abandon_console_lock_in_panic() move inside console_trylock(). This will
help to avoid the race between NMI CPU halt and the internal sema.lock
spinlock.

John Ogness

[0] https://lore.kernel.org/all/YfJFjHdg%2FkhNXiRd@google.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] panic: Add panic_in_progress helper
  2022-01-26 23:02 ` [PATCH v2 1/4] panic: Add panic_in_progress helper Stephen Brennan
@ 2022-01-27 14:34   ` Petr Mladek
  2022-01-27 16:02     ` Stephen Brennan
  0 siblings, 1 reply; 14+ messages in thread
From: Petr Mladek @ 2022-01-27 14:34 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Andrew Morton, Arnd Bergmann, Steven Rostedt, Sebastian Reichel,
	Sergey Senozhatsky, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

On Wed 2022-01-26 15:02:33, Stephen Brennan wrote:
> This helper will be used in printk code to avoid deadlocks during
> panic().
> 
> Suggested-by: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] printk: Avoid livelock with heavy printk during panic
  2022-01-26 23:02 ` [PATCH v2 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
@ 2022-01-27 14:50   ` Petr Mladek
  2022-01-27 16:19     ` Stephen Brennan
  0 siblings, 1 reply; 14+ messages in thread
From: Petr Mladek @ 2022-01-27 14:50 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Sergey Senozhatsky, Arnd Bergmann, Steven Rostedt, Andrew Morton,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

On Wed 2022-01-26 15:02:35, Stephen Brennan wrote:
> During panic(), if another CPU is writing heavily the kernel log (e.g.
> via /dev/kmsg), then the panic CPU may livelock writing out its messages
> to the console. Note when too many messages are dropped during panic and
> suppress further printk, except from the panic CPU. This could result in
> some important messages being dropped. However, messages are already
> being dropped, so this approach at least prevents a livelock.
> 
> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> ---
> 
> Notes:
>     v2: Add pr_warn when we suppress printk on non-panic CPU
> 
>  kernel/printk/printk.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 20b4b71a1a07..18107db118d4 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -93,6 +93,12 @@ EXPORT_SYMBOL_GPL(console_drivers);
>   */
>  int __read_mostly suppress_printk;
>  
> +/*
> + * During panic, heavy printk by other CPUs can delay the
> + * panic and risk deadlock on console resources.
> + */
> +int __read_mostly suppress_panic_printk;
> +
>  #ifdef CONFIG_LOCKDEP
>  static struct lockdep_map console_lock_dep_map = {
>  	.name = "console_lock"
> @@ -2228,6 +2234,10 @@ asmlinkage int vprintk_emit(int facility, int level,
>  	if (unlikely(suppress_printk))
>  		return 0;
>  
> +	if (unlikely(suppress_panic_printk) &&
> +	    atomic_read(&panic_cpu) != raw_smp_processor_id())
> +		return 0;
> +
>  	if (level == LOGLEVEL_SCHED) {
>  		level = LOGLEVEL_DEFAULT;
>  		in_sched = true;
> @@ -2613,6 +2623,7 @@ void console_unlock(void)
>  {
>  	static char ext_text[CONSOLE_EXT_LOG_MAX];
>  	static char text[CONSOLE_LOG_MAX];
> +	static int panic_console_dropped;
>  	unsigned long flags;
>  	bool do_cond_resched, retry;
>  	struct printk_info info;
> @@ -2667,6 +2678,11 @@ void console_unlock(void)
>  		if (console_seq != r.info->seq) {
>  			console_dropped += r.info->seq - console_seq;
>  			console_seq = r.info->seq;
> +			if (panic_in_progress() && panic_console_dropped++ > 10) {
> +				suppress_panic_printk = 1;
> +				pr_warn("Too many dropped messages. "
> +				        "Supress messages on non-panic CPUs to prevent livelock.\n");

It looks like the message might be printed more times when
panic_console_dropped++ > 10.

In theory, no message can be lost after we disable printk on another
CPUs. But the code might evolve in the future. Let's make it
more error-proof.

We could use (panic_console_dropped++ == 10) or pr_warn_once() or
both.

I prefer using pr_warn_once() because it looks the most error-proof.


Nit: printk() has exceptions from the 80 chars/line rule.
     The message string should be on a single line. It helps
     to find it using "git grep". I think that even checkpatch.pl
     handles this correctly.


With pr_warn_once() and message in single line:

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr


PS: I could fix the two problems when pushing to git. But there
    is still time to send v3. I have vacation the following week
    with limited internet access. I am not going to rush it into
    linux before I leave, ...
    

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] printk: Drop console_sem during panic
  2022-01-27  9:22   ` John Ogness
@ 2022-01-27 15:03     ` Petr Mladek
  2022-01-28  5:55       ` Sergey Senozhatsky
  0 siblings, 1 reply; 14+ messages in thread
From: Petr Mladek @ 2022-01-27 15:03 UTC (permalink / raw)
  To: John Ogness
  Cc: Stephen Brennan, Sergey Senozhatsky, Arnd Bergmann,
	Steven Rostedt, Andrew Morton, Sebastian Reichel,
	Andy Shevchenko, Luis Chamberlain, linux-kernel

On Thu 2022-01-27 10:28:53, John Ogness wrote:
> On 2022-01-26, Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2759,7 +2782,7 @@ void console_unlock(void)
> >  	 * flush, no worries.
> >  	 */
> >  	retry = prb_read_valid(prb, next_seq, NULL);
> > -	if (retry && console_trylock())
> > +	if (retry && !abandon_console_lock_in_panic() && console_trylock())
> 
> As Sergey suggested [0], I would like to see the call to
> abandon_console_lock_in_panic() move inside console_trylock(). This will
> help to avoid the race between NMI CPU halt and the internal sema.lock
> spinlock.

I would prefer if it is done as a followup patch. The code in this
patch is still needed. It helps when the non-panic CPU is busy
with pushing many pending messages. Also it is a more conservative
approach.

Always failing console_trylock() on non-panic CPU makes sense as well.
But it affects many more users. It is likely safe because it is
trylock. But the entire effect is not fully clear to me. So, I suggest
to do it in a separate patch. It might help with bisection.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] printk: Drop console_sem during panic
  2022-01-26 23:02 ` [PATCH v2 4/4] printk: Drop console_sem " Stephen Brennan
  2022-01-27  9:22   ` John Ogness
@ 2022-01-27 15:12   ` Petr Mladek
  1 sibling, 0 replies; 14+ messages in thread
From: Petr Mladek @ 2022-01-27 15:12 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Sergey Senozhatsky, Arnd Bergmann, Steven Rostedt, Andrew Morton,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

On Wed 2022-01-26 15:02:36, Stephen Brennan wrote:
> If another CPU is in panic, we are about to be halted. Try to gracefully
> abandon the console_sem, leaving it free for the panic CPU to grab.
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 18107db118d4..572363ff716f 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2742,6 +2761,10 @@ void console_unlock(void)
>  		if (handover)
>  			return;
>  
> +		/* Allow panic_cpu to take over the consoles safely */
> +		if (abandon_console_lock_in_panic())
> +			break;

Hmm, it makes some sense to have it before cond_resched(). But I would
like to have it at the beginning of the cycle so that console_unlock()
might leave quickly without processing any single message.

We could have it in both (three) locations. But it might be over
cautious.

Anyway, the beginning is more important. Sleeping with console_sem is
less risky from the panic and races point of view.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] panic: Add panic_in_progress helper
  2022-01-27 14:34   ` Petr Mladek
@ 2022-01-27 16:02     ` Stephen Brennan
  2022-01-28  8:24       ` Petr Mladek
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Brennan @ 2022-01-27 16:02 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Arnd Bergmann, Steven Rostedt, Sebastian Reichel,
	Sergey Senozhatsky, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

On 1/27/22 06:34, Petr Mladek wrote:
> On Wed 2022-01-26 15:02:33, Stephen Brennan wrote:
>> This helper will be used in printk code to avoid deadlocks during
>> panic().
>>
>> Suggested-by: Petr Mladek <pmladek@suse.com>
>> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> 
> Reviewed-by: Petr Mladek <pmladek@suse.com>
> 
> Best Regards,
> Petr

Hi Petr,

Thanks for the review, however over the night I received two kernel test 
robot emails. One indicating a new build error caused by this change on 
m68k arch, and the other adding a new warning on riscv. From what I can 
tell, the issues are circular dependencies in #includes. So it may be 
better to either return to the macro, or move this static inline down to 
kernel/printk/printk.c. I think moving it into kernel/printk/printk.c 
makes most sense given that the macro requires the correct #includes anyway.

Thanks,
Stephen

------------- m68k

    In file included from include/asm-generic/preempt.h:5,
                     from ./arch/m68k/include/generated/asm/preempt.h:1,
                     from include/linux/preempt.h:78,
                     from arch/m68k/include/asm/irqflags.h:6,
                     from include/linux/irqflags.h:16,
                     from arch/m68k/include/asm/atomic.h:6,
                     from include/linux/atomic.h:7,
                     from include/linux/panic.h:7,
                     from include/asm-generic/bug.h:21,
                     from arch/m68k/include/asm/bug.h:32,
                     from include/linux/bug.h:5,
                     from include/linux/page-flags.h:10,
                     from kernel/bounds.c:10:
    include/linux/thread_info.h: In function 'copy_overflow':
 >> include/linux/thread_info.h:214:9: error: implicit declaration of 
function 'WARN' [-Werror=implicit-function-declaration]
      214 |         WARN(1, "Buffer overflow detected (%d < %lu)!\n", 
size, count);
          |         ^~~~
    include/linux/thread_info.h: In function 'check_copy_size':
 >> include/linux/thread_info.h:230:13: error: implicit declaration of 
function 'WARN_ON_ONCE' [-Werror=implicit-function-declaration]
      230 |         if (WARN_ON_ONCE(bytes > INT_MAX))
          |             ^~~~~~~~~~~~
    cc1: some warnings being treated as errors
    make[2]: *** [scripts/Makefile.build:121: kernel/bounds.s] Error 1
    make[2]: Target '__build' not remade because of errors.
    make[1]: *** [Makefile:1191: prepare0] Error 2
    make[1]: Target 'prepare' not remade because of errors.
    make: *** [Makefile:219: __sub-make] Error 2
    make: Target 'prepare' not remade because of errors.

----------------- riscv:

    In file included from arch/riscv/include/asm/atomic.h:19,
                     from include/linux/atomic.h:7,
                     from include/linux/panic.h:7,
                     from include/asm-generic/bug.h:21,
                     from arch/riscv/include/asm/bug.h:83,
                     from include/linux/bug.h:5,
                     from include/linux/page-flags.h:10,
                     from kernel/bounds.c:10:
    arch/riscv/include/asm/atomic.h: In function 'arch_atomic_xchg_relaxed':
 >> arch/riscv/include/asm/cmpxchg.h:35:17: error: implicit declaration 
of function 'BUILD_BUG' [-Werror=implicit-function-declaration]
       35 |                 BUILD_BUG(); 
             \
          |                 ^~~~~~~~~
    arch/riscv/include/asm/atomic.h:249:16: note: in expansion of macro 
'__xchg_relaxed'
      249 |         return __xchg_relaxed(&(v->counter), n, size); 
             \
          |                ^~~~~~~~~~~~~~
    arch/riscv/include/asm/atomic.h:295:9: note: in expansion of macro 
'ATOMIC_OP'
      295 |         ATOMIC_OP(int,   , 4) 
             \
          |         ^~~~~~~~~
    arch/riscv/include/asm/atomic.h:299:1: note: in expansion of macro 
'ATOMIC_OPS'
      299 | ATOMIC_OPS()
          | ^~~~~~~~~~
    cc1: some warnings being treated as errors
    make[2]: *** [scripts/Makefile.build:121: kernel/bounds.s] Error 1
    make[2]: Target '__build' not remade because of errors.
    make[1]: *** [Makefile:1191: prepare0] Error 2
    make[1]: Target 'prepare' not remade because of errors.
    make: *** [Makefile:219: __sub-make] Error 2
    make: Target 'prepare' not remade because of errors.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] printk: Avoid livelock with heavy printk during panic
  2022-01-27 14:50   ` Petr Mladek
@ 2022-01-27 16:19     ` Stephen Brennan
  0 siblings, 0 replies; 14+ messages in thread
From: Stephen Brennan @ 2022-01-27 16:19 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Arnd Bergmann, Steven Rostedt, Andrew Morton,
	Sebastian Reichel, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

On 1/27/22 06:50, Petr Mladek wrote:
> On Wed 2022-01-26 15:02:35, Stephen Brennan wrote:
>> During panic(), if another CPU is writing heavily the kernel log (e.g.
>> via /dev/kmsg), then the panic CPU may livelock writing out its messages
>> to the console. Note when too many messages are dropped during panic and
>> suppress further printk, except from the panic CPU. This could result in
>> some important messages being dropped. However, messages are already
>> being dropped, so this approach at least prevents a livelock.
>>
>> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
>> ---
>>
>> Notes:
>>      v2: Add pr_warn when we suppress printk on non-panic CPU
>>
>>   kernel/printk/printk.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 20b4b71a1a07..18107db118d4 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -93,6 +93,12 @@ EXPORT_SYMBOL_GPL(console_drivers);
>>    */
>>   int __read_mostly suppress_printk;
>>   
>> +/*
>> + * During panic, heavy printk by other CPUs can delay the
>> + * panic and risk deadlock on console resources.
>> + */
>> +int __read_mostly suppress_panic_printk;
>> +
>>   #ifdef CONFIG_LOCKDEP
>>   static struct lockdep_map console_lock_dep_map = {
>>   	.name = "console_lock"
>> @@ -2228,6 +2234,10 @@ asmlinkage int vprintk_emit(int facility, int level,
>>   	if (unlikely(suppress_printk))
>>   		return 0;
>>   
>> +	if (unlikely(suppress_panic_printk) &&
>> +	    atomic_read(&panic_cpu) != raw_smp_processor_id())
>> +		return 0;
>> +
>>   	if (level == LOGLEVEL_SCHED) {
>>   		level = LOGLEVEL_DEFAULT;
>>   		in_sched = true;
>> @@ -2613,6 +2623,7 @@ void console_unlock(void)
>>   {
>>   	static char ext_text[CONSOLE_EXT_LOG_MAX];
>>   	static char text[CONSOLE_LOG_MAX];
>> +	static int panic_console_dropped;
>>   	unsigned long flags;
>>   	bool do_cond_resched, retry;
>>   	struct printk_info info;
>> @@ -2667,6 +2678,11 @@ void console_unlock(void)
>>   		if (console_seq != r.info->seq) {
>>   			console_dropped += r.info->seq - console_seq;
>>   			console_seq = r.info->seq;
>> +			if (panic_in_progress() && panic_console_dropped++ > 10) {
>> +				suppress_panic_printk = 1;
>> +				pr_warn("Too many dropped messages. "
>> +				        "Supress messages on non-panic CPUs to prevent livelock.\n");
> 
> It looks like the message might be printed more times when
> panic_console_dropped++ > 10.
> 
> In theory, no message can be lost after we disable printk on another
> CPUs. But the code might evolve in the future. Let's make it
> more error-proof.
> 
> We could use (panic_console_dropped++ == 10) or pr_warn_once() or
> both.
> 
> I prefer using pr_warn_once() because it looks the most error-proof.
> 
> 
> Nit: printk() has exceptions from the 80 chars/line rule.
>       The message string should be on a single line. It helps
>       to find it using "git grep". I think that even checkpatch.pl
>       handles this correctly.
> 
> 
> With pr_warn_once() and message in single line:
> 
> Reviewed-by: Petr Mladek <pmladek@suse.com>
> 
> Best Regards,
> Petr
> 
> 
> PS: I could fix the two problems when pushing to git. But there
>      is still time to send v3. I have vacation the following week
>      with limited internet access. I am not going to rush it into
>      linux before I leave, ...
>      

Thank you Petr, I will go ahead and resolve things in this patch, and 
the others in a v3 soon. Regardless, no need to rush, please enjoy your 
vacation!

Thanks for all the review and guidance!
Stephen

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/4] printk: Drop console_sem during panic
  2022-01-27 15:03     ` Petr Mladek
@ 2022-01-28  5:55       ` Sergey Senozhatsky
  0 siblings, 0 replies; 14+ messages in thread
From: Sergey Senozhatsky @ 2022-01-28  5:55 UTC (permalink / raw)
  To: John Ogness, Petr Mladek
  Cc: Stephen Brennan, Sergey Senozhatsky, Arnd Bergmann,
	Steven Rostedt, Andrew Morton, Sebastian Reichel,
	Andy Shevchenko, Luis Chamberlain, linux-kernel

On (22/01/27 16:03), Petr Mladek wrote:
> On Thu 2022-01-27 10:28:53, John Ogness wrote:
> > On 2022-01-26, Stephen Brennan <stephen.s.brennan@oracle.com> wrote:
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -2759,7 +2782,7 @@ void console_unlock(void)
> > >  	 * flush, no worries.
> > >  	 */
> > >  	retry = prb_read_valid(prb, next_seq, NULL);
> > > -	if (retry && console_trylock())
> > > +	if (retry && !abandon_console_lock_in_panic() && console_trylock())
> > 
> > As Sergey suggested [0], I would like to see the call to
> > abandon_console_lock_in_panic() move inside console_trylock(). This will
> > help to avoid the race between NMI CPU halt and the internal sema.lock
> > spinlock.

Thanks John.

> I would prefer if it is done as a followup patch. The code in this
> patch is still needed. It helps when the non-panic CPU is busy
> with pushing many pending messages. Also it is a more conservative
> approach.

No objections. This series fixes issue at hand, so conservative approach
makes sense.

On the other hand, we are at -rc1 and it seems like a very good time to
discuss/look into/work on/etc. solution for the remaining cases/races/etc.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] panic: Add panic_in_progress helper
  2022-01-27 16:02     ` Stephen Brennan
@ 2022-01-28  8:24       ` Petr Mladek
  0 siblings, 0 replies; 14+ messages in thread
From: Petr Mladek @ 2022-01-28  8:24 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Andrew Morton, Arnd Bergmann, Steven Rostedt, Sebastian Reichel,
	Sergey Senozhatsky, John Ogness, Andy Shevchenko,
	Luis Chamberlain, linux-kernel

On Thu 2022-01-27 08:02:19, Stephen Brennan wrote:
> On 1/27/22 06:34, Petr Mladek wrote:
> > On Wed 2022-01-26 15:02:33, Stephen Brennan wrote:
> > > This helper will be used in printk code to avoid deadlocks during
> > > panic().
> > > 
> > > Suggested-by: Petr Mladek <pmladek@suse.com>
> > > Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> > 
> > Reviewed-by: Petr Mladek <pmladek@suse.com>
> > 
> > Best Regards,
> > Petr
> 
> Hi Petr,
> 
> Thanks for the review, however over the night I received two kernel test
> robot emails. One indicating a new build error caused by this change on m68k
> arch, and the other adding a new warning on riscv. From what I can tell, the
> issues are circular dependencies in #includes. So it may be better to either
> return to the macro, or move this static inline down to
> kernel/printk/printk.c. I think moving it into kernel/printk/printk.c makes
> most sense given that the macro requires the correct #includes anyway.

Yes, I prefer moving the inline down to printk.c. It looks a bit
cleaner than the macro that would not work without another include.

Please, explain this in the commit message.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-01-28  8:24 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 23:02 [PATCH v2 0/4] printk: reduce deadlocks during panic Stephen Brennan
2022-01-26 23:02 ` [PATCH v2 1/4] panic: Add panic_in_progress helper Stephen Brennan
2022-01-27 14:34   ` Petr Mladek
2022-01-27 16:02     ` Stephen Brennan
2022-01-28  8:24       ` Petr Mladek
2022-01-26 23:02 ` [PATCH v2 2/4] printk: disable optimistic spin during panic Stephen Brennan
2022-01-26 23:02 ` [PATCH v2 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
2022-01-27 14:50   ` Petr Mladek
2022-01-27 16:19     ` Stephen Brennan
2022-01-26 23:02 ` [PATCH v2 4/4] printk: Drop console_sem " Stephen Brennan
2022-01-27  9:22   ` John Ogness
2022-01-27 15:03     ` Petr Mladek
2022-01-28  5:55       ` Sergey Senozhatsky
2022-01-27 15:12   ` Petr Mladek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).