linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] printk: reduce deadlocks during panic
@ 2022-02-02 17:18 Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 1/4] printk: Add panic_in_progress helper Stephen Brennan
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Stephen Brennan @ 2022-02-02 17:18 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek; +Cc: linux-kernel, Steven Rostedt, John Ogness

When a caller writes heavily to the kernel log (e.g. writing to
/dev/kmsg in a loop) while another panics, there's currently a high
likelihood of a deadlock (see patch 2 for the full description of this
deadlock).

The principle fix is to disable the optimistic spin once panic_cpu is
set, so the panic CPU doesn't spin waiting for a halted CPU to hand over
the console_sem.

However, this exposed us to a livelock situation, where the panic CPU
holds the console_sem, and another CPU could fill up the log buffer
faster than the consoles could drain it, preventing the panic from
progressing and halting the other CPUs. To avoid this, patch 3 adds a
mechanism to suppress printk (from non-panic-CPU) during panic, if we
reach a threshold of dropped messages.

A major goal with all of these patches is to try to decrease the
likelihood that another CPU is holding the console_sem when we halt it
in panic(). This reduces the odds of needing to break locks and
potentially encountering further deadlocks with the console drivers.

To test, I use the following script, kmsg_panic.sh:

    #!/bin/bash
    date
    # 991 chars (based on log buffer size):
    chars="$(printf 'a%.0s' {1..991})"
    while :; do
        echo $chars > /dev/kmsg
    done &
    echo c > /proc/sysrq-trigger &
    date
    exit

I defined a hang as any time the system did not reboot to a login prompt
on the serial console within 60 seconds. Here are the statistics on
hangs using this script, before and after the patch.

before:  776 hangs / 1484 trials - 52.3%
after :    0 hangs /  15k trials -  0.0%

v4: Moved panic_in_progress() out of #define CONFIG_PRINTK
v3:
    Some mild style changes, none of which affect testing (which has run
    continuously and is now over 15 thousand trials without a hang!)
v2:
    Each patch has minor updates from code reviews. I've re-done testing
    and updated the above statistics. Exact changes are in each patch.


Stephen Brennan (4):
  printk: Add panic_in_progress helper
  printk: disable optimistic spin during panic
  printk: Avoid livelock with heavy printk during panic
  printk: Drop console_sem during panic

 kernel/printk/printk.c | 55 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/4] printk: Add panic_in_progress helper
  2022-02-02 17:18 [PATCH v4 0/4] printk: reduce deadlocks during panic Stephen Brennan
@ 2022-02-02 17:18 ` Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 2/4] printk: disable optimistic spin during panic Stephen Brennan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Stephen Brennan @ 2022-02-02 17:18 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: linux-kernel, Steven Rostedt, John Ogness, Stephen Brennan

This will be used help avoid deadlocks during panics. Although it would
be better to include this in linux/panic.h, it would require that header
to include linux/atomic.h as well. On some architectures, this results
in a circular dependency as well. So instead add the helper directly to
printk.c.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v4: Move helper outside CONFIG_PRINTK section
    v3: Move the helper into printk.c due to circular include
    v2: Switch from macro to static inline function

 kernel/printk/printk.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 57b132b658e1..e3d986dd49a1 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -256,6 +256,11 @@ static void __up_console_sem(unsigned long ip)
 }
 #define up_console_sem() __up_console_sem(_RET_IP_)
 
+static bool panic_in_progress(void)
+{
+	return unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID);
+}
+
 /*
  * This is used for debugging the mess that is the VT code by
  * keeping track if we have the console semaphore held. It's
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/4] printk: disable optimistic spin during panic
  2022-02-02 17:18 [PATCH v4 0/4] printk: reduce deadlocks during panic Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 1/4] printk: Add panic_in_progress helper Stephen Brennan
@ 2022-02-02 17:18 ` Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Stephen Brennan @ 2022-02-02 17:18 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: linux-kernel, Steven Rostedt, John Ogness, Stephen Brennan

A CPU executing with console lock spinning enabled might be halted
during a panic. Before the panicking CPU calls console_flush_on_panic(),
it may call console_trylock(), which attempts to optimistically spin,
deadlocking the panic CPU:

CPU 0 (panic CPU)             CPU 1
-----------------             ------
                              printk() {
                                vprintk_func() {
                                  vprintk_default() {
                                    vprintk_emit() {
                                      console_unlock() {
                                        console_lock_spinning_enable();
                                        ... printing to console ...
panic() {
  crash_smp_send_stop() {
    NMI  -------------------> HALT
  }
  atomic_notifier_call_chain() {
    printk() {
      ...
      console_trylock_spinnning() {
        // optimistic spin infinitely

This hang during panic can be induced when a kdump kernel is loaded, and
crash_kexec_post_notifiers=1 is present on the kernel command line. The
following script which concurrently writes to /dev/kmsg, and triggers a
panic, can result in this hang:

    #!/bin/bash
    date
    # 991 chars (based on log buffer size):
    chars="$(printf 'a%.0s' {1..991})"
    while :; do
        echo $chars > /dev/kmsg
    done &
    echo c > /proc/sysrq-trigger &
    date
    exit

To avoid this deadlock, ensure that console_trylock_spinning() does not
allow spinning once a panic has begun.

Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes")

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
---
 kernel/printk/printk.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index e3d986dd49a1..dfac84876e7a 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1848,6 +1848,16 @@ static int console_trylock_spinning(void)
 	if (console_trylock())
 		return 1;
 
+	/*
+	 * It's unsafe to spin once a panic has begun. If we are the
+	 * panic CPU, we may have already halted the owner of the
+	 * console_sem. If we are not the panic CPU, then we should
+	 * avoid taking console_sem, so the panic CPU has a better
+	 * chance of cleanly acquiring it later.
+	 */
+	if (panic_in_progress())
+		return 0;
+
 	printk_safe_enter_irqsave(flags);
 
 	raw_spin_lock(&console_owner_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 3/4] printk: Avoid livelock with heavy printk during panic
  2022-02-02 17:18 [PATCH v4 0/4] printk: reduce deadlocks during panic Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 1/4] printk: Add panic_in_progress helper Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 2/4] printk: disable optimistic spin during panic Stephen Brennan
@ 2022-02-02 17:18 ` Stephen Brennan
  2022-02-02 17:18 ` [PATCH v4 4/4] printk: Drop console_sem " Stephen Brennan
  2022-02-10  9:22 ` [PATCH v4 0/4] printk: reduce deadlocks " Petr Mladek
  4 siblings, 0 replies; 10+ messages in thread
From: Stephen Brennan @ 2022-02-02 17:18 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: linux-kernel, Steven Rostedt, John Ogness, Stephen Brennan

During panic(), if another CPU is writing heavily the kernel log (e.g.
via /dev/kmsg), then the panic CPU may livelock writing out its messages
to the console. Note when too many messages are dropped during panic and
suppress further printk, except from the panic CPU. This could result in
some important messages being dropped. However, messages are already
being dropped, so this approach at least prevents a livelock.

Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v3: Use pr_warn_once, and don't break the message line
    v2: Add pr_warn when we suppress printk on non-panic CPU

 kernel/printk/printk.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index dfac84876e7a..736f00d33087 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -93,6 +93,12 @@ EXPORT_SYMBOL_GPL(console_drivers);
  */
 int __read_mostly suppress_printk;
 
+/*
+ * During panic, heavy printk by other CPUs can delay the
+ * panic and risk deadlock on console resources.
+ */
+int __read_mostly suppress_panic_printk;
+
 #ifdef CONFIG_LOCKDEP
 static struct lockdep_map console_lock_dep_map = {
 	.name = "console_lock"
@@ -2233,6 +2239,10 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (unlikely(suppress_printk))
 		return 0;
 
+	if (unlikely(suppress_panic_printk) &&
+	    atomic_read(&panic_cpu) != raw_smp_processor_id())
+		return 0;
+
 	if (level == LOGLEVEL_SCHED) {
 		level = LOGLEVEL_DEFAULT;
 		in_sched = true;
@@ -2618,6 +2628,7 @@ void console_unlock(void)
 {
 	static char ext_text[CONSOLE_EXT_LOG_MAX];
 	static char text[CONSOLE_LOG_MAX];
+	static int panic_console_dropped;
 	unsigned long flags;
 	bool do_cond_resched, retry;
 	struct printk_info info;
@@ -2672,6 +2683,10 @@ void console_unlock(void)
 		if (console_seq != r.info->seq) {
 			console_dropped += r.info->seq - console_seq;
 			console_seq = r.info->seq;
+			if (panic_in_progress() && panic_console_dropped++ > 10) {
+				suppress_panic_printk = 1;
+				pr_warn_once("Too many dropped messages. Suppress messages on non-panic CPUs to prevent livelock.\n");
+			}
 		}
 
 		if (suppress_message_printing(r.info->level)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 4/4] printk: Drop console_sem during panic
  2022-02-02 17:18 [PATCH v4 0/4] printk: reduce deadlocks during panic Stephen Brennan
                   ` (2 preceding siblings ...)
  2022-02-02 17:18 ` [PATCH v4 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
@ 2022-02-02 17:18 ` Stephen Brennan
  2022-02-10  9:22 ` [PATCH v4 0/4] printk: reduce deadlocks " Petr Mladek
  4 siblings, 0 replies; 10+ messages in thread
From: Stephen Brennan @ 2022-02-02 17:18 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek
  Cc: linux-kernel, Steven Rostedt, John Ogness, Stephen Brennan

If another CPU is in panic, we are about to be halted. Try to gracefully
abandon the console_sem, leaving it free for the panic CPU to grab.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v2: Factor check out to a helper, and check at the end of
    console_unlock() to prevent retry as well.

 kernel/printk/printk.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 736f00d33087..a174c4b2d156 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2598,6 +2598,25 @@ static int have_callable_console(void)
 	return 0;
 }
 
+/*
+ * Return true when this CPU should unlock console_sem without pushing all
+ * messages to the console. This reduces the chance that the console is
+ * locked when the panic CPU tries to use it.
+ */
+static bool abandon_console_lock_in_panic(void)
+{
+	if (!panic_in_progress())
+		return false;
+
+	/*
+	 * We can use raw_smp_processor_id() here because it is impossible for
+	 * the task to be migrated to the panic_cpu, or away from it. If
+	 * panic_cpu has already been set, and we're not currently executing on
+	 * that CPU, then we never will be.
+	 */
+	return atomic_read(&panic_cpu) != raw_smp_processor_id();
+}
+
 /*
  * Can we actually use the console at this time on this cpu?
  *
@@ -2746,6 +2765,10 @@ void console_unlock(void)
 		if (handover)
 			return;
 
+		/* Allow panic_cpu to take over the consoles safely */
+		if (abandon_console_lock_in_panic())
+			break;
+
 		if (do_cond_resched)
 			cond_resched();
 	}
@@ -2763,7 +2786,7 @@ void console_unlock(void)
 	 * flush, no worries.
 	 */
 	retry = prb_read_valid(prb, next_seq, NULL);
-	if (retry && console_trylock())
+	if (retry && !abandon_console_lock_in_panic() && console_trylock())
 		goto again;
 }
 EXPORT_SYMBOL(console_unlock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/4] printk: reduce deadlocks during panic
  2022-02-02 17:18 [PATCH v4 0/4] printk: reduce deadlocks during panic Stephen Brennan
                   ` (3 preceding siblings ...)
  2022-02-02 17:18 ` [PATCH v4 4/4] printk: Drop console_sem " Stephen Brennan
@ 2022-02-10  9:22 ` Petr Mladek
  2022-02-10 20:06   ` Stephen Brennan
  2022-02-14  2:27   ` Sergey Senozhatsky
  4 siblings, 2 replies; 10+ messages in thread
From: Petr Mladek @ 2022-02-10  9:22 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Sergey Senozhatsky, linux-kernel, Steven Rostedt, John Ogness

On Wed 2022-02-02 09:18:17, Stephen Brennan wrote:
> When a caller writes heavily to the kernel log (e.g. writing to
> /dev/kmsg in a loop) while another panics, there's currently a high
> likelihood of a deadlock (see patch 2 for the full description of this
> deadlock).
> 
> The principle fix is to disable the optimistic spin once panic_cpu is
> set, so the panic CPU doesn't spin waiting for a halted CPU to hand over
> the console_sem.
> 
> However, this exposed us to a livelock situation, where the panic CPU
> holds the console_sem, and another CPU could fill up the log buffer
> faster than the consoles could drain it, preventing the panic from
> progressing and halting the other CPUs. To avoid this, patch 3 adds a
> mechanism to suppress printk (from non-panic-CPU) during panic, if we
> reach a threshold of dropped messages.
> 
> A major goal with all of these patches is to try to decrease the
> likelihood that another CPU is holding the console_sem when we halt it
> in panic(). This reduces the odds of needing to break locks and
> potentially encountering further deadlocks with the console drivers.
> 
> To test, I use the following script, kmsg_panic.sh:
> 
>     #!/bin/bash
>     date
>     # 991 chars (based on log buffer size):
>     chars="$(printf 'a%.0s' {1..991})"
>     while :; do
>         echo $chars > /dev/kmsg
>     done &
>     echo c > /proc/sysrq-trigger &
>     date
>     exit
> 
> I defined a hang as any time the system did not reboot to a login prompt
> on the serial console within 60 seconds. Here are the statistics on
> hangs using this script, before and after the patch.
> 
> before:  776 hangs / 1484 trials - 52.3%
> after :    0 hangs /  15k trials -  0.0%
> 
> Stephen Brennan (4):
>   printk: Add panic_in_progress helper
>   printk: disable optimistic spin during panic
>   printk: Avoid livelock with heavy printk during panic
>   printk: Drop console_sem during panic
> 
>  kernel/printk/printk.c | 55 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 54 insertions(+), 1 deletion(-)

For the entire patchset:

Reviewed-by: Petr Mladek <pmladek@suse.com>

It looks ready for linux-next from my POV. I am going to push it early
next week unless anyone complains in the meantime.

Best Regards,
Petr Mladek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/4] printk: reduce deadlocks during panic
  2022-02-10  9:22 ` [PATCH v4 0/4] printk: reduce deadlocks " Petr Mladek
@ 2022-02-10 20:06   ` Stephen Brennan
  2022-02-14 13:54     ` Petr Mladek
  2022-02-14  2:27   ` Sergey Senozhatsky
  1 sibling, 1 reply; 10+ messages in thread
From: Stephen Brennan @ 2022-02-10 20:06 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Sergey Senozhatsky, linux-kernel, Steven Rostedt, John Ogness

On 2/10/22 01:22, Petr Mladek wrote:
> On Wed 2022-02-02 09:18:17, Stephen Brennan wrote:
>> When a caller writes heavily to the kernel log (e.g. writing to
>> /dev/kmsg in a loop) while another panics, there's currently a high
>> likelihood of a deadlock (see patch 2 for the full description of this
>> deadlock).
>>
>> The principle fix is to disable the optimistic spin once panic_cpu is
>> set, so the panic CPU doesn't spin waiting for a halted CPU to hand over
>> the console_sem.
>>
>> However, this exposed us to a livelock situation, where the panic CPU
>> holds the console_sem, and another CPU could fill up the log buffer
>> faster than the consoles could drain it, preventing the panic from
>> progressing and halting the other CPUs. To avoid this, patch 3 adds a
>> mechanism to suppress printk (from non-panic-CPU) during panic, if we
>> reach a threshold of dropped messages.
>>
>> A major goal with all of these patches is to try to decrease the
>> likelihood that another CPU is holding the console_sem when we halt it
>> in panic(). This reduces the odds of needing to break locks and
>> potentially encountering further deadlocks with the console drivers.
>>
>> To test, I use the following script, kmsg_panic.sh:
>>
>>      #!/bin/bash
>>      date
>>      # 991 chars (based on log buffer size):
>>      chars="$(printf 'a%.0s' {1..991})"
>>      while :; do
>>          echo $chars > /dev/kmsg
>>      done &
>>      echo c > /proc/sysrq-trigger &
>>      date
>>      exit
>>
>> I defined a hang as any time the system did not reboot to a login prompt
>> on the serial console within 60 seconds. Here are the statistics on
>> hangs using this script, before and after the patch.
>>
>> before:  776 hangs / 1484 trials - 52.3%
>> after :    0 hangs /  15k trials -  0.0%
>>
>> Stephen Brennan (4):
>>    printk: Add panic_in_progress helper
>>    printk: disable optimistic spin during panic
>>    printk: Avoid livelock with heavy printk during panic
>>    printk: Drop console_sem during panic
>>
>>   kernel/printk/printk.c | 55 +++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 54 insertions(+), 1 deletion(-)
> 
> For the entire patchset:
> 
> Reviewed-by: Petr Mladek <pmladek@suse.com>
> 
> It looks ready for linux-next from my POV. I am going to push it early
> next week unless anyone complains in the meantime.
Thank you Petr! It occurs to me that some of this could be 
stable-worthy, depending on your feelings on it. Patches 1-3 resolve 
real bugs on customer systems, and they'd apply back a decent way. 1-2 
apply all the way back to 4.14, and 3 would apply with some minor 
changes. I suppose the question is whether they are simple enough. Patch 
4 is useful but I don't have a real reproducer for a bug it fixes, so I 
wouldn't say it's stable worthy.

Of course we have the logbuf_lock in 5.10 and previous, and if a CPU is 
halted holding that lock, then printk hangs even before the optimistic 
spinning. I have patches which reinitialize those locks after the CPUs 
are halted if necessary. I think they are reasonable for stable - printk 
is guaranteed to hang without doing this, so in the worst case you trade 
a hang during a panic, with some other sort of printk log buffer bug 
during a panic. But in the common case, you eliminate the hang. I can 
send that patch to linux-stable as well.

What do you think about these patches and stable?

Thanks again,
Stephen

> 
> Best Regards,
> Petr Mladek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/4] printk: reduce deadlocks during panic
  2022-02-10  9:22 ` [PATCH v4 0/4] printk: reduce deadlocks " Petr Mladek
  2022-02-10 20:06   ` Stephen Brennan
@ 2022-02-14  2:27   ` Sergey Senozhatsky
  1 sibling, 0 replies; 10+ messages in thread
From: Sergey Senozhatsky @ 2022-02-14  2:27 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Stephen Brennan, Sergey Senozhatsky, linux-kernel,
	Steven Rostedt, John Ogness

On (22/02/10 10:22), Petr Mladek wrote:
> 
> For the entire patchset:
> 
> Reviewed-by: Petr Mladek <pmladek@suse.com>

FWIW
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/4] printk: reduce deadlocks during panic
  2022-02-10 20:06   ` Stephen Brennan
@ 2022-02-14 13:54     ` Petr Mladek
  2022-02-15 23:24       ` Stephen Brennan
  0 siblings, 1 reply; 10+ messages in thread
From: Petr Mladek @ 2022-02-14 13:54 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Sergey Senozhatsky, linux-kernel, Steven Rostedt, John Ogness

On Thu 2022-02-10 12:06:44, Stephen Brennan wrote:
> On 2/10/22 01:22, Petr Mladek wrote:
> > On Wed 2022-02-02 09:18:17, Stephen Brennan wrote:
> > > When a caller writes heavily to the kernel log (e.g. writing to
> > > /dev/kmsg in a loop) while another panics, there's currently a high
> > > likelihood of a deadlock (see patch 2 for the full description of this
> > > deadlock).
> > > 
> > > The principle fix is to disable the optimistic spin once panic_cpu is
> > > set, so the panic CPU doesn't spin waiting for a halted CPU to hand over
> > > the console_sem.
> > > 
> > > However, this exposed us to a livelock situation, where the panic CPU
> > > holds the console_sem, and another CPU could fill up the log buffer
> > > faster than the consoles could drain it, preventing the panic from
> > > progressing and halting the other CPUs. To avoid this, patch 3 adds a
> > > mechanism to suppress printk (from non-panic-CPU) during panic, if we
> > > reach a threshold of dropped messages.
> > > 
> > > A major goal with all of these patches is to try to decrease the
> > > likelihood that another CPU is holding the console_sem when we halt it
> > > in panic(). This reduces the odds of needing to break locks and
> > > potentially encountering further deadlocks with the console drivers.
> > > 
> > > To test, I use the following script, kmsg_panic.sh:
> > > 
> > >      #!/bin/bash
> > >      date
> > >      # 991 chars (based on log buffer size):
> > >      chars="$(printf 'a%.0s' {1..991})"
> > >      while :; do
> > >          echo $chars > /dev/kmsg
> > >      done &
> > >      echo c > /proc/sysrq-trigger &
> > >      date
> > >      exit
> > > 
> > > I defined a hang as any time the system did not reboot to a login prompt
> > > on the serial console within 60 seconds. Here are the statistics on
> > > hangs using this script, before and after the patch.
> > > 
> > > before:  776 hangs / 1484 trials - 52.3%
> > > after :    0 hangs /  15k trials -  0.0%
> > > 
> > > Stephen Brennan (4):
> > >    printk: Add panic_in_progress helper
> > >    printk: disable optimistic spin during panic
> > >    printk: Avoid livelock with heavy printk during panic
> > >    printk: Drop console_sem during panic
> > > 
> > >   kernel/printk/printk.c | 55 +++++++++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 54 insertions(+), 1 deletion(-)
> > 
> > For the entire patchset:
> > 
> > Reviewed-by: Petr Mladek <pmladek@suse.com>
> > 
> > It looks ready for linux-next from my POV. I am going to push it early
> > next week unless anyone complains in the meantime.

The patchset is committed in printk/linux.git, branch for-5.18-panic-deadlocks.

> Thank you Petr! It occurs to me that some of this could be stable-worthy,
> depending on your feelings on it. Patches 1-3 resolve real bugs on customer
> systems, and they'd apply back a decent way. 1-2 apply all the way back to
> 4.14, and 3 would apply with some minor changes. I suppose the question is
> whether they are simple enough. Patch 4 is useful but I don't have a real
> reproducer for a bug it fixes, so I wouldn't say it's stable worthy.

Good question. If you saw these deadlocks on customer systems in
the real life then it might be worth it.

I newer saw them. But they hard to debug and report. Also they are
visible only when CPUs are stopped by NMI. And the default
smp_send_stop() tries to stop CPUs using normal IRQ first.

Anyway, the patches 1,2,4 are pretty straightforward and should be
safe. Feel free to send them to stable.

3rd patch is a heuristic. It tries to prevent livelock and the cost
is a possible loss of information. I am not 100% sure that it will
do the right thing in all situations. I would wait one or two release
cycles before we backport it to older stable releases.


> Of course we have the logbuf_lock in 5.10 and previous, and if a CPU is
> halted holding that lock, then printk hangs even before the optimistic
> spinning. I have patches which reinitialize those locks after the CPUs are
> halted if necessary. I think they are reasonable for stable - printk is
> guaranteed to hang without doing this, so in the worst case you trade a hang
> during a panic, with some other sort of printk log buffer bug during a
> panic. But in the common case, you eliminate the hang. I can send that patch
> to linux-stable as well.

The main problem is that the locks can be safely re-initialized only
when the other CPUs were stopped using NMI. Otherwise, there is
a risk of double unlock. Such a patch would need to be arch-dependent.

Also stable people do not like much solutions that were not used
in the mainline. So, it might be a waste of time.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/4] printk: reduce deadlocks during panic
  2022-02-14 13:54     ` Petr Mladek
@ 2022-02-15 23:24       ` Stephen Brennan
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Brennan @ 2022-02-15 23:24 UTC (permalink / raw)
  To: Petr Mladek; +Cc: Sergey Senozhatsky, linux-kernel, Steven Rostedt, John Ogness

On 2/14/22 05:54, Petr Mladek wrote:
> On Thu 2022-02-10 12:06:44, Stephen Brennan wrote:
>> On 2/10/22 01:22, Petr Mladek wrote:
>>> On Wed 2022-02-02 09:18:17, Stephen Brennan wrote:
>>>> When a caller writes heavily to the kernel log (e.g. writing to
>>>> /dev/kmsg in a loop) while another panics, there's currently a high
>>>> likelihood of a deadlock (see patch 2 for the full description of this
>>>> deadlock).
>>>>
>>>> The principle fix is to disable the optimistic spin once panic_cpu is
>>>> set, so the panic CPU doesn't spin waiting for a halted CPU to hand over
>>>> the console_sem.
>>>>
>>>> However, this exposed us to a livelock situation, where the panic CPU
>>>> holds the console_sem, and another CPU could fill up the log buffer
>>>> faster than the consoles could drain it, preventing the panic from
>>>> progressing and halting the other CPUs. To avoid this, patch 3 adds a
>>>> mechanism to suppress printk (from non-panic-CPU) during panic, if we
>>>> reach a threshold of dropped messages.
>>>>
>>>> A major goal with all of these patches is to try to decrease the
>>>> likelihood that another CPU is holding the console_sem when we halt it
>>>> in panic(). This reduces the odds of needing to break locks and
>>>> potentially encountering further deadlocks with the console drivers.
>>>>
>>>> To test, I use the following script, kmsg_panic.sh:
>>>>
>>>>       #!/bin/bash
>>>>       date
>>>>       # 991 chars (based on log buffer size):
>>>>       chars="$(printf 'a%.0s' {1..991})"
>>>>       while :; do
>>>>           echo $chars > /dev/kmsg
>>>>       done &
>>>>       echo c > /proc/sysrq-trigger &
>>>>       date
>>>>       exit
>>>>
>>>> I defined a hang as any time the system did not reboot to a login prompt
>>>> on the serial console within 60 seconds. Here are the statistics on
>>>> hangs using this script, before and after the patch.
>>>>
>>>> before:  776 hangs / 1484 trials - 52.3%
>>>> after :    0 hangs /  15k trials -  0.0%
>>>>
>>>> Stephen Brennan (4):
>>>>     printk: Add panic_in_progress helper
>>>>     printk: disable optimistic spin during panic
>>>>     printk: Avoid livelock with heavy printk during panic
>>>>     printk: Drop console_sem during panic
>>>>
>>>>    kernel/printk/printk.c | 55 +++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 54 insertions(+), 1 deletion(-)
>>>
>>> For the entire patchset:
>>>
>>> Reviewed-by: Petr Mladek <pmladek@suse.com>
>>>
>>> It looks ready for linux-next from my POV. I am going to push it early
>>> next week unless anyone complains in the meantime.
> 
> The patchset is committed in printk/linux.git, branch for-5.18-panic-deadlocks.

Thank you!

> 
>> Thank you Petr! It occurs to me that some of this could be stable-worthy,
>> depending on your feelings on it. Patches 1-3 resolve real bugs on customer
>> systems, and they'd apply back a decent way. 1-2 apply all the way back to
>> 4.14, and 3 would apply with some minor changes. I suppose the question is
>> whether they are simple enough. Patch 4 is useful but I don't have a real
>> reproducer for a bug it fixes, so I wouldn't say it's stable worthy.
> 
> Good question. If you saw these deadlocks on customer systems in
> the real life then it might be worth it.
> 
> I newer saw them. But they hard to debug and report. Also they are
> visible only when CPUs are stopped by NMI. And the default
> smp_send_stop() tries to stop CPUs using normal IRQ first.

Some of our default configurations like to ship with 
crash_kexec_post_notifiers=1, so I see the aggressive 
crash_smp_send_stop() code path a lot internally.

And interestingly, Hyper-V enables crash_kexec_post_notifiers in certain 
cases, without so much as a peep in the kernel log, even if kdump is not 
enabled... So folks running on Azure or otherwise running Hyper-V guests 
would be exposed to this. I actually just saw some patches related to 
this which you reviewed, so I guess you're aware of that :)

https://lore.kernel.org/lkml/YgDZX4PYwhrA1+Ct@MiWiFi-R3L-srv/

As somebody who mainly browses subsystem-focused mailing lists I didn't 
see this at all, as these printk/panic related patches only seem to go 
to LKML. I guess I need to check out the new lore+lei system so I can 
keep track of relevant work.

> 
> Anyway, the patches 1,2,4 are pretty straightforward and should be
> safe. Feel free to send them to stable.

Very reasonable. The customer systems experiencing the issue would be 
resolved by patches 1-2 alone. I did not have a customer experiencing 
any sort of livelock (and it would be a pretty ridiculous situation, 
most printk dies down quickly). I see the value of patch 4 for 
architectures without these NMI so I can send it too.

> 
> 3rd patch is a heuristic. It tries to prevent livelock and the cost
> is a possible loss of information. I am not 100% sure that it will
> do the right thing in all situations. I would wait one or two release
> cycles before we backport it to older stable releases.

Fair enough!

> 
>> Of course we have the logbuf_lock in 5.10 and previous, and if a CPU is
>> halted holding that lock, then printk hangs even before the optimistic
>> spinning. I have patches which reinitialize those locks after the CPUs are
>> halted if necessary. I think they are reasonable for stable - printk is
>> guaranteed to hang without doing this, so in the worst case you trade a hang
>> during a panic, with some other sort of printk log buffer bug during a
>> panic. But in the common case, you eliminate the hang. I can send that patch
>> to linux-stable as well.
> 
> The main problem is that the locks can be safely re-initialized only
> when the other CPUs were stopped using NMI. Otherwise, there is
> a risk of double unlock. Such a patch would need to be arch-dependent.

That's a very good point, thank you for mentioning it.

> 
> Also stable people do not like much solutions that were not used
> in the mainline. So, it might be a waste of time.

True. I may try my hand at it simply because I would much rather see 
work be done for all via -stable rather than in vendor-specific trees.

Thanks,
Stephen

> 
> Best Regards,
> Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-02-15 23:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-02 17:18 [PATCH v4 0/4] printk: reduce deadlocks during panic Stephen Brennan
2022-02-02 17:18 ` [PATCH v4 1/4] printk: Add panic_in_progress helper Stephen Brennan
2022-02-02 17:18 ` [PATCH v4 2/4] printk: disable optimistic spin during panic Stephen Brennan
2022-02-02 17:18 ` [PATCH v4 3/4] printk: Avoid livelock with heavy printk " Stephen Brennan
2022-02-02 17:18 ` [PATCH v4 4/4] printk: Drop console_sem " Stephen Brennan
2022-02-10  9:22 ` [PATCH v4 0/4] printk: reduce deadlocks " Petr Mladek
2022-02-10 20:06   ` Stephen Brennan
2022-02-14 13:54     ` Petr Mladek
2022-02-15 23:24       ` Stephen Brennan
2022-02-14  2:27   ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).