All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] printk: fix deadlock when kernel panic
@ 2021-02-06  5:41 Muchun Song
  2021-02-08  6:38 ` Sergey Senozhatsky
  2021-02-09  9:19 ` Petr Mladek
  0 siblings, 2 replies; 10+ messages in thread
From: Muchun Song @ 2021-02-06  5:41 UTC (permalink / raw)
  To: pmladek, sergey.senozhatsky, rostedt, john.ogness, akpm
  Cc: linux-kernel, Muchun Song

We found a deadlock bug on our server when the kernel panic. It can be
described in the following diagram.

CPU0:                                         CPU1:
panic                                         rcu_dump_cpu_stacks
  kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
    register_nmi_handler(crash_nmi_callback)      printk_safe_flush
                                                    __printk_safe_flush
                                                      raw_spin_lock_irqsave(&read_lock)
    // send NMI to other processors
    apic_send_IPI_allbutself(NMI_VECTOR)
                                                        // NMI interrupt, dead loop
                                                        crash_nmi_callback
  printk_safe_flush_on_panic
    printk_safe_flush
      __printk_safe_flush
        // deadlock
        raw_spin_lock_irqsave(&read_lock)

The register_nmi_handler() can be called in the __crash_kexec() or the
crash_smp_send_stop() on the x86-64. Because CPU1 is interrupted by the
NMI with holding the read_lock and crash_nmi_callback() never returns,
CPU0 can deadlock when printk_safe_flush_on_panic() is called.

When we hold the read_lock and then interrupted by the NMI, if the NMI
handler call nmi_panic(), it is also can lead to deadlock.

In order to fix it, we make read_lock global and rename it to
safe_read_lock. And we handle safe_read_lock the same way in
printk_safe_flush_on_panic() as we handle logbuf_lock there.

Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
v2:
 - handle read_lock the same way as we handle logbuf_lock there.

 Thanks Petr.

 kernel/printk/printk_safe.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index a0e6f746de6c..2e9e3ed7d63e 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -45,6 +45,8 @@ struct printk_safe_seq_buf {
 static DEFINE_PER_CPU(struct printk_safe_seq_buf, safe_print_seq);
 static DEFINE_PER_CPU(int, printk_context);
 
+static DEFINE_RAW_SPINLOCK(safe_read_lock);
+
 #ifdef CONFIG_PRINTK_NMI
 static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq);
 #endif
@@ -180,8 +182,6 @@ static void report_message_lost(struct printk_safe_seq_buf *s)
  */
 static void __printk_safe_flush(struct irq_work *work)
 {
-	static raw_spinlock_t read_lock =
-		__RAW_SPIN_LOCK_INITIALIZER(read_lock);
 	struct printk_safe_seq_buf *s =
 		container_of(work, struct printk_safe_seq_buf, work);
 	unsigned long flags;
@@ -195,7 +195,7 @@ static void __printk_safe_flush(struct irq_work *work)
 	 * different CPUs. This is especially important when printing
 	 * a backtrace.
 	 */
-	raw_spin_lock_irqsave(&read_lock, flags);
+	raw_spin_lock_irqsave(&safe_read_lock, flags);
 
 	i = 0;
 more:
@@ -232,7 +232,7 @@ static void __printk_safe_flush(struct irq_work *work)
 
 out:
 	report_message_lost(s);
-	raw_spin_unlock_irqrestore(&read_lock, flags);
+	raw_spin_unlock_irqrestore(&safe_read_lock, flags);
 }
 
 /**
@@ -278,6 +278,14 @@ void printk_safe_flush_on_panic(void)
 		raw_spin_lock_init(&logbuf_lock);
 	}
 
+	if (raw_spin_is_locked(&safe_read_lock)) {
+		if (num_online_cpus() > 1)
+			return;
+
+		debug_locks_off();
+		raw_spin_lock_init(&safe_read_lock);
+	}
+
 	printk_safe_flush();
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-06  5:41 [PATCH v2] printk: fix deadlock when kernel panic Muchun Song
@ 2021-02-08  6:38 ` Sergey Senozhatsky
  2021-02-08  8:49   ` [External] " Muchun Song
  2021-02-09  9:19 ` Petr Mladek
  1 sibling, 1 reply; 10+ messages in thread
From: Sergey Senozhatsky @ 2021-02-08  6:38 UTC (permalink / raw)
  To: Muchun Song
  Cc: pmladek, sergey.senozhatsky, rostedt, john.ogness, akpm, linux-kernel

On (21/02/06 13:41), Muchun Song wrote:
> We found a deadlock bug on our server when the kernel panic. It can be
> described in the following diagram.
> 
> CPU0:                                         CPU1:
> panic                                         rcu_dump_cpu_stacks
>   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
>     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
>                                                     __printk_safe_flush
>                                                       raw_spin_lock_irqsave(&read_lock)
>     // send NMI to other processors
>     apic_send_IPI_allbutself(NMI_VECTOR)
>                                                         // NMI interrupt, dead loop
>                                                         crash_nmi_callback

At what point does this decrement num_online_cpus()? Any chance that
panic CPU can apic_send_IPI_allbutself() and printk_safe_flush_on_panic()
before num_online_cpus() becomes 1?

>   printk_safe_flush_on_panic
>     printk_safe_flush
>       __printk_safe_flush
>         // deadlock
>         raw_spin_lock_irqsave(&read_lock)

	-ss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-08  6:38 ` Sergey Senozhatsky
@ 2021-02-08  8:49   ` Muchun Song
  2021-02-08 13:12     ` Sergey Senozhatsky
  0 siblings, 1 reply; 10+ messages in thread
From: Muchun Song @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Steven Rostedt, john.ogness, Andrew Morton, LKML

On Mon, Feb 8, 2021 at 2:38 PM Sergey Senozhatsky
<sergey.senozhatsky@gmail.com> wrote:
>
> On (21/02/06 13:41), Muchun Song wrote:
> > We found a deadlock bug on our server when the kernel panic. It can be
> > described in the following diagram.
> >
> > CPU0:                                         CPU1:
> > panic                                         rcu_dump_cpu_stacks
> >   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
> >     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
> >                                                     __printk_safe_flush
> >                                                       raw_spin_lock_irqsave(&read_lock)
> >     // send NMI to other processors
> >     apic_send_IPI_allbutself(NMI_VECTOR)
> >                                                         // NMI interrupt, dead loop
> >                                                         crash_nmi_callback
>
> At what point does this decrement num_online_cpus()? Any chance that
> panic CPU can apic_send_IPI_allbutself() and printk_safe_flush_on_panic()
> before num_online_cpus() becomes 1?

I took a closer look at the code. IIUC, It seems that there is no point
which decreases num_online_cpus.

>
> >   printk_safe_flush_on_panic
> >     printk_safe_flush
> >       __printk_safe_flush
> >         // deadlock
> >         raw_spin_lock_irqsave(&read_lock)
>
>         -ss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-08  8:49   ` [External] " Muchun Song
@ 2021-02-08 13:12     ` Sergey Senozhatsky
  2021-02-08 15:40       ` Muchun Song
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Senozhatsky @ 2021-02-08 13:12 UTC (permalink / raw)
  To: Muchun Song
  Cc: Sergey Senozhatsky, Petr Mladek, Steven Rostedt, john.ogness,
	Andrew Morton, LKML

On (21/02/08 16:49), Muchun Song wrote:
> On Mon, Feb 8, 2021 at 2:38 PM Sergey Senozhatsky
> <sergey.senozhatsky@gmail.com> wrote:
> >
> > On (21/02/06 13:41), Muchun Song wrote:
> > > We found a deadlock bug on our server when the kernel panic. It can be
> > > described in the following diagram.
> > >
> > > CPU0:                                         CPU1:
> > > panic                                         rcu_dump_cpu_stacks
> > >   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
> > >     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
> > >                                                     __printk_safe_flush
> > >                                                       raw_spin_lock_irqsave(&read_lock)
> > >     // send NMI to other processors
> > >     apic_send_IPI_allbutself(NMI_VECTOR)
> > >                                                         // NMI interrupt, dead loop
> > >                                                         crash_nmi_callback
> >
> > At what point does this decrement num_online_cpus()? Any chance that
> > panic CPU can apic_send_IPI_allbutself() and printk_safe_flush_on_panic()
> > before num_online_cpus() becomes 1?
>
> I took a closer look at the code. IIUC, It seems that there is no point
> which decreases num_online_cpus.

So then this never re-inits the safe_read_lock?

               if (num_online_cpus() > 1)
                       return;

               debug_locks_off();
               raw_spin_lock_init(&safe_read_lock);

	-ss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-08 13:12     ` Sergey Senozhatsky
@ 2021-02-08 15:40       ` Muchun Song
  2021-02-09  8:39         ` Petr Mladek
  0 siblings, 1 reply; 10+ messages in thread
From: Muchun Song @ 2021-02-08 15:40 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Steven Rostedt, john.ogness, Andrew Morton, LKML

On Mon, Feb 8, 2021 at 9:12 PM Sergey Senozhatsky
<sergey.senozhatsky@gmail.com> wrote:
>
> On (21/02/08 16:49), Muchun Song wrote:
> > On Mon, Feb 8, 2021 at 2:38 PM Sergey Senozhatsky
> > <sergey.senozhatsky@gmail.com> wrote:
> > >
> > > On (21/02/06 13:41), Muchun Song wrote:
> > > > We found a deadlock bug on our server when the kernel panic. It can be
> > > > described in the following diagram.
> > > >
> > > > CPU0:                                         CPU1:
> > > > panic                                         rcu_dump_cpu_stacks
> > > >   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
> > > >     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
> > > >                                                     __printk_safe_flush
> > > >                                                       raw_spin_lock_irqsave(&read_lock)
> > > >     // send NMI to other processors
> > > >     apic_send_IPI_allbutself(NMI_VECTOR)
> > > >                                                         // NMI interrupt, dead loop
> > > >                                                         crash_nmi_callback
> > >
> > > At what point does this decrement num_online_cpus()? Any chance that
> > > panic CPU can apic_send_IPI_allbutself() and printk_safe_flush_on_panic()
> > > before num_online_cpus() becomes 1?
> >
> > I took a closer look at the code. IIUC, It seems that there is no point
> > which decreases num_online_cpus.
>
> So then this never re-inits the safe_read_lock?

Right. If we encounter this case, we do not flush printk
buffer. So, it seems my previous patch is the right fix.
Right?

https://lore.kernel.org/patchwork/patch/1373563/

>
>                if (num_online_cpus() > 1)
>                        return;
>
>                debug_locks_off();
>                raw_spin_lock_init(&safe_read_lock);
>
>         -ss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-08 15:40       ` Muchun Song
@ 2021-02-09  8:39         ` Petr Mladek
  2021-02-10  2:25           ` Sergey Senozhatsky
  0 siblings, 1 reply; 10+ messages in thread
From: Petr Mladek @ 2021-02-09  8:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: Sergey Senozhatsky, Steven Rostedt, john.ogness, Andrew Morton, LKML

On Mon 2021-02-08 23:40:07, Muchun Song wrote:
> On Mon, Feb 8, 2021 at 9:12 PM Sergey Senozhatsky
> <sergey.senozhatsky@gmail.com> wrote:
> >
> > On (21/02/08 16:49), Muchun Song wrote:
> > > On Mon, Feb 8, 2021 at 2:38 PM Sergey Senozhatsky
> > > <sergey.senozhatsky@gmail.com> wrote:
> > > >
> > > > On (21/02/06 13:41), Muchun Song wrote:
> > > > > We found a deadlock bug on our server when the kernel panic. It can be
> > > > > described in the following diagram.
> > > > >
> > > > > CPU0:                                         CPU1:
> > > > > panic                                         rcu_dump_cpu_stacks
> > > > >   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
> > > > >     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
> > > > >                                                     __printk_safe_flush
> > > > >                                                       raw_spin_lock_irqsave(&read_lock)
> > > > >     // send NMI to other processors
> > > > >     apic_send_IPI_allbutself(NMI_VECTOR)
> > > > >                                                         // NMI interrupt, dead loop
> > > > >                                                         crash_nmi_callback
> > > >
> > > > At what point does this decrement num_online_cpus()? Any chance that
> > > > panic CPU can apic_send_IPI_allbutself() and printk_safe_flush_on_panic()
> > > > before num_online_cpus() becomes 1?
> > >
> > > I took a closer look at the code. IIUC, It seems that there is no point
> > > which decreases num_online_cpus.
> >
> > So then this never re-inits the safe_read_lock?

Yes, but it will also not cause the deadlock.
printk_safe_flush_on_panic() will return without flushing
the buffers.

> Right. If we encounter this case, we do not flush printk
> buffer. So, it seems my previous patch is the right fix.
> Right?
> 
> https://lore.kernel.org/patchwork/patch/1373563/

No, there is a risk of deadlock caused by logbuf_lock, see
https://lore.kernel.org/lkml/YB0nggSa7a95UCIK@alley/

> >                if (num_online_cpus() > 1)
> >                        return;
> >
> >                debug_locks_off();
> >                raw_spin_lock_init(&safe_read_lock);
> >
> >         -ss

I prefer this approach. It is straightforward because it handles
read_lock the same way as logbuf_lock.

IMHO, it is not worth inventing any more complexity. Both logbuf_lock
and read_lock are obsoleted by the lockless ringbuffer. And we need
something simple to get backported to the already released kernels.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-06  5:41 [PATCH v2] printk: fix deadlock when kernel panic Muchun Song
  2021-02-08  6:38 ` Sergey Senozhatsky
@ 2021-02-09  9:19 ` Petr Mladek
  2021-02-09 12:20   ` [External] " Muchun Song
  2021-02-10  2:16   ` Sergey Senozhatsky
  1 sibling, 2 replies; 10+ messages in thread
From: Petr Mladek @ 2021-02-09  9:19 UTC (permalink / raw)
  To: Muchun Song; +Cc: sergey.senozhatsky, rostedt, john.ogness, akpm, linux-kernel

On Sat 2021-02-06 13:41:24, Muchun Song wrote:
> We found a deadlock bug on our server when the kernel panic. It can be
> described in the following diagram.
> 
> CPU0:                                         CPU1:
> panic                                         rcu_dump_cpu_stacks
>   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
>     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
>                                                     __printk_safe_flush
>                                                       raw_spin_lock_irqsave(&read_lock)
>     // send NMI to other processors
>     apic_send_IPI_allbutself(NMI_VECTOR)
>                                                         // NMI interrupt, dead loop
>                                                         crash_nmi_callback
>   printk_safe_flush_on_panic
>     printk_safe_flush
>       __printk_safe_flush
>         // deadlock
>         raw_spin_lock_irqsave(&read_lock)
> 
> The register_nmi_handler() can be called in the __crash_kexec() or the
> crash_smp_send_stop() on the x86-64. Because CPU1 is interrupted by the
> NMI with holding the read_lock and crash_nmi_callback() never returns,
> CPU0 can deadlock when printk_safe_flush_on_panic() is called.
> 
> When we hold the read_lock and then interrupted by the NMI, if the NMI
> handler call nmi_panic(), it is also can lead to deadlock.
> 
> In order to fix it, we make read_lock global and rename it to
> safe_read_lock. And we handle safe_read_lock the same way in
> printk_safe_flush_on_panic() as we handle logbuf_lock there.

What about the following commit message? It uses imperative language
and explains that the patch just prevents the deadlock. It removes
some details. The diagram is better than many words.

<commit message>
printk_safe_flush_on_panic() caused the following deadlock on our server:

CPU0:                                         CPU1:
panic                                         rcu_dump_cpu_stacks
  kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
    register_nmi_handler(crash_nmi_callback)      printk_safe_flush
                                                    __printk_safe_flush
                                                      raw_spin_lock_irqsave(&read_lock)
    // send NMI to other processors
    apic_send_IPI_allbutself(NMI_VECTOR)
                                                        // NMI interrupt, dead loop
                                                        crash_nmi_callback
  printk_safe_flush_on_panic
    printk_safe_flush
      __printk_safe_flush
        // deadlock
        raw_spin_lock_irqsave(&read_lock)

DEADLOCK: read_lock is taken on CPU1 and will never get released.

It happens when panic() stops a CPU by NMI while it has been in
the middle of printk_safe_flush().

Handle the lock the same way as logbuf_lock. The printk_safe buffers
are flushed only when both locks can be safely taken.

Note: It would actually be safe to re-init the locks when all CPUs were
      stopped by NMI. But it would require passing this information
      from arch-specific code. It is not worth the complexity.
      Especially because logbuf_lock and printk_safe buffers have been
      obsoleted by the lockless ring buffer.
</commit message>

> Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

With an updated commit message:

Reviewed-by: Petr Mladek <pmladek@suse.com>

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-09  9:19 ` Petr Mladek
@ 2021-02-09 12:20   ` Muchun Song
  2021-02-10  2:16   ` Sergey Senozhatsky
  1 sibling, 0 replies; 10+ messages in thread
From: Muchun Song @ 2021-02-09 12:20 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Steven Rostedt, john.ogness, Andrew Morton, LKML

On Tue, Feb 9, 2021 at 5:19 PM Petr Mladek <pmladek@suse.com> wrote:
>
> On Sat 2021-02-06 13:41:24, Muchun Song wrote:
> > We found a deadlock bug on our server when the kernel panic. It can be
> > described in the following diagram.
> >
> > CPU0:                                         CPU1:
> > panic                                         rcu_dump_cpu_stacks
> >   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
> >     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
> >                                                     __printk_safe_flush
> >                                                       raw_spin_lock_irqsave(&read_lock)
> >     // send NMI to other processors
> >     apic_send_IPI_allbutself(NMI_VECTOR)
> >                                                         // NMI interrupt, dead loop
> >                                                         crash_nmi_callback
> >   printk_safe_flush_on_panic
> >     printk_safe_flush
> >       __printk_safe_flush
> >         // deadlock
> >         raw_spin_lock_irqsave(&read_lock)
> >
> > The register_nmi_handler() can be called in the __crash_kexec() or the
> > crash_smp_send_stop() on the x86-64. Because CPU1 is interrupted by the
> > NMI with holding the read_lock and crash_nmi_callback() never returns,
> > CPU0 can deadlock when printk_safe_flush_on_panic() is called.
> >
> > When we hold the read_lock and then interrupted by the NMI, if the NMI
> > handler call nmi_panic(), it is also can lead to deadlock.
> >
> > In order to fix it, we make read_lock global and rename it to
> > safe_read_lock. And we handle safe_read_lock the same way in
> > printk_safe_flush_on_panic() as we handle logbuf_lock there.
>
> What about the following commit message? It uses imperative language
> and explains that the patch just prevents the deadlock. It removes
> some details. The diagram is better than many words.
>
> <commit message>
> printk_safe_flush_on_panic() caused the following deadlock on our server:
>
> CPU0:                                         CPU1:
> panic                                         rcu_dump_cpu_stacks
>   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
>     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
>                                                     __printk_safe_flush
>                                                       raw_spin_lock_irqsave(&read_lock)
>     // send NMI to other processors
>     apic_send_IPI_allbutself(NMI_VECTOR)
>                                                         // NMI interrupt, dead loop
>                                                         crash_nmi_callback
>   printk_safe_flush_on_panic
>     printk_safe_flush
>       __printk_safe_flush
>         // deadlock
>         raw_spin_lock_irqsave(&read_lock)
>
> DEADLOCK: read_lock is taken on CPU1 and will never get released.
>
> It happens when panic() stops a CPU by NMI while it has been in
> the middle of printk_safe_flush().
>
> Handle the lock the same way as logbuf_lock. The printk_safe buffers
> are flushed only when both locks can be safely taken.
>
> Note: It would actually be safe to re-init the locks when all CPUs were
>       stopped by NMI. But it would require passing this information
>       from arch-specific code. It is not worth the complexity.
>       Especially because logbuf_lock and printk_safe buffers have been
>       obsoleted by the lockless ring buffer.
> </commit message>

Many thanks. It is clear.

>
> > Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> With an updated commit message:
>
> Reviewed-by: Petr Mladek <pmladek@suse.com>

Thanks. I will update the commit message.

>
> Best Regards,
> Petr

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-09  9:19 ` Petr Mladek
  2021-02-09 12:20   ` [External] " Muchun Song
@ 2021-02-10  2:16   ` Sergey Senozhatsky
  1 sibling, 0 replies; 10+ messages in thread
From: Sergey Senozhatsky @ 2021-02-10  2:16 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Muchun Song, sergey.senozhatsky, rostedt, john.ogness, akpm,
	linux-kernel

On (21/02/09 10:19), Petr Mladek wrote:
> On Sat 2021-02-06 13:41:24, Muchun Song wrote:

[..]

> What about the following commit message? It uses imperative language
> and explains that the patch just prevents the deadlock. It removes
> some details. The diagram is better than many words.
> 
> <commit message>
> printk_safe_flush_on_panic() caused the following deadlock on our server:
> 
> CPU0:                                         CPU1:
> panic                                         rcu_dump_cpu_stacks
>   kdump_nmi_shootdown_cpus                      nmi_trigger_cpumask_backtrace
>     register_nmi_handler(crash_nmi_callback)      printk_safe_flush
>                                                     __printk_safe_flush
>                                                       raw_spin_lock_irqsave(&read_lock)
>     // send NMI to other processors
>     apic_send_IPI_allbutself(NMI_VECTOR)
>                                                         // NMI interrupt, dead loop
>                                                         crash_nmi_callback
>   printk_safe_flush_on_panic
>     printk_safe_flush
>       __printk_safe_flush
>         // deadlock
>         raw_spin_lock_irqsave(&read_lock)

[..]

I would also add to the commit message that it avoids the deadlock
_in this particular case_ at expense of losing contents of printk_safe
buffers. This looks important enough to be mentioned.

	-ss

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] Re: [PATCH v2] printk: fix deadlock when kernel panic
  2021-02-09  8:39         ` Petr Mladek
@ 2021-02-10  2:25           ` Sergey Senozhatsky
  0 siblings, 0 replies; 10+ messages in thread
From: Sergey Senozhatsky @ 2021-02-10  2:25 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Muchun Song, Sergey Senozhatsky, Steven Rostedt, john.ogness,
	Andrew Morton, LKML

On (21/02/09 09:39), Petr Mladek wrote:
> > > So then this never re-inits the safe_read_lock?
> 
> Yes, but it will also not cause the deadlock.

Right.

> I prefer this approach. It is straightforward because it handles
> read_lock the same way as logbuf_lock.

I'm fine with that approach, but this needs to be in the commit message.
Something like "lose printk_safe message when we think we will deadlock
on printk_safe flush".

	-ss

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-02-10  2:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-06  5:41 [PATCH v2] printk: fix deadlock when kernel panic Muchun Song
2021-02-08  6:38 ` Sergey Senozhatsky
2021-02-08  8:49   ` [External] " Muchun Song
2021-02-08 13:12     ` Sergey Senozhatsky
2021-02-08 15:40       ` Muchun Song
2021-02-09  8:39         ` Petr Mladek
2021-02-10  2:25           ` Sergey Senozhatsky
2021-02-09  9:19 ` Petr Mladek
2021-02-09 12:20   ` [External] " Muchun Song
2021-02-10  2:16   ` Sergey Senozhatsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.