Re: [PATCH 1/2] printk/panic: Access the main printk log in panic() only when safe

From: Petr Mladek <pmladek@suse.com>
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>,
	Sergey Senozhatsky <sergey.senozhatsky@gmail.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	x86@kernel.org, John Ogness <john.ogness@linutronix.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Tesarik <ptesarik@suse.cz>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] printk/panic: Access the main printk log in panic() only when safe
Date: Wed, 24 Jul 2019 14:27:11 +0200	[thread overview]
Message-ID: <20190724122711.qquevkcuge24bhdd@pathway.suse.cz> (raw)
In-Reply-To: <20190723031340.GA19463@jagdpanzerIV>

On Tue 2019-07-23 12:13:40, Sergey Senozhatsky wrote:
> On (07/19/19 14:57), Petr Mladek wrote:
> > But there is one more scenario 3:
> 
> Yes!
> 
> >   - we have CPUB which loops or is deadlocked in IRQ context
> > 
> >   - we have CPUA which panic()-s the system, but can't bring CPUB down,
> >     so logbuf_lock might be takes and release from time to time
> >     by CPUB
> 
> Great!
> 
> This is the only case when we actually need to pay attention to
> num_online_cpus(), because there is an active logbuf_lock owner;
> in any other case we can unconditionally re-init printk() locks.
> 
> But there is more to it.
> 
> Note, that the problem in scenario 3 is bigger than just logbuf_lock.
> Regardless of logbuf implementation we will not be able to panic()
> the system.
> 
> If we have a never ending source of printk() messages, coming from
> misbehaving CPU which stuck in printing loop in IRQ context, then
> flush_on_panic() will never end or kmsg dump will never stop, etc.

Yes.

> We need to cut off misbehaving CPUs. Panic CPU waits (for up to 1
> second?) in smp_send_stop() for secondary CPUs to die, if some
> secondary CPUs are still chatty after that then most likely those
> CPUs don't have anything good to say, just a pointless flood of same
> messages over and over again; which, however, will not let panic
> CPU to proceed.

Makes sense.

> And this is where the idea of "disconnecting" those CPUs from main
> logbuf come from.
> 
> So what we can do:
> - smp_send_stop()
> - disconnect all-but-self from logbuf (via printk-safe)

printk_safe is not really necessary. As you wrote, nobody
is interested into messages from CPUs that are supposed
to be stopped.

It might be enough to set some global variable, for example,
with the CPU number that is calling panic() and is the only
one allowed to print messages from this point on.

It might even be used to force console lock owner to leave
the cycle immediately.

> - wait for 1 or 2 more extra seconds for secondary CPUs to leave
>   console_unlock() and to redirect printks to per-CPU buffers
> - after that we are sort of good-to-go: re-init printk locks
>   and do kmsg_dump, flush_on_panic().

Sounds good.

> Note, misbehaving CPUs will write to per-CPU buffers, they are not
> expected to be able to flush per-CPU buffers to the main logbuf. That
> will require enabled IRQs, which should deliver stop IPI. But we can
> do even more - just disable print_safe irq work on disconnect CPUs.
>
> So, shall we try one more time with the "disconnect" misbehaving CPUs
> approach? I can send an RFC patch.

IMHO, it will be acceptable only when it is reasonably simple and
straightforward. The panic() code is full of special hacks and
it is already complicated to follow all the twists.

Especially because this scenario came up from a theoretical
discussion. Note that my original, real bug report, was
with logbuf_lock NMI, enabled kdump notifiers, ...

Best Regards,
Petr