All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mikulas Patocka <mpatocka@redhat.com>
To: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>, Nigel Croxon <ncroxon@redhat.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>,
	John Ogness <john.ogness@linutronix.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	dm-devel@redhat.com, linux-serial@vger.kernel.org
Subject: Re: Serial console is causing system lock-up
Date: Thu, 7 Mar 2019 07:54:44 -0500 (EST)	[thread overview]
Message-ID: <alpine.LRH.2.02.1903070744580.20758@file01.intranet.prod.int.rdu2.redhat.com> (raw)
In-Reply-To: <20190307122642.GA10415@tigerII.localdomain>



On Thu, 7 Mar 2019, Sergey Senozhatsky wrote:

> On (03/07/19 11:37), John Ogness wrote:
> > > Sorry John, the reasoning is that I'm trying to understand
> > > why this does not look like soft or hard lock-up or RCU stall
> > > scenario.
> > 
> > The reason is that you are seeing data being printed on the console. The
> > watchdogs (soft, hard, rcu, nmi) are all touched with each emergency
> > message.
> 
> Correct. Please see below.
> 
> > > The CPU which spins on prb_lock() can have preemption disabled and,
> > > additionally, can have local IRQs disabled, or be under RCU read
> > > side lock. If consoles are busy, then there are CPUs which printk()
> > > data and keep prb_lock contended; prb_lock() does not seem to be
> > > fair. What am I missing?
> > 
> > You are correct. Making prb_lock fair might be something we want to look
> > into. Perhaps also based on the loglevel of what needs to be
> > printed. (For example, KERN_ALERT always wins over KERN_CRIT.)
> 
> Good.
> 
> I'm not insisting, but I have a feeling that touching watchdogs after
> call_console_drivers() might be too late, sometimes. When we spin in

There are still NMI stacktraces - see here 
http://people.redhat.com/~mpatocka/testcases/console-lockup/

> prb_lock() we wait for all CPUs which are before/ahead of us to
> finish their call_console_drivers(), one by one. So if CPUZ is very
> unlucky and is in atomic context, then prb_lock() for that CPUZ can
> last for  N * call_console_drivers().  And depending on N (which also
> includes unfairness) and call_console_drivers() timings NMI watchdog
> may pay CPUZ a visit before it gets its chance to touch watchdogs.
> 
> *May be* sometimes we might want to touch watchdogs in prb_lock().
> 
> So, given the design of new printk(), I can't help thinking about the
> fact that current
> 	"the winner takes it all"
> may become
> 	"the winner waits for all".
> 
> Mikulas mentioned that he observes "** X messages dropped" warnings.
> And this suggests that, _most likely_, we had significantly more that
> 2 CPUs calling printk() concurrently.

When I observe these messages (usually with small log buffer size), it 
doesn't lockup.

The lockups happen because the messages are stuffed into a 2MiB buffer and 
then printed over 115200 baud serial line.

You can see this: 
http://people.redhat.com/~mpatocka/testcases/console-lockup/5.0-total-lockup.txt

Here it attempted to write 1355277 bytes over the slow serial line - it 
takes a few minutes.

> - A single source - single CPU calling printk() - would not lose messages,
>   because it would print its own message before it printk() another one (we
>   still could have another CPU rescheduling under console_sem, but I don't
>   think this is the case).
> 
> - Two CPUs would also probably not lose messages, Steven's console_owner
>   would throttle them down.
> 
> So I think what we have was a spike of WARN/ERR printk-s comming from
> N CPUs concurrently.

Losing messages is in my opinion reasonable (if they are produced faster 
than they could be printed). Another possibility is to always write the 
message synchronously and exit printk only after it is written.

> And this brings us to another pessimistic scenario: a very unlucky
> CPUZ has to spin in prb_lock() waiting for other CPUs to print out
> the very same 2 million chars. Which in terms of printk() latency
> looks to me just like current printk.
> 
> John, sorry to ask this, does new printk() design always provide
> latency guarantees good enough for PREEMPT_RT?
> 
> I'm surely missing something. Where am I wrong?
> 
> 	-ss

Mikulas

  reply	other threads:[~2019-03-07 12:54 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-06 14:27 Serial console is causing system lock-up Mikulas Patocka
2019-03-06 15:22 ` Petr Mladek
2019-03-06 16:07   ` Mikulas Patocka
2019-03-06 16:30     ` Theodore Y. Ts'o
2019-03-06 17:11       ` Mikulas Patocka
2019-03-06 22:19         ` Steven Rostedt
2019-03-06 22:43           ` John Ogness
2019-03-07  2:22             ` Sergey Senozhatsky
2019-03-07  8:17               ` John Ogness
2019-03-07  8:25                 ` Sergey Senozhatsky
2019-03-07  8:34                   ` John Ogness
2019-03-07  9:17                     ` Sergey Senozhatsky
2019-03-07 10:37                       ` John Ogness
2019-03-07 12:26                         ` Sergey Senozhatsky
2019-03-07 12:54                           ` Mikulas Patocka [this message]
2019-03-07 14:21                           ` John Ogness
2019-03-07 15:35                             ` Petr Mladek
2019-03-12  2:32                             ` Sergey Senozhatsky
2019-03-12  8:17                               ` John Ogness
2019-03-12  8:59                                 ` Sergey Senozhatsky
2019-03-12 10:05                                 ` Mikulas Patocka
2019-03-12 13:19                                   ` John Ogness
2019-03-12 13:44                                     ` Petr Mladek
2019-03-12 12:08                                 ` Petr Mladek
2019-03-12 15:19                                   ` John Ogness
2019-03-13  2:38                                   ` Sergey Senozhatsky
2019-03-13  8:43                                     ` John Ogness
2019-03-14 10:30                                       ` Sergey Senozhatsky
2019-03-07 14:08             ` John Stoffel
2019-03-07 14:26               ` Mikulas Patocka
2019-03-08  1:22                 ` Sergey Senozhatsky
2019-03-08  1:39                   ` Sergey Senozhatsky
2019-03-08  2:36                     ` John Ogness
2019-03-07 15:16         ` Petr Mladek
2019-03-07  1:56     ` Sergey Senozhatsky
2019-03-07 13:12       ` Mikulas Patocka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LRH.2.02.1903070744580.20758@file01.intranet.prod.int.rdu2.redhat.com \
    --to=mpatocka@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=john.ogness@linutronix.de \
    --cc=linux-serial@vger.kernel.org \
    --cc=ncroxon@redhat.com \
    --cc=pmladek@suse.com \
    --cc=rostedt@goodmis.org \
    --cc=sergey.senozhatsky.work@gmail.com \
    --cc=sergey.senozhatsky@gmail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.