From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751858AbdDNEmh (ORCPT ); Fri, 14 Apr 2017 00:42:37 -0400 Received: from mail-pf0-f196.google.com ([209.85.192.196]:32943 "EHLO mail-pf0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750747AbdDNEme (ORCPT ); Fri, 14 Apr 2017 00:42:34 -0400 Date: Fri, 14 Apr 2017 13:42:35 +0900 From: Sergey Senozhatsky To: Petr Mladek Cc: Sergey Senozhatsky , Pavel Machek , Jan Kara , "Eric W. Biederman" , Ye Xiaolong , Steven Rostedt , Andrew Morton , Linus Torvalds , Peter Zijlstra , "Rafael J . Wysocki" , Greg Kroah-Hartman , Jiri Slaby , Len Brown , linux-kernel@vger.kernel.org, Sergey Senozhatsky Subject: Re: [printk] fbc14616f4: BUG:kernel_reboot-without-warning_in_test_stage Message-ID: <20170414044235.GA462@jagdpanzerIV.localdomain> References: <20170407081449.GA12859@amd> <20170407121021.GA379@jagdpanzerIV.localdomain> <20170407124455.GC4756@amd> <20170407151306.GA384@tigerII.localdomain> <20170409101230.GB27363@amd> <20170410045339.GB2793@jagdpanzerIV.localdomain> <20170410184858.GA24226@amd> <20170411014635.GA3836@jagdpanzerIV.localdomain> <20170411161953.GA447@tigerII.localdomain> <20170413140321.GH16938@pathway.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170413140321.GH16938@pathway.suse.cz> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Petr, thanks for taking a look! On (04/13/17 16:03), Petr Mladek wrote: > > +static inline bool console_offload_printing(void) > > +{ > > + static struct task_struct *printing_task = NULL; > > + static unsigned long lines_printed = 0; > > + > > + if (!atomic_print_limit || !printk_kthread_enabled()) > > + return false; > > + > > + /* We rescheduled - reset the counters. */ > > + if (printing_task != current) { > > + lines_printed = 0; > > + printing_task = current; > > + return false; > > + } > > If we want to check that the process rescheduled, we should > store/check also current->nvcsw + current->nivcsw. ok. [..] > My only fear is that it is getting more and more complicated. > On the other hand, any partial solution is asking for > troubles and complains. yeah. we have to aim slightly different and conflicting targets - introducing a new printk behavior, while preserving an already existing guarantees. which is a bit tricky. [..] > > + if (lines_printed > 2 * (unsigned long)atomic_print_limit) { > > + printk_enforce_emergency = true; > > + pr_crit("Declaring printk emergency mode.\n"); > > + return false; > > + } > > The only messages that are printed on my workstation are the same > few lines everytime I connect my phone over USB to get it charged. you are right. this is a known and yet to be resolved issue. > It might help to check the number of process switch counts as > suggested above. will take a look at your 'current->nvcsw + current->nivcsw' idea. thanks. [..] > > + /* > > + * Sometimes we may lock console_sem before printk_kthread. > > + * In this case we will jump to `again' label (if there are > > + * pending messages), print one more line from current > > + * process, break out of printing loop (we don't reset the > > + * counter of printed lines) and do up_console_sem() to > > + * wakeup printk_kthread again. > > + * > > + * If printk_kthread never wakes up (which may indicate that > > + * the system is unstable or something weird is going on), > > + * then we will keep jumping to `again' label and printing > > + * one message from the logbuf. This is a bit ugly, but at > > + * least we will print out the logbuf. > > + * > > + * If such condition occurs, console_offload_printing() can > > + * declare `printk_emergency' at some point. > > I am a bit confused by the comment above. The again goto target is > used only when there is a race between leaving the loop and releasing > the console_sem. It is a corner case. not really. this is the part where "preserve printk guarantees" jumps in. when we limit the number of lines we can print we have to leave this loop with not fully flushed logbuf. so `goto again' is not solely for corner case anymore. when we prematurely leave the printing loop, we wake_up printk_kthread, unlock console_sem... and then we have no idea if printk_kthread going to wake_up at all, and, if it's going to, how much time will it take. at the same time we have a task that is already in console_unlock() and, probably, we still have pending messages in the logbuf. that's why the process that just has left the printing loop [and there easily might be pending messages in the logbuf] does the whole 'retry' thing. we can have a misbehaving high priority process or something, that would prevent printk_kthread from becoming running just when we need it. so, at least sometimes, the printing process (the one that breaks ouf of printing loop and wakes up printk_kthread) can re-acquire console_sem and print one more line, then it up() console_sem, which, hopefully, will wake_up printk_kthread. if printk_kthread did become running then if would be in console_sem wait list at this point. if it didn't - then we a) wake up some other process that is probably in console_sem list (hopefully there is one) or b) continue printing from the current process. because printk_kthread is still not running and there are no other processes that want to console_lock(). not much we can do at this point. so in expected/normal scenario, we fail to re-acquire the console_sem lock (console_trylock()), which means that either printk_kthread or some other process from console_sem wait list acquired the console_sem and will take over printing. I do something like this --- @@ -2427,6 +2427,8 @@ void console_unlock(void) console_seq++; raw_spin_unlock(&logbuf_lock); + sprintf(text + 7, "{%s}", current->comm); + stop_critical_timings(); /* don't trace print latency */ call_console_drivers(ext_text, ext_len, text, len); start_critical_timings(); --- and fire up some silly printk tests (I don't care what exactly it prints, I'm curious what process prints it). it sort of makes it easier to observe the behaviour. > If there are messages from other CPUs, they most likely appear during > the slow call_console_drivers(). They are handled inside the > for(;;) cycle unless we reach the atomic_print_limit and > wake up the printk_kthread. but `atomic_print_limit' can be quite restrictive. we break out of the printing loop then `atomic_print_limit' tells us to do so, not when we the logbuf is empty. [..] > > @@ -2276,6 +2445,8 @@ void console_unlock(void) > > */ > > raw_spin_lock(&logbuf_lock); > > retry = console_seq != log_next_seq; > > This should be: > > retry = console_seq != log_next_seq && !did_offload; > > Otherwise, it would never leave. it is expected to leave at some point, when we would know for sure that a successful offloading took place. otherwise we probably shouldn't leave. we invoked wake_up_process(), but that does not really buy us anything (from printk guarantees POV). we trust one thing only - a process holding the console_sem lock. because that's "a fact"; wake_up_process() is nothing but "a promise". so, in short, the basic idea is: - the worse the situation is, the closer printk behavior to the original one. the bigger the scheduling latencies are, the more time we spent in console_unlock(), printing more than `atomic_print_limit' chars. in more details: once we cross the `atomic_print_limit' we start this thing @again label: 1) call_console_drivers() - print a single logbuf entry 1) up() - "somebody please take over" (we expect that at least one process will do this - printk_kthread. but we don't know when, and not entirely sure if it will. so we are increasing our chances by waking up printk_kthread, but we don't completely count on it). 2) console_trylock() - if successful then "eeehhh, OK... I'll print one more line" goto @again else return, someone console_lock()-ed yes, in absolutely bad scenario -- even printk_kthread can't lock console_sem - this means old printk behaviour. but that's sort of exactly what we want to do with the printk_emergency_begin/end annotations anyway. in exchange we print out the logbuf eventually. I'm trying to find a compromise that would make everyone happy. may be I'm missing an easier/better solution, wouldn't be the first time ever :) -ss