From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966111AbdLSAwz (ORCPT ); Mon, 18 Dec 2017 19:52:55 -0500 Received: from mail-pg0-f65.google.com ([74.125.83.65]:46121 "EHLO mail-pg0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966084AbdLSAwx (ORCPT ); Mon, 18 Dec 2017 19:52:53 -0500 X-Google-Smtp-Source: ACJfBouCuMBQ5U6lpqRmLFjlS4MjfdM3krhO5hlzaZdAFn74nGmsgC+h4iMEG83lwsSFwqZ9ZJ+2lw== Date: Tue, 19 Dec 2017 09:52:48 +0900 From: Sergey Senozhatsky To: Steven Rostedt Cc: Sergey Senozhatsky , Tejun Heo , Sergey Senozhatsky , Petr Mladek , Jan Kara , Andrew Morton , Peter Zijlstra , Rafael Wysocki , Pavel Machek , Tetsuo Handa , linux-kernel@vger.kernel.org Subject: Re: [RFC][PATCHv6 00/12] printk: introduce printing kernel thread Message-ID: <20171219005248.GA8892@jagdpanzerIV> References: <20171204134825.7822-1-sergey.senozhatsky@gmail.com> <20171214142709.trgl76hbcdwaczzd@pathway.suse.cz> <20171214152551.GY3919388@devbig577.frc2.facebook.com> <20171214125506.52a7e5fa@gandalf.local.home> <20171214181153.GZ3919388@devbig577.frc2.facebook.com> <20171215021024.GA11199@jagdpanzerIV> <20171214221831.3ead0298@vmware.local.home> <20171215050607.GC11199@jagdpanzerIV> <20171215101957.5d4a1004@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171215101957.5d4a1004@gandalf.local.home> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Steven, I couldn't reply sooner. On (12/15/17 10:19), Steven Rostedt wrote: > > On (12/14/17 22:18), Steven Rostedt wrote: > > > > Steven, your approach works ONLY when we have the following preconditions: > > > > > > > > a) there is a CPU that is calling printk() from the 'safe' (non-atomic, > > > > etc) context > > > > > > > > what does guarantee that? what happens if there is NO non-atomic > > > > CPU or that non-atomic simplky missses the console_owner != false > > > > point? we are going to conclude > > > > > > > > "if printk() doesn't work for you, it's because you are holding it wrong"? > > > > > > > > > > > > what if that non-atomic CPU does not call printk(), but instead > > > > it does console_lock()/console_unlock()? why there is no handoff? > > The case here, you are talking about a CPU doing console_lock() from a > non printk() case. Which is what I was asking about how often this > happens. I'd say often enough. but the point I was trying to make is that we can have non-atomic CPUs which can do the print out, instead of "sharing the load" between atomic CPUs. > As for why there's no handoff. Does the non printk() > console_lock/unlock ever happen from a critical location? I don't think > it does (but I haven't checked). Then it is the perfect candidate to do > all the printing. that's right. that is the point I was trying making. we can have better candidates to do all the printing. [..] > > deep-stack spin_lock_irqsave() lockup reports from multiple CPUs (3 cpus) > > happening at the same moment + NMI backtraces from all the CPUs (more > > than 3 cpus) that follows the lockups, over not-so-fast serial console. > > exactly the bug report I received two days ago. so which one of the CPUs > > here is a good candidate to successfully emit all of the pending logbuf > > entries? none. all of them either have local IRQs disabled, or dump_stack() > > from either backtrace IPI or backtrace NMI (depending on the configuration). > > > > Is the above showing an issue of console_lock happening in the non > printk() case? > > > do we periodically do console_lock() on a running system? yes, we do. > > add to console_unlock() > > Right, and the non printk() console_lock() should be fine to do all > printing when it unlocks. that's right. > > this argument is really may be applied against your patch as well. I > > really don't want us to have this type of "technical" discussion. > > Sure, but my patch fixes the unfair approach that printk currently does. I did tests yesterday, traces are available. I can't conclude that the patch fixes the unfairness of printk(). > > printk() is a tool for developers. but developers can't use. > > > > > > > > c) the task that is looping in console_unlock() sees non-atomic CPU when > > > > console_owner is set. > > > > > > I haven't looked at the latest code, but my last patch didn't care > > > about "atomic" and "non-atomic" > > > > I know. and I think it is sort of a problem. > > Please show me the case that it is. And don't explain where it is. > Please apply the patch and have the problem occur and show it to me. > That's all that I'm asking for. I did some tests yesterday. I posted analysis and traces. [..] > No, because it is unrealistic. For example: right. > +static void test_noirq_console_unlock(void) > +{ > + unsigned long flags; > + unsigned long num_messages = 0; > + > + pr_err("=== TEST %s\n", __func__); > + > + num_messages = 0; > + console_lock(); > + while (num_messages++ < max_num_messages) > + pr_info("=== %s Append message %lu out of %lu\n", > + __func__, > + num_messages, > + max_num_messages); > + > + local_irq_save(flags); > + console_unlock(); > > Where in the kernel do we do this? the funny thing is that we _are going to start doing this_ with the console_owner hand off enabled. consider the following case we have console_lock() from non-atomic context. console_sem owner is getting preempted, under console_sem. which is totally possible and happens a lot. in the mean time we have OOM, which can print a lot of info. by the time console_sem returns back to TASK_RUNNING logbuf contains some number of pending messages [lets say 10 seconds worth of printing]. console owner goes to console_unlock(). accidentally we have printk from IRQ on CPUz. console_owner hands over printing duty to CPUz. so now we have to print 10 seconds worth of OOM messages from irq. CPU0 CPU1 ~ CPUx CPUz console_lock << preempted >> OOM OOM printouts, lots of OOM traces, etc. OOM end [progress done]. << back to RUNNING >> console_unlock() for (;;) sets console_owner call_console_drivers() IRQ printk sees console_owner sets console_waiter clears console_owner sees console_waier handoff for (;;) { call_console_drivers() ??? lockup } up() this is how we down() from non-atomic and up() from atomic [if we make it to up(). we might end up in NMI panic]. this scenario is totally possible, isn't it? the optimistic expectation here is that some other printk() from non-atomic CPU will jump in and take over printing from atomic CPUz. but I don't see why we are counting on it. -ss