From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752424AbeAWQCB (ORCPT ); Tue, 23 Jan 2018 11:02:01 -0500 Received: from mail-pf0-f173.google.com ([209.85.192.173]:34785 "EHLO mail-pf0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751401AbeAWQB6 (ORCPT ); Tue, 23 Jan 2018 11:01:58 -0500 X-Google-Smtp-Source: AH8x226sgjDoxM/2hJvWIJD8UL61NDGY56zqa7Gvb+wOVpnZBtJ8YKM7jQ6fyebHCU04I+pjWcftZA== Date: Wed, 24 Jan 2018 01:01:53 +0900 From: Sergey Senozhatsky To: Steven Rostedt Cc: Sergey Senozhatsky , Sergey Senozhatsky , Petr Mladek , Tejun Heo , akpm@linux-foundation.org, linux-mm@kvack.org, Cong Wang , Dave Hansen , Johannes Weiner , Mel Gorman , Michal Hocko , Vlastimil Babka , Peter Zijlstra , Linus Torvalds , Jan Kara , Mathieu Desnoyers , Tetsuo Handa , rostedt@rostedt.homelinux.com, Byungchul Park , Pavel Machek , linux-kernel@vger.kernel.org Subject: Re: [PATCH v5 0/2] printk: Console owner and waiter logic cleanup Message-ID: <20180123160153.GC429@tigerII.localdomain> References: <20180117121251.7283a56e@gandalf.local.home> <20180117134201.0a9cbbbf@gandalf.local.home> <20180119132052.02b89626@gandalf.local.home> <20180120071402.GB8371@jagdpanzerIV> <20180120104931.1942483e@gandalf.local.home> <20180121141521.GA429@tigerII.localdomain> <20180123064023.GA492@jagdpanzerIV> <20180123095652.5e14da85@gandalf.local.home> <20180123152130.GB429@tigerII.localdomain> <20180123104121.2ef96d81@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180123104121.2ef96d81@gandalf.local.home> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On (01/23/18 10:41), Steven Rostedt wrote: [..] > We can have more. But if printk is causing printks, that's a major bug. > And work queues are not going to fix it, it will just spread out the > pain. Have it be 100 printks, it needs to be fixed if it is happening. > And having all printks just generate more printks is not helpful. Even > if we slow them down. They will still never end. Dropping the messages is not the solution either. The original bug report report was - this "locks up my kernel". That's it. That's all people asked us to solve. With WQ we don't lockup the kernel, because we flush printk_safe in preemptible context. And people are very much expected to fix the misbehaving consoles. But that should not be printk_safe problem. > A printk causing a printk is a special case, and we need to just show > enough to let the user know that its happening, and why printks are > being throttled. Yes, we may lose data, but if every printk that goes > out causes another printk, then there's going to be so much noise that > we wont know what other things went wrong. Honestly, if someone showed > me a report where the logs were filled with printks that caused > printks, I'd stop right there and tell them that needs to be fixed > before we do anything else. And if that recursion is happening because > of another problem, I don't want to see the recursion printks. I want > to see the printks that show what is causing the recursions. I'll re-read this one tomorrow. Not quite following it. > > The problem is - we flush printk_safe too soon and printing CPU ends up > > in a lockup - it log_store()-s new messages while it's printing the pending > > No, the problem is that printks are causing more printks. Yes that will > make flushing them soon more likely to lock up the system. But that's > not the problem. The problem is printks causing printks. Yes. And ignoring those printk()-s by simply dropping them does not fix the problem by any means. > > ones. It's fine to do so when CPU is in preemptible context. Really, we > > should not care in printk_safe as long as we don't lockup the kernel. The > > misbehaving console must be fixed. If CPU is not in preemptible context then > > we do lockup the kernel. Because we flush printk_safe regardless of the > > current CPU context. If we will flush printk_safe via WQ then we automatically > > And if we can throttle recursive printks, then we should be able to > stop that from happening. pintk_safe was designed to be recursive. It was never designed to be used to troubleshoot or debug consoles. But it was designed to be recursive - because that's the sort of the problems it was meant to handle: recursive printks that would otherwise deadlock us. That's why we have it in the first place. > > add this "OK! The CPU is preemptible, we can log_store(), it's totally OK, we > > will not lockup it up." thing. Yes, we fill up the logbuf with probably needed > > and appreciated or unneeded messages. But we should not care in printk_safe. > > We don't lockup the kernel... And the misbehaving console must be fixed. > > I agree. Good. > > I disagree with "If we are having issues with irq_work, we are going to have > > issues with a work queue". There is a tremendous difference between irq_work > > on that CPU and queue_work_on(smp_proessor_id()). One does not care about CPU > > context, the other one does. > > But switching to work queue does not address the underlining problem > that printks are causing more printks. The only way to address those problems is to fix the console. That's the only. But that's not what I'm doing with my proposal. I fix the lockup scenario, the only reported problem so far. Whilst also keeping printk_safe around. -ss