From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933110AbaFLLu1 (ORCPT ); Thu, 12 Jun 2014 07:50:27 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51990 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933018AbaFLLu0 (ORCPT ); Thu, 12 Jun 2014 07:50:26 -0400 Date: Thu, 12 Jun 2014 13:50:23 +0200 From: Petr =?iso-8859-1?Q?Ml=E1dek?= To: Frederic Weisbecker Cc: Jiri Kosina , Andrew Morton , Steven Rostedt , Dave Anderson , "Paul E. McKenney" , Kay Sievers , Michal Hocko , Jan Kara , linux-kernel@vger.kernel.org, Linus Torvalds Subject: Re: [RFC PATCH 00/11] printk: safe printing in NMI context Message-ID: <20140612115022.GM7772@pathway.suse.cz> References: <1399626665-29817-1-git-send-email-pmladek@suse.cz> <20140529000909.GC6507@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140529000909.GC6507@localhost.localdomain> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 29-05-14 02:09:11, Frederic Weisbecker wrote: > On Thu, May 29, 2014 at 12:02:30AM +0200, Jiri Kosina wrote: > > On Fri, 9 May 2014, Petr Mladek wrote: > > > > > printk() cannot be used safely in NMI context because it uses internal locks > > > and thus could cause a deadlock. Unfortunately there are circumstances when > > > calling printk from NMI is very useful. For example, all WARN.*(in_nmi()) > > > would be much more helpful if they didn't lockup the machine. > > > > > > Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI > > > to dump traces on all CPU (either triggered by sysrq+l or from RCU stall > > > detector). > > > > I am rather surprised that this patchset hasn't received a single review > > comment for 3 weeks. > > > > Let me point out that the issues Petr is talking about in the cover letter > > are real -- we've actually seen the lockups triggered by RCU stall > > detector trying to dump stacks on all CPUs, and hard-locking machine up > > while doing so. > > > > So this really needs to be solved. > > The lack of review may be partly due to a not very appealing changestat on an > old codebase that is already unpopular: > > Documentation/kernel-parameters.txt | 19 +- > kernel/printk/printk.c | 1218 +++++++++++++++++++++++++---------- > 2 files changed, 878 insertions(+), 359 deletions(-) > > > Your patches look clean and pretty nice actually. They must be seriously considered if > we want to keep the current locked ring buffer design and extend it to multiple per context > buffers. But I wonder if it's worth to continue that way with the printk ancient design. > > If it takes more than 1000 line changes (including 500 added) to make it finally work > correctly with NMIs by working around its fundamental flaws, shouldn't we rather redesign > it to use a lockless ring buffer like ftrace or perf ones? I like the idea of the lock less buffer. So I have spent some time to understand kernel/trace/ring_buffer.c and there are some challenges that would need to be solved. Some of them might be pretty hard :-( Here are the hardest missing features from my point of view: + storing the last message in all situations + reading from more locations in parallel + "aggressive" printing to console , see below for more details. Also note that the current code is already quite complex. There are many tricks to solve conflicts a lockless way and it might get worse if we want to solve the above stuff. -------- Bellow are more details that explain the above statements. But first, let me show how I understand the lock less ringbuffer: + there are separate buffers for each CPU + writers use a circular list of pages that are linked in both directions + writers reserve space before they copy the data + reader has an extra page that is not in the main ring and thus not accessible by writers + reader swap its page with another one from the main ring buffer when it wants to read some newer data + pages might have special flag/function: + reader: the separate page accessed by reader + head: page with the oldest data; the next one to be read + tail: page with the newest data; the next write will try to reserve the space here + commit: the newest page where we have already copied the data; it is usually the same as tail; the difference happens when the write is interrupted and followup pages are reserved and written; we must newer move tail over this page, otherwise we would reserve the same location twice, overwrite the data, and create mess I hope that I have got the above correctly. It is why I think that the missing features are hard to solve. Here are the details about the above mentioned problems: 1. storing the last message in all situations --------------------------------------------- The problem is if we reserve space in a normal context, get interrupted, and the interrupt wants to write more data than the size of the ring buffer. We must stop rotating when we hit the first reserved but not committed page. Here are the code pointers: ring_buffer_write() rb_reserve_next_event() rb_reserve_next() rb_move_tail() if (unlikely(next_page == commit_page)) { goto out_reset; This is must to have because the data are simply copied when the space is reserved, see memcpy(body, data, length) in ring_buffer_write() I think that we do not want this for printk(). The last messages are usually more important, especially in case of a fatal error. Possible solutions: + use different ring buffer for each context; it would need even more space + lock the page for the given context when a space is reserved; such locked page will be skipped when the buffer is rotated in a nested interrupt context; this will make the algorithm even more complicated; I am not sure if this would work at all + ignore this problem; each nested context should make sure that it does not use the whole buffer; it might even be realistic; we have separate buffer for each CPU; for example, one backtrace should should fit into one page; two pages are minimum...; but this is the type of assumption that might hit us in the future 2. reading from more locations in parallel ------------------------------------------ The printk() ring buffer is asynchronously accessed by different readers, e.g. console, syslog, /dev/kmsg. Each one might read from a different location. Possible solutions: + have more reader and header pages; it would create the algorithm even more complicated; I am not sure if it is possible at all + have another printk() ring buffer and a single reader that would copy the messages here; it is ugly; also I am not sure if we would save much from the current printk() code 3. "aggressive" printing to console ----------------------------------- printk() currently triggers console immediately when new data appears in the ring buffer. This might cause swapping the reader page even when there is only one entry. The result might be that each page would include only one line. Few lines might occupy a large ring buffer. Possible solution: Do some lazy reading but how? Best Regards, Petr