linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] printk: safe printing in NMI context
@ 2014-05-09  9:10 Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 01/11] printk: rename struct printk_log to printk_msg Petr Mladek
                   ` (11 more replies)
  0 siblings, 12 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

printk() cannot be used safely in NMI context because it uses internal locks
and thus could cause a deadlock. Unfortunately there are circumstances when
calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
would be much more helpful if they didn't lockup the machine.

Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
detector).

This patch set solves the problem by using trylock rather than spin_lock.
If the lock can not be taken, it uses NMI specific buffers to temporary
store the message.

Patches 1-5 are preparation steps to be able to handle more ring buffers
using the same functions.

Patch 6 adds the main logic to handle the NMI messages a safe way.

Patches 7-11 improve various aspects of the NMI messages handling.

It was a long way until I got stable working solution that would fulfill
the most important needs:

    + do not lock up the machine
    + prefer last messages if logging is too slow
    + pass the messages to the console as soon as possible
    + keep the order of messages, especially parts of continuous lines
    + reduce interlacing messages from normal and NMI context

The current solution works pretty well. There are still some corner cases
where the continuous messages are split. I still want to look at it but
I do not want to hide the work any longer. I look forward to hear your
opinion and hints.

Note that the first two patches modifies API that is used by some external
tools, e.g. crash, makedumpfile. The tools would need to get updated.
The API was changed to make the solution cleaner.

I added Paul E. McKenney into CC because there are used memory barriers
in the 6th patch.


The patch set is based on kernel-next. The last commit is a42b108e06bb28348
(Add linux-next specific files for 20140507).

It can be applied also on the Linus' tree if you apply the recent
patches that touch kernel/printk/printk.c

Petr Mladek (11):
  printk: rename struct printk_log to printk_msg
  printk: allow to handle more log buffers
  printk: rename "logbuf_lock" to "main_logbuf_lock"
  printk: add NMI ring and cont buffers
  printk: allow to modify NMI log buffer size using boot parameter
  printk: NMI safe printk
  printk: right ordering of the cont buffers from NMI context
  printk: try hard to print Oops message in NMI context
  printk: merge and flush NMI buffer predictably via IRQ work
  printk: survive rotation of sequence numbers
  printk: avoid staling when merging NMI log buffer

 Documentation/kernel-parameters.txt |   19 +-
 kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
 2 files changed, 878 insertions(+), 359 deletions(-)

-- 
1.8.4

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 01/11] printk: rename struct printk_log to printk_msg
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
@ 2014-05-09  9:10 ` Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 02/11] printk: allow to handle more log buffers Petr Mladek
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

This is just a preparation patch for NMI safe printk implementation.
We will need extra log buffer that will be used in NMI context
when the lock for the main ring buffer is already taken. We cannot
wait for the main lock because it can cause a deadlock.

Many information need to be stored into the helper ring buffer, e.g.
facility, level, text, dict, time. It makes sense to reuse the logic
already implemented for the main log buffer.

It means that the same functions will need to work with different
log buffers. The code will be more clear when we wrap the log buffer
related variables into a struct.

This patch renames struct "printk_log" to "printk_msg". The name was
pretty confusing. The struct contains information about a single message.
Pointers to this struct are called "msg". Then the original name can be
used for the struct really describing the log buffer.

To make it consistent, it also renames helper functions that manipulate
the "printk_msg" struct and starts with "log_" prefix. It means that
"log_from_idx" is renamed to "msg_from_idx". On the other hand, "log_text"
and "log_dict" are renamed to "get_text", "get_dict". It is more clear
and there will be more "set/get" functions needed for the NMI log buffer.

IMPORTANT: This change causes problems for the external tools that
are accessing the structures, e.g. crash, makedumpfile. They need to
get updated.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 100 ++++++++++++++++++++++++-------------------------
 1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0fe25a11a034..e0fd7a1c0c1e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -199,7 +199,7 @@ static int console_may_schedule;
  *         67                           "g"
  *   0032     00 00 00                  padding to next message header
  *
- * The 'struct printk_log' buffer header must never be directly exported to
+ * The 'struct printk_msg' buffer header must never be directly exported to
  * userspace, it is a kernel-private implementation detail that might
  * need to be changed in the future, when the requirements change.
  *
@@ -221,7 +221,7 @@ enum log_flags {
 	LOG_CONT	= 8,	/* text is a fragment of a continuation line */
 };
 
-struct printk_log {
+struct printk_msg {
 	u64 ts_nsec;		/* timestamp in nanoseconds */
 	u16 len;		/* length of entire record */
 	u16 text_len;		/* length of text buffer */
@@ -270,7 +270,7 @@ static u32 clear_idx;
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
 #define LOG_ALIGN 4
 #else
-#define LOG_ALIGN __alignof__(struct printk_log)
+#define LOG_ALIGN __alignof__(struct printk_msg)
 #endif
 #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
 static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
@@ -278,35 +278,35 @@ static char *log_buf = __log_buf;
 static u32 log_buf_len = __LOG_BUF_LEN;
 
 /* human readable text of the record */
-static char *log_text(const struct printk_log *msg)
+static char *get_text(const struct printk_msg *msg)
 {
-	return (char *)msg + sizeof(struct printk_log);
+	return (char *)msg + sizeof(struct printk_msg);
 }
 
 /* optional key/value pair dictionary attached to the record */
-static char *log_dict(const struct printk_log *msg)
+static char *get_dict(const struct printk_msg *msg)
 {
-	return (char *)msg + sizeof(struct printk_log) + msg->text_len;
+	return (char *)msg + sizeof(struct printk_msg) + msg->text_len;
 }
 
 /* get record by index; idx must point to valid msg */
-static struct printk_log *log_from_idx(u32 idx)
+static struct printk_msg *msg_from_idx(u32 idx)
 {
-	struct printk_log *msg = (struct printk_log *)(log_buf + idx);
+	struct printk_msg *msg = (struct printk_msg *)(log_buf + idx);
 
 	/*
 	 * A length == 0 record is the end of buffer marker. Wrap around and
 	 * read the message at the start of the buffer.
 	 */
 	if (!msg->len)
-		return (struct printk_log *)log_buf;
+		return (struct printk_msg *)log_buf;
 	return msg;
 }
 
 /* get next record; idx must point to valid msg */
 static u32 log_next(u32 idx)
 {
-	struct printk_log *msg = (struct printk_log *)(log_buf + idx);
+	struct printk_msg *msg = (struct printk_msg *)(log_buf + idx);
 
 	/* length == 0 indicates the end of the buffer; wrap */
 	/*
@@ -315,7 +315,7 @@ static u32 log_next(u32 idx)
 	 * return the one after that.
 	 */
 	if (!msg->len) {
-		msg = (struct printk_log *)log_buf;
+		msg = (struct printk_msg *)log_buf;
 		return msg->len;
 	}
 	return idx + msg->len;
@@ -343,7 +343,7 @@ static int logbuf_has_space(u32 msg_size, bool empty)
 	 * We need space also for an empty header that signalizes wrapping
 	 * of the buffer.
 	 */
-	return free >= msg_size + sizeof(struct printk_log);
+	return free >= msg_size + sizeof(struct printk_msg);
 }
 
 static int log_make_free_space(u32 msg_size)
@@ -368,7 +368,7 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
 {
 	u32 size;
 
-	size = sizeof(struct printk_log) + text_len + dict_len;
+	size = sizeof(struct printk_msg) + text_len + dict_len;
 	*pad_len = (-size) & (LOG_ALIGN - 1);
 	size += *pad_len;
 
@@ -407,7 +407,7 @@ static int log_store(int facility, int level,
 		     const char *dict, u16 dict_len,
 		     const char *text, u16 text_len)
 {
-	struct printk_log *msg;
+	struct printk_msg *msg;
 	u32 size, pad_len;
 	u16 trunc_msg_len = 0;
 
@@ -423,25 +423,25 @@ static int log_store(int facility, int level,
 			return 0;
 	}
 
-	if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
+	if (log_next_idx + size + sizeof(struct printk_msg) > log_buf_len) {
 		/*
 		 * This message + an additional empty header does not fit
 		 * at the end of the buffer. Add an empty header with len == 0
 		 * to signify a wrap around.
 		 */
-		memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
+		memset(log_buf + log_next_idx, 0, sizeof(struct printk_msg));
 		log_next_idx = 0;
 	}
 
 	/* fill message */
-	msg = (struct printk_log *)(log_buf + log_next_idx);
-	memcpy(log_text(msg), text, text_len);
+	msg = (struct printk_msg *)(log_buf + log_next_idx);
+	memcpy(get_text(msg), text, text_len);
 	msg->text_len = text_len;
 	if (trunc_msg_len) {
-		memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
+		memcpy(get_text(msg) + text_len, trunc_msg, trunc_msg_len);
 		msg->text_len += trunc_msg_len;
 	}
-	memcpy(log_dict(msg), dict, dict_len);
+	memcpy(get_dict(msg), dict, dict_len);
 	msg->dict_len = dict_len;
 	msg->facility = facility;
 	msg->level = level & 7;
@@ -450,7 +450,7 @@ static int log_store(int facility, int level,
 		msg->ts_nsec = ts_nsec;
 	else
 		msg->ts_nsec = local_clock();
-	memset(log_dict(msg) + dict_len, 0, pad_len);
+	memset(get_dict(msg) + dict_len, 0, pad_len);
 	msg->len = size;
 
 	/* insert message */
@@ -576,7 +576,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
 	struct devkmsg_user *user = file->private_data;
-	struct printk_log *msg;
+	struct printk_msg *msg;
 	u64 ts_usec;
 	size_t i;
 	char cont = '-';
@@ -614,7 +614,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 		goto out;
 	}
 
-	msg = log_from_idx(user->idx);
+	msg = msg_from_idx(user->idx);
 	ts_usec = msg->ts_nsec;
 	do_div(ts_usec, 1000);
 
@@ -639,7 +639,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 
 	/* escape non-printable characters */
 	for (i = 0; i < msg->text_len; i++) {
-		unsigned char c = log_text(msg)[i];
+		unsigned char c = get_text(msg)[i];
 
 		if (c < ' ' || c >= 127 || c == '\\')
 			len += sprintf(user->buf + len, "\\x%02x", c);
@@ -652,7 +652,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 		bool line = true;
 
 		for (i = 0; i < msg->dict_len; i++) {
-			unsigned char c = log_dict(msg)[i];
+			unsigned char c = get_dict(msg)[i];
 
 			if (line) {
 				user->buf[len++] = ' ';
@@ -821,14 +821,14 @@ void log_buf_kexec_setup(void)
 	VMCOREINFO_SYMBOL(log_first_idx);
 	VMCOREINFO_SYMBOL(log_next_idx);
 	/*
-	 * Export struct printk_log size and field offsets. User space tools can
+	 * Export struct printk_msg size and field offsets. User space tools can
 	 * parse it and detect any changes to structure down the line.
 	 */
-	VMCOREINFO_STRUCT_SIZE(printk_log);
-	VMCOREINFO_OFFSET(printk_log, ts_nsec);
-	VMCOREINFO_OFFSET(printk_log, len);
-	VMCOREINFO_OFFSET(printk_log, text_len);
-	VMCOREINFO_OFFSET(printk_log, dict_len);
+	VMCOREINFO_STRUCT_SIZE(printk_msg);
+	VMCOREINFO_OFFSET(printk_msg, ts_nsec);
+	VMCOREINFO_OFFSET(printk_msg, len);
+	VMCOREINFO_OFFSET(printk_msg, text_len);
+	VMCOREINFO_OFFSET(printk_msg, dict_len);
 }
 #endif
 
@@ -977,7 +977,7 @@ static size_t print_time(u64 ts, char *buf)
 		       (unsigned long)ts, rem_nsec / 1000);
 }
 
-static size_t print_prefix(const struct printk_log *msg, bool syslog, char *buf)
+static size_t print_prefix(const struct printk_msg *msg, bool syslog, char *buf)
 {
 	size_t len = 0;
 	unsigned int prefix = (msg->facility << 3) | msg->level;
@@ -1000,10 +1000,10 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog, char *buf)
 	return len;
 }
 
-static size_t msg_print_text(const struct printk_log *msg, enum log_flags prev,
+static size_t msg_print_text(const struct printk_msg *msg, enum log_flags prev,
 			     bool syslog, char *buf, size_t size)
 {
-	const char *text = log_text(msg);
+	const char *text = get_text(msg);
 	size_t text_size = msg->text_len;
 	bool prefix = true;
 	bool newline = true;
@@ -1062,7 +1062,7 @@ static size_t msg_print_text(const struct printk_log *msg, enum log_flags prev,
 static int syslog_print(char __user *buf, int size)
 {
 	char *text;
-	struct printk_log *msg;
+	struct printk_msg *msg;
 	int len = 0;
 
 	text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
@@ -1087,7 +1087,7 @@ static int syslog_print(char __user *buf, int size)
 		}
 
 		skip = syslog_partial;
-		msg = log_from_idx(syslog_idx);
+		msg = msg_from_idx(syslog_idx);
 		n = msg_print_text(msg, syslog_prev, true, text,
 				   LOG_LINE_MAX + PREFIX_MAX);
 		if (n - syslog_partial <= size) {
@@ -1153,7 +1153,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		idx = clear_idx;
 		prev = 0;
 		while (seq < log_next_seq) {
-			struct printk_log *msg = log_from_idx(idx);
+			struct printk_msg *msg = msg_from_idx(idx);
 
 			len += msg_print_text(msg, prev, true, NULL, 0);
 			prev = msg->flags;
@@ -1166,7 +1166,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		idx = clear_idx;
 		prev = 0;
 		while (len > size && seq < log_next_seq) {
-			struct printk_log *msg = log_from_idx(idx);
+			struct printk_msg *msg = msg_from_idx(idx);
 
 			len -= msg_print_text(msg, prev, true, NULL, 0);
 			prev = msg->flags;
@@ -1179,7 +1179,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 
 		len = 0;
 		while (len >= 0 && seq < next_seq) {
-			struct printk_log *msg = log_from_idx(idx);
+			struct printk_msg *msg = msg_from_idx(idx);
 			int textlen;
 
 			textlen = msg_print_text(msg, prev, true, text,
@@ -1325,7 +1325,7 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 
 			error = 0;
 			while (seq < log_next_seq) {
-				struct printk_log *msg = log_from_idx(idx);
+				struct printk_msg *msg = msg_from_idx(idx);
 
 				error += msg_print_text(msg, prev, true, NULL, 0);
 				idx = log_next(idx);
@@ -1833,10 +1833,10 @@ static struct cont {
 	u8 level;
 	bool flushed:1;
 } cont;
-static struct printk_log *log_from_idx(u32 idx) { return NULL; }
+static struct printk_msg *msg_from_idx(u32 idx) { return NULL; }
 static u32 log_next(u32 idx) { return 0; }
 static void call_console_drivers(int level, const char *text, size_t len) {}
-static size_t msg_print_text(const struct printk_log *msg, enum log_flags prev,
+static size_t msg_print_text(const struct printk_msg *msg, enum log_flags prev,
 			     bool syslog, char *buf, size_t size) { return 0; }
 static size_t cont_print_text(char *text, size_t size) { return 0; }
 
@@ -2146,7 +2146,7 @@ void console_unlock(void)
 	console_cont_flush(text, sizeof(text));
 again:
 	for (;;) {
-		struct printk_log *msg;
+		struct printk_msg *msg;
 		u64 console_end_seq;
 		size_t len;
 		int level;
@@ -2174,7 +2174,7 @@ skip:
 		if (console_seq == log_next_seq)
 			break;
 
-		msg = log_from_idx(console_idx);
+		msg = msg_from_idx(console_idx);
 		if (msg->flags & LOG_NOCONS) {
 			/*
 			 * Skip record we have buffered and already printed
@@ -2775,7 +2775,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
 bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
 			       char *line, size_t size, size_t *len)
 {
-	struct printk_log *msg;
+	struct printk_msg *msg;
 	size_t l = 0;
 	bool ret = false;
 
@@ -2792,7 +2792,7 @@ bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
 	if (dumper->cur_seq >= log_next_seq)
 		goto out;
 
-	msg = log_from_idx(dumper->cur_idx);
+	msg = msg_from_idx(dumper->cur_idx);
 	l = msg_print_text(msg, 0, syslog, line, size);
 
 	dumper->cur_idx = log_next(dumper->cur_idx);
@@ -2887,7 +2887,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	idx = dumper->cur_idx;
 	prev = 0;
 	while (seq < dumper->next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
+		struct printk_msg *msg = msg_from_idx(idx);
 
 		l += msg_print_text(msg, prev, true, NULL, 0);
 		idx = log_next(idx);
@@ -2900,7 +2900,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	idx = dumper->cur_idx;
 	prev = 0;
 	while (l > size && seq < dumper->next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
+		struct printk_msg *msg = msg_from_idx(idx);
 
 		l -= msg_print_text(msg, prev, true, NULL, 0);
 		idx = log_next(idx);
@@ -2914,7 +2914,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 
 	l = 0;
 	while (seq < dumper->next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
+		struct printk_msg *msg = msg_from_idx(idx);
 
 		l += msg_print_text(msg, prev, syslog, buf + l, size - l);
 		idx = log_next(idx);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 02/11] printk: allow to handle more log buffers
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 01/11] printk: rename struct printk_log to printk_msg Petr Mladek
@ 2014-05-09  9:10 ` Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 03/11] printk: rename "logbuf_lock" to "main_logbuf_lock" Petr Mladek
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

This is another preparation patch for NMI safe printk implementation. An extra
log buffer will be used in NMI context when the lock for the main ring buffer
is already taken. It is not possible to wait for the main lock because it can
cause a deadlock.

This patch creates struct "printk_log" and put there all variables that are
needed to handle the log buffer, including pointer to the "cont" buffer.
This is why it renames "struct cont" to "struct printk_cont" and moves the
declaration to the top.

Unfortunately, index and sequence values will need to be modified without a lock
when the NMI log buffer will be merged back to the main log buffer. The only
safe solution is to atomically read and write both, the index and the sequence
number. It means that both values need to fit "unsigned long".

Using "unsigned long" for both "idx" and "seq" is fine for the temporary NMI
log buffer but it would limit the main log buffer too much, especially on
32-bit architectures. Therefore, we put these values into the separate
"struct printk_main_log_pos" and put it into "struct printk_log" via
an union.

The global buffers are declared as "main_cont" and "main_log". Then they
will be clearly distinguished from the NMI variant.

The patch also introduces several helper function to make it easier to access
the "idx" and "seq" values. Note that they will be more complicated once we
introduce the NMI log buffer where both values need to be updated atomically.

In addition, the function log_next() is renamed to inc_idx(). The intention
is to make it consistent with the new inc_seq(). Note that the "next_" prefix
could not be used because "next_idx" and "next_seq" is already used to mark
the last message in the ring buffer.

The followup patches will add "nmi_log" and "nmi_cont" that will be used to
temporary store messages printed in the NMI context. They will be merged into
the main log buffer when the "logbuf_lock" is available.

The functions working with console, syslog, and kmsg will work only with the
main ring buffer, so we could hardcode it there. Only the functions, that are
used to store the information into the ring buffer, will work with the selected
ring buffer and we need to pass it as a parameter.

IMPORTANT: This change causes problems for the external tools that are accessing
the structures, e.g. crash, makedumpfile. They need to get updated.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 530 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 324 insertions(+), 206 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index e0fd7a1c0c1e..c560ebdecc04 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -148,6 +148,13 @@ EXPORT_SYMBOL(console_set_on_cmdline);
 static int console_may_schedule;
 
 /*
+ * The logbuf_lock protects kmsg buffer, indexes, counters. This can be taken
+ * within the scheduler's rq lock. It must be released before calling
+ * console_unlock() or anything else that might wake up a process.
+ */
+static DEFINE_RAW_SPINLOCK(logbuf_lock);
+
+/*
  * The printk log buffer consists of a chain of concatenated variable
  * length records. Every record starts with a record header, containing
  * the overall length of the record.
@@ -231,29 +238,51 @@ struct printk_msg {
 	u8 level:3;		/* syslog level */
 };
 
+#ifdef CONFIG_PRINTK
 /*
- * The logbuf_lock protects kmsg buffer, indices, counters.  This can be taken
- * within the scheduler's rq lock. It must be released before calling
- * console_unlock() or anything else that might wake up a process.
+ * Continuation lines are buffered, and not committed to the record buffer
+ * until the line is complete, or a race forces it. The line fragments
+ * though, are printed immediately to the consoles to ensure everything has
+ * reached the console in case of a kernel crash.
  */
-static DEFINE_RAW_SPINLOCK(logbuf_lock);
+struct printk_cont {
+	char *buf;
+	size_t len;			/* length == 0 means unused buffer */
+	size_t cons;			/* bytes written to console */
+	struct task_struct *owner;	/* task of first print*/
+	u64 ts_nsec;			/* time of first print */
+	u8 level;			/* log level of first message */
+	u8 facility;			/* log level of first message */
+	enum log_flags flags;		/* prefix, newline flags */
+	bool flushed:1;			/* buffer sealed and committed */
+};
+
+/* positions of messages in the main log buffer */
+struct printk_main_log_pos {
+	u64 first_seq;		/* sequence number of the first record */
+	u32 first_idx;		/* index of the first record */
+	u64 next_seq;		/* sequence number of the next rec. to store */
+	u32 next_idx;		/* index of the next record to store */
+};
+
+/* information needed to manipulate the log buffer */
+struct printk_log {
+	struct printk_cont *cont;	/* merging continuous message */
+	char *buf;			/* ring buffer */
+	u32 buf_len;			/* size of the ring buffer */
+	union {
+		struct printk_main_log_pos main;
+	};
+};
 
-#ifdef CONFIG_PRINTK
 DECLARE_WAIT_QUEUE_HEAD(log_wait);
+
 /* the next printk record to read by syslog(READ) or /proc/kmsg */
 static u64 syslog_seq;
 static u32 syslog_idx;
 static enum log_flags syslog_prev;
 static size_t syslog_partial;
 
-/* index and sequence number of the first record stored in the buffer */
-static u64 log_first_seq;
-static u32 log_first_idx;
-
-/* index and sequence number of the next record to store in the buffer */
-static u64 log_next_seq;
-static u32 log_next_idx;
-
 /* the next printk record to write to the console */
 static u64 console_seq;
 static u32 console_idx;
@@ -274,8 +303,50 @@ static u32 clear_idx;
 #endif
 #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
 static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
-static char *log_buf = __log_buf;
-static u32 log_buf_len = __LOG_BUF_LEN;
+static char __main_cont_buf[LOG_LINE_MAX];
+
+static struct printk_cont main_cont = {
+	.buf = __main_cont_buf,
+};
+
+static struct printk_log main_log = {
+	.buf = __log_buf,
+	.buf_len = __LOG_BUF_LEN,
+	.cont = &main_cont,
+};
+
+/*
+ * Define functions needed to get the position values,
+ * for example, first_idx. Possible values are:
+ *	+ side: "first", "next"
+ *	+ pos: "idx", "seq"
+ */
+#define DEFINE_GET_POS(rettype, funcname, side, pos)		\
+static rettype funcname(const struct printk_log *log)		\
+{								\
+	return log->main.side##_##pos;				\
+}
+
+DEFINE_GET_POS(u32, get_first_idx, first, idx)
+DEFINE_GET_POS(u64, get_first_seq, first, seq)
+DEFINE_GET_POS(u32, get_next_idx, next, idx)
+DEFINE_GET_POS(u64, get_next_seq, next, seq)
+
+
+/*
+ * Define functions needed to set the position values,
+ * for example, first_idx. Possible values are:
+ *	+ side: "first", "next"
+ */
+#define DEFINE_SET_POS(funcname, side)					\
+static void funcname(struct printk_log *log, u32 idx, u64 seq)		\
+{									\
+	log->main.side##_idx = idx;					\
+	log->main.side##_seq = seq;				\
+}
+
+DEFINE_SET_POS(set_first_pos, first)
+DEFINE_SET_POS(set_next_pos, next)
 
 /* human readable text of the record */
 static char *get_text(const struct printk_msg *msg)
@@ -290,23 +361,23 @@ static char *get_dict(const struct printk_msg *msg)
 }
 
 /* get record by index; idx must point to valid msg */
-static struct printk_msg *msg_from_idx(u32 idx)
+static struct printk_msg *msg_from_idx(struct printk_log *log, u32 idx)
 {
-	struct printk_msg *msg = (struct printk_msg *)(log_buf + idx);
+	struct printk_msg *msg = (struct printk_msg *)(log->buf + idx);
 
 	/*
 	 * A length == 0 record is the end of buffer marker. Wrap around and
 	 * read the message at the start of the buffer.
 	 */
 	if (!msg->len)
-		return (struct printk_msg *)log_buf;
+		return (struct printk_msg *)log->buf;
 	return msg;
 }
 
 /* get next record; idx must point to valid msg */
-static u32 log_next(u32 idx)
+static u32 inc_idx(struct printk_log *log, u32 idx)
 {
-	struct printk_msg *msg = (struct printk_msg *)(log_buf + idx);
+	struct printk_msg *msg = (struct printk_msg *)(log->buf + idx);
 
 	/* length == 0 indicates the end of the buffer; wrap */
 	/*
@@ -315,12 +386,41 @@ static u32 log_next(u32 idx)
 	 * return the one after that.
 	 */
 	if (!msg->len) {
-		msg = (struct printk_msg *)log_buf;
+		msg = (struct printk_msg *)log->buf;
 		return msg->len;
 	}
 	return idx + msg->len;
 }
 
+/* get next sequence number for the given one */
+static u64 inc_seq(struct printk_log *log, u64 seq)
+{
+	return ++seq;
+}
+
+/*
+ * Define helper functions to move the position to the next message
+ * a safe way. Possible values are:
+ *	+ side: "first", "next"
+ */
+#define GENERATE_INC_POS(funcname, side)	\
+static void funcname(struct printk_log *log)	\
+{						\
+	u32 idx;				\
+	u64 seq;				\
+						\
+	idx = get_##side##_idx(log);		\
+	seq = get_##side##_seq(log);		\
+						\
+	idx = inc_idx(log, idx);		\
+	seq = inc_seq(log, seq);		\
+						\
+	set_##side##_pos(log, idx, seq);	\
+}
+
+GENERATE_INC_POS(inc_first_pos, first)
+GENERATE_INC_POS(inc_next_pos, next)
+
 /*
  * Check whether there is enough free space for the given message.
  *
@@ -330,14 +430,15 @@ static u32 log_next(u32 idx)
  * If the buffer is empty, we must respect the position of the indexes.
  * They cannot be reset to the beginning of the buffer.
  */
-static int logbuf_has_space(u32 msg_size, bool empty)
+static int logbuf_has_space(struct printk_log *log, u32 msg_size, bool empty)
 {
 	u32 free;
 
-	if (log_next_idx > log_first_idx || empty)
-		free = max(log_buf_len - log_next_idx, log_first_idx);
+	if (get_next_idx(log) > get_first_idx(log) || empty)
+		free = max(log->buf_len - get_next_idx(log),
+			   get_first_idx(log));
 	else
-		free = log_first_idx - log_next_idx;
+		free = get_first_idx(log) - get_next_idx(log);
 
 	/*
 	 * We need space also for an empty header that signalizes wrapping
@@ -346,18 +447,17 @@ static int logbuf_has_space(u32 msg_size, bool empty)
 	return free >= msg_size + sizeof(struct printk_msg);
 }
 
-static int log_make_free_space(u32 msg_size)
+static int log_make_free_space(struct printk_log *log, u32 msg_size)
 {
-	while (log_first_seq < log_next_seq) {
-		if (logbuf_has_space(msg_size, false))
+	while (get_first_seq(log) < get_next_seq(log)) {
+		if (logbuf_has_space(log, msg_size, false))
 			return 0;
 		/* drop old messages until we have enough continuous space */
-		log_first_idx = log_next(log_first_idx);
-		log_first_seq++;
+		inc_first_pos(log);
 	}
 
 	/* sequence numbers are equal, so the log buffer is empty */
-	if (logbuf_has_space(msg_size, true))
+	if (logbuf_has_space(log, msg_size, true))
 		return 0;
 
 	return -ENOMEM;
@@ -383,14 +483,15 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
 #define MAX_LOG_TAKE_PART 4
 static const char trunc_msg[] = "<truncated>";
 
-static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
+static u32 truncate_msg(struct printk_log *log,
+			u16 *text_len, u16 *trunc_msg_len,
 			u16 *dict_len, u32 *pad_len)
 {
 	/*
 	 * The message should not take the whole buffer. Otherwise, it might
 	 * get removed too soon.
 	 */
-	u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
+	u32 max_text_len = log->buf_len / MAX_LOG_TAKE_PART;
 	if (*text_len > max_text_len)
 		*text_len = max_text_len;
 	/* enable the warning message */
@@ -402,7 +503,7 @@ static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
 }
 
 /* insert record into the buffer, discard old ones, update heads */
-static int log_store(int facility, int level,
+static int log_store(struct printk_log *log, int facility, int level,
 		     enum log_flags flags, u64 ts_nsec,
 		     const char *dict, u16 dict_len,
 		     const char *text, u16 text_len)
@@ -414,27 +515,29 @@ static int log_store(int facility, int level,
 	/* number of '\0' padding bytes to next message */
 	size = msg_used_size(text_len, dict_len, &pad_len);
 
-	if (log_make_free_space(size)) {
+	if (log_make_free_space(log, size)) {
 		/* truncate the message if it is too long for empty buffer */
-		size = truncate_msg(&text_len, &trunc_msg_len,
+		size = truncate_msg(log, &text_len, &trunc_msg_len,
 				    &dict_len, &pad_len);
 		/* survive when the log buffer is too small for trunc_msg */
-		if (log_make_free_space(size))
+		if (log_make_free_space(log, size))
 			return 0;
 	}
 
-	if (log_next_idx + size + sizeof(struct printk_msg) > log_buf_len) {
+	if (get_next_idx(log) + size + sizeof(struct printk_msg) >
+	    log->buf_len) {
 		/*
 		 * This message + an additional empty header does not fit
 		 * at the end of the buffer. Add an empty header with len == 0
 		 * to signify a wrap around.
 		 */
-		memset(log_buf + log_next_idx, 0, sizeof(struct printk_msg));
-		log_next_idx = 0;
+		memset(log->buf + get_next_idx(log), 0,
+		       sizeof(struct printk_msg));
+		set_next_pos(log, 0, get_next_seq(log));
 	}
 
 	/* fill message */
-	msg = (struct printk_msg *)(log_buf + log_next_idx);
+	msg = (struct printk_msg *)(log->buf + get_next_idx(log));
 	memcpy(get_text(msg), text, text_len);
 	msg->text_len = text_len;
 	if (trunc_msg_len) {
@@ -454,8 +557,7 @@ static int log_store(int facility, int level,
 	msg->len = size;
 
 	/* insert message */
-	log_next_idx += msg->len;
-	log_next_seq++;
+	inc_next_pos(log);
 
 	return msg->text_len;
 }
@@ -590,7 +692,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 	if (ret)
 		return ret;
 	raw_spin_lock_irq(&logbuf_lock);
-	while (user->seq == log_next_seq) {
+	while (user->seq == get_next_seq(&main_log)) {
 		if (file->f_flags & O_NONBLOCK) {
 			ret = -EAGAIN;
 			raw_spin_unlock_irq(&logbuf_lock);
@@ -599,22 +701,23 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 
 		raw_spin_unlock_irq(&logbuf_lock);
 		ret = wait_event_interruptible(log_wait,
-					       user->seq != log_next_seq);
+					       user->seq !=
+					       get_next_seq(&main_log));
 		if (ret)
 			goto out;
 		raw_spin_lock_irq(&logbuf_lock);
 	}
 
-	if (user->seq < log_first_seq) {
+	if (user->seq < get_first_seq(&main_log)) {
 		/* our last seen message is gone, return error and reset */
-		user->idx = log_first_idx;
-		user->seq = log_first_seq;
+		user->idx = get_first_idx(&main_log);
+		user->seq = get_first_seq(&main_log);
 		ret = -EPIPE;
 		raw_spin_unlock_irq(&logbuf_lock);
 		goto out;
 	}
 
-	msg = msg_from_idx(user->idx);
+	msg = msg_from_idx(&main_log, user->idx);
 	ts_usec = msg->ts_nsec;
 	do_div(ts_usec, 1000);
 
@@ -675,7 +778,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 		user->buf[len++] = '\n';
 	}
 
-	user->idx = log_next(user->idx);
+	user->idx = inc_idx(&main_log, user->idx);
 	user->seq++;
 	raw_spin_unlock_irq(&logbuf_lock);
 
@@ -708,8 +811,8 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 	switch (whence) {
 	case SEEK_SET:
 		/* the first record */
-		user->idx = log_first_idx;
-		user->seq = log_first_seq;
+		user->idx = get_first_idx(&main_log);
+		user->seq = get_first_seq(&main_log);
 		break;
 	case SEEK_DATA:
 		/*
@@ -722,8 +825,8 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 		break;
 	case SEEK_END:
 		/* after the last record */
-		user->idx = log_next_idx;
-		user->seq = log_next_seq;
+		user->idx = get_next_idx(&main_log);
+		user->seq = get_next_seq(&main_log);
 		break;
 	default:
 		ret = -EINVAL;
@@ -743,9 +846,9 @@ static unsigned int devkmsg_poll(struct file *file, poll_table *wait)
 	poll_wait(file, &log_wait, wait);
 
 	raw_spin_lock_irq(&logbuf_lock);
-	if (user->seq < log_next_seq) {
+	if (user->seq < get_next_seq(&main_log)) {
 		/* return error when data has vanished underneath us */
-		if (user->seq < log_first_seq)
+		if (user->seq < get_first_seq(&main_log))
 			ret = POLLIN|POLLRDNORM|POLLERR|POLLPRI;
 		else
 			ret = POLLIN|POLLRDNORM;
@@ -776,8 +879,8 @@ static int devkmsg_open(struct inode *inode, struct file *file)
 	mutex_init(&user->lock);
 
 	raw_spin_lock_irq(&logbuf_lock);
-	user->idx = log_first_idx;
-	user->seq = log_first_seq;
+	user->idx = get_first_idx(&main_log);
+	user->seq = get_first_seq(&main_log);
 	raw_spin_unlock_irq(&logbuf_lock);
 
 	file->private_data = user;
@@ -816,10 +919,13 @@ const struct file_operations kmsg_fops = {
  */
 void log_buf_kexec_setup(void)
 {
-	VMCOREINFO_SYMBOL(log_buf);
-	VMCOREINFO_SYMBOL(log_buf_len);
-	VMCOREINFO_SYMBOL(log_first_idx);
-	VMCOREINFO_SYMBOL(log_next_idx);
+	VMCOREINFO_STRUCT_SIZE(printk_log);
+	VMCOREINFO_OFFSET(printk_log, buf);
+	VMCOREINFO_OFFSET(printk_log, buf_len);
+	VMCOREINFO_OFFSET(printk_log, main);
+	VMCOREINFO_STRUCT_SIZE(printk_main_log_pos);
+	VMCOREINFO_OFFSET(printk_main_log_pos, first_idx);
+	VMCOREINFO_OFFSET(printk_main_log_pos, next_idx);
 	/*
 	 * Export struct printk_msg size and field offsets. User space tools can
 	 * parse it and detect any changes to structure down the line.
@@ -842,7 +948,7 @@ static int __init log_buf_len_setup(char *str)
 
 	if (size)
 		size = roundup_pow_of_two(size);
-	if (size > log_buf_len)
+	if (size > main_log.buf_len)
 		new_log_buf_len = size;
 
 	return 0;
@@ -872,14 +978,14 @@ void __init setup_log_buf(int early)
 	}
 
 	raw_spin_lock_irqsave(&logbuf_lock, flags);
-	log_buf_len = new_log_buf_len;
-	log_buf = new_log_buf;
+	main_log.buf_len = new_log_buf_len;
+	main_log.buf = new_log_buf;
 	new_log_buf_len = 0;
-	free = __LOG_BUF_LEN - log_next_idx;
-	memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
+	free = __LOG_BUF_LEN - get_next_idx(&main_log);
+	memcpy(main_log.buf, __log_buf, __LOG_BUF_LEN);
 	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 
-	pr_info("log_buf_len: %d\n", log_buf_len);
+	pr_info("log_buf_len: %d\n", main_log.buf_len);
 	pr_info("early log buf free: %d(%d%%)\n",
 		free, (free * 100) / __LOG_BUF_LEN);
 }
@@ -1074,25 +1180,25 @@ static int syslog_print(char __user *buf, int size)
 		size_t skip;
 
 		raw_spin_lock_irq(&logbuf_lock);
-		if (syslog_seq < log_first_seq) {
+		if (syslog_seq < get_first_seq(&main_log)) {
 			/* messages are gone, move to first one */
-			syslog_seq = log_first_seq;
-			syslog_idx = log_first_idx;
+			syslog_seq = get_first_seq(&main_log);
+			syslog_idx = get_first_idx(&main_log);
 			syslog_prev = 0;
 			syslog_partial = 0;
 		}
-		if (syslog_seq == log_next_seq) {
+		if (syslog_seq == get_next_seq(&main_log)) {
 			raw_spin_unlock_irq(&logbuf_lock);
 			break;
 		}
 
 		skip = syslog_partial;
-		msg = msg_from_idx(syslog_idx);
+		msg = msg_from_idx(&main_log, syslog_idx);
 		n = msg_print_text(msg, syslog_prev, true, text,
 				   LOG_LINE_MAX + PREFIX_MAX);
 		if (n - syslog_partial <= size) {
 			/* message fits into buffer, move forward */
-			syslog_idx = log_next(syslog_idx);
+			syslog_idx = inc_idx(&main_log, syslog_idx);
 			syslog_seq++;
 			syslog_prev = msg->flags;
 			n -= syslog_partial;
@@ -1139,10 +1245,10 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		u32 idx;
 		enum log_flags prev;
 
-		if (clear_seq < log_first_seq) {
+		if (clear_seq < get_first_seq(&main_log)) {
 			/* messages are gone, move to first available one */
-			clear_seq = log_first_seq;
-			clear_idx = log_first_idx;
+			clear_seq = get_first_seq(&main_log);
+			clear_idx = get_first_idx(&main_log);
 		}
 
 		/*
@@ -1152,12 +1258,12 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		seq = clear_seq;
 		idx = clear_idx;
 		prev = 0;
-		while (seq < log_next_seq) {
-			struct printk_msg *msg = msg_from_idx(idx);
+		while (seq < get_next_seq(&main_log)) {
+			struct printk_msg *msg = msg_from_idx(&main_log, idx);
 
 			len += msg_print_text(msg, prev, true, NULL, 0);
 			prev = msg->flags;
-			idx = log_next(idx);
+			idx = inc_idx(&main_log, idx);
 			seq++;
 		}
 
@@ -1165,21 +1271,21 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		seq = clear_seq;
 		idx = clear_idx;
 		prev = 0;
-		while (len > size && seq < log_next_seq) {
-			struct printk_msg *msg = msg_from_idx(idx);
+		while (len > size && seq < get_next_seq(&main_log)) {
+			struct printk_msg *msg = msg_from_idx(&main_log, idx);
 
 			len -= msg_print_text(msg, prev, true, NULL, 0);
 			prev = msg->flags;
-			idx = log_next(idx);
+			idx = inc_idx(&main_log, idx);
 			seq++;
 		}
 
 		/* last message fitting into this dump */
-		next_seq = log_next_seq;
+		next_seq = get_next_seq(&main_log);
 
 		len = 0;
 		while (len >= 0 && seq < next_seq) {
-			struct printk_msg *msg = msg_from_idx(idx);
+			struct printk_msg *msg = msg_from_idx(&main_log, idx);
 			int textlen;
 
 			textlen = msg_print_text(msg, prev, true, text,
@@ -1188,7 +1294,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 				len = textlen;
 				break;
 			}
-			idx = log_next(idx);
+			idx = inc_idx(&main_log, idx);
 			seq++;
 			prev = msg->flags;
 
@@ -1199,18 +1305,18 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 				len += textlen;
 			raw_spin_lock_irq(&logbuf_lock);
 
-			if (seq < log_first_seq) {
+			if (seq < get_first_seq(&main_log)) {
 				/* messages are gone, move to next one */
-				seq = log_first_seq;
-				idx = log_first_idx;
+				seq = get_first_seq(&main_log);
+				idx = get_first_idx(&main_log);
 				prev = 0;
 			}
 		}
 	}
 
 	if (clear) {
-		clear_seq = log_next_seq;
-		clear_idx = log_next_idx;
+		clear_seq = get_next_seq(&main_log);
+		clear_idx = get_next_idx(&main_log);
 	}
 	raw_spin_unlock_irq(&logbuf_lock);
 
@@ -1249,7 +1355,8 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 			goto out;
 		}
 		error = wait_event_interruptible(log_wait,
-						 syslog_seq != log_next_seq);
+						 syslog_seq !=
+						 get_next_seq(&main_log));
 		if (error)
 			goto out;
 		error = syslog_print(buf, len);
@@ -1304,10 +1411,10 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 	/* Number of chars in the log buffer */
 	case SYSLOG_ACTION_SIZE_UNREAD:
 		raw_spin_lock_irq(&logbuf_lock);
-		if (syslog_seq < log_first_seq) {
+		if (syslog_seq < get_first_seq(&main_log)) {
 			/* messages are gone, move to first one */
-			syslog_seq = log_first_seq;
-			syslog_idx = log_first_idx;
+			syslog_seq = get_first_seq(&main_log);
+			syslog_idx = get_first_idx(&main_log);
 			syslog_prev = 0;
 			syslog_partial = 0;
 		}
@@ -1317,18 +1424,20 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 			 * for pending data, not the size; return the count of
 			 * records, not the length.
 			 */
-			error = log_next_idx - syslog_idx;
+			error = get_next_idx(&main_log) - syslog_idx;
 		} else {
 			u64 seq = syslog_seq;
 			u32 idx = syslog_idx;
 			enum log_flags prev = syslog_prev;
 
 			error = 0;
-			while (seq < log_next_seq) {
-				struct printk_msg *msg = msg_from_idx(idx);
+			while (seq < get_next_seq(&main_log)) {
+				struct printk_msg *msg = msg_from_idx(&main_log,
+								      idx);
 
-				error += msg_print_text(msg, prev, true, NULL, 0);
-				idx = log_next(idx);
+				error += msg_print_text(msg, prev, true,
+							NULL, 0);
+				idx = inc_idx(&main_log, idx);
 				seq++;
 				prev = msg->flags;
 			}
@@ -1338,7 +1447,7 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 		break;
 	/* Size of the log buffer */
 	case SYSLOG_ACTION_SIZE_BUFFER:
-		error = log_buf_len;
+		error = main_log.buf_len;
 		break;
 	default:
 		error = -EINVAL;
@@ -1471,106 +1580,98 @@ static inline void printk_delay(void)
 	}
 }
 
-/*
- * Continuation lines are buffered, and not committed to the record buffer
- * until the line is complete, or a race forces it. The line fragments
- * though, are printed immediately to the consoles to ensure everything has
- * reached the console in case of a kernel crash.
- */
-static struct cont {
-	char buf[LOG_LINE_MAX];
-	size_t len;			/* length == 0 means unused buffer */
-	size_t cons;			/* bytes written to console */
-	struct task_struct *owner;	/* task of first print*/
-	u64 ts_nsec;			/* time of first print */
-	u8 level;			/* log level of first message */
-	u8 facility;			/* log level of first message */
-	enum log_flags flags;		/* prefix, newline flags */
-	bool flushed:1;			/* buffer sealed and committed */
-} cont;
-
-static void cont_flush(enum log_flags flags)
+static void cont_flush(struct printk_log *log, enum log_flags flags)
 {
-	if (cont.flushed)
+	struct printk_cont *cont = log->cont;
+
+	if (cont->flushed)
 		return;
-	if (cont.len == 0)
+	if (cont->len == 0)
 		return;
 
-	if (cont.cons) {
+	if (cont->cons) {
 		/*
 		 * If a fragment of this line was directly flushed to the
 		 * console; wait for the console to pick up the rest of the
 		 * line. LOG_NOCONS suppresses a duplicated output.
 		 */
-		log_store(cont.facility, cont.level, flags | LOG_NOCONS,
-			  cont.ts_nsec, NULL, 0, cont.buf, cont.len);
-		cont.flags = flags;
-		cont.flushed = true;
+		log_store(log, cont->facility, cont->level, flags | LOG_NOCONS,
+			  cont->ts_nsec, NULL, 0, cont->buf, cont->len);
+		cont->flags = flags;
+		cont->flushed = true;
 	} else {
 		/*
 		 * If no fragment of this line ever reached the console,
 		 * just submit it to the store and free the buffer.
 		 */
-		log_store(cont.facility, cont.level, flags, 0,
-			  NULL, 0, cont.buf, cont.len);
-		cont.len = 0;
+		log_store(log, cont->facility, cont->level, flags, 0,
+			  NULL, 0, cont->buf, cont->len);
+		cont->len = 0;
 	}
 }
 
-static bool cont_add(int facility, int level, const char *text, size_t len)
+static bool cont_add(struct printk_log *log, int facility, int level,
+		     const char *text, size_t len)
 {
-	if (cont.len && cont.flushed)
+	struct printk_cont *cont = log->cont;
+
+	if (cont->len && cont->flushed)
 		return false;
 
-	if (cont.len + len > sizeof(cont.buf)) {
+	if (cont->len + len > sizeof(cont->buf)) {
 		/* the line gets too long, split it up in separate records */
-		cont_flush(LOG_CONT);
+		cont_flush(log, LOG_CONT);
 		return false;
 	}
 
-	if (!cont.len) {
-		cont.facility = facility;
-		cont.level = level;
-		cont.owner = current;
-		cont.ts_nsec = local_clock();
-		cont.flags = 0;
-		cont.cons = 0;
-		cont.flushed = false;
+	if (!cont->len) {
+		cont->facility = facility;
+		cont->level = level;
+		cont->owner = current;
+		cont->ts_nsec = local_clock();
+		cont->flags = 0;
+		cont->cons = 0;
+		cont->flushed = false;
 	}
 
-	memcpy(cont.buf + cont.len, text, len);
-	cont.len += len;
+	memcpy(cont->buf + cont->len, text, len);
+	cont->len += len;
 
-	if (cont.len > (sizeof(cont.buf) * 80) / 100)
-		cont_flush(LOG_CONT);
+	if (cont->len > (sizeof(cont->buf) * 80) / 100)
+		cont_flush(log, LOG_CONT);
 
 	return true;
 }
 
+/*
+ * Only messages from the main log buffer are printed directly to
+ * the console. Therefore this function operates directly with the main
+ * continuous buffer.
+ */
 static size_t cont_print_text(char *text, size_t size)
 {
 	size_t textlen = 0;
 	size_t len;
 
-	if (cont.cons == 0 && (console_prev & LOG_NEWLINE)) {
-		textlen += print_time(cont.ts_nsec, text);
+	if (main_cont.cons == 0 && (console_prev & LOG_NEWLINE)) {
+		textlen += print_time(main_cont.ts_nsec, text);
 		size -= textlen;
 	}
 
-	len = cont.len - cont.cons;
+	len = main_cont.len - main_cont.cons;
 	if (len > 0) {
 		if (len+1 > size)
 			len = size-1;
-		memcpy(text + textlen, cont.buf + cont.cons, len);
+		memcpy(text + textlen, main_cont.buf + main_cont.cons, len);
 		textlen += len;
-		cont.cons = cont.len;
+		main_cont.cons = main_cont.len;
 	}
 
-	if (cont.flushed) {
-		if (cont.flags & LOG_NEWLINE)
+	if (main_cont.flushed) {
+		if (main_cont.flags & LOG_NEWLINE)
 			text[textlen++] = '\n';
 		/* got everything, release buffer */
-		cont.len = 0;
+		main_cont.len = 0;
 	}
 	return textlen;
 }
@@ -1579,6 +1680,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 			    const char *dict, size_t dictlen,
 			    const char *fmt, va_list args)
 {
+	struct printk_log *log = &main_log;
+	struct printk_cont *cont = main_log.cont;
 	static int recursion_bug;
 	static char textbuf[LOG_LINE_MAX];
 	char *text = textbuf;
@@ -1633,7 +1736,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 		recursion_bug = 0;
 		text_len = strlen(recursion_msg);
 		/* emit KERN_CRIT message */
-		printed_len += log_store(0, 2, LOG_PREFIX|LOG_NEWLINE, 0,
+		printed_len += log_store(log, 0, 2, LOG_PREFIX|LOG_NEWLINE, 0,
 					 NULL, 0, recursion_msg, text_len);
 	}
 
@@ -1688,14 +1791,15 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 * Flush the conflicting buffer. An earlier newline was missing,
 		 * or another task also prints continuation lines.
 		 */
-		if (cont.len && (lflags & LOG_PREFIX || cont.owner != current))
-			cont_flush(LOG_NEWLINE);
+		if (cont->len &&
+		    (lflags & LOG_PREFIX || cont->owner != current))
+			cont_flush(log, LOG_NEWLINE);
 
 		/* buffer line if possible, otherwise store it right away */
-		if (cont_add(facility, level, text, text_len))
+		if (cont_add(log, facility, level, text, text_len))
 			printed_len += text_len;
 		else
-			printed_len += log_store(facility, level,
+			printed_len += log_store(log, facility, level,
 						 lflags | LOG_CONT, 0,
 						 dict, dictlen, text, text_len);
 	} else {
@@ -1709,18 +1813,19 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 * If the preceding printk was from a different task and missed
 		 * a newline, flush and append the newline.
 		 */
-		if (cont.len) {
-			if (cont.owner == current && !(lflags & LOG_PREFIX))
-				stored = cont_add(facility, level, text,
+		if (cont->len) {
+			if (cont->owner == current && !(lflags & LOG_PREFIX))
+				stored = cont_add(log, facility, level, text,
 						  text_len);
-			cont_flush(LOG_NEWLINE);
+			cont_flush(log, LOG_NEWLINE);
 		}
 
 		if (stored)
 			printed_len += text_len;
 		else
-			printed_len += log_store(facility, level, lflags, 0,
-						 dict, dictlen, text, text_len);
+			printed_len += log_store(log, facility, level, lflags,
+						 0, dict, dictlen,
+						 text, text_len);
 	}
 
 	logbuf_cpu = UINT_MAX;
@@ -1823,18 +1928,29 @@ static u32 syslog_idx;
 static u64 console_seq;
 static u32 console_idx;
 static enum log_flags syslog_prev;
-static u64 log_first_seq;
-static u32 log_first_idx;
-static u64 log_next_seq;
 static enum log_flags console_prev;
-static struct cont {
+static struct printk_cont {
 	size_t len;
 	size_t cons;
 	u8 level;
 	bool flushed:1;
-} cont;
-static struct printk_msg *msg_from_idx(u32 idx) { return NULL; }
-static u32 log_next(u32 idx) { return 0; }
+} main_cont;
+static struct printk_log {
+	u64 first_seq;
+	u32 first_idx;
+	u64 next_seq;
+} main_log;
+#define DEFINE_GET_POS(rettype, funcname)		\
+static rettype funcname(const struct printk_log *log)	\
+{							\
+	return 0;					\
+}
+DEFINE_GET_POS(u32, get_first_idx)
+DEFINE_GET_POS(u64, get_first_seq)
+DEFINE_GET_POS(u64, get_next_seq)
+static struct printk_msg *msg_from_idx(struct printk_log *log,
+				       u32 idx) { return NULL; }
+static u32 inc_idx(struct printk_log *log, u32 idx) { return 0; }
 static void call_console_drivers(int level, const char *text, size_t len) {}
 static size_t msg_print_text(const struct printk_msg *msg, enum log_flags prev,
 			     bool syslog, char *buf, size_t size) { return 0; }
@@ -2084,6 +2200,7 @@ int is_console_locked(void)
 	return console_locked;
 }
 
+/* only messages from the main cont buffer are flushed directly */
 static void console_cont_flush(char *text, size_t size)
 {
 	unsigned long flags;
@@ -2091,7 +2208,7 @@ static void console_cont_flush(char *text, size_t size)
 
 	raw_spin_lock_irqsave(&logbuf_lock, flags);
 
-	if (!cont.len)
+	if (!main_cont.len)
 		goto out;
 
 	/*
@@ -2099,13 +2216,13 @@ static void console_cont_flush(char *text, size_t size)
 	 * busy. The earlier ones need to be printed before this one, we
 	 * did not flush any fragment so far, so just let it queue up.
 	 */
-	if (console_seq < log_next_seq && !cont.cons)
+	if (console_seq < get_next_seq(&main_log) && !main_cont.cons)
 		goto out;
 
 	len = cont_print_text(text, size);
 	raw_spin_unlock(&logbuf_lock);
 	stop_critical_timings();
-	call_console_drivers(cont.level, text, len);
+	call_console_drivers(main_cont.level, text, len);
 	start_critical_timings();
 	local_irq_restore(flags);
 	return;
@@ -2152,35 +2269,36 @@ again:
 		int level;
 
 		raw_spin_lock_irqsave(&logbuf_lock, flags);
-		console_end_seq = log_next_seq;
+		console_end_seq = get_next_seq(&main_log);
 again_noirq:
-		if (seen_seq != log_next_seq) {
+		if (seen_seq != get_next_seq(&main_log)) {
 			wake_klogd = true;
-			seen_seq = log_next_seq;
+			seen_seq = get_next_seq(&main_log);
 		}
 
-		if (console_seq < log_first_seq) {
+		if (console_seq < get_first_seq(&main_log)) {
 			len = sprintf(text, "** %u printk messages dropped ** ",
-				      (unsigned)(log_first_seq - console_seq));
+				      (unsigned)(get_first_seq(&main_log) -
+						 console_seq));
 
 			/* messages are gone, move to first one */
-			console_seq = log_first_seq;
-			console_idx = log_first_idx;
+			console_seq = get_first_seq(&main_log);
+			console_idx = get_first_idx(&main_log);
 			console_prev = 0;
 		} else {
 			len = 0;
 		}
 skip:
-		if (console_seq == log_next_seq)
+		if (console_seq == get_next_seq(&main_log))
 			break;
 
-		msg = msg_from_idx(console_idx);
+		msg = msg_from_idx(&main_log, console_idx);
 		if (msg->flags & LOG_NOCONS) {
 			/*
 			 * Skip record we have buffered and already printed
 			 * directly to the console when we received it.
 			 */
-			console_idx = log_next(console_idx);
+			console_idx = inc_idx(&main_log, console_idx);
 			console_seq++;
 			/*
 			 * We will get here again when we register a new
@@ -2195,7 +2313,7 @@ skip:
 		level = msg->level;
 		len += msg_print_text(msg, console_prev, false,
 				      text + len, sizeof(text) - len);
-		console_idx = log_next(console_idx);
+		console_idx = inc_idx(&main_log, console_idx);
 		console_seq++;
 		console_prev = msg->flags;
 		raw_spin_unlock(&logbuf_lock);
@@ -2228,7 +2346,7 @@ skip:
 	 * flush, no worries.
 	 */
 	raw_spin_lock(&logbuf_lock);
-	retry = console_seq != log_next_seq;
+	retry = console_seq != get_next_seq(&main_log);
 	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 
 	if (retry && console_trylock())
@@ -2740,8 +2858,8 @@ void kmsg_dump(enum kmsg_dump_reason reason)
 		raw_spin_lock_irqsave(&logbuf_lock, flags);
 		dumper->cur_seq = clear_seq;
 		dumper->cur_idx = clear_idx;
-		dumper->next_seq = log_next_seq;
-		dumper->next_idx = log_next_idx;
+		dumper->next_seq = get_next_seq(&main_log);
+		dumper->next_idx = get_next_idx(&main_log);
 		raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 
 		/* invoke dumper which will iterate over records */
@@ -2782,20 +2900,20 @@ bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
 	if (!dumper->active)
 		goto out;
 
-	if (dumper->cur_seq < log_first_seq) {
+	if (dumper->cur_seq < get_first_seq(&main_log)) {
 		/* messages are gone, move to first available one */
-		dumper->cur_seq = log_first_seq;
-		dumper->cur_idx = log_first_idx;
+		dumper->cur_seq = get_first_seq(&main_log);
+		dumper->cur_idx = get_first_idx(&main_log);
 	}
 
 	/* last entry */
-	if (dumper->cur_seq >= log_next_seq)
+	if (dumper->cur_seq >= get_next_seq(&main_log))
 		goto out;
 
-	msg = msg_from_idx(dumper->cur_idx);
+	msg = msg_from_idx(&main_log, dumper->cur_idx);
 	l = msg_print_text(msg, 0, syslog, line, size);
 
-	dumper->cur_idx = log_next(dumper->cur_idx);
+	dumper->cur_idx = inc_idx(&main_log, dumper->cur_idx);
 	dumper->cur_seq++;
 	ret = true;
 out:
@@ -2870,10 +2988,10 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 		goto out;
 
 	raw_spin_lock_irqsave(&logbuf_lock, flags);
-	if (dumper->cur_seq < log_first_seq) {
+	if (dumper->cur_seq < get_first_seq(&main_log)) {
 		/* messages are gone, move to first available one */
-		dumper->cur_seq = log_first_seq;
-		dumper->cur_idx = log_first_idx;
+		dumper->cur_seq = get_first_seq(&main_log);
+		dumper->cur_idx = get_first_idx(&main_log);
 	}
 
 	/* last entry */
@@ -2887,10 +3005,10 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	idx = dumper->cur_idx;
 	prev = 0;
 	while (seq < dumper->next_seq) {
-		struct printk_msg *msg = msg_from_idx(idx);
+		struct printk_msg *msg = msg_from_idx(&main_log, idx);
 
 		l += msg_print_text(msg, prev, true, NULL, 0);
-		idx = log_next(idx);
+		idx = inc_idx(&main_log, idx);
 		seq++;
 		prev = msg->flags;
 	}
@@ -2900,10 +3018,10 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	idx = dumper->cur_idx;
 	prev = 0;
 	while (l > size && seq < dumper->next_seq) {
-		struct printk_msg *msg = msg_from_idx(idx);
+		struct printk_msg *msg = msg_from_idx(&main_log, idx);
 
 		l -= msg_print_text(msg, prev, true, NULL, 0);
-		idx = log_next(idx);
+		idx = inc_idx(&main_log, idx);
 		seq++;
 		prev = msg->flags;
 	}
@@ -2914,10 +3032,10 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 
 	l = 0;
 	while (seq < dumper->next_seq) {
-		struct printk_msg *msg = msg_from_idx(idx);
+		struct printk_msg *msg = msg_from_idx(&main_log, idx);
 
 		l += msg_print_text(msg, prev, syslog, buf + l, size - l);
-		idx = log_next(idx);
+		idx = inc_idx(&main_log, idx);
 		seq++;
 		prev = msg->flags;
 	}
@@ -2947,8 +3065,8 @@ void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
 {
 	dumper->cur_seq = clear_seq;
 	dumper->cur_idx = clear_idx;
-	dumper->next_seq = log_next_seq;
-	dumper->next_idx = log_next_idx;
+	dumper->next_seq = get_next_seq(&main_log);
+	dumper->next_idx = get_next_idx(&main_log);
 }
 
 /**
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 03/11] printk: rename "logbuf_lock" to "main_logbuf_lock"
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 01/11] printk: rename struct printk_log to printk_msg Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 02/11] printk: allow to handle more log buffers Petr Mladek
@ 2014-05-09  9:10 ` Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 04/11] printk: add NMI ring and cont buffers Petr Mladek
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

Make the naming consistent with the "main_log" and "main_cont".

This commit does not change the behavior.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 98 +++++++++++++++++++++++++-------------------------
 1 file changed, 49 insertions(+), 49 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index c560ebdecc04..d64533f9e6b2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -148,11 +148,11 @@ EXPORT_SYMBOL(console_set_on_cmdline);
 static int console_may_schedule;
 
 /*
- * The logbuf_lock protects kmsg buffer, indexes, counters. This can be taken
- * within the scheduler's rq lock. It must be released before calling
+ * The main_logbuf_lock protects kmsg buffer, indexes, counters. This can be
+ * taken within the scheduler's rq lock. It must be released before calling
  * console_unlock() or anything else that might wake up a process.
  */
-static DEFINE_RAW_SPINLOCK(logbuf_lock);
+static DEFINE_RAW_SPINLOCK(main_logbuf_lock);
 
 /*
  * The printk log buffer consists of a chain of concatenated variable
@@ -691,21 +691,21 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 	ret = mutex_lock_interruptible(&user->lock);
 	if (ret)
 		return ret;
-	raw_spin_lock_irq(&logbuf_lock);
+	raw_spin_lock_irq(&main_logbuf_lock);
 	while (user->seq == get_next_seq(&main_log)) {
 		if (file->f_flags & O_NONBLOCK) {
 			ret = -EAGAIN;
-			raw_spin_unlock_irq(&logbuf_lock);
+			raw_spin_unlock_irq(&main_logbuf_lock);
 			goto out;
 		}
 
-		raw_spin_unlock_irq(&logbuf_lock);
+		raw_spin_unlock_irq(&main_logbuf_lock);
 		ret = wait_event_interruptible(log_wait,
 					       user->seq !=
 					       get_next_seq(&main_log));
 		if (ret)
 			goto out;
-		raw_spin_lock_irq(&logbuf_lock);
+		raw_spin_lock_irq(&main_logbuf_lock);
 	}
 
 	if (user->seq < get_first_seq(&main_log)) {
@@ -713,7 +713,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 		user->idx = get_first_idx(&main_log);
 		user->seq = get_first_seq(&main_log);
 		ret = -EPIPE;
-		raw_spin_unlock_irq(&logbuf_lock);
+		raw_spin_unlock_irq(&main_logbuf_lock);
 		goto out;
 	}
 
@@ -780,7 +780,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 
 	user->idx = inc_idx(&main_log, user->idx);
 	user->seq++;
-	raw_spin_unlock_irq(&logbuf_lock);
+	raw_spin_unlock_irq(&main_logbuf_lock);
 
 	if (len > count) {
 		ret = -EINVAL;
@@ -807,7 +807,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 	if (offset)
 		return -ESPIPE;
 
-	raw_spin_lock_irq(&logbuf_lock);
+	raw_spin_lock_irq(&main_logbuf_lock);
 	switch (whence) {
 	case SEEK_SET:
 		/* the first record */
@@ -831,7 +831,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 	default:
 		ret = -EINVAL;
 	}
-	raw_spin_unlock_irq(&logbuf_lock);
+	raw_spin_unlock_irq(&main_logbuf_lock);
 	return ret;
 }
 
@@ -845,7 +845,7 @@ static unsigned int devkmsg_poll(struct file *file, poll_table *wait)
 
 	poll_wait(file, &log_wait, wait);
 
-	raw_spin_lock_irq(&logbuf_lock);
+	raw_spin_lock_irq(&main_logbuf_lock);
 	if (user->seq < get_next_seq(&main_log)) {
 		/* return error when data has vanished underneath us */
 		if (user->seq < get_first_seq(&main_log))
@@ -853,7 +853,7 @@ static unsigned int devkmsg_poll(struct file *file, poll_table *wait)
 		else
 			ret = POLLIN|POLLRDNORM;
 	}
-	raw_spin_unlock_irq(&logbuf_lock);
+	raw_spin_unlock_irq(&main_logbuf_lock);
 
 	return ret;
 }
@@ -878,10 +878,10 @@ static int devkmsg_open(struct inode *inode, struct file *file)
 
 	mutex_init(&user->lock);
 
-	raw_spin_lock_irq(&logbuf_lock);
+	raw_spin_lock_irq(&main_logbuf_lock);
 	user->idx = get_first_idx(&main_log);
 	user->seq = get_first_seq(&main_log);
-	raw_spin_unlock_irq(&logbuf_lock);
+	raw_spin_unlock_irq(&main_logbuf_lock);
 
 	file->private_data = user;
 	return 0;
@@ -977,13 +977,13 @@ void __init setup_log_buf(int early)
 		return;
 	}
 
-	raw_spin_lock_irqsave(&logbuf_lock, flags);
+	raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 	main_log.buf_len = new_log_buf_len;
 	main_log.buf = new_log_buf;
 	new_log_buf_len = 0;
 	free = __LOG_BUF_LEN - get_next_idx(&main_log);
 	memcpy(main_log.buf, __log_buf, __LOG_BUF_LEN);
-	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 
 	pr_info("log_buf_len: %d\n", main_log.buf_len);
 	pr_info("early log buf free: %d(%d%%)\n",
@@ -1179,7 +1179,7 @@ static int syslog_print(char __user *buf, int size)
 		size_t n;
 		size_t skip;
 
-		raw_spin_lock_irq(&logbuf_lock);
+		raw_spin_lock_irq(&main_logbuf_lock);
 		if (syslog_seq < get_first_seq(&main_log)) {
 			/* messages are gone, move to first one */
 			syslog_seq = get_first_seq(&main_log);
@@ -1188,7 +1188,7 @@ static int syslog_print(char __user *buf, int size)
 			syslog_partial = 0;
 		}
 		if (syslog_seq == get_next_seq(&main_log)) {
-			raw_spin_unlock_irq(&logbuf_lock);
+			raw_spin_unlock_irq(&main_logbuf_lock);
 			break;
 		}
 
@@ -1209,7 +1209,7 @@ static int syslog_print(char __user *buf, int size)
 			syslog_partial += n;
 		} else
 			n = 0;
-		raw_spin_unlock_irq(&logbuf_lock);
+		raw_spin_unlock_irq(&main_logbuf_lock);
 
 		if (!n)
 			break;
@@ -1238,7 +1238,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 	if (!text)
 		return -ENOMEM;
 
-	raw_spin_lock_irq(&logbuf_lock);
+	raw_spin_lock_irq(&main_logbuf_lock);
 	if (buf) {
 		u64 next_seq;
 		u64 seq;
@@ -1298,12 +1298,12 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 			seq++;
 			prev = msg->flags;
 
-			raw_spin_unlock_irq(&logbuf_lock);
+			raw_spin_unlock_irq(&main_logbuf_lock);
 			if (copy_to_user(buf + len, text, textlen))
 				len = -EFAULT;
 			else
 				len += textlen;
-			raw_spin_lock_irq(&logbuf_lock);
+			raw_spin_lock_irq(&main_logbuf_lock);
 
 			if (seq < get_first_seq(&main_log)) {
 				/* messages are gone, move to next one */
@@ -1318,7 +1318,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		clear_seq = get_next_seq(&main_log);
 		clear_idx = get_next_idx(&main_log);
 	}
-	raw_spin_unlock_irq(&logbuf_lock);
+	raw_spin_unlock_irq(&main_logbuf_lock);
 
 	kfree(text);
 	return len;
@@ -1410,7 +1410,7 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 		break;
 	/* Number of chars in the log buffer */
 	case SYSLOG_ACTION_SIZE_UNREAD:
-		raw_spin_lock_irq(&logbuf_lock);
+		raw_spin_lock_irq(&main_logbuf_lock);
 		if (syslog_seq < get_first_seq(&main_log)) {
 			/* messages are gone, move to first one */
 			syslog_seq = get_first_seq(&main_log);
@@ -1443,7 +1443,7 @@ int do_syslog(int type, char __user *buf, int len, bool from_file)
 			}
 			error -= syslog_partial;
 		}
-		raw_spin_unlock_irq(&logbuf_lock);
+		raw_spin_unlock_irq(&main_logbuf_lock);
 		break;
 	/* Size of the log buffer */
 	case SYSLOG_ACTION_SIZE_BUFFER:
@@ -1509,7 +1509,7 @@ static void zap_locks(void)
 
 	debug_locks_off();
 	/* If a crash is occurring, make sure we can't deadlock */
-	raw_spin_lock_init(&logbuf_lock);
+	raw_spin_lock_init(&main_logbuf_lock);
 	/* And make sure that we print immediately */
 	sema_init(&console_sem, 1);
 }
@@ -1726,7 +1726,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	}
 
 	lockdep_off();
-	raw_spin_lock(&logbuf_lock);
+	raw_spin_lock(&main_logbuf_lock);
 	logbuf_cpu = this_cpu;
 
 	if (recursion_bug) {
@@ -1829,7 +1829,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	}
 
 	logbuf_cpu = UINT_MAX;
-	raw_spin_unlock(&logbuf_lock);
+	raw_spin_unlock(&main_logbuf_lock);
 	lockdep_on();
 	local_irq_restore(flags);
 
@@ -2206,7 +2206,7 @@ static void console_cont_flush(char *text, size_t size)
 	unsigned long flags;
 	size_t len;
 
-	raw_spin_lock_irqsave(&logbuf_lock, flags);
+	raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 
 	if (!main_cont.len)
 		goto out;
@@ -2220,14 +2220,14 @@ static void console_cont_flush(char *text, size_t size)
 		goto out;
 
 	len = cont_print_text(text, size);
-	raw_spin_unlock(&logbuf_lock);
+	raw_spin_unlock(&main_logbuf_lock);
 	stop_critical_timings();
 	call_console_drivers(main_cont.level, text, len);
 	start_critical_timings();
 	local_irq_restore(flags);
 	return;
 out:
-	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 }
 
 /**
@@ -2268,7 +2268,7 @@ again:
 		size_t len;
 		int level;
 
-		raw_spin_lock_irqsave(&logbuf_lock, flags);
+		raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 		console_end_seq = get_next_seq(&main_log);
 again_noirq:
 		if (seen_seq != get_next_seq(&main_log)) {
@@ -2316,14 +2316,14 @@ skip:
 		console_idx = inc_idx(&main_log, console_idx);
 		console_seq++;
 		console_prev = msg->flags;
-		raw_spin_unlock(&logbuf_lock);
+		raw_spin_unlock(&main_logbuf_lock);
 
 		stop_critical_timings();	/* don't trace print latency */
 		call_console_drivers(level, text, len);
 		start_critical_timings();
 
 		if (console_seq < console_end_seq) {
-			raw_spin_lock(&logbuf_lock);
+			raw_spin_lock(&main_logbuf_lock);
 			goto again_noirq;
 		}
 
@@ -2335,7 +2335,7 @@ skip:
 	if (unlikely(exclusive_console))
 		exclusive_console = NULL;
 
-	raw_spin_unlock(&logbuf_lock);
+	raw_spin_unlock(&main_logbuf_lock);
 
 	up_console_sem();
 
@@ -2345,9 +2345,9 @@ skip:
 	 * there's a new owner and the console_unlock() from them will do the
 	 * flush, no worries.
 	 */
-	raw_spin_lock(&logbuf_lock);
+	raw_spin_lock(&main_logbuf_lock);
 	retry = console_seq != get_next_seq(&main_log);
-	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 
 	if (retry && console_trylock())
 		goto again;
@@ -2584,11 +2584,11 @@ void register_console(struct console *newcon)
 		 * console_unlock(); will print out the buffered messages
 		 * for us.
 		 */
-		raw_spin_lock_irqsave(&logbuf_lock, flags);
+		raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 		console_seq = syslog_seq;
 		console_idx = syslog_idx;
 		console_prev = syslog_prev;
-		raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+		raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 		/*
 		 * We're about to replay the log buffer.  Only do this to the
 		 * just-registered console to avoid excessive message spam to
@@ -2855,12 +2855,12 @@ void kmsg_dump(enum kmsg_dump_reason reason)
 		/* initialize iterator with data about the stored records */
 		dumper->active = true;
 
-		raw_spin_lock_irqsave(&logbuf_lock, flags);
+		raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 		dumper->cur_seq = clear_seq;
 		dumper->cur_idx = clear_idx;
 		dumper->next_seq = get_next_seq(&main_log);
 		dumper->next_idx = get_next_idx(&main_log);
-		raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+		raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 
 		/* invoke dumper which will iterate over records */
 		dumper->dump(dumper, reason);
@@ -2945,9 +2945,9 @@ bool kmsg_dump_get_line(struct kmsg_dumper *dumper, bool syslog,
 	unsigned long flags;
 	bool ret;
 
-	raw_spin_lock_irqsave(&logbuf_lock, flags);
+	raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 	ret = kmsg_dump_get_line_nolock(dumper, syslog, line, size, len);
-	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 
 	return ret;
 }
@@ -2987,7 +2987,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	if (!dumper->active)
 		goto out;
 
-	raw_spin_lock_irqsave(&logbuf_lock, flags);
+	raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 	if (dumper->cur_seq < get_first_seq(&main_log)) {
 		/* messages are gone, move to first available one */
 		dumper->cur_seq = get_first_seq(&main_log);
@@ -2996,7 +2996,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 
 	/* last entry */
 	if (dumper->cur_seq >= dumper->next_seq) {
-		raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+		raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 		goto out;
 	}
 
@@ -3043,7 +3043,7 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	dumper->next_seq = next_seq;
 	dumper->next_idx = next_idx;
 	ret = true;
-	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 out:
 	if (len)
 		*len = l;
@@ -3081,9 +3081,9 @@ void kmsg_dump_rewind(struct kmsg_dumper *dumper)
 {
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&logbuf_lock, flags);
+	raw_spin_lock_irqsave(&main_logbuf_lock, flags);
 	kmsg_dump_rewind_nolock(dumper);
-	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 }
 EXPORT_SYMBOL_GPL(kmsg_dump_rewind);
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 04/11] printk: add NMI ring and cont buffers
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (2 preceding siblings ...)
  2014-05-09  9:10 ` [RFC PATCH 03/11] printk: rename "logbuf_lock" to "main_logbuf_lock" Petr Mladek
@ 2014-05-09  9:10 ` Petr Mladek
  2014-05-09  9:10 ` [RFC PATCH 05/11] printk: allow to modify NMI log buffer size using boot parameter Petr Mladek
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

This is another preparation patch for NMI safe printk implementation.
It adds a new ring and cont buffers to temporary store messages printed
from NMI context. They are used when the logbuf_lock for the main ring
buffer is already held.

Unfortunately, we need to store the index and the sequence number into
a single "unsigned log". I did not find any better way how to keep them
consistent. Both values are modified when new messages are added in
the NMI context. They are read when the messages are copied to the main
log buffer in the normal context. These two operations cannot be guarded
by a common lock because it would cause the original deadlock.

Anyway, this patch adds a bunch of macros that do the needed bit operations.
It also extends the existing "set/get" functions so that the access will
be transparent.

The ring buffer is allocated during early initialization such as log_buf.
It shares the same length for now but it will get improved in a later patch.

Also it updates messages that are printed when resizing the main ring buffer.
It makes them more explanatory and consistent with the message from the NMI
log buffer.

This patch does not change the existing behavior, except for the printed
messages. The logic for using the NMI buffer will be added in followup patches.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 104 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 91 insertions(+), 13 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index d64533f9e6b2..e8d0df2d3e01 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -265,6 +265,16 @@ struct printk_main_log_pos {
 	u32 next_idx;		/* index of the next record to store */
 };
 
+/*
+ * The "id" has to be read and written atomically. But we do not need other
+ * atomic operations. The value has to be as big as possible. Therefore we
+ * use unsigned long.
+ */
+struct printk_nmi_log_pos {
+	unsigned long first_id;
+	unsigned long next_id;
+};
+
 /* information needed to manipulate the log buffer */
 struct printk_log {
 	struct printk_cont *cont;	/* merging continuous message */
@@ -272,6 +282,7 @@ struct printk_log {
 	u32 buf_len;			/* size of the ring buffer */
 	union {
 		struct printk_main_log_pos main;
+		struct printk_nmi_log_pos nmi;
 	};
 };
 
@@ -304,16 +315,53 @@ static u32 clear_idx;
 #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
 static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
 static char __main_cont_buf[LOG_LINE_MAX];
+static char __nmi_cont_buf[LOG_LINE_MAX];
 
 static struct printk_cont main_cont = {
 	.buf = __main_cont_buf,
 };
 
+static struct printk_cont nmi_cont = {
+	.buf = __nmi_cont_buf,
+};
+
 static struct printk_log main_log = {
 	.buf = __log_buf,
 	.buf_len = __LOG_BUF_LEN,
 	.cont = &main_cont,
 };
+/*
+ * NMI ring buffer must be used if we are in NMI context and the lock for
+ * the main buffer is already in use by code that has been interrupted.
+ * The content of the NMI buffer is moved to the main buffer on the first
+ * occasion.
+ */
+static struct printk_log nmi_log = {
+	.cont = &nmi_cont,
+	.buf_len = __LOG_BUF_LEN,
+};
+
+/*
+ * Byte operations needed to manipulate index and sequence numbers for the NMI
+ * log buffer:
+ *	+ sequence number takes the lower half of the _id variable
+ *	+ index takes the higher half of the _id variable
+ */
+#define NMI_SEQ_BYTES (sizeof(nmi_log.nmi.first_id) * 8 / 2)
+#define NMI_IDX_BYTES NMI_SEQ_BYTES
+#define NMI_SEQ_MASK ((1UL << NMI_SEQ_BYTES) - 1)
+#define NMI_IDX_MASK (~NMI_SEQ_MASK)
+#define idx_from_id(id) ((id & NMI_IDX_MASK) >> NMI_SEQ_BYTES)
+#define seq_from_id(id) (id & NMI_SEQ_MASK)
+#define make_id(idx, seq) (((unsigned long)idx << NMI_SEQ_BYTES) | \
+			   (seq & NMI_SEQ_MASK))
+/*
+ * Maximum length of the allocated buffer. It has to be a power of two.
+ * It can be limited either by the maximum number of indexes or
+ * by the "buf_len" variable size.
+ */
+#define NMI_MAX_LEN_POWER (min(NMI_IDX_BYTES, sizeof(nmi_log.buf_len) * 8 - 1))
+#define NMI_MAX_LEN (1UL << NMI_MAX_LEN_POWER)
 
 /*
  * Define functions needed to get the position values,
@@ -324,7 +372,10 @@ static struct printk_log main_log = {
 #define DEFINE_GET_POS(rettype, funcname, side, pos)		\
 static rettype funcname(const struct printk_log *log)		\
 {								\
-	return log->main.side##_##pos;				\
+	if (log == &main_log)					\
+		return log->main.side##_##pos;			\
+	else							\
+		return pos##_from_id((log)->nmi.side##_id);	\
 }
 
 DEFINE_GET_POS(u32, get_first_idx, first, idx)
@@ -341,8 +392,12 @@ DEFINE_GET_POS(u64, get_next_seq, next, seq)
 #define DEFINE_SET_POS(funcname, side)					\
 static void funcname(struct printk_log *log, u32 idx, u64 seq)		\
 {									\
-	log->main.side##_idx = idx;					\
-	log->main.side##_seq = seq;				\
+	if (log == &main_log) {						\
+		(log)->main.side ## _idx = idx;				\
+		(log)->main.side ## _seq = seq;				\
+	} else {							\
+		(log)->nmi.side ## _id = make_id(idx, seq);		\
+	}								\
 }
 
 DEFINE_SET_POS(set_first_pos, first)
@@ -395,7 +450,10 @@ static u32 inc_idx(struct printk_log *log, u32 idx)
 /* get next sequence number for the given one */
 static u64 inc_seq(struct printk_log *log, u64 seq)
 {
-	return ++seq;
+	if (log == &main_log)
+		return ++seq;
+	else
+		return ++seq & NMI_SEQ_MASK;
 }
 
 /*
@@ -955,22 +1013,42 @@ static int __init log_buf_len_setup(char *str)
 }
 early_param("log_buf_len", log_buf_len_setup);
 
+char * __init alloc_log_buf(int early, unsigned len)
+{
+	if (early)
+		return memblock_virt_alloc(len, PAGE_SIZE);
+
+	return memblock_virt_alloc_nopanic(len, 0);
+}
+
 void __init setup_log_buf(int early)
 {
 	unsigned long flags;
 	char *new_log_buf;
 	int free;
 
+	if (!nmi_log.buf) {
+		/* use the same size that will be used for normal buffer */
+		if (new_log_buf_len > nmi_log.buf_len)
+			nmi_log.buf_len = new_log_buf_len;
+		if (nmi_log.buf_len > NMI_MAX_LEN)
+			nmi_log.buf_len = NMI_MAX_LEN;
+		nmi_log.buf = alloc_log_buf(early, nmi_log.buf_len);
+		if (!nmi_log.buf)
+			pr_err("%d bytes not available for NMI ring buffer\n",
+			       nmi_log.buf_len);
+		else
+			pr_info("NMI ring buffer size: %d\n", nmi_log.buf_len);
+	}
+
+	/*
+	 * The default static buffer is used when the size is not increased
+	 * by the boot parameter.
+	 */
 	if (!new_log_buf_len)
 		return;
 
-	if (early) {
-		new_log_buf =
-			memblock_virt_alloc(new_log_buf_len, PAGE_SIZE);
-	} else {
-		new_log_buf = memblock_virt_alloc_nopanic(new_log_buf_len, 0);
-	}
-
+	new_log_buf = alloc_log_buf(early, new_log_buf_len);
 	if (unlikely(!new_log_buf)) {
 		pr_err("log_buf_len: %ld bytes not available\n",
 			new_log_buf_len);
@@ -985,8 +1063,8 @@ void __init setup_log_buf(int early)
 	memcpy(main_log.buf, __log_buf, __LOG_BUF_LEN);
 	raw_spin_unlock_irqrestore(&main_logbuf_lock, flags);
 
-	pr_info("log_buf_len: %d\n", main_log.buf_len);
-	pr_info("early log buf free: %d(%d%%)\n",
+	pr_info("increased the main ring buffer: %d\n", main_log.buf_len);
+	pr_info("free space before resizing: %d(%d%%)\n",
 		free, (free * 100) / __LOG_BUF_LEN);
 }
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 05/11] printk: allow to modify NMI log buffer size using boot parameter
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (3 preceding siblings ...)
  2014-05-09  9:10 ` [RFC PATCH 04/11] printk: add NMI ring and cont buffers Petr Mladek
@ 2014-05-09  9:10 ` Petr Mladek
  2014-05-09  9:11 ` [RFC PATCH 06/11] printk: NMI safe printk Petr Mladek
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

Having NMI log buffer of the same size as the main log buffer might
be considered as a waste of memory. Especially when the main buffer
is increased to a big value. So create a separate kernel parameters
to set nmi_log_buf_len.

Some users might want to avoid the buffer entirely. It can be done
by the zero value. In this case, the printk will stay safe in the
NMI context but there will be higher chance that some message could
get lost.

The maximum size is tested when the NMI log buffer is allocated. Note
that even the default size might be too big if there is not enough
bytes to store the index.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 Documentation/kernel-parameters.txt | 19 +++++++++++++++++--
 kernel/printk/printk.c              | 22 +++++++++++++++++-----
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 296e6da5ce6c..5e90eab4d696 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -211,8 +211,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			    acpi.debug_layer=0x2 acpi.debug_level=0xffffffff
 
 			Some values produce so much output that the system is
-			unusable.  The "log_buf_len" parameter may be useful
-			if you need to capture more output.
+			unusable.  The "log_buf_len" and "nmi_log_buf_len"
+			parameters may be useful if you need to capture more
+			output.
 
 	acpi_irq_balance [HW,ACPI]
 			ACPI will balance active IRQs
@@ -2052,6 +2053,20 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			when a NMI is triggered.
 			Format: [state][,regs][,debounce][,die]
 
+	nmi_log_buf_len=n[KMG]	Sets the size of a helper ring buffer for
+			printk messages in NMI context. It is used only as
+			fallback when the lock for the main ring buffer is
+			already taken. The content is merged into the main
+			buffer when possible.
+
+			The size is in bytes and must be a power of two. The
+			default size is the same as for the main printk ring
+			buffer.
+
+			The size 0 can be used to disable the extra buffer
+			entirely. It saves space but there is a higher risk
+			that some messages will get lost.
+
 	nmi_watchdog=	[KNL,BUGS=X86] Debugging features for SMP kernels
 			Format: [panic,][nopanic,][num]
 			Valid num: 0
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index e8d0df2d3e01..7d0d5c714f71 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1013,6 +1013,18 @@ static int __init log_buf_len_setup(char *str)
 }
 early_param("log_buf_len", log_buf_len_setup);
 
+/* NMI log buffer can be completely disabled by setting 0 value */
+static int __init nmi_log_buf_len_setup(char *str)
+{
+	nmi_log.buf_len = memparse(str, &str);
+
+	if (nmi_log.buf_len)
+		nmi_log.buf_len = roundup_pow_of_two(nmi_log.buf_len);
+
+	return 0;
+}
+early_param("nmi_log_buf_len", nmi_log_buf_len_setup);
+
 char * __init alloc_log_buf(int early, unsigned len)
 {
 	if (early)
@@ -1028,13 +1040,13 @@ void __init setup_log_buf(int early)
 	int free;
 
 	if (!nmi_log.buf) {
-		/* use the same size that will be used for normal buffer */
-		if (new_log_buf_len > nmi_log.buf_len)
-			nmi_log.buf_len = new_log_buf_len;
 		if (nmi_log.buf_len > NMI_MAX_LEN)
 			nmi_log.buf_len = NMI_MAX_LEN;
-		nmi_log.buf = alloc_log_buf(early, nmi_log.buf_len);
-		if (!nmi_log.buf)
+		/* zero length means that the feature is disabled */
+		if (nmi_log.buf_len)
+			nmi_log.buf = alloc_log_buf(early, nmi_log.buf_len);
+
+		if (!nmi_log.buf && nmi_log.buf_len)
 			pr_err("%d bytes not available for NMI ring buffer\n",
 			       nmi_log.buf_len);
 		else
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 06/11] printk: NMI safe printk
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (4 preceding siblings ...)
  2014-05-09  9:10 ` [RFC PATCH 05/11] printk: allow to modify NMI log buffer size using boot parameter Petr Mladek
@ 2014-05-09  9:11 ` Petr Mladek
  2014-05-09  9:11 ` [RFC PATCH 07/11] printk: right ordering of the cont buffers from NMI context Petr Mladek
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

printk cannot be used in NMI context safely because it uses internal locks
and thus could cause a deadlock. Unfortunately there are circumstances when
calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
would be much more helpful if they didn't lockup the machine.

Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
detector).

This patch prevents from deadlock on logbuf_lock by using trylock rather than
spin_lock. If the lock can not be taken, it uses NMI specific text, cont,
and log buffers to handle the message.

In order to synchronize parallel printks from NMI context there was a new lock
introduced. This one is held only from the NMI context and it doesn't nest into
any other lock so it is safe against deadlocks.

The most funny part is the operation that merges the messages back to the main
log buffer. It is usually done in the normal context. Thus it needs to read
from NMI log buffer without taking the "nmi_logbuf_lock".

The merging is relatively easy when you allow to ignore last messages in case
of full NMI log buffer. Then it is a simple producer and consumer problem.
log_store() modifies only "next_idx", "next_seq" and merge_nmi_delayed_printk()
modifies only "first_idx", "first_seq".

But we want the last messages. Thus log_store() needs to modify all four
"first_*" and "next_*" position variables. Then it is rally hard to read
them consistently in merge_nmi_delayed_printk(). Note that the function
can be interrupted at anytime and the NMI context could rotate the NMI log
buffer even several times.

The NMI log buffer could be relatively small, so there is relatively high
chance of rotation. Thus we cannot detect it by checking the "idx" values.

It would help to compare the sequence numbers where the rotation is almost
impossible. But then it is very hard to make sure that an "idx" and "seq"
values describe the same message.

The cleanest solution seems to be to modify "idx" and "seq" values in an atomic
operation. This is why they are stored in a single "unsigned long" and accessed
by the bunch of helper functions.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 282 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 261 insertions(+), 21 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7d0d5c714f71..12435f3ef2d4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -154,6 +154,18 @@ static int console_may_schedule;
  */
 static DEFINE_RAW_SPINLOCK(main_logbuf_lock);
 
+#ifdef CONFIG_PRINTK
+/*
+ * The nmi_logbuf_lock protects writing into the nmi_log_buf. It is used in NMI
+ * context when logbuf_lock is already taken or it is not safe to take it. NEVER
+ * EVER take this lock outside of NMI context.
+ *
+ * The messages are moved from nmi_logbuf without this lock. Be careful about
+ * racing when writing and reading, see log_store and merge_nmi_delayed_printk.
+ */
+static DEFINE_RAW_SPINLOCK(nmi_logbuf_lock);
+#endif
+
 /*
  * The printk log buffer consists of a chain of concatenated variable
  * length records. Every record starts with a record header, containing
@@ -355,6 +367,8 @@ static struct printk_log nmi_log = {
 #define seq_from_id(id) (id & NMI_SEQ_MASK)
 #define make_id(idx, seq) (((unsigned long)idx << NMI_SEQ_BYTES) | \
 			   (seq & NMI_SEQ_MASK))
+/* maximum number of sequence numbers */
+#define NMI_MAX_SEQ (1UL << NMI_SEQ_BYTES)
 /*
  * Maximum length of the allocated buffer. It has to be a power of two.
  * It can be limited either by the maximum number of indexes or
@@ -363,6 +377,13 @@ static struct printk_log nmi_log = {
 #define NMI_MAX_LEN_POWER (min(NMI_IDX_BYTES, sizeof(nmi_log.buf_len) * 8 - 1))
 #define NMI_MAX_LEN (1UL << NMI_MAX_LEN_POWER)
 
+/* maximum number of messages that fit into NMI log buffer */
+static unsigned int nmi_max_messages;
+/* temporary buffer to print the formatted string */
+static char textbuf[LOG_LINE_MAX];
+/* temporary buffer to print the formatted string in NMI context */
+static char nmi_textbuf[LOG_LINE_MAX];
+
 /*
  * Define functions needed to get the position values,
  * for example, first_idx. Possible values are:
@@ -383,7 +404,6 @@ DEFINE_GET_POS(u64, get_first_seq, first, seq)
 DEFINE_GET_POS(u32, get_next_idx, next, idx)
 DEFINE_GET_POS(u64, get_next_seq, next, seq)
 
-
 /*
  * Define functions needed to set the position values,
  * for example, first_idx. Possible values are:
@@ -415,6 +435,12 @@ static char *get_dict(const struct printk_msg *msg)
 	return (char *)msg + sizeof(struct printk_msg) + msg->text_len;
 }
 
+/* safe variant that could be used in lock-less situation */
+static char *get_dict_safely(const struct printk_msg *msg, u16 text_len)
+{
+	return (char *)msg + sizeof(struct printk_msg) + text_len;
+}
+
 /* get record by index; idx must point to valid msg */
 static struct printk_msg *msg_from_idx(struct printk_log *log, u32 idx)
 {
@@ -507,18 +533,28 @@ static int logbuf_has_space(struct printk_log *log, u32 msg_size, bool empty)
 
 static int log_make_free_space(struct printk_log *log, u32 msg_size)
 {
+	int freed = 0;
+	int ret = 0;
+
 	while (get_first_seq(log) < get_next_seq(log)) {
 		if (logbuf_has_space(log, msg_size, false))
-			return 0;
+			goto out;
+
 		/* drop old messages until we have enough continuous space */
 		inc_first_pos(log);
+		freed = 1;
 	}
 
 	/* sequence numbers are equal, so the log buffer is empty */
-	if (logbuf_has_space(log, msg_size, true))
-		return 0;
+	if (!logbuf_has_space(log, msg_size, true))
+		ret = -ENOMEM;
 
-	return -ENOMEM;
+out:
+	/* propagate the updated values before the freed space is overwritten */
+	if (unlikely(log == &nmi_log) && freed)
+		smp_wmb();
+
+	return ret;
 }
 
 /* compute the message size including the padding bytes */
@@ -614,6 +650,10 @@ static int log_store(struct printk_log *log, int facility, int level,
 	memset(get_dict(msg) + dict_len, 0, pad_len);
 	msg->len = size;
 
+	/* write the data before we move to the next position */
+	if (log == &nmi_log)
+		smp_wmb();
+
 	/* insert message */
 	inc_next_pos(log);
 
@@ -1042,6 +1082,7 @@ void __init setup_log_buf(int early)
 	if (!nmi_log.buf) {
 		if (nmi_log.buf_len > NMI_MAX_LEN)
 			nmi_log.buf_len = NMI_MAX_LEN;
+		nmi_max_messages = nmi_log.buf_len / sizeof(struct printk_msg);
 		/* zero length means that the feature is disabled */
 		if (nmi_log.buf_len)
 			nmi_log.buf = alloc_log_buf(early, nmi_log.buf_len);
@@ -1766,6 +1807,170 @@ static size_t cont_print_text(char *text, size_t size)
 	return textlen;
 }
 
+/*
+ * This function copies one message from NMI log buffer to the main one.
+ * It cannot guarantee that a valid data are copied because NMI buffer
+ * can be modified in parallel. It just do some basic checks, especially
+ * to make sure that we do not read outside of the buffer. Anyway, the
+ * caller should do more checks, for example by validating the sequence
+ * number.
+ */
+static int merge_nmi_msg(u32 merge_idx, u64 merge_seq)
+{
+	struct printk_msg *msg;
+	u16 text_len, dict_len;
+	u32 after_msg_idx, pad_len, size, counted_size;
+
+	/*
+	 * The given idx might be invalid, especially when it was read via
+	 * inc_idx() without having the related log buffer lock.
+	 */
+	if (merge_idx > nmi_log.buf_len - sizeof(struct printk_msg))
+		return -EINVAL;
+
+	/*
+	 * Get more info about the message. The values could be modified in NMI
+	 * context at any time.
+	 */
+	msg = msg_from_idx(&nmi_log, merge_idx);
+	text_len = ACCESS_ONCE(msg->text_len);
+	dict_len = ACCESS_ONCE(msg->dict_len);
+	size = ACCESS_ONCE(msg->len);
+
+	/* check a bit the consistency */
+	counted_size = msg_used_size(text_len, dict_len, &pad_len);
+	if (counted_size != size)
+		return -EINVAL;
+	/*
+	 * Make sure that we do not read outside of NMI log buf.
+	 *
+	 * First, get the real idx. The original one might have pointed to the
+	 * zero length message that signalizes the end of the buffer. Thus we
+	 * need to compute it from the "msg" pointer.
+	 */
+	merge_idx = (char *)msg - nmi_log.buf;
+	after_msg_idx = merge_idx + counted_size;
+	if (after_msg_idx > nmi_log.buf_len)
+		return -EINVAL;
+
+	log_store(&main_log, msg->facility, msg->level,
+		  msg->flags, msg->ts_nsec,
+		  get_dict_safely(msg, text_len), dict_len,
+		  get_text(msg), text_len);
+
+	return 0;
+}
+
+/*
+ * Unfortunately, we could not guarantee that a sequence number
+ * is 100% valid because the whole range can get rotated in NMI
+ * context between readings in the normal context.
+ *
+ * This is just the best guess. Any valid sequence number must in
+ * the range from first_seq to first_seq + maximum number of
+ * messages that fit the buffer.
+ *
+ * Invalid result is not critical because it will get detected
+ * by the consistency check in merge_nmi_msg(). In the worst case
+ * it will copy some mess and the inconsistency will get detected
+ * and resolved later.
+ */
+static int nmi_seq_is_invalid(u64 seq, u64 first_seq)
+{
+	if (seq >= first_seq) {
+		if (seq - first_seq > nmi_max_messages)
+			return 1;
+		else
+			return 0;
+	}
+
+	if (first_seq - seq < NMI_MAX_SEQ - nmi_max_messages)
+		return 1;
+	else
+		return 0;
+}
+
+/*
+ * Called to merge strings from NMI ring buffer into the regular ring buffer.
+ *
+ * Messages can be asynchronously added and even removed in NMI context under
+ * nmi_logbuf_lock. We need to be VERY CAUTIOUS and work with valid indexes all
+ * the time. We even might need to revert a store operation if the message
+ * gets overwritten in the meantime.
+ */
+static void merge_nmi_delayed_printk(void)
+{
+	static u64 nmi_merge_seq;
+	static u32 nmi_merge_idx;
+	unsigned long nmi_first_id, nmi_next_id;
+	u32 old_main_next_idx;
+	u64 old_main_next_seq;
+
+	while (true) {
+		/*
+		 * Refresh next_id information in every iteration. There might
+		 * be new messages.
+		 */
+		nmi_next_id = ACCESS_ONCE(nmi_log.nmi.next_id);
+
+		/* we are done when all messages have been merged already */
+		if (likely(nmi_merge_seq == seq_from_id(nmi_next_id)))
+			return;
+
+		/* check if we lost some messages */
+		nmi_first_id = ACCESS_ONCE(nmi_log.nmi.first_id);
+		if (nmi_seq_is_invalid(nmi_merge_seq,
+				       seq_from_id(nmi_first_id))) {
+restart_merge:
+			nmi_merge_seq = seq_from_id(nmi_first_id);
+			nmi_merge_idx = idx_from_id(nmi_first_id);
+		}
+
+		/*
+		 * Make sure that the whole message has been written for the
+		 * given idx.
+		 */
+		smp_rmb();
+		/* store current state of the main log buffer */
+		old_main_next_idx = get_next_idx(&main_log);
+		old_main_next_seq = get_next_seq(&main_log);
+		/* restart merge if copying fails */
+		if (merge_nmi_msg(nmi_merge_idx, nmi_merge_seq)) {
+			nmi_first_id = ACCESS_ONCE(nmi_log.nmi.first_id);
+			goto restart_merge;
+		}
+
+		/*
+		 * Make sure that the data were copied before validating. We
+		 * check that we did read correct data, so read the barrier is
+		 * enough.
+		 */
+		smp_rmb();
+		/* check if we have copied a valid message */
+		nmi_first_id = ACCESS_ONCE(nmi_log.nmi.first_id);
+		if (nmi_seq_is_invalid(nmi_merge_seq,
+				       seq_from_id(nmi_first_id))) {
+			/*
+			 * The copied message does not longer exist in the NMI
+			 * log buffer. It was most likely modified during
+			 * copying, so forget it and restart the merge.
+			 */
+			set_next_pos(&main_log,
+				     old_main_next_idx, old_main_next_seq);
+			goto restart_merge;
+		}
+
+		/*
+		 * The message was valid when copying. Go to next one. Note that
+		 * we might read broken idx here that might point outside of the
+		 * log buffer. But this will be detected in merge_nmi_msg() when
+		 * checking the index validity.
+		 */
+		nmi_merge_idx = inc_idx(&nmi_log, nmi_merge_idx);
+		nmi_merge_seq = inc_seq(&nmi_log, nmi_merge_seq);
+	}
+}
+
 asmlinkage int vprintk_emit(int facility, int level,
 			    const char *dict, size_t dictlen,
 			    const char *fmt, va_list args)
@@ -1773,7 +1978,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 	struct printk_log *log = &main_log;
 	struct printk_cont *cont = main_log.cont;
 	static int recursion_bug;
-	static char textbuf[LOG_LINE_MAX];
 	char *text = textbuf;
 	size_t text_len = 0;
 	enum log_flags lflags = 0;
@@ -1799,7 +2003,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	/*
 	 * Ouch, printk recursed into itself!
 	 */
-	if (unlikely(logbuf_cpu == this_cpu)) {
+	if (unlikely(logbuf_cpu == this_cpu) && !in_nmi()) {
 		/*
 		 * If a crash is occurring during printk() on this CPU,
 		 * then try to get the crash message out but make sure
@@ -1816,18 +2020,46 @@ asmlinkage int vprintk_emit(int facility, int level,
 	}
 
 	lockdep_off();
-	raw_spin_lock(&main_logbuf_lock);
-	logbuf_cpu = this_cpu;
 
-	if (recursion_bug) {
-		static const char recursion_msg[] =
-			"BUG: recent printk recursion!";
+	/*
+	 * Get lock for a log buffer. Make sure we are not going to deadlock
+	 * when we managed to preempt the currently running printk from NMI
+	 * context. When we are not sure, rather copy the current message
+	 * into NMI ring buffer and merge it later.
+	 */
+	if (likely(!in_nmi())) {
+		raw_spin_lock(&main_logbuf_lock);
+	} else {
+		if (!raw_spin_trylock(&main_logbuf_lock)) {
+			if (!nmi_log.buf) {
+				lockdep_on();
+				local_irq_restore(flags);
+				return 0;
+			}
 
-		recursion_bug = 0;
-		text_len = strlen(recursion_msg);
-		/* emit KERN_CRIT message */
-		printed_len += log_store(log, 0, 2, LOG_PREFIX|LOG_NEWLINE, 0,
-					 NULL, 0, recursion_msg, text_len);
+			raw_spin_lock(&nmi_logbuf_lock);
+			log = &nmi_log;
+			cont = &nmi_cont;
+			text = nmi_textbuf;
+		}
+	}
+
+	if (likely(log == &main_log)) {
+		logbuf_cpu = this_cpu;
+		merge_nmi_delayed_printk();
+
+		if (recursion_bug) {
+			static const char recursion_msg[] =
+				"BUG: recent printk recursion!";
+
+			recursion_bug = 0;
+			text_len = strlen(recursion_msg);
+			/* emit KERN_CRIT message */
+			printed_len += log_store(log, 0, 2,
+						 LOG_PREFIX|LOG_NEWLINE, 0,
+						 NULL, 0, recursion_msg,
+						 text_len);
+		}
 	}
 
 	/*
@@ -1918,13 +2150,21 @@ asmlinkage int vprintk_emit(int facility, int level,
 						 text, text_len);
 	}
 
-	logbuf_cpu = UINT_MAX;
-	raw_spin_unlock(&main_logbuf_lock);
+	if (likely(log == &main_log)) {
+		logbuf_cpu = UINT_MAX;
+		raw_spin_unlock(&main_logbuf_lock);
+	} else {
+		raw_spin_unlock(&nmi_logbuf_lock);
+	}
+
 	lockdep_on();
 	local_irq_restore(flags);
 
-	/* If called from the scheduler, we can not call up(). */
-	if (in_sched)
+	/*
+	 * If called from the scheduler or NMI context, we can not get console
+	 * without a possible deadlock.
+	 */
+	if (in_sched || in_nmi())
 		return printed_len;
 
 	/*
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 07/11] printk: right ordering of the cont buffers from NMI context
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (5 preceding siblings ...)
  2014-05-09  9:11 ` [RFC PATCH 06/11] printk: NMI safe printk Petr Mladek
@ 2014-05-09  9:11 ` Petr Mladek
  2014-05-09  9:11 ` [RFC PATCH 08/11] printk: try hard to print Oops message in " Petr Mladek
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

When using the NMI log buffer, continuous messages were sometimes mixed
after merging to the main log buffer.

The problems are solved by the following two tricks. First, continuous messages
are always put into the NMI log buffer if it already contains something. Second,
the main cont buffer is always flushed before merging any message from the NMI
log buffer.

Note that the second part of a continuous message always has to be the first
message in the NMI log buffer. If the first part goes to the main log buffer,
we have the lock for the main log buffer and merge all previous messages.

The only drawback is that we check "nmi_cont" without having "nmi_logbuf_lock".
Well, parallel NMIs are very rare. In fact, they should not happen. In the
worst case, we crate some mess in the log.

I tested this a lot. I still see that a single line might get split into two
lines from time to time. In each case, this patch helped a lot. Especially it
fixed problems with ordering of the line parts.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 12435f3ef2d4..7c992b8f44a4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1905,6 +1905,7 @@ static void merge_nmi_delayed_printk(void)
 	unsigned long nmi_first_id, nmi_next_id;
 	u32 old_main_next_idx;
 	u64 old_main_next_seq;
+	int main_cont_flushed = 0;
 
 	while (true) {
 		/*
@@ -1917,6 +1918,15 @@ static void merge_nmi_delayed_printk(void)
 		if (likely(nmi_merge_seq == seq_from_id(nmi_next_id)))
 			return;
 
+		/*
+		 * The main cont buffer might include the first part of the
+		 * first message from NMI context.
+		 */
+		if (unlikely(!main_cont_flushed)) {
+			cont_flush(&main_log, LOG_CONT);
+			main_cont_flushed = 1;
+		}
+
 		/* check if we lost some messages */
 		nmi_first_id = ACCESS_ONCE(nmi_log.nmi.first_id);
 		if (nmi_seq_is_invalid(nmi_merge_seq,
@@ -2030,7 +2040,12 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (likely(!in_nmi())) {
 		raw_spin_lock(&main_logbuf_lock);
 	} else {
-		if (!raw_spin_trylock(&main_logbuf_lock)) {
+		/*
+		 * Always use NMI ring buffer if something is already
+		 * in the cont buffer.
+		 */
+		if ((nmi_cont.len && nmi_cont.owner == current) ||
+		    !raw_spin_trylock(&main_logbuf_lock)) {
 			if (!nmi_log.buf) {
 				lockdep_on();
 				local_irq_restore(flags);
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 08/11] printk: try hard to print Oops message in NMI context
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (6 preceding siblings ...)
  2014-05-09  9:11 ` [RFC PATCH 07/11] printk: right ordering of the cont buffers from NMI context Petr Mladek
@ 2014-05-09  9:11 ` Petr Mladek
  2014-05-09  9:11 ` [RFC PATCH 09/11] printk: merge and flush NMI buffer predictably via IRQ work Petr Mladek
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

Oops messages are important for debugging. We should try harder to get them into
the main ring buffer and print them to the console. This is problematic in NMI
context because the needed locks might already be taken.

What we can do, though, is to zap all printk locks. We already do this
when a printk recursion is detected. This should be safe because the system
is crashing and there shouldn't be any printk caller by now. In case somebody
manages to grab the logbuf_lock after zap_locks then we just fallback to the
NMI ring buffer and hope that someone else will merge the messages strings and
flush the buffer.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7c992b8f44a4..cc6e77f6d72b 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2036,16 +2036,26 @@ asmlinkage int vprintk_emit(int facility, int level,
 	 * when we managed to preempt the currently running printk from NMI
 	 * context. When we are not sure, rather copy the current message
 	 * into NMI ring buffer and merge it later.
+	 *
+	 * Special case are Oops messages from NMI context. We try hard to print
+	 * them. So we forcefully drop existing locks, try to pass them via the
+	 * main log buffer, and even later push them to the console.
 	 */
 	if (likely(!in_nmi())) {
 		raw_spin_lock(&main_logbuf_lock);
 	} else {
 		/*
 		 * Always use NMI ring buffer if something is already
-		 * in the cont buffer.
+		 * in the cont buffer, except for Oops.
 		 */
-		if ((nmi_cont.len && nmi_cont.owner == current) ||
-		    !raw_spin_trylock(&main_logbuf_lock)) {
+		bool force_nmi_logbuf = nmi_cont.len &&
+			nmi_cont.owner == current &&
+			!oops_in_progress;
+
+		if (oops_in_progress)
+			zap_locks();
+
+		if (force_nmi_logbuf || !raw_spin_trylock(&main_logbuf_lock)) {
 			if (!nmi_log.buf) {
 				lockdep_on();
 				local_irq_restore(flags);
@@ -2178,8 +2188,12 @@ asmlinkage int vprintk_emit(int facility, int level,
 	/*
 	 * If called from the scheduler or NMI context, we can not get console
 	 * without a possible deadlock.
+	 *
+	 * The only exception are Oops messages from NMI context where all
+	 * relevant locks have been forcefully dropped above. We have to try
+	 * to get the console, otherwise the last messages would get lost.
 	 */
-	if (in_sched || in_nmi())
+	if (in_sched || (in_nmi() && !oops_in_progress))
 		return printed_len;
 
 	/*
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 09/11] printk: merge and flush NMI buffer predictably via IRQ work
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (7 preceding siblings ...)
  2014-05-09  9:11 ` [RFC PATCH 08/11] printk: try hard to print Oops message in " Petr Mladek
@ 2014-05-09  9:11 ` Petr Mladek
  2014-05-09  9:11 ` [RFC PATCH 10/11] printk: survive rotation of sequence numbers Petr Mladek
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

It might take ages until users see messages from NMI context. They cannot
be flushed to the console because the operation involves taking and releasing
a bunch of locks. If the main lock is not available then then messages are even
temporary put into the NMI log buf. Everything gets fixed by the followup printk
in normal context but it is not predictable.

The same problem has printk_sched() and this patch reuses the existing solution.

There is no special printk() variant for NMI context. Hence the IRQ work
need to get queued from vprintk_emit(). This is why the functions are
moved to the top of the source file and irq_work_queue() call is removed
from printk_sched().

In addition, wake_up_klogd_work_func() is extended to merge the NMI log buffer
when needed.

Oops messages are still merged and flushed immediately. We need to make sure
that they are visible before the system dies.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 131 ++++++++++++++++++++++++++++---------------------
 1 file changed, 74 insertions(+), 57 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cc6e77f6d72b..874ebab41ba3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -311,6 +311,10 @@ static u64 console_seq;
 static u32 console_idx;
 static enum log_flags console_prev;
 
+/* the next printk record to merge from NMI log buffer */
+static u64 nmi_merge_seq;
+static u32 nmi_merge_idx;
+
 /* the next printk record to read after the last 'clear' command */
 static u64 clear_seq;
 static u32 clear_idx;
@@ -1711,6 +1715,72 @@ static inline void printk_delay(void)
 	}
 }
 
+/*
+ * Delayed printk version, for scheduler-internal messages. The IRQ work is
+ * also used to handle delayed messages from NMI context.
+ */
+#define PRINTK_PENDING_WAKEUP	0x01
+#define PRINTK_PENDING_OUTPUT	0x02
+
+static DEFINE_PER_CPU(int, printk_pending);
+
+static void merge_nmi_delayed_printk(void);
+
+static void wake_up_klogd_work_func(struct irq_work *irq_work)
+{
+	int pending = __this_cpu_xchg(printk_pending, 0);
+	u64 nmi_next_id = ACCESS_ONCE(nmi_log.nmi.next_id);
+
+	/*
+	 * Check if there are any pending messages in NMI log buffer.
+	 * We do not need any lock. If new message are being added,
+	 * another IRQ work is automatically scheduled.
+	 */
+	if (ACCESS_ONCE(nmi_merge_seq) != seq_from_id(nmi_next_id)) {
+		raw_spin_lock(&main_logbuf_lock);
+		merge_nmi_delayed_printk();
+		raw_spin_unlock(&main_logbuf_lock);
+	}
+
+	if (pending & PRINTK_PENDING_OUTPUT) {
+		/* If trylock fails, someone else is doing the printing */
+		if (console_trylock())
+			console_unlock();
+	}
+
+	if (pending & PRINTK_PENDING_WAKEUP)
+		wake_up_interruptible(&log_wait);
+}
+
+static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
+	.func = wake_up_klogd_work_func,
+	.flags = IRQ_WORK_LAZY,
+};
+
+void wake_up_klogd(void)
+{
+	preempt_disable();
+	if (waitqueue_active(&log_wait)) {
+		this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
+		irq_work_queue(&__get_cpu_var(wake_up_klogd_work));
+	}
+	preempt_enable();
+}
+
+int printk_deferred(const char *fmt, ...)
+{
+	va_list args;
+	int r;
+
+	preempt_disable();
+	va_start(args, fmt);
+	r = vprintk_emit(0, SCHED_MESSAGE_LOGLEVEL, NULL, 0, fmt, args);
+	va_end(args);
+	preempt_enable();
+
+	return r;
+}
+
 static void cont_flush(struct printk_log *log, enum log_flags flags)
 {
 	struct printk_cont *cont = log->cont;
@@ -1900,8 +1970,6 @@ static int nmi_seq_is_invalid(u64 seq, u64 first_seq)
  */
 static void merge_nmi_delayed_printk(void)
 {
-	static u64 nmi_merge_seq;
-	static u32 nmi_merge_idx;
 	unsigned long nmi_first_id, nmi_next_id;
 	u32 old_main_next_idx;
 	u64 old_main_next_seq;
@@ -2193,8 +2261,11 @@ asmlinkage int vprintk_emit(int facility, int level,
 	 * relevant locks have been forcefully dropped above. We have to try
 	 * to get the console, otherwise the last messages would get lost.
 	 */
-	if (in_sched || (in_nmi() && !oops_in_progress))
+	if (in_sched || (in_nmi() && !oops_in_progress)) {
+		__this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT);
+		irq_work_queue(&__get_cpu_var(wake_up_klogd_work));
 		return printed_len;
+	}
 
 	/*
 	 * Disable preemption to avoid being preempted while holding
@@ -3039,60 +3110,6 @@ late_initcall(printk_late_init);
 
 #if defined CONFIG_PRINTK
 /*
- * Delayed printk version, for scheduler-internal messages:
- */
-#define PRINTK_PENDING_WAKEUP	0x01
-#define PRINTK_PENDING_OUTPUT	0x02
-
-static DEFINE_PER_CPU(int, printk_pending);
-
-static void wake_up_klogd_work_func(struct irq_work *irq_work)
-{
-	int pending = __this_cpu_xchg(printk_pending, 0);
-
-	if (pending & PRINTK_PENDING_OUTPUT) {
-		/* If trylock fails, someone else is doing the printing */
-		if (console_trylock())
-			console_unlock();
-	}
-
-	if (pending & PRINTK_PENDING_WAKEUP)
-		wake_up_interruptible(&log_wait);
-}
-
-static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = {
-	.func = wake_up_klogd_work_func,
-	.flags = IRQ_WORK_LAZY,
-};
-
-void wake_up_klogd(void)
-{
-	preempt_disable();
-	if (waitqueue_active(&log_wait)) {
-		this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP);
-		irq_work_queue(&__get_cpu_var(wake_up_klogd_work));
-	}
-	preempt_enable();
-}
-
-int printk_deferred(const char *fmt, ...)
-{
-	va_list args;
-	int r;
-
-	preempt_disable();
-	va_start(args, fmt);
-	r = vprintk_emit(0, SCHED_MESSAGE_LOGLEVEL, NULL, 0, fmt, args);
-	va_end(args);
-
-	__this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT);
-	irq_work_queue(&__get_cpu_var(wake_up_klogd_work));
-	preempt_enable();
-
-	return r;
-}
-
-/*
  * printk rate limiting, lifted from the networking subsystem.
  *
  * This enforces a rate limit: not more than 10 kernel messages
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 10/11] printk: survive rotation of sequence numbers
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (8 preceding siblings ...)
  2014-05-09  9:11 ` [RFC PATCH 09/11] printk: merge and flush NMI buffer predictably via IRQ work Petr Mladek
@ 2014-05-09  9:11 ` Petr Mladek
  2014-05-09  9:11 ` [RFC PATCH 11/11] printk: avoid staling when merging NMI log buffer Petr Mladek
  2014-05-28 22:02 ` [RFC PATCH 00/11] printk: safe printing in NMI context Jiri Kosina
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

NMI log buffer uses half of "unsigned long" to store the sequence number.
The maximum value might be relatively small, especially on 32-bit systems.
This patch makes the check for free space more strict, so it does give
false results after the rotation.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 874ebab41ba3..9c97b6daccc3 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -540,7 +540,7 @@ static int log_make_free_space(struct printk_log *log, u32 msg_size)
 	int freed = 0;
 	int ret = 0;
 
-	while (get_first_seq(log) < get_next_seq(log)) {
+	while (get_first_seq(log) != get_next_seq(log)) {
 		if (logbuf_has_space(log, msg_size, false))
 			goto out;
 
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 11/11] printk: avoid staling when merging NMI log buffer
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (9 preceding siblings ...)
  2014-05-09  9:11 ` [RFC PATCH 10/11] printk: survive rotation of sequence numbers Petr Mladek
@ 2014-05-09  9:11 ` Petr Mladek
  2014-05-28 22:02 ` [RFC PATCH 00/11] printk: safe printing in NMI context Jiri Kosina
  11 siblings, 0 replies; 39+ messages in thread
From: Petr Mladek @ 2014-05-09  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Jiri Kosina, Michal Hocko,
	Jan Kara, linux-kernel, Petr Mladek

New messages can appear in NMI log buffer at any time. The call merging them
into the main log buffer might have hard times to catch up with the flood. It
can be even interrupted by an NMI that adds new messages itself.

It would make sense to limit the number of copied messages and postpone
the merging to a later time and even another CPU. We could nicely reuse
the IRQ work here.

One drawback is that it might create mess when merging continuous lines
but this would be very rare. IMHO, it is not worth making the code more
complicated.

To be honest, I am not much happy with the current limit 5000 messages.
The number is sucked out of my little finger ;-) I would feel better with
some limit but I also might be too paranoid. Any better idea is welcome.

Signed-off-by: Petr Mladek <pmladek@suse.cz>
---
 kernel/printk/printk.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 9c97b6daccc3..26accbf2186d 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -380,6 +380,8 @@ static struct printk_log nmi_log = {
  */
 #define NMI_MAX_LEN_POWER (min(NMI_IDX_BYTES, sizeof(nmi_log.buf_len) * 8 - 1))
 #define NMI_MAX_LEN (1UL << NMI_MAX_LEN_POWER)
+/* maximum number of merged strings in one call */
+#define NMI_MAX_MERGE 5000
 
 /* maximum number of messages that fit into NMI log buffer */
 static unsigned int nmi_max_messages;
@@ -1974,6 +1976,7 @@ static void merge_nmi_delayed_printk(void)
 	u32 old_main_next_idx;
 	u64 old_main_next_seq;
 	int main_cont_flushed = 0;
+	int merged = 0;
 
 	while (true) {
 		/*
@@ -2004,6 +2007,13 @@ restart_merge:
 			nmi_merge_idx = idx_from_id(nmi_first_id);
 		}
 
+		/* do not stale CPU with too many merges */
+		if (merged++ > NMI_MAX_MERGE) {
+			__this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT);
+			irq_work_queue(&__get_cpu_var(wake_up_klogd_work));
+			return;
+		}
+
 		/*
 		 * Make sure that the whole message has been written for the
 		 * given idx.
-- 
1.8.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
                   ` (10 preceding siblings ...)
  2014-05-09  9:11 ` [RFC PATCH 11/11] printk: avoid staling when merging NMI log buffer Petr Mladek
@ 2014-05-28 22:02 ` Jiri Kosina
  2014-05-29  0:09   ` Frederic Weisbecker
  11 siblings, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-05-28 22:02 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Andrew Morton, Frederic Weisbecker, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	Jan Kara, linux-kernel

On Fri, 9 May 2014, Petr Mladek wrote:

> printk() cannot be used safely in NMI context because it uses internal locks
> and thus could cause a deadlock. Unfortunately there are circumstances when
> calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
> would be much more helpful if they didn't lockup the machine.
> 
> Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
> to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
> detector).

I am rather surprised that this patchset hasn't received a single review 
comment for 3 weeks.

Let me point out that the issues Petr is talking about in the cover letter 
are real -- we've actually seen the lockups triggered by RCU stall 
detector trying to dump stacks on all CPUs, and hard-locking machine up 
while doing so.

So this really needs to be solved.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-28 22:02 ` [RFC PATCH 00/11] printk: safe printing in NMI context Jiri Kosina
@ 2014-05-29  0:09   ` Frederic Weisbecker
  2014-05-29  8:09     ` Jiri Kosina
                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2014-05-29  0:09 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Petr Mladek, Andrew Morton, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Michal Hocko, Jan Kara,
	linux-kernel

On Thu, May 29, 2014 at 12:02:30AM +0200, Jiri Kosina wrote:
> On Fri, 9 May 2014, Petr Mladek wrote:
> 
> > printk() cannot be used safely in NMI context because it uses internal locks
> > and thus could cause a deadlock. Unfortunately there are circumstances when
> > calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
> > would be much more helpful if they didn't lockup the machine.
> > 
> > Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
> > to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
> > detector).
> 
> I am rather surprised that this patchset hasn't received a single review 
> comment for 3 weeks.
> 
> Let me point out that the issues Petr is talking about in the cover letter 
> are real -- we've actually seen the lockups triggered by RCU stall 
> detector trying to dump stacks on all CPUs, and hard-locking machine up 
> while doing so.
> 
> So this really needs to be solved.

The lack of review may be partly due to a not very appealing changestat on an
old codebase that is already unpopular:

 Documentation/kernel-parameters.txt |   19 +-
 kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
 2 files changed, 878 insertions(+), 359 deletions(-)


Your patches look clean and pretty nice actually. They must be seriously considered if
we want to keep the current locked ring buffer design and extend it to multiple per context
buffers. But I wonder if it's worth to continue that way with the printk ancient design.

If it takes more than 1000 line changes (including 500 added) to make it finally work
correctly with NMIs by working around its fundamental flaws, shouldn't we rather redesign
it to use a lockless ring buffer like ftrace or perf ones?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-29  0:09   ` Frederic Weisbecker
@ 2014-05-29  8:09     ` Jiri Kosina
  2014-06-10 16:46       ` Frederic Weisbecker
  2014-05-30  8:13     ` Jan Kara
  2014-06-12 11:50     ` Petr Mládek
  2 siblings, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-05-29  8:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Petr Mladek, Andrew Morton, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Michal Hocko, Jan Kara,
	linux-kernel

On Thu, 29 May 2014, Frederic Weisbecker wrote:

> > I am rather surprised that this patchset hasn't received a single review 
> > comment for 3 weeks.
> > 
> > Let me point out that the issues Petr is talking about in the cover letter 
> > are real -- we've actually seen the lockups triggered by RCU stall 
> > detector trying to dump stacks on all CPUs, and hard-locking machine up 
> > while doing so.
> > 
> > So this really needs to be solved.
> 
> The lack of review may be partly due to a not very appealing changestat 
> on an old codebase that is already unpopular:
> 
>  Documentation/kernel-parameters.txt |   19 +-
>  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
>  2 files changed, 878 insertions(+), 359 deletions(-)
> 
> 
> Your patches look clean and pretty nice actually. They must be seriously 
> considered if we want to keep the current locked ring buffer design and 
> extend it to multiple per context buffers. But I wonder if it's worth to 
> continue that way with the printk ancient design.
> 
> If it takes more than 1000 line changes (including 500 added) to make it 
> finally work correctly with NMIs by working around its fundamental 
> flaws, shouldn't we rather redesign it to use a lockless ring buffer 
> like ftrace or perf ones?

Yeah, printk() has grown over years to a stinking pile of you-know-what, 
no argument to that.

I also agree that performing a massive rewrite, which will make it use a 
lockless buffer, and therefore ultimately solve all its problems 
(scheduler deadlocks, NMI deadlocks, xtime_lock deadlocks) at once, is 
necessary in the long run.

On the other hand, I am completely sure that the diffstat for such rewrite 
is going to be much more scary :)

This is not adding fancy features to printk(), where we really should be 
saying no; horrible commits like 7ff9554bb5 is exactly something that 
should be pushed against *heavily*. But bugfixes for hard machine lockups 
are a completely different story to me (until we have a whole new printk() 
buffer handling implementation).

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-29  0:09   ` Frederic Weisbecker
  2014-05-29  8:09     ` Jiri Kosina
@ 2014-05-30  8:13     ` Jan Kara
  2014-05-30 10:10       ` Jiri Kosina
  2014-06-10 16:49       ` Frederic Weisbecker
  2014-06-12 11:50     ` Petr Mládek
  2 siblings, 2 replies; 39+ messages in thread
From: Jan Kara @ 2014-05-30  8:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jiri Kosina, Petr Mladek, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	Jan Kara, linux-kernel

On Thu 29-05-14 02:09:11, Frederic Weisbecker wrote:
> On Thu, May 29, 2014 at 12:02:30AM +0200, Jiri Kosina wrote:
> > On Fri, 9 May 2014, Petr Mladek wrote:
> > 
> > > printk() cannot be used safely in NMI context because it uses internal locks
> > > and thus could cause a deadlock. Unfortunately there are circumstances when
> > > calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
> > > would be much more helpful if they didn't lockup the machine.
> > > 
> > > Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
> > > to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
> > > detector).
> > 
> > I am rather surprised that this patchset hasn't received a single review 
> > comment for 3 weeks.
> > 
> > Let me point out that the issues Petr is talking about in the cover letter 
> > are real -- we've actually seen the lockups triggered by RCU stall 
> > detector trying to dump stacks on all CPUs, and hard-locking machine up 
> > while doing so.
> > 
> > So this really needs to be solved.
> 
> The lack of review may be partly due to a not very appealing changestat on an
> old codebase that is already unpopular:
> 
>  Documentation/kernel-parameters.txt |   19 +-
>  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
>  2 files changed, 878 insertions(+), 359 deletions(-)
> 
> 
> Your patches look clean and pretty nice actually. They must be seriously
> considered if we want to keep the current locked ring buffer design and
> extend it to multiple per context buffers. But I wonder if it's worth to
> continue that way with the printk ancient design.
> 
> If it takes more than 1000 line changes (including 500 added) to make it
> finally work correctly with NMIs by working around its fundamental flaws,
> shouldn't we rather redesign it to use a lockless ring buffer like ftrace
> or perf ones?
  I agree that lockless ringbuffer would be a more elegant solution but a
much more intrusive one and complex as well. Petr's patch set basically
leaves ordinary printk path intact to avoid concerns about regressions
there.

Given how difficult / time consuming is it to push any complex changes to
printk I'd push for fixing printk from NMI in this inelegant but relatively
non-contentious way and work on converting printk to lockless
implementation long term. But before spending huge amount of time on that
I'd like to get some wider concensus that this is really the way we want to
go - at least AKPM and Steven - something for discussion in the KS topic I'd
proposed I think [1].

							Honza

[1]
http://lists.linuxfoundation.org/pipermail/ksummit-discuss/2014-May/000598.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-30  8:13     ` Jan Kara
@ 2014-05-30 10:10       ` Jiri Kosina
  2014-06-10 16:49       ` Frederic Weisbecker
  1 sibling, 0 replies; 39+ messages in thread
From: Jiri Kosina @ 2014-05-30 10:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Frederic Weisbecker, Petr Mladek, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	linux-kernel

On Fri, 30 May 2014, Jan Kara wrote:

> >  Documentation/kernel-parameters.txt |   19 +-
> >  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> >  2 files changed, 878 insertions(+), 359 deletions(-)
> > 
> > 
> > Your patches look clean and pretty nice actually. They must be seriously
> > considered if we want to keep the current locked ring buffer design and
> > extend it to multiple per context buffers. But I wonder if it's worth to
> > continue that way with the printk ancient design.
> > 
> > If it takes more than 1000 line changes (including 500 added) to make it
> > finally work correctly with NMIs by working around its fundamental flaws,
> > shouldn't we rather redesign it to use a lockless ring buffer like ftrace
> > or perf ones?
>   I agree that lockless ringbuffer would be a more elegant solution but a
> much more intrusive one and complex as well. Petr's patch set basically
> leaves ordinary printk path intact to avoid concerns about regressions
> there.

Fully agreed, vast majority of the changes done by the patchset are on the 
unlikely in-NMI path, leaving normal printk operation as-is.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-29  8:09     ` Jiri Kosina
@ 2014-06-10 16:46       ` Frederic Weisbecker
  2014-06-10 16:57         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2014-06-10 16:46 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Petr Mladek, Andrew Morton, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Michal Hocko, Jan Kara,
	linux-kernel, Linus Torvalds

On Thu, May 29, 2014 at 10:09:48AM +0200, Jiri Kosina wrote:
> On Thu, 29 May 2014, Frederic Weisbecker wrote:
> 
> > > I am rather surprised that this patchset hasn't received a single review 
> > > comment for 3 weeks.
> > > 
> > > Let me point out that the issues Petr is talking about in the cover letter 
> > > are real -- we've actually seen the lockups triggered by RCU stall 
> > > detector trying to dump stacks on all CPUs, and hard-locking machine up 
> > > while doing so.
> > > 
> > > So this really needs to be solved.
> > 
> > The lack of review may be partly due to a not very appealing changestat 
> > on an old codebase that is already unpopular:
> > 
> >  Documentation/kernel-parameters.txt |   19 +-
> >  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> >  2 files changed, 878 insertions(+), 359 deletions(-)
> > 
> > 
> > Your patches look clean and pretty nice actually. They must be seriously 
> > considered if we want to keep the current locked ring buffer design and 
> > extend it to multiple per context buffers. But I wonder if it's worth to 
> > continue that way with the printk ancient design.
> > 
> > If it takes more than 1000 line changes (including 500 added) to make it 
> > finally work correctly with NMIs by working around its fundamental 
> > flaws, shouldn't we rather redesign it to use a lockless ring buffer 
> > like ftrace or perf ones?
> 
> Yeah, printk() has grown over years to a stinking pile of you-know-what, 
> no argument to that.
> 
> I also agree that performing a massive rewrite, which will make it use a 
> lockless buffer, and therefore ultimately solve all its problems 
> (scheduler deadlocks, NMI deadlocks, xtime_lock deadlocks) at once, is 
> necessary in the long run.
> 
> On the other hand, I am completely sure that the diffstat for such rewrite 
> is going to be much more scary :)

Indeed, but probably much more valuable in the long term.

> 
> This is not adding fancy features to printk(), where we really should be 
> saying no; horrible commits like 7ff9554bb5 is exactly something that 
> should be pushed against *heavily*. But bugfixes for hard machine lockups 
> are a completely different story to me (until we have a whole new printk() 
> buffer handling implementation).

Yeah bugfixes are certainly another story. Still it looks like yet another
layer of workaround on a big hack.

But yeah I'm certainly not in a right position to set anyone to do a massive
rewrite on such a boring subsystem :)

There is also a big risk that if we push back this bugfix, nobody will actually do
that desired rewrite.

Lets be crazy and Cc Linus on that.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-30  8:13     ` Jan Kara
  2014-05-30 10:10       ` Jiri Kosina
@ 2014-06-10 16:49       ` Frederic Weisbecker
  1 sibling, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2014-06-10 16:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jiri Kosina, Petr Mladek, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	linux-kernel, Linus Torvalds

On Fri, May 30, 2014 at 10:13:28AM +0200, Jan Kara wrote:
> On Thu 29-05-14 02:09:11, Frederic Weisbecker wrote:
> > On Thu, May 29, 2014 at 12:02:30AM +0200, Jiri Kosina wrote:
> > > On Fri, 9 May 2014, Petr Mladek wrote:
> > > 
> > > > printk() cannot be used safely in NMI context because it uses internal locks
> > > > and thus could cause a deadlock. Unfortunately there are circumstances when
> > > > calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
> > > > would be much more helpful if they didn't lockup the machine.
> > > > 
> > > > Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
> > > > to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
> > > > detector).
> > > 
> > > I am rather surprised that this patchset hasn't received a single review 
> > > comment for 3 weeks.
> > > 
> > > Let me point out that the issues Petr is talking about in the cover letter 
> > > are real -- we've actually seen the lockups triggered by RCU stall 
> > > detector trying to dump stacks on all CPUs, and hard-locking machine up 
> > > while doing so.
> > > 
> > > So this really needs to be solved.
> > 
> > The lack of review may be partly due to a not very appealing changestat on an
> > old codebase that is already unpopular:
> > 
> >  Documentation/kernel-parameters.txt |   19 +-
> >  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> >  2 files changed, 878 insertions(+), 359 deletions(-)
> > 
> > 
> > Your patches look clean and pretty nice actually. They must be seriously
> > considered if we want to keep the current locked ring buffer design and
> > extend it to multiple per context buffers. But I wonder if it's worth to
> > continue that way with the printk ancient design.
> > 
> > If it takes more than 1000 line changes (including 500 added) to make it
> > finally work correctly with NMIs by working around its fundamental flaws,
> > shouldn't we rather redesign it to use a lockless ring buffer like ftrace
> > or perf ones?
>   I agree that lockless ringbuffer would be a more elegant solution but a
> much more intrusive one and complex as well. Petr's patch set basically
> leaves ordinary printk path intact to avoid concerns about regressions
> there.
> 
> Given how difficult / time consuming is it to push any complex changes to
> printk I'd push for fixing printk from NMI in this inelegant but relatively
> non-contentious way and work on converting printk to lockless
> implementation long term. But before spending huge amount of time on that
> I'd like to get some wider concensus that this is really the way we want to
> go - at least AKPM and Steven - something for discussion in the KS topic I'd
> proposed I think [1].

Agreed, lets wait for others opinion. Andrew, Steve?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-10 16:46       ` Frederic Weisbecker
@ 2014-06-10 16:57         ` Linus Torvalds
  2014-06-10 17:32           ` Jiri Kosina
  2014-06-18 11:03           ` Jiri Kosina
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2014-06-10 16:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jiri Kosina, Petr Mladek, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Tue, Jun 10, 2014 at 9:46 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> There is also a big risk that if we push back this bugfix, nobody will actually do
> that desired rewrite.
>
> Lets be crazy and Cc Linus on that.

Quite frankly, I hate seeing something like this:

 kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------

for something that is stupid and broken. Printing from NMI context
isn't really supposed to work, and we all *know* it's not supposed to
work.

I'd much rather disallow it, and if there is one or two places that
really want to print a warning and know that they are in NMI context,
have a special workaround just for them, with something that does
*not* try to make printk in general work any better.

Dammit, NMI context is special. I absolutely refuse to buy into the
broken concept that we should make more stuff work in NMI context.
Hell no, we should *not* try to make more crap work in NMI. NMI people
should be careful.

Make a trivial "printk_nmi()" wrapper that tries to do a trylock on
logbuf_lock, and *maybe* the existing sequence of

        if (console_trylock_for_printk())
                console_unlock();

then works for actually triggering the printout. But the wrapper
should be 15 lines of code for "if possible, try to print things", and
*not* a thousand lines of changes.

             Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-10 16:57         ` Linus Torvalds
@ 2014-06-10 17:32           ` Jiri Kosina
  2014-06-11  9:01             ` Petr Mládek
  2014-06-18 11:03           ` Jiri Kosina
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-06-10 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, Petr Mladek, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Tue, 10 Jun 2014, Linus Torvalds wrote:

> > Lets be crazy and Cc Linus on that.
> 
> Quite frankly, I hate seeing something like this:
> 
>  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> 
> for something that is stupid and broken. Printing from NMI context
> isn't really supposed to work, and we all *know* it's not supposed to
> work.

It's OTOH rather useful in a few scenarios -- particularly it's the only 
way to dump stacktraces from remote CPUs in order to obtain traces that 
actually make sense (in situations like RCU stall); using workqueue-based 
dumping is useless there.

> I'd much rather disallow it, and if there is one or two places that
> really want to print a warning and know that they are in NMI context,
> have a special workaround just for them, with something that does
> *not* try to make printk in general work any better.

Well, that'd mean that at least our stack dumping mechanism would need to 
know both ways of printing; but yes, it'll still probably be less than 880 
lines added.

> Dammit, NMI context is special. I absolutely refuse to buy into the
> broken concept that we should make more stuff work in NMI context.
> Hell no, we should *not* try to make more crap work in NMI. NMI people
> should be careful.

In parallel, I'd for the sake of argument propose to just drop the whole 
_CONT printing (and all the things that followed on top) as that made 
printk() a complete hell to maintain for a disputable gain IMO.

> Make a trivial "printk_nmi()" wrapper that tries to do a trylock on
> logbuf_lock, and *maybe* the existing sequence of
> 
>         if (console_trylock_for_printk())
>                 console_unlock();
> 
> then works for actually triggering the printout. But the wrapper
> should be 15 lines of code for "if possible, try to print things", and
> *not* a thousand lines of changes.

Well, we are carrying much simpler fix for this whole braindamage in our 
enterprise kernel that is from pre-7ff9554bb578 era, and it was rather 
simple fix in principle (the diffstat is much larger than it had to be due 
to code movements):

	http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP3&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9

But after the scary 7ff9554bb578 and its successors, things got a lot more 
complicated.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-10 17:32           ` Jiri Kosina
@ 2014-06-11  9:01             ` Petr Mládek
  0 siblings, 0 replies; 39+ messages in thread
From: Petr Mládek @ 2014-06-11  9:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jiri Kosina, Frederic Weisbecker, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Tue 2014-06-10 19:32:51, Jiri Kosina wrote:
> On Tue, 10 Jun 2014, Linus Torvalds wrote:
> 
> > > Lets be crazy and Cc Linus on that.
> > 
> > Quite frankly, I hate seeing something like this:
> > 
> >  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> > 
> > for something that is stupid and broken. Printing from NMI context
> > isn't really supposed to work, and we all *know* it's not supposed to
> > work.
> 
> It's OTOH rather useful in a few scenarios -- particularly it's the only 
> way to dump stacktraces from remote CPUs in order to obtain traces that 
> actually make sense (in situations like RCU stall); using workqueue-based 
> dumping is useless there.
> 
> > Make a trivial "printk_nmi()" wrapper that tries to do a trylock on
> > logbuf_lock, and *maybe* the existing sequence of
> > 
> >         if (console_trylock_for_printk())
> >                 console_unlock();
> > 
> > then works for actually triggering the printout. But the wrapper
> > should be 15 lines of code for "if possible, try to print things", and
> > *not* a thousand lines of changes.

I am afraid that basically this is done in my patch set. It does
trylock and uses the main buffer when possible. I am just not able to
squeeze it into 15 lines :-(

One problem is that we do not want to loose the messages,
e.g. stacktraces. We need to store them somewhere and merge them into
the main ring buffer later.

> Well, we are carrying much simpler fix for this whole braindamage in our 
> enterprise kernel that is from pre-7ff9554bb578 era, and it was rather 
> simple fix in principle (the diffstat is much larger than it had to be due 
> to code movements):
> 
> 	http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP3&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9
> 
> But after the scary 7ff9554bb578 and its successors, things got a lot more 
> complicated.

Yes, another big problem is the above mentioned commit. The reading from the
temporary storage has to be in the normal context and thus lockless.
It is much more complicated when we work with whole messages and all
the added flags.

Also note that we want to save the last messages when the temporary storage
is full. This is why I used a ring buffer and was not able to use a
more simple producer and consumer algorithm.


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-05-29  0:09   ` Frederic Weisbecker
  2014-05-29  8:09     ` Jiri Kosina
  2014-05-30  8:13     ` Jan Kara
@ 2014-06-12 11:50     ` Petr Mládek
  2 siblings, 0 replies; 39+ messages in thread
From: Petr Mládek @ 2014-06-12 11:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jiri Kosina, Andrew Morton, Steven Rostedt, Dave Anderson,
	Paul E. McKenney, Kay Sievers, Michal Hocko, Jan Kara,
	linux-kernel, Linus Torvalds

On Thu 29-05-14 02:09:11, Frederic Weisbecker wrote:
> On Thu, May 29, 2014 at 12:02:30AM +0200, Jiri Kosina wrote:
> > On Fri, 9 May 2014, Petr Mladek wrote:
> > 
> > > printk() cannot be used safely in NMI context because it uses internal locks
> > > and thus could cause a deadlock. Unfortunately there are circumstances when
> > > calling printk from NMI is very useful. For example, all WARN.*(in_nmi())
> > > would be much more helpful if they didn't lockup the machine.
> > > 
> > > Another example would be arch_trigger_all_cpu_backtrace for x86 which uses NMI
> > > to dump traces on all CPU (either triggered by sysrq+l or from RCU stall
> > > detector).
> > 
> > I am rather surprised that this patchset hasn't received a single review 
> > comment for 3 weeks.
> > 
> > Let me point out that the issues Petr is talking about in the cover letter 
> > are real -- we've actually seen the lockups triggered by RCU stall 
> > detector trying to dump stacks on all CPUs, and hard-locking machine up 
> > while doing so.
> > 
> > So this really needs to be solved.
> 
> The lack of review may be partly due to a not very appealing changestat on an
> old codebase that is already unpopular:
> 
>  Documentation/kernel-parameters.txt |   19 +-
>  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
>  2 files changed, 878 insertions(+), 359 deletions(-)
> 
> 
> Your patches look clean and pretty nice actually. They must be seriously considered if
> we want to keep the current locked ring buffer design and extend it to multiple per context
> buffers. But I wonder if it's worth to continue that way with the printk ancient design.
> 
> If it takes more than 1000 line changes (including 500 added) to make it finally work
> correctly with NMIs by working around its fundamental flaws, shouldn't we rather redesign
> it to use a lockless ring buffer like ftrace or perf ones?


I like the idea of the lock less buffer. So I have spent some time
to understand kernel/trace/ring_buffer.c and there are some challenges
that would need to be solved. Some of them might be pretty hard :-(

Here are the hardest missing features from my point of view:

    + storing the last message in all situations
    + reading from more locations in parallel
    + "aggressive" printing to console

, see below for more details.

Also note that the current code is already quite complex. There are
many tricks to solve conflicts a lockless way and it might get worse if
we want to solve the above stuff.

--------

Bellow are more details that explain the above statements. But first,
let me show how I understand the lock less ringbuffer:

    + there are separate buffers for each CPU
    + writers use a circular list of pages that are linked in both
      directions
    + writers reserve space before they copy the data
    + reader has an extra page that is not in the main ring and thus
      not accessible by writers
    + reader swap its page with another one from the main ring buffer
      when it wants to read some newer data

    + pages might have special flag/function:

	+ reader: the separate page accessed by reader
	+ head: page with the oldest data; the next one to be read
	+ tail: page with the newest data; the next write will try to
	    reserve the space here
	+ commit: the newest page where we have already copied the
	    data; it is usually the same as tail; the difference
	    happens when the write is interrupted and followup pages
	    are reserved and written; we must newer move tail over
	    this page, otherwise we would reserve the same location
	    twice, overwrite the data, and create mess

I hope that I have got the above correctly. It is why I think that the
missing features are hard to solve.

Here are the details about the above mentioned problems:

1. storing the last message in all situations
---------------------------------------------

  The problem is if we reserve space in a normal context, get
  interrupted, and the interrupt wants to write more data than the size
  of the ring buffer. We must stop rotating when we hit the first
  reserved but not committed page. Here are the code pointers:

  ring_buffer_write()
    rb_reserve_next_event()
      rb_reserve_next()
	rb_move_tail()
	  if (unlikely(next_page == commit_page)) {
		goto out_reset;

  This is must to have because the data are simply copied when
  the space is reserved, see memcpy(body, data, length) in
  ring_buffer_write()

  I think that we do not want this for printk(). The last messages are
  usually more important, especially in case of a fatal error.

  Possible solutions:

  + use different ring buffer for each context; it would need even
    more space

  + lock the page for the given context when a space is reserved; such
    locked page will be skipped when the buffer is rotated in a nested
    interrupt context; this will make the algorithm even more
    complicated; I am not sure if this would work at all

  + ignore this problem; each nested context should make sure that
    it does not use the whole buffer; it might even be realistic; we have
    separate buffer for each CPU; for example, one backtrace should
    should fit into one page; two pages are minimum...; but this
    is the type of assumption that might hit us in the future


2. reading from more locations in parallel
------------------------------------------

  The printk() ring buffer is asynchronously accessed by different
  readers, e.g. console, syslog, /dev/kmsg. Each one might read from
  a different location.

  Possible solutions:

  + have more reader and header pages; it would create the algorithm
    even more complicated; I am not sure if it is possible at all

  + have another printk() ring buffer and a single reader that would
    copy the messages here; it is ugly; also I am not sure if we would
    save much from the current printk() code


3. "aggressive" printing to console
-----------------------------------

  printk() currently triggers console immediately when new data
  appears in the ring buffer.

  This might cause swapping the reader page even when there is only
  one entry. The result might be that each page would include only
  one line. Few lines might occupy a large ring buffer.

  Possible solution:

  Do some lazy reading but how?


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-10 16:57         ` Linus Torvalds
  2014-06-10 17:32           ` Jiri Kosina
@ 2014-06-18 11:03           ` Jiri Kosina
  2014-06-18 14:36             ` Paul E. McKenney
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-06-18 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, Petr Mladek, Andrew Morton, Steven Rostedt,
	Dave Anderson, Paul E. McKenney, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Tue, 10 Jun 2014, Linus Torvalds wrote:

> > Lets be crazy and Cc Linus on that.
> 
> Quite frankly, I hate seeing something like this:
> 
>  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> 
> for something that is stupid and broken. Printing from NMI context
> isn't really supposed to work, and we all *know* it's not supposed to
> work.
> 
> I'd much rather disallow it, and if there is one or two places that
> really want to print a warning and know that they are in NMI context,
> have a special workaround just for them, with something that does
> *not* try to make printk in general work any better.
> 
> Dammit, NMI context is special. I absolutely refuse to buy into the
> broken concept that we should make more stuff work in NMI context.
> Hell no, we should *not* try to make more crap work in NMI. NMI people
> should be careful.
> 
> Make a trivial "printk_nmi()" wrapper that tries to do a trylock on
> logbuf_lock, and *maybe* the existing sequence of
> 
>         if (console_trylock_for_printk())
>                 console_unlock();
> 
> then works for actually triggering the printout. But the wrapper
> should be 15 lines of code for "if possible, try to print things", and
> *not* a thousand lines of changes.

Alright, so this went silent again without any real progress. Is everyone 
hoping this gets sorted out on kernel summit, or ... ?

Let me sum up the current situation:

- both RCU stall detector and 'echo l > sysrq-trigger' can (and we've 
  seen it happening for real) cause a complete, undebuggable, silent hang 
  of machine (deadlock in NMI context)

- before 7ff9554bb578 and friends, this was trivial to fix more or less 
  exactly the way Linus is proposing above. We've been carrying the 
  fix in our kernels for a while already [1]. With printk() having got  
  overly complicated recently, the "in principle trivial" fix turns out  
  into crazy mess due to handling of all the indexes, sequence numbers, 
  etc.

- printk() from NMI is actually useful in rare cases (such as inter-CPU 
  stack dumping)

- using lockless buffers that trace_printk() is using has its own 
  problems, as described by Petr elsewhere in this thread


I find it rather outrageous that fixing *real bugs* (leading to hangs) 
becomes impossible due to printk() being too complex. It's very 
unfortunate that the same level of pushback didn't happen when new 
features (that actually *made* it so complicated) have been pushed; that 
would be much more valuable an appropriate.

I believe Jan Kara is in the same situation with his softlockup fixes for 
printk. Those are real problems, as they are bringing machines down, and 
after two years, still not fixed, because "printk() code is scary enough 
as-is"

[1] http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP3&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 11:03           ` Jiri Kosina
@ 2014-06-18 14:36             ` Paul E. McKenney
  2014-06-18 14:41               ` Jiri Kosina
       [not found]               ` <CA+55aFwPgDC6gSEPfu3i-pA4f0ZbsTSvykxzX4sXMeLbdXuKrw@mail.gmail.com>
  0 siblings, 2 replies; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 14:36 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Frederic Weisbecker, Petr Mladek, Andrew Morton,
	Steven Rostedt, Dave Anderson, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Wed, Jun 18, 2014 at 01:03:05PM +0200, Jiri Kosina wrote:
> On Tue, 10 Jun 2014, Linus Torvalds wrote:
> 
> > > Lets be crazy and Cc Linus on that.
> > 
> > Quite frankly, I hate seeing something like this:
> > 
> >  kernel/printk/printk.c              | 1218 +++++++++++++++++++++++++----------
> > 
> > for something that is stupid and broken. Printing from NMI context
> > isn't really supposed to work, and we all *know* it's not supposed to
> > work.
> > 
> > I'd much rather disallow it, and if there is one or two places that
> > really want to print a warning and know that they are in NMI context,
> > have a special workaround just for them, with something that does
> > *not* try to make printk in general work any better.
> > 
> > Dammit, NMI context is special. I absolutely refuse to buy into the
> > broken concept that we should make more stuff work in NMI context.
> > Hell no, we should *not* try to make more crap work in NMI. NMI people
> > should be careful.
> > 
> > Make a trivial "printk_nmi()" wrapper that tries to do a trylock on
> > logbuf_lock, and *maybe* the existing sequence of
> > 
> >         if (console_trylock_for_printk())
> >                 console_unlock();
> > 
> > then works for actually triggering the printout. But the wrapper
> > should be 15 lines of code for "if possible, try to print things", and
> > *not* a thousand lines of changes.
> 
> Alright, so this went silent again without any real progress. Is everyone 
> hoping this gets sorted out on kernel summit, or ... ?
> 
> Let me sum up the current situation:
> 
> - both RCU stall detector and 'echo l > sysrq-trigger' can (and we've 
>   seen it happening for real) cause a complete, undebuggable, silent hang 
>   of machine (deadlock in NMI context)

I could easily add an option to RCU to allow people to tell it not to
use NMIs to dump the stack.  Would that help?

							Thanx, Paul

> - before 7ff9554bb578 and friends, this was trivial to fix more or less 
>   exactly the way Linus is proposing above. We've been carrying the 
>   fix in our kernels for a while already [1]. With printk() having got  
>   overly complicated recently, the "in principle trivial" fix turns out  
>   into crazy mess due to handling of all the indexes, sequence numbers, 
>   etc.
> 
> - printk() from NMI is actually useful in rare cases (such as inter-CPU 
>   stack dumping)
> 
> - using lockless buffers that trace_printk() is using has its own 
>   problems, as described by Petr elsewhere in this thread
> 
> 
> I find it rather outrageous that fixing *real bugs* (leading to hangs) 
> becomes impossible due to printk() being too complex. It's very 
> unfortunate that the same level of pushback didn't happen when new 
> features (that actually *made* it so complicated) have been pushed; that 
> would be much more valuable an appropriate.
> 
> I believe Jan Kara is in the same situation with his softlockup fixes for 
> printk. Those are real problems, as they are bringing machines down, and 
> after two years, still not fixed, because "printk() code is scary enough 
> as-is"
> 
> [1] http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP3&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9
> 
> -- 
> Jiri Kosina
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 14:36             ` Paul E. McKenney
@ 2014-06-18 14:41               ` Jiri Kosina
  2014-06-18 14:44                 ` Paul E. McKenney
       [not found]               ` <CA+55aFwPgDC6gSEPfu3i-pA4f0ZbsTSvykxzX4sXMeLbdXuKrw@mail.gmail.com>
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-06-18 14:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Frederic Weisbecker, Petr Mladek, Andrew Morton,
	Steven Rostedt, Dave Anderson, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Wed, 18 Jun 2014, Paul E. McKenney wrote:

> > - both RCU stall detector and 'echo l > sysrq-trigger' can (and we've 
> >   seen it happening for real) cause a complete, undebuggable, silent hang 
> >   of machine (deadlock in NMI context)
> 
> I could easily add an option to RCU to allow people to tell it not to
> use NMIs to dump the stack.  Would that help?

Well, that would make unfortunately the information provided by RCU stall 
detector rather useless ... workqueue-based stack dumping is very unlikely 
to point its finger to the real offender, as it'd be coming way too late.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 14:41               ` Jiri Kosina
@ 2014-06-18 14:44                 ` Paul E. McKenney
  2014-06-18 14:53                   ` Jiri Kosina
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 14:44 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Frederic Weisbecker, Petr Mladek, Andrew Morton,
	Steven Rostedt, Dave Anderson, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Wed, Jun 18, 2014 at 04:41:09PM +0200, Jiri Kosina wrote:
> On Wed, 18 Jun 2014, Paul E. McKenney wrote:
> 
> > > - both RCU stall detector and 'echo l > sysrq-trigger' can (and we've 
> > >   seen it happening for real) cause a complete, undebuggable, silent hang 
> > >   of machine (deadlock in NMI context)
> > 
> > I could easily add an option to RCU to allow people to tell it not to
> > use NMIs to dump the stack.  Would that help?
> 
> Well, that would make unfortunately the information provided by RCU stall 
> detector rather useless ... workqueue-based stack dumping is very unlikely 
> to point its finger to the real offender, as it'd be coming way too late.

I would not use workqueues, but rather have the CPU detecting the
stall grovel through the other CPUs' stacks, which is what I do now for
architectures that don't support NMI-based stack dumps.  Would that be
a reasonable approach?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 14:44                 ` Paul E. McKenney
@ 2014-06-18 14:53                   ` Jiri Kosina
  2014-06-18 15:07                     ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-06-18 14:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Frederic Weisbecker, Petr Mladek, Andrew Morton,
	Steven Rostedt, Dave Anderson, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Wed, 18 Jun 2014, Paul E. McKenney wrote:

> > > > - both RCU stall detector and 'echo l > sysrq-trigger' can (and we've 
> > > >   seen it happening for real) cause a complete, undebuggable, silent hang 
> > > >   of machine (deadlock in NMI context)
> > > 
> > > I could easily add an option to RCU to allow people to tell it not to
> > > use NMIs to dump the stack.  Would that help?
> > 
> > Well, that would make unfortunately the information provided by RCU stall 
> > detector rather useless ... workqueue-based stack dumping is very unlikely 
> > to point its finger to the real offender, as it'd be coming way too late.
> 
> I would not use workqueues, but rather have the CPU detecting the
> stall grovel through the other CPUs' stacks, which is what I do now for
> architectures that don't support NMI-based stack dumps.  Would that be
> a reasonable approach?

That would indeed solve lockups induced by RCU stall detector (and we 
should convert sysrq stack dumping code to use the same mechanism 
afterwards).

But then, the kernel is still polluted by quite a few instances of

	WARN_ON(in_nmi())

	BUG_IN(in_nmi())

	if (in_nmi())
		printk(....)

which need to be fixed separately afterwards anyway.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 14:53                   ` Jiri Kosina
@ 2014-06-18 15:07                     ` Paul E. McKenney
  0 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 15:07 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Frederic Weisbecker, Petr Mladek, Andrew Morton,
	Steven Rostedt, Dave Anderson, Kay Sievers, Michal Hocko,
	Jan Kara, Linux Kernel Mailing List

On Wed, Jun 18, 2014 at 04:53:14PM +0200, Jiri Kosina wrote:
> On Wed, 18 Jun 2014, Paul E. McKenney wrote:
> 
> > > > > - both RCU stall detector and 'echo l > sysrq-trigger' can (and we've 
> > > > >   seen it happening for real) cause a complete, undebuggable, silent hang 
> > > > >   of machine (deadlock in NMI context)
> > > > 
> > > > I could easily add an option to RCU to allow people to tell it not to
> > > > use NMIs to dump the stack.  Would that help?
> > > 
> > > Well, that would make unfortunately the information provided by RCU stall 
> > > detector rather useless ... workqueue-based stack dumping is very unlikely 
> > > to point its finger to the real offender, as it'd be coming way too late.
> > 
> > I would not use workqueues, but rather have the CPU detecting the
> > stall grovel through the other CPUs' stacks, which is what I do now for
> > architectures that don't support NMI-based stack dumps.  Would that be
> > a reasonable approach?
> 
> That would indeed solve lockups induced by RCU stall detector (and we 
> should convert sysrq stack dumping code to use the same mechanism 
> afterwards).
> 
> But then, the kernel is still polluted by quite a few instances of
> 
> 	WARN_ON(in_nmi())
> 
> 	BUG_IN(in_nmi())
> 
> 	if (in_nmi())
> 		printk(....)
> 
> which need to be fixed separately afterwards anyway.

True enough!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
       [not found]               ` <CA+55aFwPgDC6gSEPfu3i-pA4f0ZbsTSvykxzX4sXMeLbdXuKrw@mail.gmail.com>
@ 2014-06-18 16:21                 ` Paul E. McKenney
  2014-06-18 16:38                   ` Steven Rostedt
  2014-06-18 20:36                   ` Jiri Kosina
  0 siblings, 2 replies; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 16:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Michal Hocko, Jan Kara,
	Frederic Weisbecker, Steven Rostedt, Dave Anderson, Jiri Kosina,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, Jun 18, 2014 at 05:58:40AM -1000, Linus Torvalds wrote:
> On Jun 18, 2014 4:36 AM, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> wrote:
> >
> > I could easily add an option to RCU to allow people to tell it not to
> > use NMIs to dump the stack.
> 
> I don't think it should be an "option".
> 
> We should stop using nmi as if it was something "normal". It isn't. Code
> running in nmi context should be special, and should be very very aware
> that it is special. That goes way beyond "don't use printk". We seem to
> have gone way way too far in using nmi context.
> 
> So we should get *rid* of code in nmi context rather than then complain
> about printk being buggy.

OK, unconditional non-use of NMIs is even easier.  ;-)

Something like the following.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Don't use NMIs to dump other CPUs' stacks

Although NMI-based stack dumps are in principle more accurate, they are
also more likely to trigger deadlocks.  This commit therefore replaces
all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
that the CPU detecting an RCU CPU stall does the stack dumping.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c590e1201c74..777624e1329b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -932,10 +932,7 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
 }
 
 /*
- * Dump stacks of all tasks running on stalled CPUs.  This is a fallback
- * for architectures that do not implement trigger_all_cpu_backtrace().
- * The NMI-triggered stack traces are more accurate because they are
- * printed by the target CPU.
+ * Dump stacks of all tasks running on stalled CPUs.
  */
 static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
 {
@@ -1013,7 +1010,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	       (long)rsp->gpnum, (long)rsp->completed, totqlen);
 	if (ndetected == 0)
 		pr_err("INFO: Stall ended before state dump start\n");
-	else if (!trigger_all_cpu_backtrace())
+	else
 		rcu_dump_cpu_stacks(rsp);
 
 	/* Complain about tasks blocking the grace period. */
@@ -1044,8 +1041,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
 	pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n",
 		jiffies - rsp->gp_start,
 		(long)rsp->gpnum, (long)rsp->completed, totqlen);
-	if (!trigger_all_cpu_backtrace())
-		dump_stack();
+	rcu_dump_cpu_stacks(rsp);
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
 	if (ULONG_CMP_GE(jiffies, ACCESS_ONCE(rsp->jiffies_stall)))


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 16:21                 ` Paul E. McKenney
@ 2014-06-18 16:38                   ` Steven Rostedt
  2014-06-18 16:43                     ` Paul E. McKenney
  2014-06-18 20:36                   ` Jiri Kosina
  1 sibling, 1 reply; 39+ messages in thread
From: Steven Rostedt @ 2014-06-18 16:38 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson, Jiri Kosina,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, 18 Jun 2014 09:21:17 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Wed, Jun 18, 2014 at 05:58:40AM -1000, Linus Torvalds wrote:
> > On Jun 18, 2014 4:36 AM, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > wrote:
> > >
> > > I could easily add an option to RCU to allow people to tell it not to
> > > use NMIs to dump the stack.
> > 
> > I don't think it should be an "option".
> > 
> > We should stop using nmi as if it was something "normal". It isn't. Code
> > running in nmi context should be special, and should be very very aware
> > that it is special. That goes way beyond "don't use printk". We seem to
> > have gone way way too far in using nmi context.
> > 
> > So we should get *rid* of code in nmi context rather than then complain
> > about printk being buggy.
> 
> OK, unconditional non-use of NMIs is even easier.  ;-)
> 
> Something like the following.
> 

I have found the RCU stalls extremely useful in debugging lockups. In
case this doesn't work as well, I'm willing to write up something that
could send NMIs to all CPUs that would write into the ftrace ring
buffer and when finished, the calling CPU can dump it out. No printk
from NMI context at all.

-- Steve

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 16:38                   ` Steven Rostedt
@ 2014-06-18 16:43                     ` Paul E. McKenney
  0 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 16:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson, Jiri Kosina,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, Jun 18, 2014 at 12:38:37PM -0400, Steven Rostedt wrote:
> On Wed, 18 Jun 2014 09:21:17 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Wed, Jun 18, 2014 at 05:58:40AM -1000, Linus Torvalds wrote:
> > > On Jun 18, 2014 4:36 AM, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > wrote:
> > > >
> > > > I could easily add an option to RCU to allow people to tell it not to
> > > > use NMIs to dump the stack.
> > > 
> > > I don't think it should be an "option".
> > > 
> > > We should stop using nmi as if it was something "normal". It isn't. Code
> > > running in nmi context should be special, and should be very very aware
> > > that it is special. That goes way beyond "don't use printk". We seem to
> > > have gone way way too far in using nmi context.
> > > 
> > > So we should get *rid* of code in nmi context rather than then complain
> > > about printk being buggy.
> > 
> > OK, unconditional non-use of NMIs is even easier.  ;-)
> > 
> > Something like the following.
> 
> I have found the RCU stalls extremely useful in debugging lockups. In
> case this doesn't work as well, I'm willing to write up something that
> could send NMIs to all CPUs that would write into the ftrace ring
> buffer and when finished, the calling CPU can dump it out. No printk
> from NMI context at all.

Sounds like a good plan to me!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 16:21                 ` Paul E. McKenney
  2014-06-18 16:38                   ` Steven Rostedt
@ 2014-06-18 20:36                   ` Jiri Kosina
  2014-06-18 21:07                     ` Paul E. McKenney
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-06-18 20:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, 18 Jun 2014, Paul E. McKenney wrote:

> OK, unconditional non-use of NMIs is even easier.  ;-)
> 
> Something like the following.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> rcu: Don't use NMIs to dump other CPUs' stacks
> 
> Although NMI-based stack dumps are in principle more accurate, they are
> also more likely to trigger deadlocks.  This commit therefore replaces
> all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
> that the CPU detecting an RCU CPU stall does the stack dumping.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c590e1201c74..777624e1329b 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -932,10 +932,7 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
>  }
>  
>  /*
> - * Dump stacks of all tasks running on stalled CPUs.  This is a fallback
> - * for architectures that do not implement trigger_all_cpu_backtrace().
> - * The NMI-triggered stack traces are more accurate because they are
> - * printed by the target CPU.
> + * Dump stacks of all tasks running on stalled CPUs.
>   */
>  static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
>  {
> @@ -1013,7 +1010,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
>  	       (long)rsp->gpnum, (long)rsp->completed, totqlen);
>  	if (ndetected == 0)
>  		pr_err("INFO: Stall ended before state dump start\n");
> -	else if (!trigger_all_cpu_backtrace())
> +	else
>  		rcu_dump_cpu_stacks(rsp);
>  
>  	/* Complain about tasks blocking the grace period. */
> @@ -1044,8 +1041,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
>  	pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n",
>  		jiffies - rsp->gp_start,
>  		(long)rsp->gpnum, (long)rsp->completed, totqlen);
> -	if (!trigger_all_cpu_backtrace())
> -		dump_stack();
> +	rcu_dump_cpu_stacks(rsp);

This is prone to producing not really consistent stacktraces though, 
right? As the target task is still running at the time the stack is being 
walked, it might produce stacktraces that are potentially nonsensial.

How about sending NMI to the target CPU, so that the task is actually 
stopped, but printing its stacktrace from the CPU that detected the stall 
while it's stopped?

That way, there is no printk()-from-NMI, but also the stacktrace is 
guaranteed to be self-consistent.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 20:36                   ` Jiri Kosina
@ 2014-06-18 21:07                     ` Paul E. McKenney
  2014-06-18 21:12                       ` Jiri Kosina
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 21:07 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, Jun 18, 2014 at 10:36:10PM +0200, Jiri Kosina wrote:
> On Wed, 18 Jun 2014, Paul E. McKenney wrote:
> 
> > OK, unconditional non-use of NMIs is even easier.  ;-)
> > 
> > Something like the following.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > rcu: Don't use NMIs to dump other CPUs' stacks
> > 
> > Although NMI-based stack dumps are in principle more accurate, they are
> > also more likely to trigger deadlocks.  This commit therefore replaces
> > all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
> > that the CPU detecting an RCU CPU stall does the stack dumping.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index c590e1201c74..777624e1329b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -932,10 +932,7 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
> >  }
> >  
> >  /*
> > - * Dump stacks of all tasks running on stalled CPUs.  This is a fallback
> > - * for architectures that do not implement trigger_all_cpu_backtrace().
> > - * The NMI-triggered stack traces are more accurate because they are
> > - * printed by the target CPU.
> > + * Dump stacks of all tasks running on stalled CPUs.
> >   */
> >  static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
> >  {
> > @@ -1013,7 +1010,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
> >  	       (long)rsp->gpnum, (long)rsp->completed, totqlen);
> >  	if (ndetected == 0)
> >  		pr_err("INFO: Stall ended before state dump start\n");
> > -	else if (!trigger_all_cpu_backtrace())
> > +	else
> >  		rcu_dump_cpu_stacks(rsp);
> >  
> >  	/* Complain about tasks blocking the grace period. */
> > @@ -1044,8 +1041,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
> >  	pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n",
> >  		jiffies - rsp->gp_start,
> >  		(long)rsp->gpnum, (long)rsp->completed, totqlen);
> > -	if (!trigger_all_cpu_backtrace())
> > -		dump_stack();
> > +	rcu_dump_cpu_stacks(rsp);
> 
> This is prone to producing not really consistent stacktraces though, 
> right? As the target task is still running at the time the stack is being 
> walked, it might produce stacktraces that are potentially nonsensial.

If a CPU is stuck, the stack trace down to where it is stuck is
likely to be static.  But yes, there is some potential for confusion.
My (admittedly limited) rcutorture testing produced sensible stack traces,
but things might be a bit uglier in other situations.

> How about sending NMI to the target CPU, so that the task is actually 
> stopped, but printing its stacktrace from the CPU that detected the stall 
> while it's stopped?
> 
> That way, there is no printk()-from-NMI, but also the stacktrace is 
> guaranteed to be self-consistent.

I believe that this was what Steven was suggesting, though by using
tracing.  Of course, if my current approach isn't up to the job,
then something like this general approach would look quite good.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 21:07                     ` Paul E. McKenney
@ 2014-06-18 21:12                       ` Jiri Kosina
  2014-06-18 21:20                         ` Paul E. McKenney
  2014-06-18 23:20                         ` Steven Rostedt
  0 siblings, 2 replies; 39+ messages in thread
From: Jiri Kosina @ 2014-06-18 21:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, 18 Jun 2014, Paul E. McKenney wrote:

> > >  	/* Complain about tasks blocking the grace period. */
> > > @@ -1044,8 +1041,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
> > >  	pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n",
> > >  		jiffies - rsp->gp_start,
> > >  		(long)rsp->gpnum, (long)rsp->completed, totqlen);
> > > -	if (!trigger_all_cpu_backtrace())
> > > -		dump_stack();
> > > +	rcu_dump_cpu_stacks(rsp);
> > 
> > This is prone to producing not really consistent stacktraces though, 
> > right? As the target task is still running at the time the stack is being 
> > walked, it might produce stacktraces that are potentially nonsensial.
> 
> If a CPU is stuck, the stack trace down to where it is stuck is
> likely to be static.  But yes, there is some potential for confusion.
> My (admittedly limited) rcutorture testing produced sensible stack traces,
> but things might be a bit uglier in other situations.

I agree that it might work nicely for RCU stall detector indeed. I was 
looking for solution that'd work nicely both for RCU and for sysrq-l 
(where we can't rely on processess being stuck in any way).

> > How about sending NMI to the target CPU, so that the task is actually 
> > stopped, but printing its stacktrace from the CPU that detected the stall 
> > while it's stopped?
> > 
> > That way, there is no printk()-from-NMI, but also the stacktrace is 
> > guaranteed to be self-consistent.
> 
> I believe that this was what Steven was suggesting, though by using
> tracing.  

My understanding was that Steven is suggesting using trace_printk() from 
NMI.

> Of course, if my current approach isn't up to the job, then something 
> like this general approach would look quite good.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 21:12                       ` Jiri Kosina
@ 2014-06-18 21:20                         ` Paul E. McKenney
  2014-06-18 21:32                           ` Jiri Kosina
  2014-06-18 23:20                         ` Steven Rostedt
  1 sibling, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 21:20 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, Jun 18, 2014 at 11:12:48PM +0200, Jiri Kosina wrote:
> On Wed, 18 Jun 2014, Paul E. McKenney wrote:
> 
> > > >  	/* Complain about tasks blocking the grace period. */
> > > > @@ -1044,8 +1041,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
> > > >  	pr_cont(" (t=%lu jiffies g=%ld c=%ld q=%lu)\n",
> > > >  		jiffies - rsp->gp_start,
> > > >  		(long)rsp->gpnum, (long)rsp->completed, totqlen);
> > > > -	if (!trigger_all_cpu_backtrace())
> > > > -		dump_stack();
> > > > +	rcu_dump_cpu_stacks(rsp);
> > > 
> > > This is prone to producing not really consistent stacktraces though, 
> > > right? As the target task is still running at the time the stack is being 
> > > walked, it might produce stacktraces that are potentially nonsensial.
> > 
> > If a CPU is stuck, the stack trace down to where it is stuck is
> > likely to be static.  But yes, there is some potential for confusion.
> > My (admittedly limited) rcutorture testing produced sensible stack traces,
> > but things might be a bit uglier in other situations.
> 
> I agree that it might work nicely for RCU stall detector indeed. I was 
> looking for solution that'd work nicely both for RCU and for sysrq-l 
> (where we can't rely on processess being stuck in any way).

Agreed.  And if some more generally useful approach appears, I will be
quite happy to adjust RCU to use it.  In the meantime, I expect that
my patch will be helpful.

							Thanx, Paul

> > > How about sending NMI to the target CPU, so that the task is actually 
> > > stopped, but printing its stacktrace from the CPU that detected the stall 
> > > while it's stopped?
> > > 
> > > That way, there is no printk()-from-NMI, but also the stacktrace is 
> > > guaranteed to be self-consistent.
> > 
> > I believe that this was what Steven was suggesting, though by using
> > tracing.  
> 
> My understanding was that Steven is suggesting using trace_printk() from 
> NMI.
> 
> > Of course, if my current approach isn't up to the job, then something 
> > like this general approach would look quite good.
> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 21:20                         ` Paul E. McKenney
@ 2014-06-18 21:32                           ` Jiri Kosina
  2014-06-18 21:37                             ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Jiri Kosina @ 2014-06-18 21:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, 18 Jun 2014, Paul E. McKenney wrote:

> > I agree that it might work nicely for RCU stall detector indeed. I was 
> > looking for solution that'd work nicely both for RCU and for sysrq-l 
> > (where we can't rely on processess being stuck in any way).
> 
> Agreed.  And if some more generally useful approach appears, I will be
> quite happy to adjust RCU to use it.  In the meantime, I expect that
> my patch will be helpful.

Agreed. And we'll look into fixing sysrq-l in parallel I guess; once there 
is a working solution (hangs with sysrq-l can be trivially reproduced 
almost immediately), we can then migrate RCU to it.

Still, I feel bad about the fact that we are now hostages of our printk() 
implementation, which doesn't allow for any fixes/improvements. Having the 
possibility to printk() from NMI would be nice and more robust ... 
otherwise, we'll be getting people trying to do it in the future over and 
over again, even if we now get rid of it at once.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 21:32                           ` Jiri Kosina
@ 2014-06-18 21:37                             ` Paul E. McKenney
  0 siblings, 0 replies; 39+ messages in thread
From: Paul E. McKenney @ 2014-06-18 21:37 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Linus Torvalds, Linux Kernel Mailing List, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Steven Rostedt, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, Jun 18, 2014 at 11:32:53PM +0200, Jiri Kosina wrote:
> On Wed, 18 Jun 2014, Paul E. McKenney wrote:
> 
> > > I agree that it might work nicely for RCU stall detector indeed. I was 
> > > looking for solution that'd work nicely both for RCU and for sysrq-l 
> > > (where we can't rely on processess being stuck in any way).
> > 
> > Agreed.  And if some more generally useful approach appears, I will be
> > quite happy to adjust RCU to use it.  In the meantime, I expect that
> > my patch will be helpful.
> 
> Agreed. And we'll look into fixing sysrq-l in parallel I guess; once there 
> is a working solution (hangs with sysrq-l can be trivially reproduced 
> almost immediately), we can then migrate RCU to it.
> 
> Still, I feel bad about the fact that we are now hostages of our printk() 
> implementation, which doesn't allow for any fixes/improvements. Having the 
> possibility to printk() from NMI would be nice and more robust ... 
> otherwise, we'll be getting people trying to do it in the future over and 
> over again, even if we now get rid of it at once.

Well, we could always have printk() splat if invoke in_nmi().

Oh, wait...  ;-)

More seriously, an in_nmi() printk() could taint the kernel, set a
flag that results in a deferred splat, do a trace_printk(), or any
number of things to let the developer know that this was a bad idea.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 00/11] printk: safe printing in NMI context
  2014-06-18 21:12                       ` Jiri Kosina
  2014-06-18 21:20                         ` Paul E. McKenney
@ 2014-06-18 23:20                         ` Steven Rostedt
  1 sibling, 0 replies; 39+ messages in thread
From: Steven Rostedt @ 2014-06-18 23:20 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Paul E. McKenney, Linus Torvalds, Linux Kernel Mailing List,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Andrew Morton, Petr Mladek, Kay Sievers

On Wed, 18 Jun 2014 23:12:48 +0200 (CEST)
Jiri Kosina <jkosina@suse.cz> wrote:


> > I believe that this was what Steven was suggesting, though by using
> > tracing.  
> 
> My understanding was that Steven is suggesting using trace_printk() from 
> NMI.

Not quite. I was suggesting using the ftrace ring buffer. It could have
its own way to write the stack and not depend on tracing itself.

-- Steve

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-06-18 23:20 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-09  9:10 [RFC PATCH 00/11] printk: safe printing in NMI context Petr Mladek
2014-05-09  9:10 ` [RFC PATCH 01/11] printk: rename struct printk_log to printk_msg Petr Mladek
2014-05-09  9:10 ` [RFC PATCH 02/11] printk: allow to handle more log buffers Petr Mladek
2014-05-09  9:10 ` [RFC PATCH 03/11] printk: rename "logbuf_lock" to "main_logbuf_lock" Petr Mladek
2014-05-09  9:10 ` [RFC PATCH 04/11] printk: add NMI ring and cont buffers Petr Mladek
2014-05-09  9:10 ` [RFC PATCH 05/11] printk: allow to modify NMI log buffer size using boot parameter Petr Mladek
2014-05-09  9:11 ` [RFC PATCH 06/11] printk: NMI safe printk Petr Mladek
2014-05-09  9:11 ` [RFC PATCH 07/11] printk: right ordering of the cont buffers from NMI context Petr Mladek
2014-05-09  9:11 ` [RFC PATCH 08/11] printk: try hard to print Oops message in " Petr Mladek
2014-05-09  9:11 ` [RFC PATCH 09/11] printk: merge and flush NMI buffer predictably via IRQ work Petr Mladek
2014-05-09  9:11 ` [RFC PATCH 10/11] printk: survive rotation of sequence numbers Petr Mladek
2014-05-09  9:11 ` [RFC PATCH 11/11] printk: avoid staling when merging NMI log buffer Petr Mladek
2014-05-28 22:02 ` [RFC PATCH 00/11] printk: safe printing in NMI context Jiri Kosina
2014-05-29  0:09   ` Frederic Weisbecker
2014-05-29  8:09     ` Jiri Kosina
2014-06-10 16:46       ` Frederic Weisbecker
2014-06-10 16:57         ` Linus Torvalds
2014-06-10 17:32           ` Jiri Kosina
2014-06-11  9:01             ` Petr Mládek
2014-06-18 11:03           ` Jiri Kosina
2014-06-18 14:36             ` Paul E. McKenney
2014-06-18 14:41               ` Jiri Kosina
2014-06-18 14:44                 ` Paul E. McKenney
2014-06-18 14:53                   ` Jiri Kosina
2014-06-18 15:07                     ` Paul E. McKenney
     [not found]               ` <CA+55aFwPgDC6gSEPfu3i-pA4f0ZbsTSvykxzX4sXMeLbdXuKrw@mail.gmail.com>
2014-06-18 16:21                 ` Paul E. McKenney
2014-06-18 16:38                   ` Steven Rostedt
2014-06-18 16:43                     ` Paul E. McKenney
2014-06-18 20:36                   ` Jiri Kosina
2014-06-18 21:07                     ` Paul E. McKenney
2014-06-18 21:12                       ` Jiri Kosina
2014-06-18 21:20                         ` Paul E. McKenney
2014-06-18 21:32                           ` Jiri Kosina
2014-06-18 21:37                             ` Paul E. McKenney
2014-06-18 23:20                         ` Steven Rostedt
2014-05-30  8:13     ` Jan Kara
2014-05-30 10:10       ` Jiri Kosina
2014-06-10 16:49       ` Frederic Weisbecker
2014-06-12 11:50     ` Petr Mládek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).