All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
@ 2014-06-19 21:33 Steven Rostedt
  2014-06-19 21:33 ` [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/ Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

This is my proposal to print the NMI stack traces from an RCU stall safely.
Here's the gist of it.

Patch 1: move the trace_seq out of the tracing code. It's useful for other
 purposes too. Like writing from an NMI context.

Patch 2: Add a per_cpu "printk_func" that printk calls. By default it calls
 vprintk_def() which does what it has always done. This allows us to
 override what printk() calls normally on a per cpu basis.

Patch 3: Have the NMI handler that dumps the stack trace just change the
 printk_func to call a NMI safe printk function that writes to a per cpu
 trace_seq. When all NMI handlers chimed in, the original caller prints
 out the trace_seqs for each CPU from a printk safe context.

This is much less intrusive than the other versions out there.

Note, The biggest change was the moving of trace_seq code out of trace_output.c
and into the new trace_seq.c library.

Thoughts?


Steven Rostedt (Red Hat) (3):
      trace_seq: Move the trace_seq code to lib/
      printk: Add per_cpu printk func to allow printk to be diverted
      x86/nmi: Perform a safe NMI stack trace on all CPUs

----
 arch/x86/kernel/apic/hw_nmi.c |  66 ++++++++-
 include/linux/percpu.h        |   3 +
 include/linux/printk.h        |   2 +
 include/linux/trace_seq.h     |  68 ++--------
 kernel/printk/printk.c        |  38 ++++--
 kernel/trace/trace.c          |  24 ----
 kernel/trace/trace_output.c   | 268 -------------------------------------
 kernel/trace/trace_output.h   |  16 ---
 lib/Makefile                  |   2 +-
 lib/trace_seq.c               | 303 ++++++++++++++++++++++++++++++++++++++++++
 10 files changed, 412 insertions(+), 378 deletions(-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-19 21:33 [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Steven Rostedt
@ 2014-06-19 21:33 ` Steven Rostedt
  2014-06-20  4:45   ` Linus Torvalds
  2014-06-20  5:06   ` Andrew Morton
  2014-06-19 21:33 ` [RFC][PATCH 2/3] printk: Add per_cpu printk func to allow printk to be diverted Steven Rostedt
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

[-- Attachment #1: 0001-trace_seq-Move-the-trace_seq-code-to-lib.patch --]
[-- Type: text/plain, Size: 17629 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

The trace_seq functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/trace_seq.h   |  68 ++--------
 kernel/trace/trace.c        |  24 ----
 kernel/trace/trace_output.c | 268 ---------------------------------------
 kernel/trace/trace_output.h |  16 ---
 lib/Makefile                |   2 +-
 lib/trace_seq.c             | 303 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 316 insertions(+), 365 deletions(-)
 create mode 100644 lib/trace_seq.c

diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index 136116924d8d..3fc21ee71389 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -25,10 +25,6 @@ trace_seq_init(struct trace_seq *s)
 	s->full = 0;
 }
 
-/*
- * Currently only defined when tracing is enabled.
- */
-#ifdef CONFIG_TRACING
 extern __printf(2, 3)
 int trace_seq_printf(struct trace_seq *s, const char *fmt, ...);
 extern __printf(2, 0)
@@ -49,59 +45,19 @@ extern int trace_seq_path(struct trace_seq *s, const struct path *path);
 extern int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
 			     int nmaskbits);
 
-#else /* CONFIG_TRACING */
-static inline int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
-{
-	return 0;
-}
-static inline int
-trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
-{
-	return 0;
-}
+#define MAX_MEMHEX_BYTES	8
 
-static inline int
-trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
-		  int nmaskbits)
-{
-	return 0;
-}
+#define SEQ_PUT_FIELD_RET(s, x)				\
+do {							\
+	if (!trace_seq_putmem(s, &(x), sizeof(x)))	\
+		return TRACE_TYPE_PARTIAL_LINE;		\
+} while (0)
 
-static inline int trace_print_seq(struct seq_file *m, struct trace_seq *s)
-{
-	return 0;
-}
-static inline ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf,
-				 size_t cnt)
-{
-	return 0;
-}
-static inline int trace_seq_puts(struct trace_seq *s, const char *str)
-{
-	return 0;
-}
-static inline int trace_seq_putc(struct trace_seq *s, unsigned char c)
-{
-	return 0;
-}
-static inline int
-trace_seq_putmem(struct trace_seq *s, const void *mem, size_t len)
-{
-	return 0;
-}
-static inline int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
-				       size_t len)
-{
-	return 0;
-}
-static inline void *trace_seq_reserve(struct trace_seq *s, size_t len)
-{
-	return NULL;
-}
-static inline int trace_seq_path(struct trace_seq *s, const struct path *path)
-{
-	return 0;
-}
-#endif /* CONFIG_TRACING */
+#define SEQ_PUT_HEX_FIELD_RET(s, x)			\
+do {							\
+	BUILD_BUG_ON(sizeof(x) > MAX_MEMHEX_BYTES);	\
+	if (!trace_seq_putmem_hex(s, &(x), sizeof(x)))	\
+		return TRACE_TYPE_PARTIAL_LINE;		\
+} while (0)
 
 #endif /* _LINUX_TRACE_SEQ_H */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 384ede311717..eeb233cbac4f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -923,30 +923,6 @@ out:
 	return ret;
 }
 
-ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, size_t cnt)
-{
-	int len;
-	int ret;
-
-	if (!cnt)
-		return 0;
-
-	if (s->len <= s->readpos)
-		return -EBUSY;
-
-	len = s->len - s->readpos;
-	if (cnt > len)
-		cnt = len;
-	ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt);
-	if (ret == cnt)
-		return -EFAULT;
-
-	cnt -= ret;
-
-	s->readpos += cnt;
-	return cnt;
-}
-
 static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt)
 {
 	int len;
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index f3dad80c20b2..b8930f79a04b 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -20,23 +20,6 @@ static struct hlist_head event_hash[EVENT_HASHSIZE] __read_mostly;
 
 static int next_event_type = __TRACE_LAST_TYPE + 1;
 
-int trace_print_seq(struct seq_file *m, struct trace_seq *s)
-{
-	int len = s->len >= PAGE_SIZE ? PAGE_SIZE - 1 : s->len;
-	int ret;
-
-	ret = seq_write(m, s->buffer, len);
-
-	/*
-	 * Only reset this buffer if we successfully wrote to the
-	 * seq_file buffer.
-	 */
-	if (!ret)
-		trace_seq_init(s);
-
-	return ret;
-}
-
 enum print_line_t trace_print_bputs_msg_only(struct trace_iterator *iter)
 {
 	struct trace_seq *s = &iter->seq;
@@ -85,257 +68,6 @@ enum print_line_t trace_print_printk_msg_only(struct trace_iterator *iter)
 	return TRACE_TYPE_HANDLED;
 }
 
-/**
- * trace_seq_printf - sequence printing of trace information
- * @s: trace sequence descriptor
- * @fmt: printf format string
- *
- * It returns 0 if the trace oversizes the buffer's free
- * space, 1 otherwise.
- *
- * The tracer may use either sequence operations or its own
- * copy to user routines. To simplify formating of a trace
- * trace_seq_printf is used to store strings into a special
- * buffer (@s). Then the output may be either used by
- * the sequencer or pulled into another buffer.
- */
-int
-trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
-{
-	int len = (PAGE_SIZE - 1) - s->len;
-	va_list ap;
-	int ret;
-
-	if (s->full || !len)
-		return 0;
-
-	va_start(ap, fmt);
-	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
-	va_end(ap);
-
-	/* If we can't write it all, don't bother writing anything */
-	if (ret >= len) {
-		s->full = 1;
-		return 0;
-	}
-
-	s->len += ret;
-
-	return 1;
-}
-EXPORT_SYMBOL_GPL(trace_seq_printf);
-
-/**
- * trace_seq_bitmask - put a list of longs as a bitmask print output
- * @s:		trace sequence descriptor
- * @maskp:	points to an array of unsigned longs that represent a bitmask
- * @nmaskbits:	The number of bits that are valid in @maskp
- *
- * It returns 0 if the trace oversizes the buffer's free
- * space, 1 otherwise.
- *
- * Writes a ASCII representation of a bitmask string into @s.
- */
-int
-trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
-		  int nmaskbits)
-{
-	int len = (PAGE_SIZE - 1) - s->len;
-	int ret;
-
-	if (s->full || !len)
-		return 0;
-
-	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
-	s->len += ret;
-
-	return 1;
-}
-EXPORT_SYMBOL_GPL(trace_seq_bitmask);
-
-/**
- * trace_seq_vprintf - sequence printing of trace information
- * @s: trace sequence descriptor
- * @fmt: printf format string
- *
- * The tracer may use either sequence operations or its own
- * copy to user routines. To simplify formating of a trace
- * trace_seq_printf is used to store strings into a special
- * buffer (@s). Then the output may be either used by
- * the sequencer or pulled into another buffer.
- */
-int
-trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
-{
-	int len = (PAGE_SIZE - 1) - s->len;
-	int ret;
-
-	if (s->full || !len)
-		return 0;
-
-	ret = vsnprintf(s->buffer + s->len, len, fmt, args);
-
-	/* If we can't write it all, don't bother writing anything */
-	if (ret >= len) {
-		s->full = 1;
-		return 0;
-	}
-
-	s->len += ret;
-
-	return len;
-}
-EXPORT_SYMBOL_GPL(trace_seq_vprintf);
-
-int trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
-{
-	int len = (PAGE_SIZE - 1) - s->len;
-	int ret;
-
-	if (s->full || !len)
-		return 0;
-
-	ret = bstr_printf(s->buffer + s->len, len, fmt, binary);
-
-	/* If we can't write it all, don't bother writing anything */
-	if (ret >= len) {
-		s->full = 1;
-		return 0;
-	}
-
-	s->len += ret;
-
-	return len;
-}
-
-/**
- * trace_seq_puts - trace sequence printing of simple string
- * @s: trace sequence descriptor
- * @str: simple string to record
- *
- * The tracer may use either the sequence operations or its own
- * copy to user routines. This function records a simple string
- * into a special buffer (@s) for later retrieval by a sequencer
- * or other mechanism.
- */
-int trace_seq_puts(struct trace_seq *s, const char *str)
-{
-	int len = strlen(str);
-
-	if (s->full)
-		return 0;
-
-	if (len > ((PAGE_SIZE - 1) - s->len)) {
-		s->full = 1;
-		return 0;
-	}
-
-	memcpy(s->buffer + s->len, str, len);
-	s->len += len;
-
-	return len;
-}
-
-int trace_seq_putc(struct trace_seq *s, unsigned char c)
-{
-	if (s->full)
-		return 0;
-
-	if (s->len >= (PAGE_SIZE - 1)) {
-		s->full = 1;
-		return 0;
-	}
-
-	s->buffer[s->len++] = c;
-
-	return 1;
-}
-EXPORT_SYMBOL(trace_seq_putc);
-
-int trace_seq_putmem(struct trace_seq *s, const void *mem, size_t len)
-{
-	if (s->full)
-		return 0;
-
-	if (len > ((PAGE_SIZE - 1) - s->len)) {
-		s->full = 1;
-		return 0;
-	}
-
-	memcpy(s->buffer + s->len, mem, len);
-	s->len += len;
-
-	return len;
-}
-
-int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
-{
-	unsigned char hex[HEX_CHARS];
-	const unsigned char *data = mem;
-	int i, j;
-
-	if (s->full)
-		return 0;
-
-#ifdef __BIG_ENDIAN
-	for (i = 0, j = 0; i < len; i++) {
-#else
-	for (i = len-1, j = 0; i >= 0; i--) {
-#endif
-		hex[j++] = hex_asc_hi(data[i]);
-		hex[j++] = hex_asc_lo(data[i]);
-	}
-	hex[j++] = ' ';
-
-	return trace_seq_putmem(s, hex, j);
-}
-
-void *trace_seq_reserve(struct trace_seq *s, size_t len)
-{
-	void *ret;
-
-	if (s->full)
-		return NULL;
-
-	if (len > ((PAGE_SIZE - 1) - s->len)) {
-		s->full = 1;
-		return NULL;
-	}
-
-	ret = s->buffer + s->len;
-	s->len += len;
-
-	return ret;
-}
-
-int trace_seq_path(struct trace_seq *s, const struct path *path)
-{
-	unsigned char *p;
-
-	if (s->full)
-		return 0;
-
-	if (s->len >= (PAGE_SIZE - 1)) {
-		s->full = 1;
-		return 0;
-	}
-
-	p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
-	if (!IS_ERR(p)) {
-		p = mangle_path(s->buffer + s->len, p, "\n");
-		if (p) {
-			s->len = p - s->buffer;
-			return 1;
-		}
-	} else {
-		s->buffer[s->len++] = '?';
-		return 1;
-	}
-
-	s->full = 1;
-	return 0;
-}
-
 const char *
 ftrace_print_flags_seq(struct trace_seq *p, const char *delim,
 		       unsigned long flags,
diff --git a/kernel/trace/trace_output.h b/kernel/trace/trace_output.h
index 127a9d8c8357..89e904540509 100644
--- a/kernel/trace/trace_output.h
+++ b/kernel/trace/trace_output.h
@@ -35,21 +35,5 @@ trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry);
 extern int __unregister_ftrace_event(struct trace_event *event);
 extern struct rw_semaphore trace_event_sem;
 
-#define MAX_MEMHEX_BYTES	8
-#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
-
-#define SEQ_PUT_FIELD_RET(s, x)				\
-do {							\
-	if (!trace_seq_putmem(s, &(x), sizeof(x)))	\
-		return TRACE_TYPE_PARTIAL_LINE;		\
-} while (0)
-
-#define SEQ_PUT_HEX_FIELD_RET(s, x)			\
-do {							\
-	BUILD_BUG_ON(sizeof(x) > MAX_MEMHEX_BYTES);	\
-	if (!trace_seq_putmem_hex(s, &(x), sizeof(x)))	\
-		return TRACE_TYPE_PARTIAL_LINE;		\
-} while (0)
-
 #endif
 
diff --git a/lib/Makefile b/lib/Makefile
index ba967a19edba..fe8d8c1b04aa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 	 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o
+	 earlycpio.o trace_seq.o
 
 obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
 lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/trace_seq.c b/lib/trace_seq.c
new file mode 100644
index 000000000000..5ba99c6cf834
--- /dev/null
+++ b/lib/trace_seq.c
@@ -0,0 +1,303 @@
+/*
+ * trace_seq.c
+ *
+ * Copyright (C) 2008-2014 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
+ *
+ */
+#include <linux/uaccess.h>
+#include <linux/seq_file.h>
+#include <linux/trace_seq.h>
+
+int trace_print_seq(struct seq_file *m, struct trace_seq *s)
+{
+	int len = s->len >= PAGE_SIZE ? PAGE_SIZE - 1 : s->len;
+	int ret;
+
+	ret = seq_write(m, s->buffer, len);
+
+	/*
+	 * Only reset this buffer if we successfully wrote to the
+	 * seq_file buffer.
+	 */
+	if (!ret)
+		trace_seq_init(s);
+
+	return ret;
+}
+
+/**
+ * trace_seq_printf - sequence printing of trace information
+ * @s: trace sequence descriptor
+ * @fmt: printf format string
+ *
+ * It returns 0 if the trace oversizes the buffer's free
+ * space, 1 otherwise.
+ *
+ * The tracer may use either sequence operations or its own
+ * copy to user routines. To simplify formating of a trace
+ * trace_seq_printf is used to store strings into a special
+ * buffer (@s). Then the output may be either used by
+ * the sequencer or pulled into another buffer.
+ */
+int
+trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	va_list ap;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	va_start(ap, fmt);
+	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
+	va_end(ap);
+
+	/* If we can't write it all, don't bother writing anything */
+	if (ret >= len) {
+		s->full = 1;
+		return 0;
+	}
+
+	s->len += ret;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(trace_seq_printf);
+
+/**
+ * trace_seq_bitmask - put a list of longs as a bitmask print output
+ * @s:		trace sequence descriptor
+ * @maskp:	points to an array of unsigned longs that represent a bitmask
+ * @nmaskbits:	The number of bits that are valid in @maskp
+ *
+ * It returns 0 if the trace oversizes the buffer's free
+ * space, 1 otherwise.
+ *
+ * Writes a ASCII representation of a bitmask string into @s.
+ */
+int
+trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
+		  int nmaskbits)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
+	s->len += ret;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(trace_seq_bitmask);
+
+/**
+ * trace_seq_vprintf - sequence printing of trace information
+ * @s: trace sequence descriptor
+ * @fmt: printf format string
+ *
+ * The tracer may use either sequence operations or its own
+ * copy to user routines. To simplify formating of a trace
+ * trace_seq_printf is used to store strings into a special
+ * buffer (@s). Then the output may be either used by
+ * the sequencer or pulled into another buffer.
+ */
+int
+trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = vsnprintf(s->buffer + s->len, len, fmt, args);
+
+	/* If we can't write it all, don't bother writing anything */
+	if (ret >= len) {
+		s->full = 1;
+		return 0;
+	}
+
+	s->len += ret;
+
+	return len;
+}
+EXPORT_SYMBOL_GPL(trace_seq_vprintf);
+
+int trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
+{
+	int len = (PAGE_SIZE - 1) - s->len;
+	int ret;
+
+	if (s->full || !len)
+		return 0;
+
+	ret = bstr_printf(s->buffer + s->len, len, fmt, binary);
+
+	/* If we can't write it all, don't bother writing anything */
+	if (ret >= len) {
+		s->full = 1;
+		return 0;
+	}
+
+	s->len += ret;
+
+	return len;
+}
+
+/**
+ * trace_seq_puts - trace sequence printing of simple string
+ * @s: trace sequence descriptor
+ * @str: simple string to record
+ *
+ * The tracer may use either the sequence operations or its own
+ * copy to user routines. This function records a simple string
+ * into a special buffer (@s) for later retrieval by a sequencer
+ * or other mechanism.
+ */
+int trace_seq_puts(struct trace_seq *s, const char *str)
+{
+	int len = strlen(str);
+
+	if (s->full)
+		return 0;
+
+	if (len > ((PAGE_SIZE - 1) - s->len)) {
+		s->full = 1;
+		return 0;
+	}
+
+	memcpy(s->buffer + s->len, str, len);
+	s->len += len;
+
+	return len;
+}
+
+int trace_seq_putc(struct trace_seq *s, unsigned char c)
+{
+	if (s->full)
+		return 0;
+
+	if (s->len >= (PAGE_SIZE - 1)) {
+		s->full = 1;
+		return 0;
+	}
+
+	s->buffer[s->len++] = c;
+
+	return 1;
+}
+EXPORT_SYMBOL(trace_seq_putc);
+
+int trace_seq_putmem(struct trace_seq *s, const void *mem, size_t len)
+{
+	if (s->full)
+		return 0;
+
+	if (len > ((PAGE_SIZE - 1) - s->len)) {
+		s->full = 1;
+		return 0;
+	}
+
+	memcpy(s->buffer + s->len, mem, len);
+	s->len += len;
+
+	return len;
+}
+
+#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
+
+int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
+{
+	unsigned char hex[HEX_CHARS];
+	const unsigned char *data = mem;
+	int i, j;
+
+	if (s->full)
+		return 0;
+
+#ifdef __BIG_ENDIAN
+	for (i = 0, j = 0; i < len; i++) {
+#else
+	for (i = len-1, j = 0; i >= 0; i--) {
+#endif
+		hex[j++] = hex_asc_hi(data[i]);
+		hex[j++] = hex_asc_lo(data[i]);
+	}
+	hex[j++] = ' ';
+
+	return trace_seq_putmem(s, hex, j);
+}
+
+void *trace_seq_reserve(struct trace_seq *s, size_t len)
+{
+	void *ret;
+
+	if (s->full)
+		return NULL;
+
+	if (len > ((PAGE_SIZE - 1) - s->len)) {
+		s->full = 1;
+		return NULL;
+	}
+
+	ret = s->buffer + s->len;
+	s->len += len;
+
+	return ret;
+}
+
+int trace_seq_path(struct trace_seq *s, const struct path *path)
+{
+	unsigned char *p;
+
+	if (s->full)
+		return 0;
+
+	if (s->len >= (PAGE_SIZE - 1)) {
+		s->full = 1;
+		return 0;
+	}
+
+	p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
+	if (!IS_ERR(p)) {
+		p = mangle_path(s->buffer + s->len, p, "\n");
+		if (p) {
+			s->len = p - s->buffer;
+			return 1;
+		}
+	} else {
+		s->buffer[s->len++] = '?';
+		return 1;
+	}
+
+	s->full = 1;
+	return 0;
+}
+
+ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, size_t cnt)
+{
+	int len;
+	int ret;
+
+	if (!cnt)
+		return 0;
+
+	if (s->len <= s->readpos)
+		return -EBUSY;
+
+	len = s->len - s->readpos;
+	if (cnt > len)
+		cnt = len;
+	ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt);
+	if (ret == cnt)
+		return -EFAULT;
+
+	cnt -= ret;
+
+	s->readpos += cnt;
+	return cnt;
+}
-- 
2.0.0.rc2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 2/3] printk: Add per_cpu printk func to allow printk to be diverted
  2014-06-19 21:33 [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Steven Rostedt
  2014-06-19 21:33 ` [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/ Steven Rostedt
@ 2014-06-19 21:33 ` Steven Rostedt
  2014-06-23 16:06   ` Paul E. McKenney
  2014-06-19 21:33 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
  2014-06-19 21:56 ` [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Jiri Kosina
  3 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

[-- Attachment #1: 0002-printk-Add-per_cpu-printk-func-to-allow-printk-to-be.patch --]
[-- Type: text/plain, Size: 3373 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

Being able to divert printk to call another function besides the normal
logging is useful for such things like NMI handling. If some functions
are to be called from NMI that does printk() it is possible to lock up
the box if the nmi handler triggers when another printk is happening.

One example of this use is to perform a stack trace on all CPUs via NMI.
But if the NMI is to do the printk() it can cause the system to lock up.
By allowing the printk to be diverted to another function that can safely
record the printk output and then print it when it in a safe context
then NMIs will be safe to call these functions like show_regs().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/percpu.h |  3 +++
 include/linux/printk.h |  2 ++
 kernel/printk/printk.c | 38 +++++++++++++++++++++++++++++---------
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 8419053d0f2e..9997c92ce3bd 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -802,4 +802,7 @@ do { __this_cpu_preempt_check("or");					\
 	(__this_cpu_preempt_check("cmpxchg_double"),__pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2)))
 #endif
 
+/* To avoid include hell, as printk can not declare this, we declare it here */
+DECLARE_PER_CPU(printk_func_t, printk_func);
+
 #endif /* __LINUX_PERCPU_H */
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 319ff7e53efb..e26310b2d2fd 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -159,6 +159,8 @@ extern int kptr_restrict;
 
 extern void wake_up_klogd(void);
 
+typedef int(*printk_func_t)(const char *fmt, va_list args);
+
 void log_buf_kexec_setup(void);
 void __init setup_log_buf(int early);
 void dump_stack_set_arch_desc(const char *fmt, ...);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea2d5f6962ed..d019ccd10d4c 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1764,6 +1764,30 @@ asmlinkage int printk_emit(int facility, int level,
 }
 EXPORT_SYMBOL(printk_emit);
 
+int vprintk_default(const char *fmt, va_list args)
+{
+	int r;
+
+#ifdef CONFIG_KGDB_KDB
+	if (unlikely(kdb_trap_printk)) {
+		r = vkdb_printf(fmt, args);
+		return r;
+	}
+#endif
+	r = vprintk_emit(0, -1, NULL, 0, fmt, args);
+
+	return r;
+}
+EXPORT_SYMBOL_GPL(vprintk_default);
+
+/*
+ * This allows printk to be diverted to another function per cpu.
+ * This is useful for calling printk functions from within NMI
+ * without worrying about race conditions that can lock up the
+ * box.
+ */
+DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
+	
 /**
  * printk - print a kernel message
  * @fmt: format string
@@ -1787,19 +1811,15 @@ EXPORT_SYMBOL(printk_emit);
  */
 asmlinkage __visible int printk(const char *fmt, ...)
 {
+	printk_func_t vprintk_func;
 	va_list args;
 	int r;
 
-#ifdef CONFIG_KGDB_KDB
-	if (unlikely(kdb_trap_printk)) {
-		va_start(args, fmt);
-		r = vkdb_printf(fmt, args);
-		va_end(args);
-		return r;
-	}
-#endif
 	va_start(args, fmt);
-	r = vprintk_emit(0, -1, NULL, 0, fmt, args);
+	preempt_disable();
+	vprintk_func = this_cpu_read(printk_func);
+	r = vprintk_func(fmt, args);
+	preempt_enable();
 	va_end(args);
 
 	return r;
-- 
2.0.0.rc2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-06-19 21:33 [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Steven Rostedt
  2014-06-19 21:33 ` [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/ Steven Rostedt
  2014-06-19 21:33 ` [RFC][PATCH 2/3] printk: Add per_cpu printk func to allow printk to be diverted Steven Rostedt
@ 2014-06-19 21:33 ` Steven Rostedt
  2014-06-20 13:58   ` Don Zickus
                     ` (2 more replies)
  2014-06-19 21:56 ` [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Jiri Kosina
  3 siblings, 3 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 21:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

[-- Attachment #1: 0003-x86-nmi-Perform-a-safe-NMI-stack-trace-on-all-CPUs.patch --]
[-- Type: text/plain, Size: 3704 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

When trigger_all_cpu_backtrace() is called on x86, it will trigger an
NMI on each CPU and call show_regs(). But this can lead to a hard lock
up if the NMI comes in on another printk().

In order to avoid this, when the NMI triggers, it switches the printk
routine for that CPU to call a NMI safe printk function that records the
printk in a per_cpu trace_seq descriptor. After all NMIs have finished
recording its data, the trace_seqs are printed in a safe context.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/apic/hw_nmi.c | 66 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index c3fcb5de5083..6731604bb1cd 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -18,6 +18,7 @@
 #include <linux/nmi.h>
 #include <linux/module.h>
 #include <linux/delay.h>
+#include <linux/trace_seq.h>
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 u64 hw_nmi_get_sample_period(int watchdog_thresh)
@@ -30,11 +31,30 @@ u64 hw_nmi_get_sample_period(int watchdog_thresh)
 /* For reliability, we're prepared to waste bits here. */
 static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
 
+/* Safe printing in NMI context */
+static DEFINE_PER_CPU(struct trace_seq, nmi_print_seq);
+
 /* "in progress" flag of arch_trigger_all_cpu_backtrace */
 static unsigned long backtrace_flag;
 
+static void print_seq_line(struct trace_seq *s, int last, int pos)
+{
+	const char *buf = s->buffer + last;
+
+	/* Chop off the saved log level and update the length */
+	if (printk_get_level(buf)) {
+		buf += 2;
+		last += 2;
+	}
+
+	pr_emerg("%.*s", (pos - last) + 1, buf);
+}
+
 void arch_trigger_all_cpu_backtrace(void)
 {
+	struct trace_seq *s;
+	int len;
+	int cpu;
 	int i;
 
 	if (test_and_set_bit(0, &backtrace_flag))
@@ -44,6 +64,11 @@ void arch_trigger_all_cpu_backtrace(void)
 		 */
 		return;
 
+	for_each_possible_cpu(i) {
+		s = &per_cpu(nmi_print_seq, i);
+		trace_seq_init(s);
+	}
+
 	cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
 
 	printk(KERN_INFO "sending NMI to all CPUs:\n");
@@ -56,8 +81,39 @@ void arch_trigger_all_cpu_backtrace(void)
 		mdelay(1);
 	}
 
+	for_each_possible_cpu(cpu) {
+		int last_i = 0;
+
+		s = &per_cpu(nmi_print_seq, cpu);
+		len = s->len;
+		if (!len)
+			continue;
+
+		/* Print line by line. */
+		for (i = 0; i < len; i++) {
+			if (s->buffer[i] == '\n') {
+				print_seq_line(s, last_i, i);
+				last_i = i + 1;
+			}
+		}
+		if (last_i < i - 1) {
+			print_seq_line(s, last_i, i);
+			pr_cont("\n");
+		}
+	}
+
 	clear_bit(0, &backtrace_flag);
 	smp_mb__after_atomic();
+	return true;
+}
+
+static int nmi_vprintk(const char *fmt, va_list args)
+{
+	struct trace_seq *s = this_cpu_ptr(&nmi_print_seq);
+	int len = s->len;
+
+	trace_seq_vprintf(s, fmt, args);
+	return s->len - len;
 }
 
 static int
@@ -68,12 +124,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
 	cpu = smp_processor_id();
 
 	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
-		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
+		printk_func_t printk_func_save = this_cpu_read(printk_func);
 
-		arch_spin_lock(&lock);
-		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
+		/* Replace printk to write into the NMI seq */
+		this_cpu_write(printk_func, nmi_vprintk);
+		printk("NMI backtrace for cpu %d\n", cpu);
 		show_regs(regs);
-		arch_spin_unlock(&lock);
+		this_cpu_write(printk_func, printk_func_save);
+
 		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
 		return NMI_HANDLED;
 	}
-- 
2.0.0.rc2



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 21:33 [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Steven Rostedt
                   ` (2 preceding siblings ...)
  2014-06-19 21:33 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
@ 2014-06-19 21:56 ` Jiri Kosina
  2014-06-19 22:58   ` Steven Rostedt
  3 siblings, 1 reply; 43+ messages in thread
From: Jiri Kosina @ 2014-06-19 21:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Thu, 19 Jun 2014, Steven Rostedt wrote:

> This is my proposal to print the NMI stack traces from an RCU stall safely.
> Here's the gist of it.
> 
> Patch 1: move the trace_seq out of the tracing code. It's useful for other
>  purposes too. Like writing from an NMI context.
> 
> Patch 2: Add a per_cpu "printk_func" that printk calls. By default it calls
>  vprintk_def() which does what it has always done. This allows us to
>  override what printk() calls normally on a per cpu basis.
> 
> Patch 3: Have the NMI handler that dumps the stack trace just change the
>  printk_func to call a NMI safe printk function that writes to a per cpu
>  trace_seq. When all NMI handlers chimed in, the original caller prints
>  out the trace_seqs for each CPU from a printk safe context.
> 
> This is much less intrusive than the other versions out there.

I agree this is less intrusive than having printk() use two versions of 
the buffers and perform merging, OTOH, it doesn't really seem to be 
fully clean and systematic solution either.

I had a different idea earlier today, and Petr seems to have implemented 
it already; I guess he'll be sending it out as RFC tomorrow for 
comparision.

The idea basically is to *switch* what arch_trigger_all_cpu_backtrace() 
and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI 
as a way to stop all the CPUs (one by one), and let the CPU that is 
sending the NMIs around to actually walk and dump the stacks of the CPUs 
receiving the NMI IPI.

It's the most trivial aproach I've been able to come up with, and should 
be usable for everybody (RCU stall detector and sysrq). The only tricky 
part is: if we want pt_regs to be part of the dump as well, how to pass 
those cleanly between the 'stopped' CPU and the CPU that is doing the 
printing. Other than that, it's just moving a few lines of code around, I 
believe.

What do you think?

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 21:56 ` [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Jiri Kosina
@ 2014-06-19 22:58   ` Steven Rostedt
  2014-06-19 23:03     ` Jiri Kosina
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 22:58 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Thu, 19 Jun 2014 23:56:36 +0200 (CEST)
Jiri Kosina <jkosina@suse.cz> wrote:


> I agree this is less intrusive than having printk() use two versions of 
> the buffers and perform merging, OTOH, it doesn't really seem to be 
> fully clean and systematic solution either.

I'm not sure why you consider it not fully clean. I think it is quite
elegant ;-)

> 
> I had a different idea earlier today, and Petr seems to have implemented 
> it already; I guess he'll be sending it out as RFC tomorrow for 
> comparision.
> 
> The idea basically is to *switch* what arch_trigger_all_cpu_backtrace() 
> and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI 
> as a way to stop all the CPUs (one by one), and let the CPU that is 
> sending the NMIs around to actually walk and dump the stacks of the CPUs 
> receiving the NMI IPI.

And this is cleaner? Stopping a CPU via NMI and then what happens if
something else goes wrong and that CPU never starts back up? This
sounds like something that can cause more problems than it was
reporting on.

The code had better handle all the strange cases where an NMI comes in
after the code that sent it timed out waiting for the NMIs to return.
Then you also need to print out the data while the NMIs still spin.

> 
> It's the most trivial aproach I've been able to come up with, and should 
> be usable for everybody (RCU stall detector and sysrq). The only tricky 
> part is: if we want pt_regs to be part of the dump as well, how to pass 
> those cleanly between the 'stopped' CPU and the CPU that is doing the 
> printing. Other than that, it's just moving a few lines of code around, I 
> believe.

Note, I first tried to just save the data and pass it to the calling
functions, but found that rather messy, as there is a lot of stuff
being dumped, and it's different if it is x86_64 vs i386 or if the NMI
happened when in userspace or kernel. That's when I realized it would
be trivial to just switch what printk does and then have the NMI do the
printing to another NMI safe buffer.

It was even more convenient that trace_seq already existed and made it
much easier to implement.

-- Steve


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 22:58   ` Steven Rostedt
@ 2014-06-19 23:03     ` Jiri Kosina
  2014-06-19 23:19       ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Jiri Kosina @ 2014-06-19 23:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Thu, 19 Jun 2014, Steven Rostedt wrote:

> > The idea basically is to *switch* what arch_trigger_all_cpu_backtrace() 
> > and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI 
> > as a way to stop all the CPUs (one by one), and let the CPU that is 
> > sending the NMIs around to actually walk and dump the stacks of the CPUs 
> > receiving the NMI IPI.
> 
> And this is cleaner? Stopping a CPU via NMI and then what happens if
> something else goes wrong and that CPU never starts back up? This
> sounds like something that can cause more problems than it was
> reporting on.

It's going to get NMI in exactly the same situations it does with the 
current arch_trigger_all_cpu_backtrace(), the only difference being that 
it doesn't try to invoke printk() from inside NMI. The IPI-NMI is used 
solely as a point of synchronization for the stack dumping.

> Then you also need to print out the data while the NMIs still spin.

Exactly, that's actually the whole point.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 23:03     ` Jiri Kosina
@ 2014-06-19 23:19       ` Steven Rostedt
  2014-06-19 23:27         ` Jiri Kosina
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 23:19 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Fri, 20 Jun 2014 01:03:28 +0200 (CEST)
Jiri Kosina <jkosina@suse.cz> wrote:

> On Thu, 19 Jun 2014, Steven Rostedt wrote:
> 
> > > The idea basically is to *switch* what arch_trigger_all_cpu_backtrace() 
> > > and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI 
> > > as a way to stop all the CPUs (one by one), and let the CPU that is 
> > > sending the NMIs around to actually walk and dump the stacks of the CPUs 
> > > receiving the NMI IPI.
> > 
> > And this is cleaner? Stopping a CPU via NMI and then what happens if
> > something else goes wrong and that CPU never starts back up? This
> > sounds like something that can cause more problems than it was
> > reporting on.
> 
> It's going to get NMI in exactly the same situations it does with the 
> current arch_trigger_all_cpu_backtrace(), the only difference being that 
> it doesn't try to invoke printk() from inside NMI. The IPI-NMI is used 
> solely as a point of synchronization for the stack dumping.

Well, all CPUs are going to be spinning until the main CPU prints
everything out. That's not quite the same thing as what it use to do.

> 
> > Then you also need to print out the data while the NMIs still spin.
> 
> Exactly, that's actually the whole point.

But this stops everything with a big hammer, until everything gets
printed out, not just the one CPU that happens to be stuck.

-- Steve


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 23:19       ` Steven Rostedt
@ 2014-06-19 23:27         ` Jiri Kosina
  2014-06-19 23:36           ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Jiri Kosina @ 2014-06-19 23:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Thu, 19 Jun 2014, Steven Rostedt wrote:

> > > > The idea basically is to *switch* what arch_trigger_all_cpu_backtrace() 
> > > > and arch_trigger_all_cpu_backtrace_handler() are doing; i.e. use the NMI 
> > > > as a way to stop all the CPUs (one by one), and let the CPU that is 
> > > > sending the NMIs around to actually walk and dump the stacks of the CPUs 
> > > > receiving the NMI IPI.
> > > 
> > > And this is cleaner? Stopping a CPU via NMI and then what happens if
> > > something else goes wrong and that CPU never starts back up? This
> > > sounds like something that can cause more problems than it was
> > > reporting on.
> > 
> > It's going to get NMI in exactly the same situations it does with the 
> > current arch_trigger_all_cpu_backtrace(), the only difference being that 
> > it doesn't try to invoke printk() from inside NMI. The IPI-NMI is used 
> > solely as a point of synchronization for the stack dumping.
> 
> Well, all CPUs are going to be spinning until the main CPU prints
> everything out. That's not quite the same thing as what it use to do.

Is there a reason fo all CPUs to be spinning until everything is printed 
out? Every CPU will be spinning until his very stack is printed out, while 
other CPUs will be running uninterrupted.

I don't think there is a need for a global stop_machine()-like 
synchronization here. The printing CPU will be sending IPI to the CPU N+1 
only after it has finished printing CPU N stacktrace.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 23:27         ` Jiri Kosina
@ 2014-06-19 23:36           ` Steven Rostedt
  2014-06-19 23:38             ` Jiri Kosina
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-19 23:36 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Fri, 20 Jun 2014 01:27:26 +0200 (CEST)
Jiri Kosina <jkosina@suse.cz> wrote:

> 
> I don't think there is a need for a global stop_machine()-like 
> synchronization here. The printing CPU will be sending IPI to the CPU N+1 
> only after it has finished printing CPU N stacktrace.

So you plan on sending an IPI to a CPU then wait for it to acknowledge
that it is spinning, and then print out the data and then tell the CPU
it can stop spinning?

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 23:36           ` Steven Rostedt
@ 2014-06-19 23:38             ` Jiri Kosina
  2014-06-20 14:35               ` Petr Mládek
  0 siblings, 1 reply; 43+ messages in thread
From: Jiri Kosina @ 2014-06-19 23:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Thu, 19 Jun 2014, Steven Rostedt wrote:

> > I don't think there is a need for a global stop_machine()-like 
> > synchronization here. The printing CPU will be sending IPI to the CPU N+1 
> > only after it has finished printing CPU N stacktrace.
> 
> So you plan on sending an IPI to a CPU then wait for it to acknowledge
> that it is spinning, and then print out the data and then tell the CPU
> it can stop spinning?

Yes, that was exactly my idea. You have to be synchronized with the CPU 
receiving the NMI anyway in case you'd like to get its pt_regs and dump 
those as part of the dump.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-19 21:33 ` [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/ Steven Rostedt
@ 2014-06-20  4:45   ` Linus Torvalds
  2014-06-20 16:21     ` Steven Rostedt
  2014-06-20  5:06   ` Andrew Morton
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2014-06-20  4:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Thu, Jun 19, 2014 at 11:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
>
> The trace_seq functions are rather useful outside of tracing.

They are? Why and where? And why are they still called "trace_xyz()" then?

That commit message is useless and contains no actual information.

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-19 21:33 ` [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/ Steven Rostedt
  2014-06-20  4:45   ` Linus Torvalds
@ 2014-06-20  5:06   ` Andrew Morton
  2014-06-20 16:58     ` Steven Rostedt
  1 sibling, 1 reply; 43+ messages in thread
From: Andrew Morton @ 2014-06-20  5:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek

On Thu, 19 Jun 2014 17:33:30 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> 
> The trace_seq functions are rather useful outside of tracing. Instead
> of having it be dependent on CONFIG_TRACING, move the code into lib/
> and allow other users to have access to it even when tracing is not
> configured.

What LT said.  It's pileon time!

> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  include/linux/trace_seq.h   |  68 ++--------
>  kernel/trace/trace.c        |  24 ----
>  kernel/trace/trace_output.c | 268 ---------------------------------------
>  kernel/trace/trace_output.h |  16 ---
>  lib/Makefile                |   2 +-
>  lib/trace_seq.c             | 303 ++++++++++++++++++++++++++++++++++++++++++++

Putting it in there makes me look at it ;)

> --- a/include/linux/trace_seq.h
> +++ b/include/linux/trace_seq.h
>
> ...
>
> +#define SEQ_PUT_FIELD_RET(s, x)				\
> +do {							\
> +	if (!trace_seq_putmem(s, &(x), sizeof(x)))	\

hm, does sizeof(x++) increment x?  I guess it does.

> +		return TRACE_TYPE_PARTIAL_LINE;		\
> +} while (0)
>  
>
> ...
>
> +#define SEQ_PUT_HEX_FIELD_RET(s, x)			\
> +do {							\
> +	BUILD_BUG_ON(sizeof(x) > MAX_MEMHEX_BYTES);	\
> +	if (!trace_seq_putmem_hex(s, &(x), sizeof(x)))	\
> +		return TRACE_TYPE_PARTIAL_LINE;		\
> +} while (0)

Also has side-effects.

>  #endif /* _LINUX_TRACE_SEQ_H */
>
> ...
>
> --- /dev/null
> +++ b/lib/trace_seq.c
> @@ -0,0 +1,303 @@
> +/*
> + * trace_seq.c
> + *
> + * Copyright (C) 2008-2014 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
> + *
> + */
> +#include <linux/uaccess.h>
> +#include <linux/seq_file.h>
> +#include <linux/trace_seq.h>
> +
> +int trace_print_seq(struct seq_file *m, struct trace_seq *s)

-ENODOC

> +{
> +	int len = s->len >= PAGE_SIZE ? PAGE_SIZE - 1 : s->len;

	int = uint >= ulong ? ulog : uint

that's spastic.  Can we choose a type and stick to it?

Also, min().

> +	int ret;
> +
> +	ret = seq_write(m, s->buffer, len);
> +
> +	/*
> +	 * Only reset this buffer if we successfully wrote to the
> +	 * seq_file buffer.

why?

> +	 */
> +	if (!ret)
> +		trace_seq_init(s);
> +
> +	return ret;
> +}
> +
> +/**
> + * trace_seq_printf - sequence printing of trace information
> + * @s: trace sequence descriptor
> + * @fmt: printf format string
> + *
> + * It returns 0 if the trace oversizes the buffer's free
> + * space, 1 otherwise.

s/oversizes/would overrun/?

> + * The tracer may use either sequence operations or its own
> + * copy to user routines. To simplify formating of a trace
> + * trace_seq_printf is used to store strings into a special
> + * buffer (@s). Then the output may be either used by
> + * the sequencer or pulled into another buffer.
> + */
> +int
> +trace_seq_printf(struct trace_seq *s, const char *fmt, ...)

unneeded newline

> +{
> +	int len = (PAGE_SIZE - 1) - s->len;

	int = ulong - uint;

> +	va_list ap;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	va_start(ap, fmt);
> +	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
> +	va_end(ap);
> +
> +	/* If we can't write it all, don't bother writing anything */

This is somewhat unusual behavior for a write()-style thing.  Comment
should explain "why", not "what".

> +	if (ret >= len) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	s->len += ret;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(trace_seq_printf);
> +
> +/**
> + * trace_seq_bitmask - put a list of longs as a bitmask print output

Is that grammatical?

> + * @s:		trace sequence descriptor
> + * @maskp:	points to an array of unsigned longs that represent a bitmask
> + * @nmaskbits:	The number of bits that are valid in @maskp
> + *
> + * It returns 0 if the trace oversizes the buffer's free
> + * space, 1 otherwise.

Ditto

> + * Writes a ASCII representation of a bitmask string into @s.
> + */
> +int
> +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> +		  int nmaskbits)
> +{
> +	int len = (PAGE_SIZE - 1) - s->len;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> +	s->len += ret;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(trace_seq_bitmask);

More dittos.

> +/**
> + * trace_seq_vprintf - sequence printing of trace information
> + * @s: trace sequence descriptor
> + * @fmt: printf format string
> + *
> + * The tracer may use either sequence operations or its own
> + * copy to user routines. To simplify formating of a trace
> + * trace_seq_printf is used to store strings into a special

"trace_seq_printf()".  Apparently it makes the kerneldoc output come
out right.

> + * buffer (@s). Then the output may be either used by
> + * the sequencer or pulled into another buffer.
> + */
> +int
> +trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
> +{
> +	int len = (PAGE_SIZE - 1) - s->len;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	ret = vsnprintf(s->buffer + s->len, len, fmt, args);
> +
> +	/* If we can't write it all, don't bother writing anything */
> +	if (ret >= len) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	s->len += ret;
> +
> +	return len;
> +}
> +EXPORT_SYMBOL_GPL(trace_seq_vprintf);

Several dittos.

> +int trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)

-ENODOC

> +{
> +	int len = (PAGE_SIZE - 1) - s->len;
> +	int ret;
> +
> +	if (s->full || !len)
> +		return 0;
> +
> +	ret = bstr_printf(s->buffer + s->len, len, fmt, binary);
> +
> +	/* If we can't write it all, don't bother writing anything */
> +	if (ret >= len) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	s->len += ret;
> +
> +	return len;
> +}

Dittos.

> +/**
> + * trace_seq_puts - trace sequence printing of simple string
> + * @s: trace sequence descriptor
> + * @str: simple string to record
> + *
> + * The tracer may use either the sequence operations or its own
> + * copy to user routines. This function records a simple string
> + * into a special buffer (@s) for later retrieval by a sequencer
> + * or other mechanism.
> + */
> +int trace_seq_puts(struct trace_seq *s, const char *str)
> +{
> +	int len = strlen(str);
> +
> +	if (s->full)
> +		return 0;
> +
> +	if (len > ((PAGE_SIZE - 1) - s->len)) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	memcpy(s->buffer + s->len, str, len);
> +	s->len += len;
> +
> +	return len;
> +}

Missing EXPORT_SYMBOL?

> +int trace_seq_putc(struct trace_seq *s, unsigned char c)
> +{
> +	if (s->full)
> +		return 0;
> +
> +	if (s->len >= (PAGE_SIZE - 1)) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	s->buffer[s->len++] = c;
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(trace_seq_putc);

Mix of EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL()

> +int trace_seq_putmem(struct trace_seq *s, const void *mem, size_t len)
> +{
> +	if (s->full)
> +		return 0;
> +
> +	if (len > ((PAGE_SIZE - 1) - s->len)) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	memcpy(s->buffer + s->len, mem, len);
> +	s->len += len;
> +
> +	return len;
> +}
> +

Lotsa dittos

> +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> +
> +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> +{
> +	unsigned char hex[HEX_CHARS];
> +	const unsigned char *data = mem;
> +	int i, j;
> +
> +	if (s->full)
> +		return 0;

What's this ->full thing all about anyway?  Some central comment which
explains the design is needed.

Is this test really needed?  trace_seq_putmem() will handle this.

> +#ifdef __BIG_ENDIAN
> +	for (i = 0, j = 0; i < len; i++) {
> +#else
> +	for (i = len-1, j = 0; i >= 0; i--) {
> +#endif
> +		hex[j++] = hex_asc_hi(data[i]);
> +		hex[j++] = hex_asc_lo(data[i]);
> +	}
> +	hex[j++] = ' ';
> +
> +	return trace_seq_putmem(s, hex, j);
> +}

-ENODOC, missing EXPORT_SYMBOL.

> +void *trace_seq_reserve(struct trace_seq *s, size_t len)

`len' is a size_t here, a uint in trace_seq and an int when it's a local.

> +{
> +	void *ret;
> +
> +	if (s->full)
> +		return NULL;
> +
> +	if (len > ((PAGE_SIZE - 1) - s->len)) {
> +		s->full = 1;
> +		return NULL;
> +	}
> +
> +	ret = s->buffer + s->len;
> +	s->len += len;
> +
> +	return ret;
> +}

Dittos

> +int trace_seq_path(struct trace_seq *s, const struct path *path)
> +{
> +	unsigned char *p;
> +
> +	if (s->full)
> +		return 0;
> +
> +	if (s->len >= (PAGE_SIZE - 1)) {
> +		s->full = 1;
> +		return 0;
> +	}
> +
> +	p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
> +	if (!IS_ERR(p)) {
> +		p = mangle_path(s->buffer + s->len, p, "\n");
> +		if (p) {
> +			s->len = p - s->buffer;
> +			return 1;
> +		}
> +	} else {
> +		s->buffer[s->len++] = '?';
> +		return 1;
> +	}
> +
> +	s->full = 1;
> +	return 0;
> +}

Dittos

> +ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, size_t cnt)
> +{
> +	int len;
> +	int ret;
> +
> +	if (!cnt)
> +		return 0;
> +
> +	if (s->len <= s->readpos)
> +		return -EBUSY;
> +
> +	len = s->len - s->readpos;
> +	if (cnt > len)
> +		cnt = len;
> +	ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt);
> +	if (ret == cnt)
> +		return -EFAULT;
> +
> +	cnt -= ret;
> +
> +	s->readpos += cnt;
> +	return cnt;
> +}

Dittos


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-06-19 21:33 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
@ 2014-06-20 13:58   ` Don Zickus
  2014-06-20 14:21     ` Steven Rostedt
  2014-06-20 14:55   ` Petr Mládek
  2014-06-23 16:12   ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Don Zickus @ 2014-06-20 13:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Thu, Jun 19, 2014 at 05:33:32PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> 
> When trigger_all_cpu_backtrace() is called on x86, it will trigger an
> NMI on each CPU and call show_regs(). But this can lead to a hard lock
> up if the NMI comes in on another printk().
> 
> In order to avoid this, when the NMI triggers, it switches the printk
> routine for that CPU to call a NMI safe printk function that records the
> printk in a per_cpu trace_seq descriptor. After all NMIs have finished
> recording its data, the trace_seqs are printed in a safe context.

Ah, if there is other places that need to call a printk from an NMI
context, do we have to copy and paste this code there too?  That would
seem a little much. :-)  I guess my only concern really is the seq_init
and seq print stuff.  The handler seems relatively simple.

Can we isolate this to a pr_nmi() routine or something?  Or maybe a
pr_nmi_queue and pr_nmi_dump or something?

Cheers,
Don

> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  arch/x86/kernel/apic/hw_nmi.c | 66 ++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
> index c3fcb5de5083..6731604bb1cd 100644
> --- a/arch/x86/kernel/apic/hw_nmi.c
> +++ b/arch/x86/kernel/apic/hw_nmi.c
> @@ -18,6 +18,7 @@
>  #include <linux/nmi.h>
>  #include <linux/module.h>
>  #include <linux/delay.h>
> +#include <linux/trace_seq.h>
>  
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  u64 hw_nmi_get_sample_period(int watchdog_thresh)
> @@ -30,11 +31,30 @@ u64 hw_nmi_get_sample_period(int watchdog_thresh)
>  /* For reliability, we're prepared to waste bits here. */
>  static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
>  
> +/* Safe printing in NMI context */
> +static DEFINE_PER_CPU(struct trace_seq, nmi_print_seq);
> +
>  /* "in progress" flag of arch_trigger_all_cpu_backtrace */
>  static unsigned long backtrace_flag;
>  
> +static void print_seq_line(struct trace_seq *s, int last, int pos)
> +{
> +	const char *buf = s->buffer + last;
> +
> +	/* Chop off the saved log level and update the length */
> +	if (printk_get_level(buf)) {
> +		buf += 2;
> +		last += 2;
> +	}
> +
> +	pr_emerg("%.*s", (pos - last) + 1, buf);
> +}
> +
>  void arch_trigger_all_cpu_backtrace(void)
>  {
> +	struct trace_seq *s;
> +	int len;
> +	int cpu;
>  	int i;
>  
>  	if (test_and_set_bit(0, &backtrace_flag))
> @@ -44,6 +64,11 @@ void arch_trigger_all_cpu_backtrace(void)
>  		 */
>  		return;
>  
> +	for_each_possible_cpu(i) {
> +		s = &per_cpu(nmi_print_seq, i);
> +		trace_seq_init(s);
> +	}
> +
>  	cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
>  
>  	printk(KERN_INFO "sending NMI to all CPUs:\n");
> @@ -56,8 +81,39 @@ void arch_trigger_all_cpu_backtrace(void)
>  		mdelay(1);
>  	}
>  
> +	for_each_possible_cpu(cpu) {
> +		int last_i = 0;
> +
> +		s = &per_cpu(nmi_print_seq, cpu);
> +		len = s->len;
> +		if (!len)
> +			continue;
> +
> +		/* Print line by line. */
> +		for (i = 0; i < len; i++) {
> +			if (s->buffer[i] == '\n') {
> +				print_seq_line(s, last_i, i);
> +				last_i = i + 1;
> +			}
> +		}
> +		if (last_i < i - 1) {
> +			print_seq_line(s, last_i, i);
> +			pr_cont("\n");
> +		}
> +	}
> +
>  	clear_bit(0, &backtrace_flag);
>  	smp_mb__after_atomic();
> +	return true;
> +}
> +
> +static int nmi_vprintk(const char *fmt, va_list args)
> +{
> +	struct trace_seq *s = this_cpu_ptr(&nmi_print_seq);
> +	int len = s->len;
> +
> +	trace_seq_vprintf(s, fmt, args);
> +	return s->len - len;
>  }
>  
>  static int
> @@ -68,12 +124,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
>  	cpu = smp_processor_id();
>  
>  	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> -		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
> +		printk_func_t printk_func_save = this_cpu_read(printk_func);
>  
> -		arch_spin_lock(&lock);
> -		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
> +		/* Replace printk to write into the NMI seq */
> +		this_cpu_write(printk_func, nmi_vprintk);
> +		printk("NMI backtrace for cpu %d\n", cpu);
>  		show_regs(regs);
> -		arch_spin_unlock(&lock);
> +		this_cpu_write(printk_func, printk_func_save);
> +
>  		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
>  		return NMI_HANDLED;
>  	}
> -- 
> 2.0.0.rc2
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-06-20 13:58   ` Don Zickus
@ 2014-06-20 14:21     ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 14:21 UTC (permalink / raw)
  To: Don Zickus
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Fri, 20 Jun 2014 09:58:15 -0400
Don Zickus <dzickus@redhat.com> wrote:

> On Thu, Jun 19, 2014 at 05:33:32PM -0400, Steven Rostedt wrote:
> > From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> > 
> > When trigger_all_cpu_backtrace() is called on x86, it will trigger an
> > NMI on each CPU and call show_regs(). But this can lead to a hard lock
> > up if the NMI comes in on another printk().
> > 
> > In order to avoid this, when the NMI triggers, it switches the printk
> > routine for that CPU to call a NMI safe printk function that records the
> > printk in a per_cpu trace_seq descriptor. After all NMIs have finished
> > recording its data, the trace_seqs are printed in a safe context.
> 
> Ah, if there is other places that need to call a printk from an NMI
> context, do we have to copy and paste this code there too?  That would
> seem a little much. :-)  I guess my only concern really is the seq_init
> and seq print stuff.  The handler seems relatively simple.

This is a special case for printing from NMI. As you can trigger it
with sysrq-l. But any other prints from NMI should only be in the BUG()
case, where we should just force the output and override any locks in
printk.

> 
> Can we isolate this to a pr_nmi() routine or something?  Or maybe a
> pr_nmi_queue and pr_nmi_dump or something?
> 

We'll have to see how traction this gets. It's just an RFC.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-19 23:38             ` Jiri Kosina
@ 2014-06-20 14:35               ` Petr Mládek
  2014-06-24 13:32                 ` Konstantin Khlebnikov
  0 siblings, 1 reply; 43+ messages in thread
From: Petr Mládek @ 2014-06-20 14:35 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Steven Rostedt, linux-kernel, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson

On Fri 2014-06-20 01:38:59, Jiri Kosina wrote:
> On Thu, 19 Jun 2014, Steven Rostedt wrote:
> 
> > > I don't think there is a need for a global stop_machine()-like 
> > > synchronization here. The printing CPU will be sending IPI to the CPU N+1 
> > > only after it has finished printing CPU N stacktrace.
> > 
> > So you plan on sending an IPI to a CPU then wait for it to acknowledge
> > that it is spinning, and then print out the data and then tell the CPU
> > it can stop spinning?
> 
> Yes, that was exactly my idea. You have to be synchronized with the CPU 
> receiving the NMI anyway in case you'd like to get its pt_regs and dump 
> those as part of the dump.

This approach did not work after all. There was still the same
race. If we stop a CPU in the middle of printk(), it does not help
to move the printing task to another CPU ;-) We would need to
make a copy of regs and all the stacks to unblock the CPU.

Hmm, in general, if we want a consistent snapshot, we need to temporary store
the information in NMI context and put it into the main ring buffer
in normal context. We either need to copy stacks or copy the printed text.


I start to like Steven's solution with the trace_seq buffer. I see the
following advantages:

    + the snapshot is pretty good;
	+ we still send NMI to all CPUs at the "same" time

    + only minimal time is spent in NMI context;
	+ CPUs are not blocked by each other to get sequential output

    + minimum of new code
	+ trace_seq buffer is already implemented and used
	+ it might be even better after getting attention from new users


Of course, it has also some disadvantages:

   + needs quite big per-CPU buffer;
	+ but we would need some extra space to copy the data anyway

   + trace might be shrunken;
	+ but 1 page should be enough in most cases;
	+ we could make it configurable

   + delay until the message appears in the ringbuffer and console
	+ better than freezing
	+ still saved in core file
	+ crash tool could get improved to find the traces


Note that the above solution solves only printing of the stack.
There are still other locations when printk is called in NMI
context. IMHO, some of them are helpful:

    ./arch/x86/kernel/nmi.c:                        WARN(in_nmi(),
    ./arch/x86/mm/kmemcheck/kmemcheck.c:    WARN_ON_ONCE(in_nmi());
    ./arch/x86/mm/fault.c:  WARN_ON_ONCE(in_nmi());
    ./arch/x86/mm/fault.c:  WARN_ON_ONCE(in_nmi());

    ./mm/vmalloc.c: BUG_ON(in_nmi());
    ./lib/genalloc.c:       BUG_ON(in_nmi());
    ./lib/genalloc.c:       BUG_ON(in_nmi());
    ./include/linux/hardirq.h:              BUG_ON(in_nmi());

And some are probably less important:

    ./arch/x86/platform/uv/uv_nmi.c   several locations here
    ./arch/m68k/mac/macints.c-              printk("... pausing, press NMI to resume ...");


Well, there are only few. Maybe, we could share the trace_seq buffer
here.

Of course, there is still the possibility to implement a lockless
buffer. But it will be much more complicated than the current one.
I am not sure that we really want it.


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-06-19 21:33 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
  2014-06-20 13:58   ` Don Zickus
@ 2014-06-20 14:55   ` Petr Mládek
  2014-06-20 15:17     ` Steven Rostedt
  2014-06-23 16:12   ` Paul E. McKenney
  2 siblings, 1 reply; 43+ messages in thread
From: Petr Mládek @ 2014-06-20 14:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson

On Thu 2014-06-19 17:33:32, Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> 
> When trigger_all_cpu_backtrace() is called on x86, it will trigger an
> NMI on each CPU and call show_regs(). But this can lead to a hard lock
> up if the NMI comes in on another printk().
> 
> In order to avoid this, when the NMI triggers, it switches the printk
> routine for that CPU to call a NMI safe printk function that records the
> printk in a per_cpu trace_seq descriptor. After all NMIs have finished
> recording its data, the trace_seqs are printed in a safe context.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  arch/x86/kernel/apic/hw_nmi.c | 66 ++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
> index c3fcb5de5083..6731604bb1cd 100644
> --- a/arch/x86/kernel/apic/hw_nmi.c
> +++ b/arch/x86/kernel/apic/hw_nmi.c
> @@ -18,6 +18,7 @@
>  #include <linux/nmi.h>
>  #include <linux/module.h>
>  #include <linux/delay.h>
> +#include <linux/trace_seq.h>
>  
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  u64 hw_nmi_get_sample_period(int watchdog_thresh)
> @@ -30,11 +31,30 @@ u64 hw_nmi_get_sample_period(int watchdog_thresh)
>  /* For reliability, we're prepared to waste bits here. */
>  static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
>  
> +/* Safe printing in NMI context */
> +static DEFINE_PER_CPU(struct trace_seq, nmi_print_seq);
> +
>  /* "in progress" flag of arch_trigger_all_cpu_backtrace */
>  static unsigned long backtrace_flag;
>  
> +static void print_seq_line(struct trace_seq *s, int last, int pos)
> +{
> +	const char *buf = s->buffer + last;
> +
> +	/* Chop off the saved log level and update the length */
> +	if (printk_get_level(buf)) {
> +		buf += 2;
> +		last += 2;
> +	}
>
> +	pr_emerg("%.*s", (pos - last) + 1, buf);

I wonder why you chop off the log level and force the emergency one.
I think that this should be better set by the caller. For example,
drivers/tty/sysrq.c when it wants to push the output on console.

> +}
> +
>  void arch_trigger_all_cpu_backtrace(void)
>  {
> +	struct trace_seq *s;
> +	int len;
> +	int cpu;
>  	int i;
>  
>  	if (test_and_set_bit(0, &backtrace_flag))
> @@ -44,6 +64,11 @@ void arch_trigger_all_cpu_backtrace(void)
>  		 */
>  		return;
>  
> +	for_each_possible_cpu(i) {
> +		s = &per_cpu(nmi_print_seq, i);
> +		trace_seq_init(s);
> +	}
> +
>  	cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
>  
>  	printk(KERN_INFO "sending NMI to all CPUs:\n");
> @@ -56,8 +81,39 @@ void arch_trigger_all_cpu_backtrace(void)
>  		mdelay(1);
>  	}

I would add a comment here. e.g.

	/*
	 * Finally, copy the backtraces to the main ring buffer in a safe
	 * context, outside NMI.
	 */
 
> +	for_each_possible_cpu(cpu) {
> +		int last_i = 0;
> +
> +		s = &per_cpu(nmi_print_seq, cpu);
> +		len = s->len;
> +		if (!len)
> +			continue;
> +
> +		/* Print line by line. */
> +		for (i = 0; i < len; i++) {
> +			if (s->buffer[i] == '\n') {
> +				print_seq_line(s, last_i, i);
> +				last_i = i + 1;
> +			}
> +		}
> +		if (last_i < i - 1) {
> +			print_seq_line(s, last_i, i);
> +			pr_cont("\n");
> +		}
> +	}
> +
>  	clear_bit(0, &backtrace_flag);
>  	smp_mb__after_atomic();
> +	return true;

This function returns void ;-)

> +}
> +
> +static int nmi_vprintk(const char *fmt, va_list args)
> +{
> +	struct trace_seq *s = this_cpu_ptr(&nmi_print_seq);
> +	int len = s->len;
> +
> +	trace_seq_vprintf(s, fmt, args);
> +	return s->len - len;
>  }
>  
>  static int
> @@ -68,12 +124,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
>  	cpu = smp_processor_id();
>  
>  	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> -		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
> +		printk_func_t printk_func_save = this_cpu_read(printk_func);
>  
> -		arch_spin_lock(&lock);
> -		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
> +		/* Replace printk to write into the NMI seq */
> +		this_cpu_write(printk_func, nmi_vprintk);
> +		printk("NMI backtrace for cpu %d\n", cpu);

I would leave the log level as is.

>  		show_regs(regs);
> -		arch_spin_unlock(&lock);
> +		this_cpu_write(printk_func, printk_func_save);
> +
>  		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
>  		return NMI_HANDLED;
>  	}
> -- 
> 2.0.0.rc2

Otherwise, I basically like the approach.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-06-20 14:55   ` Petr Mládek
@ 2014-06-20 15:17     ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 15:17 UTC (permalink / raw)
  To: Petr Mládek
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson

On Fri, 20 Jun 2014 16:55:05 +0200
Petr Mládek <pmladek@suse.cz> wrote:

> On Thu 2014-06-19 17:33:32, Steven Rostedt wrote:
> > From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> > 
> > When trigger_all_cpu_backtrace() is called on x86, it will trigger an
> > NMI on each CPU and call show_regs(). But this can lead to a hard lock
> > up if the NMI comes in on another printk().
> > 
> > In order to avoid this, when the NMI triggers, it switches the printk
> > routine for that CPU to call a NMI safe printk function that records the
> > printk in a per_cpu trace_seq descriptor. After all NMIs have finished
> > recording its data, the trace_seqs are printed in a safe context.
> > 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> >  arch/x86/kernel/apic/hw_nmi.c | 66 ++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 62 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
> > index c3fcb5de5083..6731604bb1cd 100644
> > --- a/arch/x86/kernel/apic/hw_nmi.c
> > +++ b/arch/x86/kernel/apic/hw_nmi.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/nmi.h>
> >  #include <linux/module.h>
> >  #include <linux/delay.h>
> > +#include <linux/trace_seq.h>
> >  
> >  #ifdef CONFIG_HARDLOCKUP_DETECTOR
> >  u64 hw_nmi_get_sample_period(int watchdog_thresh)
> > @@ -30,11 +31,30 @@ u64 hw_nmi_get_sample_period(int watchdog_thresh)
> >  /* For reliability, we're prepared to waste bits here. */
> >  static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
> >  
> > +/* Safe printing in NMI context */
> > +static DEFINE_PER_CPU(struct trace_seq, nmi_print_seq);
> > +
> >  /* "in progress" flag of arch_trigger_all_cpu_backtrace */
> >  static unsigned long backtrace_flag;
> >  
> > +static void print_seq_line(struct trace_seq *s, int last, int pos)
> > +{
> > +	const char *buf = s->buffer + last;
> > +
> > +	/* Chop off the saved log level and update the length */
> > +	if (printk_get_level(buf)) {
> > +		buf += 2;
> > +		last += 2;
> > +	}
> >
> > +	pr_emerg("%.*s", (pos - last) + 1, buf);
> 
> I wonder why you chop off the log level and force the emergency one.
> I think that this should be better set by the caller. For example,
> drivers/tty/sysrq.c when it wants to push the output on console.

Ah, that's because I originally printed it as one buffer, and then
parsed it up stripping all of them to be consistent. But I guess just
calling printk() directly would work.

> 
> > +}
> > +
> >  void arch_trigger_all_cpu_backtrace(void)
> >  {
> > +	struct trace_seq *s;
> > +	int len;
> > +	int cpu;
> >  	int i;
> >  
> >  	if (test_and_set_bit(0, &backtrace_flag))
> > @@ -44,6 +64,11 @@ void arch_trigger_all_cpu_backtrace(void)
> >  		 */
> >  		return;
> >  
> > +	for_each_possible_cpu(i) {
> > +		s = &per_cpu(nmi_print_seq, i);
> > +		trace_seq_init(s);
> > +	}
> > +
> >  	cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
> >  
> >  	printk(KERN_INFO "sending NMI to all CPUs:\n");
> > @@ -56,8 +81,39 @@ void arch_trigger_all_cpu_backtrace(void)
> >  		mdelay(1);
> >  	}
> 
> I would add a comment here. e.g.
> 
> 	/*
> 	 * Finally, copy the backtraces to the main ring buffer in a safe
> 	 * context, outside NMI.

Sure.

> 	 */
>  
> > +	for_each_possible_cpu(cpu) {
> > +		int last_i = 0;
> > +
> > +		s = &per_cpu(nmi_print_seq, cpu);
> > +		len = s->len;
> > +		if (!len)
> > +			continue;
> > +
> > +		/* Print line by line. */
> > +		for (i = 0; i < len; i++) {
> > +			if (s->buffer[i] == '\n') {
> > +				print_seq_line(s, last_i, i);
> > +				last_i = i + 1;
> > +			}
> > +		}
> > +		if (last_i < i - 1) {
> > +			print_seq_line(s, last_i, i);
> > +			pr_cont("\n");
> > +		}
> > +	}
> > +
> >  	clear_bit(0, &backtrace_flag);
> >  	smp_mb__after_atomic();
> > +	return true;
> 
> This function returns void ;-)

oops, I switched back and forth between changing the function to return
bool because I was back and forth with allocating the trace-seq instead
of having them around at boot up (just to save space). That was going
to be an add on patch if we decide to allocate. But if we do allocate,
we need to return whether or not it succeeded in allocation and if not,
then trigger_all_cpu_backtrace() needs to fail. (this would need to
update sparc as well).

> 
> > +}
> > +
> > +static int nmi_vprintk(const char *fmt, va_list args)
> > +{
> > +	struct trace_seq *s = this_cpu_ptr(&nmi_print_seq);
> > +	int len = s->len;
> > +
> > +	trace_seq_vprintf(s, fmt, args);
> > +	return s->len - len;
> >  }
> >  
> >  static int
> > @@ -68,12 +124,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
> >  	cpu = smp_processor_id();
> >  
> >  	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> > -		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
> > +		printk_func_t printk_func_save = this_cpu_read(printk_func);
> >  
> > -		arch_spin_lock(&lock);
> > -		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
> > +		/* Replace printk to write into the NMI seq */
> > +		this_cpu_write(printk_func, nmi_vprintk);
> > +		printk("NMI backtrace for cpu %d\n", cpu);
> 
> I would leave the log level as is.

Sure.

> 
> >  		show_regs(regs);
> > -		arch_spin_unlock(&lock);
> > +		this_cpu_write(printk_func, printk_func_save);
> > +
> >  		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
> >  		return NMI_HANDLED;
> >  	}
> > -- 
> > 2.0.0.rc2
> 
> Otherwise, I basically like the approach.

Thanks, I'll make some updates.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20  4:45   ` Linus Torvalds
@ 2014-06-20 16:21     ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 16:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Thu, 19 Jun 2014 18:45:57 -1000
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jun 19, 2014 at 11:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> >
> > The trace_seq functions are rather useful outside of tracing.
> 
> They are?

They will be.

> Why and where?

Why? for NMI safe output.

Where? Well, in patch 3. :-)

> And why are they still called "trace_xyz()" then?

Good point, I'd like to rename it too. It was designed like seq_file,
so what about seq_buffer_xxx()?


> 
> That commit message is useless and contains no actual information.

Yeah, it was rather weak. It was quick and not a stand alone. It was
dependent on reading the rest of the patches.

OK, I can update the name and make a better change log on the move.

Of course if you hate this patch series, I can just not waste any time
on it.

-- Steve


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20  5:06   ` Andrew Morton
@ 2014-06-20 16:58     ` Steven Rostedt
  2014-06-20 17:12       ` Andrew Morton
  2014-06-22  7:38       ` Johannes Berg
  0 siblings, 2 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 16:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek, Johannes Berg

On Thu, 19 Jun 2014 22:06:07 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 19 Jun 2014 17:33:30 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> > 
> > The trace_seq functions are rather useful outside of tracing. Instead
> > of having it be dependent on CONFIG_TRACING, move the code into lib/
> > and allow other users to have access to it even when tracing is not
> > configured.
> 
> What LT said.  It's pileon time!
> 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> >  include/linux/trace_seq.h   |  68 ++--------
> >  kernel/trace/trace.c        |  24 ----
> >  kernel/trace/trace_output.c | 268 ---------------------------------------
> >  kernel/trace/trace_output.h |  16 ---
> >  lib/Makefile                |   2 +-
> >  lib/trace_seq.c             | 303 ++++++++++++++++++++++++++++++++++++++++++++
> 
> Putting it in there makes me look at it ;)

Damn it! If I had known this, I would have moved it there on day one.

You can't say you didn't know about this. I cc you on all my patches :-)

So next time, to get you to review my code, I just need to threaten to
add it to lib?

> 
> > --- a/include/linux/trace_seq.h
> > +++ b/include/linux/trace_seq.h
> >
> > ...
> >
> > +#define SEQ_PUT_FIELD_RET(s, x)				\
> > +do {							\
> > +	if (!trace_seq_putmem(s, &(x), sizeof(x)))	\
> 
> hm, does sizeof(x++) increment x?  I guess it does.

I hate this macro. I moved it just to make the kernel build. I would
love to change it. It is more to do with tracing than trace_seq anyway.

> 
> > +		return TRACE_TYPE_PARTIAL_LINE;		\
> > +} while (0)
> >  
> >
> > ...
> >
> > +#define SEQ_PUT_HEX_FIELD_RET(s, x)			\
> > +do {							\
> > +	BUILD_BUG_ON(sizeof(x) > MAX_MEMHEX_BYTES);	\
> > +	if (!trace_seq_putmem_hex(s, &(x), sizeof(x)))	\
> > +		return TRACE_TYPE_PARTIAL_LINE;		\
> > +} while (0)
> 
> Also has side-effects.

Yep, and it shouldn't be in the lib anyway. Again, HATE IT!

> 
> >  #endif /* _LINUX_TRACE_SEQ_H */
> >
> > ...
> >
> > --- /dev/null
> > +++ b/lib/trace_seq.c
> > @@ -0,0 +1,303 @@
> > +/*
> > + * trace_seq.c
> > + *
> > + * Copyright (C) 2008-2014 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
> > + *
> > + */
> > +#include <linux/uaccess.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/trace_seq.h>
> > +
> > +int trace_print_seq(struct seq_file *m, struct trace_seq *s)
> 
> -ENODOC

Heh, I noticed this too, and almost sent out a 2/4 patch that added the
docs. I should add it first before moving it. But I was lazy and sent
this out because I was more interested in the NMI code feedback than
this.


> 
> > +{
> > +	int len = s->len >= PAGE_SIZE ? PAGE_SIZE - 1 : s->len;
> 
> 	int = uint >= ulong ? ulog : uint
> 
> that's spastic.  Can we choose a type and stick to it?

Ah, yeah that should be fixed.

> 
> Also, min().

Right.

> 
> > +	int ret;
> > +
> > +	ret = seq_write(m, s->buffer, len);
> > +
> > +	/*
> > +	 * Only reset this buffer if we successfully wrote to the
> > +	 * seq_file buffer.
> 
> why?

Because we might want to try again. I'll update the comment.

> 
> > +	 */
> > +	if (!ret)
> > +		trace_seq_init(s);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * trace_seq_printf - sequence printing of trace information
> > + * @s: trace sequence descriptor
> > + * @fmt: printf format string
> > + *
> > + * It returns 0 if the trace oversizes the buffer's free
> > + * space, 1 otherwise.
> 
> s/oversizes/would overrun/?

It made sense when I first read it (Jiri wrote it). How about:

  Returns 0 if the recorded print is bigger than the free space
  of the buffer. Returns 1 otherwise.

> 
> > + * The tracer may use either sequence operations or its own
> > + * copy to user routines. To simplify formating of a trace
> > + * trace_seq_printf is used to store strings into a special
> > + * buffer (@s). Then the output may be either used by
> > + * the sequencer or pulled into another buffer.
> > + */
> > +int
> > +trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
> 
> unneeded newline

OK.

> 
> > +{
> > +	int len = (PAGE_SIZE - 1) - s->len;
> 
> 	int = ulong - uint;

Maybe it should just be initialized via a macro too?

> 
> > +	va_list ap;
> > +	int ret;
> > +
> > +	if (s->full || !len)
> > +		return 0;
> > +
> > +	va_start(ap, fmt);
> > +	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
> > +	va_end(ap);
> > +
> > +	/* If we can't write it all, don't bother writing anything */
> 
> This is somewhat unusual behavior for a write()-style thing.  Comment
> should explain "why", not "what".

Yeah. It's like that because this was driven by the code that calls it,
and is pretty engrained in the process. But taking it out of it's
natural habitat, we need to explain these strange rituals that it does.

The why is, this is a buffer write, and users of it expect it to be all
or nothing, because a partial write is usually solved by flushing and
trying again with the same write arguments. If we did partial writes,
then the user would need another buffer to store the full write to be
able to just write out the rest of the line. That basically destroys
the point of trace_seq as it is suppose to be that extra buffer.

> 
> > +	if (ret >= len) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	s->len += ret;
> > +
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL_GPL(trace_seq_printf);
> > +
> > +/**
> > + * trace_seq_bitmask - put a list of longs as a bitmask print output
> 
> Is that grammatical?

No but it's almost a haiku:

      trace_seq_bitmask -
   put a list of longs as a
      bitmask print output


> 
> > + * @s:		trace sequence descriptor
> > + * @maskp:	points to an array of unsigned longs that represent a bitmask
> > + * @nmaskbits:	The number of bits that are valid in @maskp
> > + *
> > + * It returns 0 if the trace oversizes the buffer's free
> > + * space, 1 otherwise.
> 
> Ditto

The beauty of cut-and-paste.

> 
> > + * Writes a ASCII representation of a bitmask string into @s.
> > + */
> > +int
> > +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> > +		  int nmaskbits)
> > +{
> > +	int len = (PAGE_SIZE - 1) - s->len;
> > +	int ret;
> > +
> > +	if (s->full || !len)
> > +		return 0;
> > +
> > +	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> > +	s->len += ret;
> > +
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL_GPL(trace_seq_bitmask);
> 
> More dittos.

Confused. What dittos is this dittoing?

> 
> > +/**
> > + * trace_seq_vprintf - sequence printing of trace information
> > + * @s: trace sequence descriptor
> > + * @fmt: printf format string
> > + *
> > + * The tracer may use either sequence operations or its own
> > + * copy to user routines. To simplify formating of a trace
> > + * trace_seq_printf is used to store strings into a special
> 
> "trace_seq_printf()".  Apparently it makes the kerneldoc output come
> out right.

OK.

> 
> > + * buffer (@s). Then the output may be either used by
> > + * the sequencer or pulled into another buffer.
> > + */
> > +int
> > +trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
> > +{
> > +	int len = (PAGE_SIZE - 1) - s->len;
> > +	int ret;
> > +
> > +	if (s->full || !len)
> > +		return 0;
> > +
> > +	ret = vsnprintf(s->buffer + s->len, len, fmt, args);
> > +
> > +	/* If we can't write it all, don't bother writing anything */
> > +	if (ret >= len) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	s->len += ret;
> > +
> > +	return len;
> > +}
> > +EXPORT_SYMBOL_GPL(trace_seq_vprintf);
> 
> Several dittos.

Oh, just on the function. You're not dittoing a comment about the
EXPORT_SYMBOL_GPL() that you forgot to add, are you?

> 
> > +int trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
> 
> -ENODOC

Ditto

> 
> > +{
> > +	int len = (PAGE_SIZE - 1) - s->len;
> > +	int ret;
> > +
> > +	if (s->full || !len)
> > +		return 0;
> > +
> > +	ret = bstr_printf(s->buffer + s->len, len, fmt, binary);
> > +
> > +	/* If we can't write it all, don't bother writing anything */
> > +	if (ret >= len) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	s->len += ret;
> > +
> > +	return len;
> > +}
> 
> Dittos.

Dittos

> 
> > +/**
> > + * trace_seq_puts - trace sequence printing of simple string
> > + * @s: trace sequence descriptor
> > + * @str: simple string to record
> > + *
> > + * The tracer may use either the sequence operations or its own
> > + * copy to user routines. This function records a simple string
> > + * into a special buffer (@s) for later retrieval by a sequencer
> > + * or other mechanism.
> > + */
> > +int trace_seq_puts(struct trace_seq *s, const char *str)
> > +{
> > +	int len = strlen(str);
> > +
> > +	if (s->full)
> > +		return 0;
> > +
> > +	if (len > ((PAGE_SIZE - 1) - s->len)) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	memcpy(s->buffer + s->len, str, len);
> > +	s->len += len;
> > +
> > +	return len;
> > +}
> 
> Missing EXPORT_SYMBOL?

The others were used by the tracepoint infrastructure. But I'm not even
sure they are needed anymore because the ftrace.h file now calls
another function that does more work which is exported and then calls
these.

But if I'm going to add this to the library, they probably all should
be exported.

> 
> > +int trace_seq_putc(struct trace_seq *s, unsigned char c)
> > +{
> > +	if (s->full)
> > +		return 0;
> > +
> > +	if (s->len >= (PAGE_SIZE - 1)) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	s->buffer[s->len++] = c;
> > +
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL(trace_seq_putc);
> 
> Mix of EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL()

That's a mistake. They all should be EXPORT_SYMBOL_GPL(). Hmm, I caught
this once before and made a patch to fix it. That got lost somehow :-/


> 
> > +int trace_seq_putmem(struct trace_seq *s, const void *mem, size_t len)
> > +{
> > +	if (s->full)
> > +		return 0;
> > +
> > +	if (len > ((PAGE_SIZE - 1) - s->len)) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	memcpy(s->buffer + s->len, mem, len);
> > +	s->len += len;
> > +
> > +	return len;
> > +}
> > +
> 
> Lotsa dittos

Ditto back at'cha

> 
> > +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> > +
> > +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> > +{
> > +	unsigned char hex[HEX_CHARS];
> > +	const unsigned char *data = mem;
> > +	int i, j;
> > +
> > +	if (s->full)
> > +		return 0;
> 
> What's this ->full thing all about anyway?  Some central comment which
> explains the design is needed.

Comment? What? Git blame isn't good enough for ya? ;-)

> 
> Is this test really needed?  trace_seq_putmem() will handle this.

It was added as an optimization, because once it filled up, you could
still have multiple calls to the trace_seq() functions that would waste
time trying to write the buffer.

It seemed like a good idea at the time. I Cc'd Johannes Berg as he's
the one that implemented.

Johannes, is this really needed, should we bother keeping it?

> 
> > +#ifdef __BIG_ENDIAN
> > +	for (i = 0, j = 0; i < len; i++) {
> > +#else
> > +	for (i = len-1, j = 0; i >= 0; i--) {
> > +#endif
> > +		hex[j++] = hex_asc_hi(data[i]);
> > +		hex[j++] = hex_asc_lo(data[i]);
> > +	}
> > +	hex[j++] = ' ';
> > +
> > +	return trace_seq_putmem(s, hex, j);
> > +}
> 
> -ENODOC, missing EXPORT_SYMBOL.

Ditto.

> 
> > +void *trace_seq_reserve(struct trace_seq *s, size_t len)
> 
> `len' is a size_t here, a uint in trace_seq and an int when it's a local.

We like to be diverse.

> 
> > +{
> > +	void *ret;
> > +
> > +	if (s->full)
> > +		return NULL;
> > +
> > +	if (len > ((PAGE_SIZE - 1) - s->len)) {
> > +		s->full = 1;
> > +		return NULL;
> > +	}
> > +
> > +	ret = s->buffer + s->len;
> > +	s->len += len;
> > +
> > +	return ret;
> > +}
> 
> Dittos

Return dittos

> 
> > +int trace_seq_path(struct trace_seq *s, const struct path *path)
> > +{
> > +	unsigned char *p;
> > +
> > +	if (s->full)
> > +		return 0;
> > +
> > +	if (s->len >= (PAGE_SIZE - 1)) {
> > +		s->full = 1;
> > +		return 0;
> > +	}
> > +
> > +	p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
> > +	if (!IS_ERR(p)) {
> > +		p = mangle_path(s->buffer + s->len, p, "\n");
> > +		if (p) {
> > +			s->len = p - s->buffer;
> > +			return 1;
> > +		}
> > +	} else {
> > +		s->buffer[s->len++] = '?';
> > +		return 1;
> > +	}
> > +
> > +	s->full = 1;
> > +	return 0;
> > +}
> 
> Dittos

sottiD

> 
> > +ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, size_t cnt)
> > +{
> > +	int len;
> > +	int ret;
> > +
> > +	if (!cnt)
> > +		return 0;
> > +
> > +	if (s->len <= s->readpos)
> > +		return -EBUSY;
> > +
> > +	len = s->len - s->readpos;
> > +	if (cnt > len)
> > +		cnt = len;
> > +	ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt);
> > +	if (ret == cnt)
> > +		return -EFAULT;
> > +
> > +	cnt -= ret;
> > +
> > +	s->readpos += cnt;
> > +	return cnt;
> > +}
> 
> Dittos

Tripple ditto!


Hey! Thanks for the review. Much appreciated. And maybe you should read
those messages in your /dev/null folder that I cc you with. :-)

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20 16:58     ` Steven Rostedt
@ 2014-06-20 17:12       ` Andrew Morton
  2014-06-20 17:17         ` Steven Rostedt
  2014-06-20 20:28         ` Steven Rostedt
  2014-06-22  7:38       ` Johannes Berg
  1 sibling, 2 replies; 43+ messages in thread
From: Andrew Morton @ 2014-06-20 17:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek, Johannes Berg

On Fri, 20 Jun 2014 12:58:23 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

>
> ...
>
> > 
> > > + * Writes a ASCII representation of a bitmask string into @s.
> > > + */
> > > +int
> > > +trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> > > +		  int nmaskbits)
> > > +{
> > > +	int len = (PAGE_SIZE - 1) - s->len;
> > > +	int ret;
> > > +
> > > +	if (s->full || !len)
> > > +		return 0;
> > > +
> > > +	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> > > +	s->len += ret;
> > > +
> > > +	return 1;
> > > +}
> > > +EXPORT_SYMBOL_GPL(trace_seq_bitmask);
> > 
> > More dittos.
> 
> Confused. What dittos is this dittoing?

Unneeded newline, poorly considered choice of types.

>
> ...
>
> > > + * buffer (@s). Then the output may be either used by
> > > + * the sequencer or pulled into another buffer.
> > > + */
> > > +int
> > > +trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
> > > +{
> > > +	int len = (PAGE_SIZE - 1) - s->len;
> > > +	int ret;
> > > +
> > > +	if (s->full || !len)
> > > +		return 0;
> > > +
> > > +	ret = vsnprintf(s->buffer + s->len, len, fmt, args);
> > > +
> > > +	/* If we can't write it all, don't bother writing anything */
> > > +	if (ret >= len) {
> > > +		s->full = 1;
> > > +		return 0;
> > > +	}
> > > +
> > > +	s->len += ret;
> > > +
> > > +	return len;
> > > +}
> > > +EXPORT_SYMBOL_GPL(trace_seq_vprintf);
> > 
> > Several dittos.
> 
> Oh, just on the function. You're not dittoing a comment about the
> EXPORT_SYMBOL_GPL() that you forgot to add, are you?

yup.  Unneded newline, types, EXPORT_ confusion.

>
> ...
>
> > 
> > > +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> > > +
> > > +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> > > +{
> > > +	unsigned char hex[HEX_CHARS];
> > > +	const unsigned char *data = mem;
> > > +	int i, j;
> > > +
> > > +	if (s->full)
> > > +		return 0;
> > 
> > What's this ->full thing all about anyway?  Some central comment which
> > explains the design is needed.
> 
> Comment? What? Git blame isn't good enough for ya? ;-)

There's always that.  There's also googling for the original list
dicsussion.  But it's a bit user-unfriendly, particularly when then
code has aged was subsequently altered many times.

>
> ...
>
> 
> Hey! Thanks for the review. Much appreciated. And maybe you should read
> those messages in your /dev/null folder that I cc you with. :-)

I sometimes do.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20 17:12       ` Andrew Morton
@ 2014-06-20 17:17         ` Steven Rostedt
  2014-06-20 20:28         ` Steven Rostedt
  1 sibling, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 17:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek, Johannes Berg

On Fri, 20 Jun 2014 10:12:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:


> > Comment? What? Git blame isn't good enough for ya? ;-)
> 
> There's always that.  There's also googling for the original list
> dicsussion.  But it's a bit user-unfriendly, particularly when then
> code has aged was subsequently altered many times.

I was replying mostly in jest as that was what I did to remember why we
added it.

> 
> >
> > ...
> >
> > 
> > Hey! Thanks for the review. Much appreciated. And maybe you should read
> > those messages in your /dev/null folder that I cc you with. :-)
> 
> I sometimes do.

Ah, the lurker.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20 17:12       ` Andrew Morton
  2014-06-20 17:17         ` Steven Rostedt
@ 2014-06-20 20:28         ` Steven Rostedt
  2014-06-20 20:51           ` Steven Rostedt
  1 sibling, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 20:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek, Johannes Berg

On Fri, 20 Jun 2014 10:12:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:


> > > > +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> > > > +
> > > > +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> > > > +{
> > > > +	unsigned char hex[HEX_CHARS];
> > > > +	const unsigned char *data = mem;
> > > > +	int i, j;
> > > > +
> > > > +	if (s->full)
> > > > +		return 0;
> > > 
> > > What's this ->full thing all about anyway?  Some central comment which
> > > explains the design is needed.
> > 
> > Comment? What? Git blame isn't good enough for ya? ;-)
> 
> There's always that.  There's also googling for the original list
> dicsussion.  But it's a bit user-unfriendly, particularly when then
> code has aged was subsequently altered many times.

Although I did not address this because I'm waiting to hear back from
Johannes Berg, I updated for your other comments. I hope I got them all.

Regardless of this patch series, I pulled out the code from
trace_output.c and made a separate file for the trace_seq_*() functions
in kernel/trace/trace_seq.c. I then updated the file with your
comments. I found a bug or two and I will be dealing with them later as
this will only be a clean up patch not a bug fix patch.

Anyway, here's the new file:

-- Steve

/*
 * trace_seq.c
 *
 * Copyright (C) 2008-2014 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
 *
 * The trace_seq is a handy tool that allows you to pass a descriptor around
 * to a buffer that other functions can write to. It is similar to the
 * seq_file functionality but has some differences.
 *
 * To use it, the trace_seq must be initialized with trace_seq_init().
 * This will set up the counters within the descriptor. You can call
 * trace_seq_init() more than once to reset the trace_seq to start
 * from scratch.
 * 
 * The buffer size is currently PAGE_SIZE, although it may become dynamic
 * in the future.
 *
 * A write to the buffer will either succed or fail. That is, unlike
 * sprintf() there will not be a partial write (well it may write into
 * the buffer but it wont update the pointers). This allows users to
 * try to write something into the trace_seq buffer and if it fails
 * they can flush it and try again.
 *
 */
#include <linux/uaccess.h>
#include <linux/seq_file.h>
#include <linux/trace_seq.h>

/* How much buffer is left on the trace_seq? */
#define TRACE_SEQ_BUF_LEFT(s) ((PAGE_SIZE - 1) - (s)->len)

/* How much buffer is written? */
#define TRACE_SEQ_BUF_USED(s) min((s)->len, PAGE_SIZE - 1)

/**
 * trace_print_seq - move the contents of trace_seq into a seq_file
 * @m: the seq_file descriptor that is the destination
 * @s: the trace_seq descriptor that is the source.
 *
 * Returns 0 on success and non zero on error. If it succeeds to
 * write to the seq_file it will reset the trace_seq, otherwise
 * it does not modify the trace_seq to let the caller try again.
 */
int trace_print_seq(struct seq_file *m, struct trace_seq *s)
{
	unsigned int len = TRACE_SEQ_BUF_USED(s);
	int ret;

	ret = seq_write(m, s->buffer, len);

	/*
	 * Only reset this buffer if we successfully wrote to the
	 * seq_file buffer. This lets the caller try again or
	 * do something else with the contents.
	 */
	if (!ret)
		trace_seq_init(s);

	return ret;
}

/**
 * trace_seq_printf - sequence printing of trace information
 * @s: trace sequence descriptor
 * @fmt: printf format string
 *
 * The tracer may use either sequence operations or its own
 * copy to user routines. To simplify formating of a trace
 * trace_seq_printf() is used to store strings into a special
 * buffer (@s). Then the output may be either used by
 * the sequencer or pulled into another buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
  * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	va_list ap;
	int ret;

	if (s->full || !len)
		return 0;

	va_start(ap, fmt);
	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
	va_end(ap);

	/* If we can't write it all, don't bother writing anything */
	if (ret >= len) {
		s->full = 1;
		return 0;
	}

	s->len += ret;

	return 1;
}
EXPORT_SYMBOL_GPL(trace_seq_printf);

/**
 * trace_seq_bitmask - write a bitmask array in its ASCII representation
 * @s:		trace sequence descriptor
 * @maskp:	points to an array of unsigned longs that represent a bitmask
 * @nmaskbits:	The number of bits that are valid in @maskp
 *
 * Writes a ASCII representation of a bitmask string into @s.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
		      int nmaskbits)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	int ret;

	if (s->full || !len)
		return 0;

	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
	s->len += ret;

	return 1;
}
EXPORT_SYMBOL_GPL(trace_seq_bitmask);

/**
 * trace_seq_vprintf - sequence printing of trace information
 * @s: trace sequence descriptor
 * @fmt: printf format string
 *
 * The tracer may use either sequence operations or its own
 * copy to user routines. To simplify formating of a trace
 * trace_seq_printf is used to store strings into a special
 * buffer (@s). Then the output may be either used by
 * the sequencer or pulled into another buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	int ret;

	if (s->full || !len)
		return 0;

	ret = vsnprintf(s->buffer + s->len, len, fmt, args);

	/* If we can't write it all, don't bother writing anything */
	if (ret >= len) {
		s->full = 1;
		return 0;
	}

	s->len += ret;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_vprintf);

/**
 * trace_seq_bprintf - Write the printf string from binary arguments
 * @s: trace sequence descriptor
 * @fmt: The format string for the @binary arguments
 * @binary: The binary arguments for @fmt.
 *
 * When recording in a fast path, a printf may be recorded with just
 * saving the format and the arguments as they were passed to the
 * function, instead of wasting cycles converting the arguments into
 * ASCII characters. Instead, the arguments are saved in a 32 bit
 * word array that is defined by the format string constraints.
 *
 * This function will take the format and the binary array and finish
 * the conversion into the ASCII string within the buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	int ret;

	if (s->full || !len)
		return 0;

	ret = bstr_printf(s->buffer + s->len, len, fmt, binary);

	/* If we can't write it all, don't bother writing anything */
	if (ret >= len) {
		s->full = 1;
		return 0;
	}

	s->len += ret;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_bprintf);

/**
 * trace_seq_puts - trace sequence printing of simple string
 * @s: trace sequence descriptor
 * @str: simple string to record
 *
 * The tracer may use either the sequence operations or its own
 * copy to user routines. This function records a simple string
 * into a special buffer (@s) for later retrieval by a sequencer
 * or other mechanism.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_puts(struct trace_seq *s, const char *str)
{
	unsigned int len = strlen(str);

	if (s->full)
		return 0;

	if (len > TRACE_SEQ_BUF_LEFT(s)) {
		s->full = 1;
		return 0;
	}

	memcpy(s->buffer + s->len, str, len);
	s->len += len;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_puts);

/**
 * trace_seq_putc - trace sequence printing of simple character
 * @s: trace sequence descriptor
 * @c: simple character to record
 *
 * The tracer may use either the sequence operations or its own
 * copy to user routines. This function records a simple charater
 * into a special buffer (@s) for later retrieval by a sequencer
 * or other mechanism.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_putc(struct trace_seq *s, unsigned char c)
{
	if (s->full)
		return 0;

	if (TRACE_SEQ_BUF_LEFT(s) < 1) {
		s->full = 1;
		return 0;
	}

	s->buffer[s->len++] = c;

	return 1;
}
EXPORT_SYMBOL_GPL(trace_seq_putc);

/**
 * trace_seq_putmem - write raw data into the trace_seq buffer
 * @s: trace sequence descriptor
 * @mem: The raw memory to copy into the buffer
 * @len: The length of the raw memory to copy (in bytes)
 *
 * There may be cases where raw memory needs to be written into the
 * buffer and a strcpy() would not work. Using this function allows
 * for such cases.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_putmem(struct trace_seq *s, const void *mem, unsigned len len)
{
	if (s->full)
		return 0;

	if (len > TRACE_SEQ_BUF_LEFT(s)) {
		s->full = 1;
		return 0;
	}

	memcpy(s->buffer + s->len, mem, len);
	s->len += len;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_putmem);

#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)

/**
 * trace_seq_putmem_hex - write raw memory into the buffer in ASCII hex
 * @s: trace sequence descriptor
 * @mem: The raw memory to write its hex ASCII representation of
 * @len: The length of the raw memory to copy (in bytes)
 *
 * This is similar to trace_seq_putmem() except instead of just copying the
 * raw memory into the buffer it writes its ASCII representation of it
 * in hex characters.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
			 unsigned int len)
{
	unsigned char hex[HEX_CHARS];
	const unsigned char *data = mem;
	int i, j;

	if (s->full)
		return 0;

#ifdef __BIG_ENDIAN
	for (i = 0, j = 0; i < len; i++) {
#else
	for (i = len-1, j = 0; i >= 0; i--) {
#endif
		hex[j++] = hex_asc_hi(data[i]);
		hex[j++] = hex_asc_lo(data[i]);
	}
	hex[j++] = ' ';

	return trace_seq_putmem(s, hex, j);
}
EXPORT_SYMBOL_GPL(trace_seq_putmem_hex);

/**
 * trace_seq_reserve - reserve space on the sequence buffer
 * @s: trace sequence descriptor
 * @len: The amount to reserver.
 *
 * If for some reason there is a need to save some space on the
 * buffer to fill in later, this function is used for that purpose.
 * The given length will be reserved and the pointer to that
 * location on the buffer is returned, unless there is not enough
 * buffer left to hold the given length then NULL is returned.
 */
void *trace_seq_reserve(struct trace_seq *s, unsigned int len)
{
	void *ret;

	if (s->full)
		return NULL;

	if (len > TRACE_SEQ_BUF_LEFT(s)) {
		s->full = 1;
		return NULL;
	}

	ret = s->buffer + s->len;
	s->len += len;

	return ret;
}

/**
 * trace_seq_path - copy a path into the sequence buffer
 * @s: trace sequence descriptor
 * @path: path to write into the sequence buffer.
 *
 * Write a path name into the sequence buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_path(struct trace_seq *s, const struct path *path)
{
	unsigned char *p;

	if (s->full)
		return 0;

	if (TRACE_SEQ_BUF_LEFT(s) < 1) {
		s->full = 1;
		return 0;
	}

	p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
	if (!IS_ERR(p)) {
		p = mangle_path(s->buffer + s->len, p, "\n");
		if (p) {
			s->len = p - s->buffer;
			return 1;
		}
	} else {
		s->buffer[s->len++] = '?';
		return 1;
	}

	s->full = 1;
	return 0;
}

/**
 * trace_seq_to_user - copy the squence buffer to user space
 * @s: trace sequence descriptor
 * @ubuf: The userspace memory location to copy to
 * @cnt: The amount to copy
 *
 * Copies the sequence buffer into the userspace memory pointed to
 * by @ubuf. It starts from the last read position (@s->readpos)
 * and writes up to @cnt characters or till it reaches the end of
 * the content in the buffer (@s->len), which ever comes first.
 *
 * On success, it returns a positive number of the number of bytes
 * it copied.
 *
 * On failure it returns -EBUSY if all of the content in the
 * sequence has been already read, which includes nothing in the
 * sequenc (@s->len == @s->readpos).
 *
 * Returns -EFAULT if the copy to userspace fails.
 */
int trace_seq_to_user(struct trace_seq *s, char __user *ubuf, int cnt)
{
	int len;
	int ret;

	if (!cnt)
		return 0;

	if (s->len <= s->readpos)
		return -EBUSY;

	len = s->len - s->readpos;
	if (cnt > len)
		cnt = len;
	ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt);
	if (ret == cnt)
		return -EFAULT;

	cnt -= ret;

	s->readpos += cnt;
	return cnt;
}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20 20:28         ` Steven Rostedt
@ 2014-06-20 20:51           ` Steven Rostedt
  2014-06-23 16:33             ` Petr Mládek
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-20 20:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Jiri Kosina,
	Michal Hocko, Jan Kara, Frederic Weisbecker, Dave Anderson,
	Petr Mladek, Johannes Berg

On Fri, 20 Jun 2014 16:28:45 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 20 Jun 2014 10:12:44 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> 
> > > > > +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> > > > > +
> > > > > +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> > > > > +{
> > > > > +	unsigned char hex[HEX_CHARS];
> > > > > +	const unsigned char *data = mem;
> > > > > +	int i, j;
> > > > > +
> > > > > +	if (s->full)
> > > > > +		return 0;
> > > > 
> > > > What's this ->full thing all about anyway?  Some central comment which
> > > > explains the design is needed.
> > > 
> > > Comment? What? Git blame isn't good enough for ya? ;-)
> > 
> > There's always that.  There's also googling for the original list
> > dicsussion.  But it's a bit user-unfriendly, particularly when then
> > code has aged was subsequently altered many times.
> 
> Although I did not address this because I'm waiting to hear back from
> Johannes Berg, I updated for your other comments. I hope I got them all.
> 
> Regardless of this patch series, I pulled out the code from
> trace_output.c and made a separate file for the trace_seq_*() functions
> in kernel/trace/trace_seq.c. I then updated the file with your
> comments. I found a bug or two and I will be dealing with them later as
> this will only be a clean up patch not a bug fix patch.
> 
> Anyway, here's the new file:

Would be helpful if I compiled it before posting it :-)

Also, I missed a few EXPORT_SYMBOL_GPL()s along the way.

New update:

-- Steve

/*
 * trace_seq.c
 *
 * Copyright (C) 2008-2014 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
 *
 * The trace_seq is a handy tool that allows you to pass a descriptor around
 * to a buffer that other functions can write to. It is similar to the
 * seq_file functionality but has some differences.
 *
 * To use it, the trace_seq must be initialized with trace_seq_init().
 * This will set up the counters within the descriptor. You can call
 * trace_seq_init() more than once to reset the trace_seq to start
 * from scratch.
 * 
 * The buffer size is currently PAGE_SIZE, although it may become dynamic
 * in the future.
 *
 * A write to the buffer will either succed or fail. That is, unlike
 * sprintf() there will not be a partial write (well it may write into
 * the buffer but it wont update the pointers). This allows users to
 * try to write something into the trace_seq buffer and if it fails
 * they can flush it and try again.
 *
 */
#include <linux/uaccess.h>
#include <linux/seq_file.h>
#include <linux/trace_seq.h>

/* How much buffer is left on the trace_seq? */
#define TRACE_SEQ_BUF_LEFT(s) ((PAGE_SIZE - 1) - (s)->len)

/* How much buffer is written? */
#define TRACE_SEQ_BUF_USED(s) min((s)->len, (unsigned int)(PAGE_SIZE - 1))

/**
 * trace_print_seq - move the contents of trace_seq into a seq_file
 * @m: the seq_file descriptor that is the destination
 * @s: the trace_seq descriptor that is the source.
 *
 * Returns 0 on success and non zero on error. If it succeeds to
 * write to the seq_file it will reset the trace_seq, otherwise
 * it does not modify the trace_seq to let the caller try again.
 */
int trace_print_seq(struct seq_file *m, struct trace_seq *s)
{
	unsigned int len = TRACE_SEQ_BUF_USED(s);
	int ret;

	ret = seq_write(m, s->buffer, len);

	/*
	 * Only reset this buffer if we successfully wrote to the
	 * seq_file buffer. This lets the caller try again or
	 * do something else with the contents.
	 */
	if (!ret)
		trace_seq_init(s);

	return ret;
}

/**
 * trace_seq_printf - sequence printing of trace information
 * @s: trace sequence descriptor
 * @fmt: printf format string
 *
 * The tracer may use either sequence operations or its own
 * copy to user routines. To simplify formating of a trace
 * trace_seq_printf() is used to store strings into a special
 * buffer (@s). Then the output may be either used by
 * the sequencer or pulled into another buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
  * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	va_list ap;
	int ret;

	if (s->full || !len)
		return 0;

	va_start(ap, fmt);
	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
	va_end(ap);

	/* If we can't write it all, don't bother writing anything */
	if (ret >= len) {
		s->full = 1;
		return 0;
	}

	s->len += ret;

	return 1;
}
EXPORT_SYMBOL_GPL(trace_seq_printf);

/**
 * trace_seq_bitmask - write a bitmask array in its ASCII representation
 * @s:		trace sequence descriptor
 * @maskp:	points to an array of unsigned longs that represent a bitmask
 * @nmaskbits:	The number of bits that are valid in @maskp
 *
 * Writes a ASCII representation of a bitmask string into @s.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
		      int nmaskbits)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	int ret;

	if (s->full || !len)
		return 0;

	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
	s->len += ret;

	return 1;
}
EXPORT_SYMBOL_GPL(trace_seq_bitmask);

/**
 * trace_seq_vprintf - sequence printing of trace information
 * @s: trace sequence descriptor
 * @fmt: printf format string
 *
 * The tracer may use either sequence operations or its own
 * copy to user routines. To simplify formating of a trace
 * trace_seq_printf is used to store strings into a special
 * buffer (@s). Then the output may be either used by
 * the sequencer or pulled into another buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_vprintf(struct trace_seq *s, const char *fmt, va_list args)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	int ret;

	if (s->full || !len)
		return 0;

	ret = vsnprintf(s->buffer + s->len, len, fmt, args);

	/* If we can't write it all, don't bother writing anything */
	if (ret >= len) {
		s->full = 1;
		return 0;
	}

	s->len += ret;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_vprintf);

/**
 * trace_seq_bprintf - Write the printf string from binary arguments
 * @s: trace sequence descriptor
 * @fmt: The format string for the @binary arguments
 * @binary: The binary arguments for @fmt.
 *
 * When recording in a fast path, a printf may be recorded with just
 * saving the format and the arguments as they were passed to the
 * function, instead of wasting cycles converting the arguments into
 * ASCII characters. Instead, the arguments are saved in a 32 bit
 * word array that is defined by the format string constraints.
 *
 * This function will take the format and the binary array and finish
 * the conversion into the ASCII string within the buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_bprintf(struct trace_seq *s, const char *fmt, const u32 *binary)
{
	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
	int ret;

	if (s->full || !len)
		return 0;

	ret = bstr_printf(s->buffer + s->len, len, fmt, binary);

	/* If we can't write it all, don't bother writing anything */
	if (ret >= len) {
		s->full = 1;
		return 0;
	}

	s->len += ret;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_bprintf);

/**
 * trace_seq_puts - trace sequence printing of simple string
 * @s: trace sequence descriptor
 * @str: simple string to record
 *
 * The tracer may use either the sequence operations or its own
 * copy to user routines. This function records a simple string
 * into a special buffer (@s) for later retrieval by a sequencer
 * or other mechanism.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_puts(struct trace_seq *s, const char *str)
{
	unsigned int len = strlen(str);

	if (s->full)
		return 0;

	if (len > TRACE_SEQ_BUF_LEFT(s)) {
		s->full = 1;
		return 0;
	}

	memcpy(s->buffer + s->len, str, len);
	s->len += len;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_puts);

/**
 * trace_seq_putc - trace sequence printing of simple character
 * @s: trace sequence descriptor
 * @c: simple character to record
 *
 * The tracer may use either the sequence operations or its own
 * copy to user routines. This function records a simple charater
 * into a special buffer (@s) for later retrieval by a sequencer
 * or other mechanism.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_putc(struct trace_seq *s, unsigned char c)
{
	if (s->full)
		return 0;

	if (TRACE_SEQ_BUF_LEFT(s) < 1) {
		s->full = 1;
		return 0;
	}

	s->buffer[s->len++] = c;

	return 1;
}
EXPORT_SYMBOL_GPL(trace_seq_putc);

/**
 * trace_seq_putmem - write raw data into the trace_seq buffer
 * @s: trace sequence descriptor
 * @mem: The raw memory to copy into the buffer
 * @len: The length of the raw memory to copy (in bytes)
 *
 * There may be cases where raw memory needs to be written into the
 * buffer and a strcpy() would not work. Using this function allows
 * for such cases.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_putmem(struct trace_seq *s, const void *mem, unsigned int len)
{
	if (s->full)
		return 0;

	if (len > TRACE_SEQ_BUF_LEFT(s)) {
		s->full = 1;
		return 0;
	}

	memcpy(s->buffer + s->len, mem, len);
	s->len += len;

	return len;
}
EXPORT_SYMBOL_GPL(trace_seq_putmem);

#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)

/**
 * trace_seq_putmem_hex - write raw memory into the buffer in ASCII hex
 * @s: trace sequence descriptor
 * @mem: The raw memory to write its hex ASCII representation of
 * @len: The length of the raw memory to copy (in bytes)
 *
 * This is similar to trace_seq_putmem() except instead of just copying the
 * raw memory into the buffer it writes its ASCII representation of it
 * in hex characters.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
			 unsigned int len)
{
	unsigned char hex[HEX_CHARS];
	const unsigned char *data = mem;
	int i, j;

	if (s->full)
		return 0;

#ifdef __BIG_ENDIAN
	for (i = 0, j = 0; i < len; i++) {
#else
	for (i = len-1, j = 0; i >= 0; i--) {
#endif
		hex[j++] = hex_asc_hi(data[i]);
		hex[j++] = hex_asc_lo(data[i]);
	}
	hex[j++] = ' ';

	return trace_seq_putmem(s, hex, j);
}
EXPORT_SYMBOL_GPL(trace_seq_putmem_hex);

/**
 * trace_seq_reserve - reserve space on the sequence buffer
 * @s: trace sequence descriptor
 * @len: The amount to reserver.
 *
 * If for some reason there is a need to save some space on the
 * buffer to fill in later, this function is used for that purpose.
 * The given length will be reserved and the pointer to that
 * location on the buffer is returned, unless there is not enough
 * buffer left to hold the given length then NULL is returned.
 */
void *trace_seq_reserve(struct trace_seq *s, unsigned int len)
{
	void *ret;

	if (s->full)
		return NULL;

	if (len > TRACE_SEQ_BUF_LEFT(s)) {
		s->full = 1;
		return NULL;
	}

	ret = s->buffer + s->len;
	s->len += len;

	return ret;
}
EXPORT_SYMBOL_GPL(trace_seq_reserve);

/**
 * trace_seq_path - copy a path into the sequence buffer
 * @s: trace sequence descriptor
 * @path: path to write into the sequence buffer.
 *
 * Write a path name into the sequence buffer.
 *
 * Returns 1 if we successfully written all the contents to
 *   the buffer.
 * Returns 0 if we the length to write is bigger than the
 *   reserved buffer space. In this case, nothing gets written.
 */
int trace_seq_path(struct trace_seq *s, const struct path *path)
{
	unsigned char *p;

	if (s->full)
		return 0;

	if (TRACE_SEQ_BUF_LEFT(s) < 1) {
		s->full = 1;
		return 0;
	}

	p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
	if (!IS_ERR(p)) {
		p = mangle_path(s->buffer + s->len, p, "\n");
		if (p) {
			s->len = p - s->buffer;
			return 1;
		}
	} else {
		s->buffer[s->len++] = '?';
		return 1;
	}

	s->full = 1;
	return 0;
}
EXPORT_SYMBOL_GPL(trace_seq_path);

/**
 * trace_seq_to_user - copy the squence buffer to user space
 * @s: trace sequence descriptor
 * @ubuf: The userspace memory location to copy to
 * @cnt: The amount to copy
 *
 * Copies the sequence buffer into the userspace memory pointed to
 * by @ubuf. It starts from the last read position (@s->readpos)
 * and writes up to @cnt characters or till it reaches the end of
 * the content in the buffer (@s->len), which ever comes first.
 *
 * On success, it returns a positive number of the number of bytes
 * it copied.
 *
 * On failure it returns -EBUSY if all of the content in the
 * sequence has been already read, which includes nothing in the
 * sequenc (@s->len == @s->readpos).
 *
 * Returns -EFAULT if the copy to userspace fails.
 */
int trace_seq_to_user(struct trace_seq *s, char __user *ubuf, int cnt)
{
	int len;
	int ret;

	if (!cnt)
		return 0;

	if (s->len <= s->readpos)
		return -EBUSY;

	len = s->len - s->readpos;
	if (cnt > len)
		cnt = len;
	ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt);
	if (ret == cnt)
		return -EFAULT;

	cnt -= ret;

	s->readpos += cnt;
	return cnt;
}
EXPORT_SYMBOL_GPL(trace_seq_to_user);


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20 16:58     ` Steven Rostedt
  2014-06-20 17:12       ` Andrew Morton
@ 2014-06-22  7:38       ` Johannes Berg
  2014-06-23 16:08         ` Steven Rostedt
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Berg @ 2014-06-22  7:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Fri, 2014-06-20 at 12:58 -0400, Steven Rostedt wrote:

> > > +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> > > +
> > > +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> > > +{
> > > +	unsigned char hex[HEX_CHARS];
> > > +	const unsigned char *data = mem;
> > > +	int i, j;
> > > +
> > > +	if (s->full)
> > > +		return 0;
> > 
> > What's this ->full thing all about anyway?  Some central comment which
> > explains the design is needed.
> 
> Comment? What? Git blame isn't good enough for ya? ;-)
> 
> > 
> > Is this test really needed?  trace_seq_putmem() will handle this.
> 
> It was added as an optimization, because once it filled up, you could
> still have multiple calls to the trace_seq() functions that would waste
> time trying to write the buffer.
> 
> It seemed like a good idea at the time. I Cc'd Johannes Berg as he's
> the one that implemented.
> 
> Johannes, is this really needed, should we bother keeping it?

Honestly, I don't remember, sorry.

Looking at the code though, I'm not sure it's a pure optimisation - if
you do say putc() after a failed puts(), without this code the putc()
would succeed? I can't tell right now if that's really a problem, but it
seems you could get some odd behaviour out of it.

johannes


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 2/3] printk: Add per_cpu printk func to allow printk to be diverted
  2014-06-19 21:33 ` [RFC][PATCH 2/3] printk: Add per_cpu printk func to allow printk to be diverted Steven Rostedt
@ 2014-06-23 16:06   ` Paul E. McKenney
  0 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-23 16:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Thu, Jun 19, 2014 at 05:33:31PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> 
> Being able to divert printk to call another function besides the normal
> logging is useful for such things like NMI handling. If some functions
> are to be called from NMI that does printk() it is possible to lock up
> the box if the nmi handler triggers when another printk is happening.
> 
> One example of this use is to perform a stack trace on all CPUs via NMI.
> But if the NMI is to do the printk() it can cause the system to lock up.
> By allowing the printk to be diverted to another function that can safely
> record the printk output and then print it when it in a safe context
> then NMIs will be safe to call these functions like show_regs().
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  include/linux/percpu.h |  3 +++
>  include/linux/printk.h |  2 ++
>  kernel/printk/printk.c | 38 +++++++++++++++++++++++++++++---------
>  3 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index 8419053d0f2e..9997c92ce3bd 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -802,4 +802,7 @@ do { __this_cpu_preempt_check("or");					\
>  	(__this_cpu_preempt_check("cmpxchg_double"),__pcpu_double_call_return_bool(raw_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2)))
>  #endif
>  
> +/* To avoid include hell, as printk can not declare this, we declare it here */
> +DECLARE_PER_CPU(printk_func_t, printk_func);
> +
>  #endif /* __LINUX_PERCPU_H */
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 319ff7e53efb..e26310b2d2fd 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -159,6 +159,8 @@ extern int kptr_restrict;
>  
>  extern void wake_up_klogd(void);
>  
> +typedef int(*printk_func_t)(const char *fmt, va_list args);
> +
>  void log_buf_kexec_setup(void);
>  void __init setup_log_buf(int early);
>  void dump_stack_set_arch_desc(const char *fmt, ...);
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index ea2d5f6962ed..d019ccd10d4c 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1764,6 +1764,30 @@ asmlinkage int printk_emit(int facility, int level,
>  }
>  EXPORT_SYMBOL(printk_emit);
>  
> +int vprintk_default(const char *fmt, va_list args)
> +{
> +	int r;
> +
> +#ifdef CONFIG_KGDB_KDB
> +	if (unlikely(kdb_trap_printk)) {
> +		r = vkdb_printf(fmt, args);
> +		return r;
> +	}
> +#endif
> +	r = vprintk_emit(0, -1, NULL, 0, fmt, args);
> +
> +	return r;
> +}
> +EXPORT_SYMBOL_GPL(vprintk_default);
> +
> +/*
> + * This allows printk to be diverted to another function per cpu.
> + * This is useful for calling printk functions from within NMI
> + * without worrying about race conditions that can lock up the
> + * box.
> + */
> +DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> +	
>  /**
>   * printk - print a kernel message
>   * @fmt: format string
> @@ -1787,19 +1811,15 @@ EXPORT_SYMBOL(printk_emit);
>   */
>  asmlinkage __visible int printk(const char *fmt, ...)
>  {
> +	printk_func_t vprintk_func;
>  	va_list args;
>  	int r;
>  
> -#ifdef CONFIG_KGDB_KDB
> -	if (unlikely(kdb_trap_printk)) {
> -		va_start(args, fmt);
> -		r = vkdb_printf(fmt, args);
> -		va_end(args);
> -		return r;
> -	}
> -#endif
>  	va_start(args, fmt);
> -	r = vprintk_emit(0, -1, NULL, 0, fmt, args);
> +	preempt_disable();
> +	vprintk_func = this_cpu_read(printk_func);
> +	r = vprintk_func(fmt, args);
> +	preempt_enable();
>  	va_end(args);
>  
>  	return r;
> -- 
> 2.0.0.rc2
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-22  7:38       ` Johannes Berg
@ 2014-06-23 16:08         ` Steven Rostedt
  2014-06-23 17:38           ` Johannes Berg
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-23 16:08 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Sun, 22 Jun 2014 09:38:05 +0200
Johannes Berg <johannes@sipsolutions.net> wrote:


> Looking at the code though, I'm not sure it's a pure optimisation - if
> you do say putc() after a failed puts(), without this code the putc()
> would succeed? I can't tell right now if that's really a problem, but it
> seems you could get some odd behaviour out of it.

How would putc() still succeed? We're just talking about the "full"
field. It would still do the length check:

	if (s->len >= (PAGE_SIZE - 1))
		return 0;

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-06-19 21:33 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
  2014-06-20 13:58   ` Don Zickus
  2014-06-20 14:55   ` Petr Mládek
@ 2014-06-23 16:12   ` Paul E. McKenney
  2 siblings, 0 replies; 43+ messages in thread
From: Paul E. McKenney @ 2014-06-23 16:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Thu, Jun 19, 2014 at 05:33:32PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
> 
> When trigger_all_cpu_backtrace() is called on x86, it will trigger an
> NMI on each CPU and call show_regs(). But this can lead to a hard lock
> up if the NMI comes in on another printk().
> 
> In order to avoid this, when the NMI triggers, it switches the printk
> routine for that CPU to call a NMI safe printk function that records the
> printk in a per_cpu trace_seq descriptor. After all NMIs have finished
> recording its data, the trace_seqs are printed in a safe context.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

I do like the idea of safe printk from NMI, so given the changes later
in this thread:

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  arch/x86/kernel/apic/hw_nmi.c | 66 ++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 62 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
> index c3fcb5de5083..6731604bb1cd 100644
> --- a/arch/x86/kernel/apic/hw_nmi.c
> +++ b/arch/x86/kernel/apic/hw_nmi.c
> @@ -18,6 +18,7 @@
>  #include <linux/nmi.h>
>  #include <linux/module.h>
>  #include <linux/delay.h>
> +#include <linux/trace_seq.h>
>  
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  u64 hw_nmi_get_sample_period(int watchdog_thresh)
> @@ -30,11 +31,30 @@ u64 hw_nmi_get_sample_period(int watchdog_thresh)
>  /* For reliability, we're prepared to waste bits here. */
>  static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
>  
> +/* Safe printing in NMI context */
> +static DEFINE_PER_CPU(struct trace_seq, nmi_print_seq);
> +
>  /* "in progress" flag of arch_trigger_all_cpu_backtrace */
>  static unsigned long backtrace_flag;
>  
> +static void print_seq_line(struct trace_seq *s, int last, int pos)
> +{
> +	const char *buf = s->buffer + last;
> +
> +	/* Chop off the saved log level and update the length */
> +	if (printk_get_level(buf)) {
> +		buf += 2;
> +		last += 2;
> +	}
> +
> +	pr_emerg("%.*s", (pos - last) + 1, buf);
> +}
> +
>  void arch_trigger_all_cpu_backtrace(void)
>  {
> +	struct trace_seq *s;
> +	int len;
> +	int cpu;
>  	int i;
>  
>  	if (test_and_set_bit(0, &backtrace_flag))
> @@ -44,6 +64,11 @@ void arch_trigger_all_cpu_backtrace(void)
>  		 */
>  		return;
>  
> +	for_each_possible_cpu(i) {
> +		s = &per_cpu(nmi_print_seq, i);
> +		trace_seq_init(s);
> +	}
> +
>  	cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
>  
>  	printk(KERN_INFO "sending NMI to all CPUs:\n");
> @@ -56,8 +81,39 @@ void arch_trigger_all_cpu_backtrace(void)
>  		mdelay(1);
>  	}
>  
> +	for_each_possible_cpu(cpu) {
> +		int last_i = 0;
> +
> +		s = &per_cpu(nmi_print_seq, cpu);
> +		len = s->len;
> +		if (!len)
> +			continue;
> +
> +		/* Print line by line. */
> +		for (i = 0; i < len; i++) {
> +			if (s->buffer[i] == '\n') {
> +				print_seq_line(s, last_i, i);
> +				last_i = i + 1;
> +			}
> +		}
> +		if (last_i < i - 1) {
> +			print_seq_line(s, last_i, i);
> +			pr_cont("\n");
> +		}
> +	}
> +
>  	clear_bit(0, &backtrace_flag);
>  	smp_mb__after_atomic();
> +	return true;
> +}
> +
> +static int nmi_vprintk(const char *fmt, va_list args)
> +{
> +	struct trace_seq *s = this_cpu_ptr(&nmi_print_seq);
> +	int len = s->len;
> +
> +	trace_seq_vprintf(s, fmt, args);
> +	return s->len - len;
>  }
>  
>  static int
> @@ -68,12 +124,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
>  	cpu = smp_processor_id();
>  
>  	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> -		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
> +		printk_func_t printk_func_save = this_cpu_read(printk_func);
>  
> -		arch_spin_lock(&lock);
> -		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
> +		/* Replace printk to write into the NMI seq */
> +		this_cpu_write(printk_func, nmi_vprintk);
> +		printk("NMI backtrace for cpu %d\n", cpu);
>  		show_regs(regs);
> -		arch_spin_unlock(&lock);
> +		this_cpu_write(printk_func, printk_func_save);
> +
>  		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
>  		return NMI_HANDLED;
>  	}
> -- 
> 2.0.0.rc2
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-20 20:51           ` Steven Rostedt
@ 2014-06-23 16:33             ` Petr Mládek
  2014-06-23 17:03               ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Petr Mládek @ 2014-06-23 16:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Johannes Berg

On Fri 2014-06-20 16:51:45, Steven Rostedt wrote:
> On Fri, 20 Jun 2014 16:28:45 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Fri, 20 Jun 2014 10:12:44 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > 
> > > > > > +#define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> > > > > > +
> > > > > > +int trace_seq_putmem_hex(struct trace_seq *s, const void *mem, size_t len)
> > > > > > +{
> > > > > > +	unsigned char hex[HEX_CHARS];
> > > > > > +	const unsigned char *data = mem;
> > > > > > +	int i, j;
> > > > > > +
> > > > > > +	if (s->full)
> > > > > > +		return 0;
> > > > > 
> > > > > What's this ->full thing all about anyway?  Some central comment which
> > > > > explains the design is needed.
> > > > 
> > > > Comment? What? Git blame isn't good enough for ya? ;-)
> > > 
> > > There's always that.  There's also googling for the original list
> > > dicsussion.  But it's a bit user-unfriendly, particularly when then
> > > code has aged was subsequently altered many times.
> > 
> > Although I did not address this because I'm waiting to hear back from
> > Johannes Berg, I updated for your other comments. I hope I got them all.
> > 
> > Regardless of this patch series, I pulled out the code from
> > trace_output.c and made a separate file for the trace_seq_*() functions
> > in kernel/trace/trace_seq.c. I then updated the file with your
> > comments. I found a bug or two and I will be dealing with them later as
> > this will only be a clean up patch not a bug fix patch.
> > 
> > Anyway, here's the new file:
> 
> Would be helpful if I compiled it before posting it :-)
> 
> Also, I missed a few EXPORT_SYMBOL_GPL()s along the way.
> 
> New update:
> 
> -- Steve
> 
> /*
>  * trace_seq.c
>  *
>  * Copyright (C) 2008-2014 Red Hat Inc, Steven Rostedt <srostedt@redhat.com>
>  *
>  * The trace_seq is a handy tool that allows you to pass a descriptor around
>  * to a buffer that other functions can write to. It is similar to the
>  * seq_file functionality but has some differences.
>  *
>  * To use it, the trace_seq must be initialized with trace_seq_init().
>  * This will set up the counters within the descriptor. You can call
>  * trace_seq_init() more than once to reset the trace_seq to start
>  * from scratch.
>  * 
>  * The buffer size is currently PAGE_SIZE, although it may become dynamic
>  * in the future.
>  *
>  * A write to the buffer will either succed or fail. That is, unlike
>  * sprintf() there will not be a partial write (well it may write into
>  * the buffer but it wont update the pointers). This allows users to
>  * try to write something into the trace_seq buffer and if it fails
>  * they can flush it and try again.

We might want to write as much as possible to the buffer if it is used
for the backtrace.

Well, if I understand it correctly, this is only about the last
line. If the trace does not fit, we are in troubles anyway. We might
solve this later when needed.

>  */
> #include <linux/uaccess.h>
> #include <linux/seq_file.h>
> #include <linux/trace_seq.h>
> 
> /* How much buffer is left on the trace_seq? */
> #define TRACE_SEQ_BUF_LEFT(s) ((PAGE_SIZE - 1) - (s)->len)
> 
> /* How much buffer is written? */
> #define TRACE_SEQ_BUF_USED(s) min((s)->len, (unsigned int)(PAGE_SIZE - 1))

[...]
 
> /**
>  * trace_seq_printf - sequence printing of trace information
>  * @s: trace sequence descriptor
>  * @fmt: printf format string
>  *
>  * The tracer may use either sequence operations or its own
>  * copy to user routines. To simplify formating of a trace
>  * trace_seq_printf() is used to store strings into a special
>  * buffer (@s). Then the output may be either used by
>  * the sequencer or pulled into another buffer.
>  *
>  * Returns 1 if we successfully written all the contents to
>  *   the buffer.
>   * Returns 0 if we the length to write is bigger than the
>  *   reserved buffer space. In this case, nothing gets written.
>  */
> int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
> {
> 	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
> 	va_list ap;
> 	int ret;
> 
> 	if (s->full || !len)
> 		return 0;
> 
> 	va_start(ap, fmt);
> 	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
> 	va_end(ap);
> 
> 	/* If we can't write it all, don't bother writing anything */
> 	if (ret >= len) {
> 		s->full = 1;
> 		return 0;
> 	}
> 
> 	s->len += ret;
> 
> 	return 1;
> }
> EXPORT_SYMBOL_GPL(trace_seq_printf);
> 
> /**
>  * trace_seq_bitmask - write a bitmask array in its ASCII representation
>  * @s:		trace sequence descriptor
>  * @maskp:	points to an array of unsigned longs that represent a bitmask
>  * @nmaskbits:	The number of bits that are valid in @maskp
>  *
>  * Writes a ASCII representation of a bitmask string into @s.
>  *
>  * Returns 1 if we successfully written all the contents to
>  *   the buffer.
>  * Returns 0 if we the length to write is bigger than the
>  *   reserved buffer space. In this case, nothing gets written.
>  */
> int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> 		      int nmaskbits)
> {
> 	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
> 	int ret;
> 
> 	if (s->full || !len)
> 		return 0;
> 
> 	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);

We do not detect here if the whole bitmap is written. Hmm, the perfect
solution is not easy. What about using the following workaround?

If we wrote until the end of the buffer, the write was most likely
incomplete:

	if (ret == len) {
		s->full = 1;
		return 0;
	}

It is more consistent with what we do in the other functions.

> 	s->len += ret;
>
> 	return 1;
> }
> EXPORT_SYMBOL_GPL(trace_seq_bitmask);
> 

[...]

> /**
>  * trace_seq_putmem_hex - write raw memory into the buffer in ASCII hex
>  * @s: trace sequence descriptor
>  * @mem: The raw memory to write its hex ASCII representation of
>  * @len: The length of the raw memory to copy (in bytes)
>  *
>  * This is similar to trace_seq_putmem() except instead of just copying the
>  * raw memory into the buffer it writes its ASCII representation of it
>  * in hex characters.
>  *
>  * Returns 1 if we successfully written all the contents to
>  *   the buffer.
>  * Returns 0 if we the length to write is bigger than the
>  *   reserved buffer space. In this case, nothing gets written.
>  */
> int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
> 			 unsigned int len)
> {
> 	unsigned char hex[HEX_CHARS];
> 	const unsigned char *data = mem;
> 	int i, j;
> 
> 	if (s->full)
> 		return 0;
> 
> #ifdef __BIG_ENDIAN
> 	for (i = 0, j = 0; i < len; i++) {
> #else
> 	for (i = len-1, j = 0; i >= 0; i--) {
> #endif

This might overflow when "len" > (HEX_CHARS-1)/2.

I think that we should either limit "len" or make it in two cycles.

Hmm, if we use two cycles, it might be problem how to solve the endianity.
IMHO, it depends on what type of data is written into the buffer
(single vs. set of values).

> 		hex[j++] = hex_asc_hi(data[i]);
> 		hex[j++] = hex_asc_lo(data[i]);
> 	}
> 	hex[j++] = ' ';
> 
> 	return trace_seq_putmem(s, hex, j);
> }
> EXPORT_SYMBOL_GPL(trace_seq_putmem_hex);
> 
> /**
>  * trace_seq_reserve - reserve space on the sequence buffer
>  * @s: trace sequence descriptor
>  * @len: The amount to reserver.
>  *
>  * If for some reason there is a need to save some space on the
>  * buffer to fill in later, this function is used for that purpose.
>  * The given length will be reserved and the pointer to that
>  * location on the buffer is returned, unless there is not enough
>  * buffer left to hold the given length then NULL is returned.
>  */
> void *trace_seq_reserve(struct trace_seq *s, unsigned int len)
> {
> 	void *ret;
> 
> 	if (s->full)
> 		return NULL;
> 
> 	if (len > TRACE_SEQ_BUF_LEFT(s)) {
> 		s->full = 1;
> 		return NULL;
> 	}
> 
> 	ret = s->buffer + s->len;
> 	s->len += len;
> 
> 	return ret;
> }
> EXPORT_SYMBOL_GPL(trace_seq_reserve);

This sounds like a slightly dangerous operation. The reserve is valid only
until the buffer is re-initialized. Is this really intended?
Do we really want to export it?

Well, the whole buffer need to be used carefully. IMHO, it can't be accessed
in parallel without extra locks. So, this might be fine after all. We
could only warn about this in the function description.

[...]

The rest looks fine to me.

Also the proposed name "seq_buf" sounds fine to me.


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-23 16:33             ` Petr Mládek
@ 2014-06-23 17:03               ` Steven Rostedt
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Rostedt @ 2014-06-23 17:03 UTC (permalink / raw)
  To: Petr Mládek
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Johannes Berg, Eduard - Gabriel Munteanu

On Mon, 23 Jun 2014 18:33:06 +0200
Petr Mládek <pmladek@suse.cz> wrote:


> >  * A write to the buffer will either succed or fail. That is, unlike
> >  * sprintf() there will not be a partial write (well it may write into
> >  * the buffer but it wont update the pointers). This allows users to
> >  * try to write something into the trace_seq buffer and if it fails
> >  * they can flush it and try again.
> 
> We might want to write as much as possible to the buffer if it is used
> for the backtrace.

Understood, but there is a reason for this way of doing things.

> 
> Well, if I understand it correctly, this is only about the last
> line. If the trace does not fit, we are in troubles anyway. We might
> solve this later when needed.
> 
> >  */
> > #include <linux/uaccess.h>
> > #include <linux/seq_file.h>
> > #include <linux/trace_seq.h>
> > 
> > /* How much buffer is left on the trace_seq? */
> > #define TRACE_SEQ_BUF_LEFT(s) ((PAGE_SIZE - 1) - (s)->len)
> > 
> > /* How much buffer is written? */
> > #define TRACE_SEQ_BUF_USED(s) min((s)->len, (unsigned int)(PAGE_SIZE - 1))
> 
> [...]
>  
> > /**
> >  * trace_seq_printf - sequence printing of trace information
> >  * @s: trace sequence descriptor
> >  * @fmt: printf format string
> >  *
> >  * The tracer may use either sequence operations or its own
> >  * copy to user routines. To simplify formating of a trace
> >  * trace_seq_printf() is used to store strings into a special
> >  * buffer (@s). Then the output may be either used by
> >  * the sequencer or pulled into another buffer.
> >  *
> >  * Returns 1 if we successfully written all the contents to
> >  *   the buffer.
> >   * Returns 0 if we the length to write is bigger than the
> >  *   reserved buffer space. In this case, nothing gets written.
> >  */
> > int trace_seq_printf(struct trace_seq *s, const char *fmt, ...)
> > {
> > 	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
> > 	va_list ap;
> > 	int ret;
> > 
> > 	if (s->full || !len)
> > 		return 0;
> > 
> > 	va_start(ap, fmt);
> > 	ret = vsnprintf(s->buffer + s->len, len, fmt, ap);
> > 	va_end(ap);
> > 
> > 	/* If we can't write it all, don't bother writing anything */
> > 	if (ret >= len) {
> > 		s->full = 1;
> > 		return 0;
> > 	}
> > 
> > 	s->len += ret;
> > 
> > 	return 1;
> > }
> > EXPORT_SYMBOL_GPL(trace_seq_printf);
> > 
> > /**
> >  * trace_seq_bitmask - write a bitmask array in its ASCII representation
> >  * @s:		trace sequence descriptor
> >  * @maskp:	points to an array of unsigned longs that represent a bitmask
> >  * @nmaskbits:	The number of bits that are valid in @maskp
> >  *
> >  * Writes a ASCII representation of a bitmask string into @s.
> >  *
> >  * Returns 1 if we successfully written all the contents to
> >  *   the buffer.
> >  * Returns 0 if we the length to write is bigger than the
> >  *   reserved buffer space. In this case, nothing gets written.
> >  */
> > int trace_seq_bitmask(struct trace_seq *s, const unsigned long *maskp,
> > 		      int nmaskbits)
> > {
> > 	unsigned int len = TRACE_SEQ_BUF_LEFT(s);
> > 	int ret;
> > 
> > 	if (s->full || !len)
> > 		return 0;
> > 
> > 	ret = bitmap_scnprintf(s->buffer, len, maskp, nmaskbits);
> 
> We do not detect here if the whole bitmap is written. Hmm, the perfect
> solution is not easy. What about using the following workaround?

Ah right. Nice catch.

> 
> If we wrote until the end of the buffer, the write was most likely
> incomplete:
> 
> 	if (ret == len) {
> 		s->full = 1;
> 		return 0;
> 	}
> 
> It is more consistent with what we do in the other functions.
> 
> > 	s->len += ret;
> >
> > 	return 1;
> > }

I'll have to look into this too. Yeah, the results should be more
consistent. Maybe have zero be it fit, and return how much buffer is
left when it doesn't fit?

> > EXPORT_SYMBOL_GPL(trace_seq_bitmask);
> > 
> 
> [...]
> 
> > /**
> >  * trace_seq_putmem_hex - write raw memory into the buffer in ASCII hex
> >  * @s: trace sequence descriptor
> >  * @mem: The raw memory to write its hex ASCII representation of
> >  * @len: The length of the raw memory to copy (in bytes)
> >  *
> >  * This is similar to trace_seq_putmem() except instead of just copying the
> >  * raw memory into the buffer it writes its ASCII representation of it
> >  * in hex characters.
> >  *
> >  * Returns 1 if we successfully written all the contents to
> >  *   the buffer.
> >  * Returns 0 if we the length to write is bigger than the
> >  *   reserved buffer space. In this case, nothing gets written.
> >  */
> > int trace_seq_putmem_hex(struct trace_seq *s, const void *mem,
> > 			 unsigned int len)
> > {
> > 	unsigned char hex[HEX_CHARS];
> > 	const unsigned char *data = mem;
> > 	int i, j;
> > 
> > 	if (s->full)
> > 		return 0;
> > 
> > #ifdef __BIG_ENDIAN
> > 	for (i = 0, j = 0; i < len; i++) {
> > #else
> > 	for (i = len-1, j = 0; i >= 0; i--) {
> > #endif
> 
> This might overflow when "len" > (HEX_CHARS-1)/2.

Yes I caught this while doing updates per Andrew's replies. Currently
it's fine because the only user of it is wrapped in the macro. It
originally had a BUG_ON() if len > HEX_CHARS but that was moved to the
macro to make it a build bug on. I hate this function, but we need to
keep it to keep backward compatibility for what was in the original -rt
latency tracer :-p

Anyway, last week I rewrote this function to be much more robust. You
can look at my branch ftrace/next (caution, it rebases constantly).



> 
> I think that we should either limit "len" or make it in two cycles.
> 
> Hmm, if we use two cycles, it might be problem how to solve the endianity.
> IMHO, it depends on what type of data is written into the buffer
> (single vs. set of values).
> 
> > 		hex[j++] = hex_asc_hi(data[i]);
> > 		hex[j++] = hex_asc_lo(data[i]);
> > 	}
> > 	hex[j++] = ' ';
> > 
> > 	return trace_seq_putmem(s, hex, j);
> > }
> > EXPORT_SYMBOL_GPL(trace_seq_putmem_hex);
> > 
> > /**
> >  * trace_seq_reserve - reserve space on the sequence buffer
> >  * @s: trace sequence descriptor
> >  * @len: The amount to reserver.
> >  *
> >  * If for some reason there is a need to save some space on the
> >  * buffer to fill in later, this function is used for that purpose.
> >  * The given length will be reserved and the pointer to that
> >  * location on the buffer is returned, unless there is not enough
> >  * buffer left to hold the given length then NULL is returned.
> >  */
> > void *trace_seq_reserve(struct trace_seq *s, unsigned int len)
> > {
> > 	void *ret;
> > 
> > 	if (s->full)
> > 		return NULL;
> > 
> > 	if (len > TRACE_SEQ_BUF_LEFT(s)) {
> > 		s->full = 1;
> > 		return NULL;
> > 	}
> > 
> > 	ret = s->buffer + s->len;
> > 	s->len += len;
> > 
> > 	return ret;
> > }
> > EXPORT_SYMBOL_GPL(trace_seq_reserve);
> 
> This sounds like a slightly dangerous operation. The reserve is valid only
> until the buffer is re-initialized. Is this really intended?

Well, the only on reserving it should be whoever is using it. It is
like allocating memory and then letting something else free it.

> Do we really want to export it?

Honestly, I don't know why this was added. It was created by Eduard and
pulled in by Ingo. There's no users in the kernel. I think I may just
nuke it.


> 
> Well, the whole buffer need to be used carefully. IMHO, it can't be accessed
> in parallel without extra locks. So, this might be fine after all. We
> could only warn about this in the function description.
> 

Well, neither can seq_file. It's based on that. Take a look at some of
my changes in ftrace/next and see .

> [...]
> 
> The rest looks fine to me.
> 
> Also the proposed name "seq_buf" sounds fine to me.

Yeah, that may change soon too. Right now I'm working on cleaning up
the code within the trace directory, and then I'll push it to lib if
this is the way we are going.

-- Steve


> 
> 
> Best Regards,
> Petr


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-23 16:08         ` Steven Rostedt
@ 2014-06-23 17:38           ` Johannes Berg
  2014-06-23 18:04             ` Steven Rostedt
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Berg @ 2014-06-23 17:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Mon, 2014-06-23 at 12:08 -0400, Steven Rostedt wrote:
> On Sun, 22 Jun 2014 09:38:05 +0200
> Johannes Berg <johannes@sipsolutions.net> wrote:
> 
> 
> > Looking at the code though, I'm not sure it's a pure optimisation - if
> > you do say putc() after a failed puts(), without this code the putc()
> > would succeed? I can't tell right now if that's really a problem, but it
> > seems you could get some odd behaviour out of it.
> 
> How would putc() still succeed? We're just talking about the "full"
> field. It would still do the length check:
> 
> 	if (s->len >= (PAGE_SIZE - 1))
> 		return 0;

Right, but the puts() could fail if not all of the string fits, and a
subsequent putc() might fit.

johannes


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-23 17:38           ` Johannes Berg
@ 2014-06-23 18:04             ` Steven Rostedt
  2014-06-24  8:19               ` Johannes Berg
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-06-23 18:04 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Mon, 23 Jun 2014 19:38:14 +0200
Johannes Berg <johannes@sipsolutions.net> wrote:

> On Mon, 2014-06-23 at 12:08 -0400, Steven Rostedt wrote:
> > On Sun, 22 Jun 2014 09:38:05 +0200
> > Johannes Berg <johannes@sipsolutions.net> wrote:
> > 
> > 
> > > Looking at the code though, I'm not sure it's a pure optimisation - if
> > > you do say putc() after a failed puts(), without this code the putc()
> > > would succeed? I can't tell right now if that's really a problem, but it
> > > seems you could get some odd behaviour out of it.
> > 
> > How would putc() still succeed? We're just talking about the "full"
> > field. It would still do the length check:
> > 
> > 	if (s->len >= (PAGE_SIZE - 1))
> > 		return 0;
> 
> Right, but the puts() could fail if not all of the string fits, and a
> subsequent putc() might fit.
> 

Ah! I see what you are saying. You are saying that once you are full
you need to flush. Honestly, the trace_puts() should check the return.
Perhaps it wants to try again with a shorter string?

The full flag means that once we hit full, we are done, no more can be
written till the buffer is flushed.

Is that what we would want from this utility? Or do we want to allow
the user to try again with a smaller string?

I was thinking of reversing the error code and have the trace_seq_*()
return 0 as success and the remaining buffer size on error.

-- Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/
  2014-06-23 18:04             ` Steven Rostedt
@ 2014-06-24  8:19               ` Johannes Berg
  0 siblings, 0 replies; 43+ messages in thread
From: Johannes Berg @ 2014-06-24  8:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, linux-kernel, Linus Torvalds, Ingo Molnar,
	Jiri Kosina, Michal Hocko, Jan Kara, Frederic Weisbecker,
	Dave Anderson, Petr Mladek

On Mon, 2014-06-23 at 14:04 -0400, Steven Rostedt wrote:

> Ah! I see what you are saying. You are saying that once you are full
> you need to flush. Honestly, the trace_puts() should check the return.
> Perhaps it wants to try again with a shorter string?

Perhaps, but perhaps not? I don't remember exactly the issue that I saw
without the full flag, and it may very well be a purely theoretical
thing too. I don't think I mind either way.

> The full flag means that once we hit full, we are done, no more can be
> written till the buffer is flushed.

Right. Arguably, it was already full or overflowed, but as it now stands
we make sure the data is consistent with the input (or rejected if the
input didn't fit.)

> Is that what we would want from this utility? Or do we want to allow
> the user to try again with a smaller string?

That's a good question - right now of course the first attempt would
make it full, and then any further attempt would fail.

> I was thinking of reversing the error code and have the trace_seq_*()
> return 0 as success and the remaining buffer size on error.

That would make some sense if we wanted to make it post-determined (you
fail, and then you can use the remaining buffer). Maybe it would make
sense to pre-determine that, e.g. by having a function that can reserve
a range, say

 void trace_seq_put_buf(..., int min_len, int max_len,
                        int *out_avail, void *out_data_ptr)

(of course that prototype needs to be adjusted appropriately)

If less than min_len is available, we could still have the "full"
semantics.

Either way, I don't really know or even care. I'm only using this
indirectly through the tracing, where I don't think it makes a lot of
sense without the full semantics since afaik the buffers need to be in
the right order in the page(s), and where you likely don't want to cut a
string short since you couldn't continue it on the next page, but you're
clearly the expert on this and I'm probably wrong anyway :)

johannes


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-20 14:35               ` Petr Mládek
@ 2014-06-24 13:32                 ` Konstantin Khlebnikov
  2014-06-25 10:01                   ` Jiri Kosina
  2014-06-25 12:21                   ` Petr Mládek
  0 siblings, 2 replies; 43+ messages in thread
From: Konstantin Khlebnikov @ 2014-06-24 13:32 UTC (permalink / raw)
  To: Petr Mládek
  Cc: Jiri Kosina, Steven Rostedt, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Andrew Morton, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson

On Fri, Jun 20, 2014 at 6:35 PM, Petr Mládek <pmladek@suse.cz> wrote:
> On Fri 2014-06-20 01:38:59, Jiri Kosina wrote:
>> On Thu, 19 Jun 2014, Steven Rostedt wrote:
>>
>> > > I don't think there is a need for a global stop_machine()-like
>> > > synchronization here. The printing CPU will be sending IPI to the CPU N+1
>> > > only after it has finished printing CPU N stacktrace.
>> >
>> > So you plan on sending an IPI to a CPU then wait for it to acknowledge
>> > that it is spinning, and then print out the data and then tell the CPU
>> > it can stop spinning?
>>
>> Yes, that was exactly my idea. You have to be synchronized with the CPU
>> receiving the NMI anyway in case you'd like to get its pt_regs and dump
>> those as part of the dump.
>
> This approach did not work after all. There was still the same
> race. If we stop a CPU in the middle of printk(), it does not help
> to move the printing task to another CPU ;-) We would need to
> make a copy of regs and all the stacks to unblock the CPU.
>
> Hmm, in general, if we want a consistent snapshot, we need to temporary store
> the information in NMI context and put it into the main ring buffer
> in normal context. We either need to copy stacks or copy the printed text.
>
>
> I start to like Steven's solution with the trace_seq buffer. I see the
> following advantages:
>
>     + the snapshot is pretty good;
>         + we still send NMI to all CPUs at the "same" time
>
>     + only minimal time is spent in NMI context;
>         + CPUs are not blocked by each other to get sequential output
>
>     + minimum of new code
>         + trace_seq buffer is already implemented and used
>         + it might be even better after getting attention from new users
>
>
> Of course, it has also some disadvantages:
>
>    + needs quite big per-CPU buffer;
>         + but we would need some extra space to copy the data anyway
>
>    + trace might be shrunken;
>         + but 1 page should be enough in most cases;
>         + we could make it configurable
>
>    + delay until the message appears in the ringbuffer and console
>         + better than freezing
>         + still saved in core file
>         + crash tool could get improved to find the traces
>
>
> Note that the above solution solves only printing of the stack.
> There are still other locations when printk is called in NMI
> context. IMHO, some of them are helpful:
>
>     ./arch/x86/kernel/nmi.c:                        WARN(in_nmi(),
>     ./arch/x86/mm/kmemcheck/kmemcheck.c:    WARN_ON_ONCE(in_nmi());
>     ./arch/x86/mm/fault.c:  WARN_ON_ONCE(in_nmi());
>     ./arch/x86/mm/fault.c:  WARN_ON_ONCE(in_nmi());
>
>     ./mm/vmalloc.c: BUG_ON(in_nmi());
>     ./lib/genalloc.c:       BUG_ON(in_nmi());
>     ./lib/genalloc.c:       BUG_ON(in_nmi());
>     ./include/linux/hardirq.h:              BUG_ON(in_nmi());
>
> And some are probably less important:
>
>     ./arch/x86/platform/uv/uv_nmi.c   several locations here
>     ./arch/m68k/mac/macints.c-              printk("... pausing, press NMI to resume ...");
>
>
> Well, there are only few. Maybe, we could share the trace_seq buffer
> here.
>
> Of course, there is still the possibility to implement a lockless
> buffer. But it will be much more complicated than the current one.
> I am not sure that we really want it.

Let me join the discussion.
In the past I've implemented similar feature for x86 in openvz kernel
and now I'm thinking about it for arm.

Originally I thought that seizing all cpus one by one and printing
from the initiator
is a best approach and I've started preparing arguments against
over-engineered printk...
But it seems deferring printk output isn't that bad idea, it much
easier and arch-independent.
Saving context is very arch-specific. Printing is a problem too:
show_regs() is a horrible mess,
on most arches it always prints something from current context. I'm
thinking about cleaning this
mess, but doing this for all arches takes some time for sure.

Instead of per-cpu buffers printk might use part of existing ring
buffer  -- initiator cpu allocates space for
target cpu and flushes it into common stream after it finish printing.
Probably this kind of transactional model might be used on single cpu
for multi-line KERN_CONT.

>
>
> Best Regards,
> Petr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-24 13:32                 ` Konstantin Khlebnikov
@ 2014-06-25 10:01                   ` Jiri Kosina
  2014-06-25 11:04                     ` Konstantin Khlebnikov
  2014-06-25 12:21                   ` Petr Mládek
  1 sibling, 1 reply; 43+ messages in thread
From: Jiri Kosina @ 2014-06-25 10:01 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Petr Mládek, Steven Rostedt, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Andrew Morton, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson

On Tue, 24 Jun 2014, Konstantin Khlebnikov wrote:

> Originally I thought that seizing all cpus one by one and printing from 
> the initiator is a best approach and I've started preparing arguments 
> against over-engineered printk...

By "seizing" I guess you mean sending IPI, right?

What do you do if you'd interrupt it in the middle of printk() in order 
not to deadlock by trying to take the buffer lock on the initiator CPU?

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-25 10:01                   ` Jiri Kosina
@ 2014-06-25 11:04                     ` Konstantin Khlebnikov
  2014-06-25 11:57                       ` Petr Mládek
  0 siblings, 1 reply; 43+ messages in thread
From: Konstantin Khlebnikov @ 2014-06-25 11:04 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Petr Mládek, Steven Rostedt, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Andrew Morton, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson

On Wed, Jun 25, 2014 at 2:01 PM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Tue, 24 Jun 2014, Konstantin Khlebnikov wrote:
>
>> Originally I thought that seizing all cpus one by one and printing from
>> the initiator is a best approach and I've started preparing arguments
>> against over-engineered printk...
>
> By "seizing" I guess you mean sending IPI, right?
>
> What do you do if you'd interrupt it in the middle of printk() in order
> not to deadlock by trying to take the buffer lock on the initiator CPU?

Yep, under "seize" I mean NMI IPI which gathering state of registers from cpu.
Also it can keep cpu in nmi handler while initiator cpu unwinds its stack trace.
Code of dump-all-stack in arch/sparc does something like that.

You're right, moving printk from one cpu to another doesn't fixes recursion,
but we could send NMI IPI under logbuf_lock. I don't think that this is good
approach, but it seems realistic. Of course all printing from other NMI
handlers must be done in some weird way.


I like that context-depend printk routing from this patchset.
I'm just not quite happy with this straight forward implementation.
Rerouting flow into different buffers might serve different purposes,
it must make code simpler instead of adding yet another knot.
For example printk continuation might use it too.

>
> --
> Jiri Kosina
> SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-25 11:04                     ` Konstantin Khlebnikov
@ 2014-06-25 11:57                       ` Petr Mládek
  0 siblings, 0 replies; 43+ messages in thread
From: Petr Mládek @ 2014-06-25 11:57 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Jiri Kosina, Steven Rostedt, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Andrew Morton, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson

On Wed 2014-06-25 15:04:41, Konstantin Khlebnikov wrote:
> On Wed, Jun 25, 2014 at 2:01 PM, Jiri Kosina <jkosina@suse.cz> wrote:
> > On Tue, 24 Jun 2014, Konstantin Khlebnikov wrote:
> 
> I like that context-depend printk routing from this patchset.
> I'm just not quite happy with this straight forward implementation.
> Rerouting flow into different buffers might serve different purposes,
> it must make code simpler instead of adding yet another knot.
> For example printk continuation might use it too.

Note that we could not share the buffer between normal and NMI
context easily. We would end with the same locking problems as we have now.

In fact, any sharing is complicated when NMI is involved, especially
when we need to access the data from both NMI and normal context.
And we need to copy them into the main ring buffer in a normal context.

This patchset works only because arch_trigger_all_cpu_backtrace() is
called from normal context. Also all other calls are ignored until the
previous call is finished, see "backtrace_flag".


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely
  2014-06-24 13:32                 ` Konstantin Khlebnikov
  2014-06-25 10:01                   ` Jiri Kosina
@ 2014-06-25 12:21                   ` Petr Mládek
  1 sibling, 0 replies; 43+ messages in thread
From: Petr Mládek @ 2014-06-25 12:21 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Jiri Kosina, Steven Rostedt, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Andrew Morton, Michal Hocko,
	Jan Kara, Frederic Weisbecker, Dave Anderson

On Tue 2014-06-24 17:32:15, Konstantin Khlebnikov wrote:
> 
> Instead of per-cpu buffers printk might use part of existing ring
> buffer  -- initiator cpu allocates space for
> target cpu and flushes it into common stream after it finish printing.
> Probably this kind of transactional model might be used on single cpu
> for multi-line KERN_CONT.

I wanted to think more about this.

The evident advantage is that we do not need extra buffers. Also it might
be used instead of the cont buffer.

One challenge is that we do not know how much space is needed. If
we do not reserve enough, we might not be able to reserve it later.
This would happen when the main ring buffer is locked by some
blocked process. If we reserve too much, it is a waste of information.
Note that we must remove old messages to make the space.

Also we could not rotate the ring buffer over the reserved space. Then
we might miss the last messages from the normal context when the
system goes down.

Another question is if it could simplify the printk() code. The reservation
logic would be probably more complicated than the current logic around
the cont buffer.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-11-19 10:41   ` Borislav Petkov
  2014-11-19 10:44     ` Jiri Kosina
@ 2014-11-19 13:02     ` Petr Mladek
  1 sibling, 0 replies; 43+ messages in thread
From: Petr Mladek @ 2014-11-19 13:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Steven Rostedt, linux-kernel, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Jiri Kosina, Paul E. McKenney, Andy Lutomirski,
	Tony Luck

On Wed 2014-11-19 11:41:14, Borislav Petkov wrote:
> On Tue, Nov 18, 2014 at 11:39:20PM -0500, Steven Rostedt wrote:
> >  static int
> >  arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
> >  {
> > @@ -78,12 +157,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
> >  	cpu = smp_processor_id();
> >  
> >  	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> > -		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
> > +		printk_func_t printk_func_save = this_cpu_read(printk_func);
> >  
> > -		arch_spin_lock(&lock);
> > +		/* Replace printk to write into the NMI seq */
> > +		this_cpu_write(printk_func, nmi_vprintk);
> >  		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
> >  		show_regs(regs);
> > -		arch_spin_unlock(&lock);
> > +		this_cpu_write(printk_func, printk_func_save);
> 
> I'm wondering if this could be used in a generic manner throughout code
> where we could say "ok, I'm in an NMI context, so lemme switch printk's
> and do some printing" so that NMI and NMI-like atomic contexts could use
> printk. Lemme do an mce example:
> 
> do_machine_check(..)
> {
> 	printk_func_t printk_func_save = this_cpu_read(printk_func);
> 
> 	...
> 
> 	/* in #MC handler, switch printks */
> 	this_cpu_write(printk_func, nmi_vprintk);
> 
> 	printk("This is a hw error, details: ...\n");
> 
> 	/* more bla */
> 
> 	this_cpu_write(printk_func, printk_func_save);
> }
> 
> or should we change that in entry.S, before we call the handler?
> 
> Because, if we could do something like that, then we finally get to use
> printk in an NMI context which would be a good thing.

Unfortunately, the problem is more complicated. The alternative
printk_func() just stores the string into the seq_buf. Then someone
needs to move the data to the main ring buffer, see print_seq_line()
and the related cycle in this patch.

The copying needs to be done in the normal context because it will
need to take the lock for the main ring buffer. Then it will need to
somehow protect the seq_buf against other accesses from NMI context.

By other words, any serious generic solution would end up with implementing
an alternative ring buffer and lockless operations. It would be
something similar to my older patch set, see
https://lkml.org/lkml/2014/5/9/118. It was not
accepted and it triggered this solution for NMI backtraces.

Note that this patch works because arch_trigger_all_cpu_backtrace() is
called from normal context and any further calls are ignored until the
current call is finished. So, there is a safe place to copy the data.


I have another idea for the generic solution. Linus was against using
printk() in NMI context, see https://lkml.org/lkml/2014/6/10/388.
The backtraces are the only serious usage. Other than
that there are only few warnings. My proposal is to handle the
warnings similar way like recursive printk() warning. The recursive
printk() just sets a flag, the message is dropped, warning about lost
message is printed within the next printk() call.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-11-19 10:44     ` Jiri Kosina
@ 2014-11-19 10:53       ` Borislav Petkov
  0 siblings, 0 replies; 43+ messages in thread
From: Borislav Petkov @ 2014-11-19 10:53 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Steven Rostedt, linux-kernel, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Petr Mladek, Paul E. McKenney, Andy Lutomirski,
	Tony Luck

On Wed, Nov 19, 2014 at 11:44:56AM +0100, Jiri Kosina wrote:
> If we are going down this path, do_nmi() should be early enough to do
> it, no need to pollute NMI assembly code with this.

do_nmi() doesn't cover all, like do_machine_check(), do_double_fault(),
and the rest of the handlers. I'm suggesting to do it before any handler
gets run so that you can use printk in the handler itself.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-11-19 10:41   ` Borislav Petkov
@ 2014-11-19 10:44     ` Jiri Kosina
  2014-11-19 10:53       ` Borislav Petkov
  2014-11-19 13:02     ` Petr Mladek
  1 sibling, 1 reply; 43+ messages in thread
From: Jiri Kosina @ 2014-11-19 10:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Steven Rostedt, linux-kernel, Linus Torvalds, Ingo Molnar,
	Andrew Morton, Petr Mladek, Paul E. McKenney, Andy Lutomirski,
	Tony Luck

On Wed, 19 Nov 2014, Borislav Petkov wrote:

> I'm wondering if this could be used in a generic manner throughout code 
> where we could say "ok, I'm in an NMI context, so lemme switch printk's 
> and do some printing" so that NMI and NMI-like atomic contexts could use 
> printk. Lemme do an mce example:
> 
> do_machine_check(..)
> {
> 	printk_func_t printk_func_save = this_cpu_read(printk_func);
> 
> 	...
> 
> 	/* in #MC handler, switch printks */
> 	this_cpu_write(printk_func, nmi_vprintk);
> 
> 	printk("This is a hw error, details: ...\n");
> 
> 	/* more bla */
> 
> 	this_cpu_write(printk_func, printk_func_save);
> }
> 
> or should we change that in entry.S, before we call the handler?

If we are going down this path, do_nmi() should be early enough to do it, 
no need to pollute NMI assembly code with this.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-11-19  4:39 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
@ 2014-11-19 10:41   ` Borislav Petkov
  2014-11-19 10:44     ` Jiri Kosina
  2014-11-19 13:02     ` Petr Mladek
  0 siblings, 2 replies; 43+ messages in thread
From: Borislav Petkov @ 2014-11-19 10:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Linus Torvalds, Ingo Molnar, Andrew Morton,
	Jiri Kosina, Petr Mladek, Paul E. McKenney, Andy Lutomirski,
	Tony Luck

On Tue, Nov 18, 2014 at 11:39:20PM -0500, Steven Rostedt wrote:
>  static int
>  arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
>  {
> @@ -78,12 +157,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
>  	cpu = smp_processor_id();
>  
>  	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> -		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
> +		printk_func_t printk_func_save = this_cpu_read(printk_func);
>  
> -		arch_spin_lock(&lock);
> +		/* Replace printk to write into the NMI seq */
> +		this_cpu_write(printk_func, nmi_vprintk);
>  		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
>  		show_regs(regs);
> -		arch_spin_unlock(&lock);
> +		this_cpu_write(printk_func, printk_func_save);

I'm wondering if this could be used in a generic manner throughout code
where we could say "ok, I'm in an NMI context, so lemme switch printk's
and do some printing" so that NMI and NMI-like atomic contexts could use
printk. Lemme do an mce example:

do_machine_check(..)
{
	printk_func_t printk_func_save = this_cpu_read(printk_func);

	...

	/* in #MC handler, switch printks */
	this_cpu_write(printk_func, nmi_vprintk);

	printk("This is a hw error, details: ...\n");

	/* more bla */

	this_cpu_write(printk_func, printk_func_save);
}

or should we change that in entry.S, before we call the handler?

Because, if we could do something like that, then we finally get to use
printk in an NMI context which would be a good thing.

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs
  2014-11-19  4:39 [RFC][PATCH 0/3] printk/seq-buf/NMI: Revisit of safe NMI printing with seq_buf code Steven Rostedt
@ 2014-11-19  4:39 ` Steven Rostedt
  2014-11-19 10:41   ` Borislav Petkov
  0 siblings, 1 reply; 43+ messages in thread
From: Steven Rostedt @ 2014-11-19  4:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Jiri Kosina,
	Petr Mladek, Paul E. McKenney

[-- Attachment #1: 0003-x86-nmi-Perform-a-safe-NMI-stack-trace-on-all-CPUs.patch --]
[-- Type: text/plain, Size: 5451 bytes --]

From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>

When trigger_all_cpu_backtrace() is called on x86, it will trigger an
NMI on each CPU and call show_regs(). But this can lead to a hard lock
up if the NMI comes in on another printk().

In order to avoid this, when the NMI triggers, it switches the printk
routine for that CPU to call a NMI safe printk function that records the
printk in a per_cpu seq_buf descriptor. After all NMIs have finished
recording its data, the seq_bufs are printed in a safe context.

Link: http://lkml.kernel.org/p/20140619213952.360076309@goodmis.org
Link: http://lkml.kernel.org/r/20141115050605.055232587@goodmis.org

Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/apic/hw_nmi.c | 91 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index 6a1e71bde323..c95c3e9ce196 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -18,6 +18,7 @@
 #include <linux/nmi.h>
 #include <linux/module.h>
 #include <linux/delay.h>
+#include <linux/seq_buf.h>
 
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 u64 hw_nmi_get_sample_period(int watchdog_thresh)
@@ -29,14 +30,35 @@ u64 hw_nmi_get_sample_period(int watchdog_thresh)
 #ifdef arch_trigger_all_cpu_backtrace
 /* For reliability, we're prepared to waste bits here. */
 static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
+static cpumask_var_t printtrace_mask;
+
+#define NMI_BUF_SIZE		4096
+
+struct nmi_seq_buf {
+	unsigned char		buffer[NMI_BUF_SIZE];
+	struct seq_buf		seq;
+};
+
+/* Safe printing in NMI context */
+static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq);
 
 /* "in progress" flag of arch_trigger_all_cpu_backtrace */
 static unsigned long backtrace_flag;
 
+static void print_seq_line(struct nmi_seq_buf *s, int start, int end)
+{
+	const char *buf = s->buffer + start;
+
+	printk("%.*s", (end - start) + 1, buf);
+}
+
 void arch_trigger_all_cpu_backtrace(bool include_self)
 {
+	struct nmi_seq_buf *s;
+	int len;
+	int cpu;
 	int i;
-	int cpu = get_cpu();
+	int this_cpu = get_cpu();
 
 	if (test_and_set_bit(0, &backtrace_flag)) {
 		/*
@@ -49,7 +71,17 @@ void arch_trigger_all_cpu_backtrace(bool include_self)
 
 	cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
 	if (!include_self)
-		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
+		cpumask_clear_cpu(this_cpu, to_cpumask(backtrace_mask));
+
+	cpumask_copy(printtrace_mask, to_cpumask(backtrace_mask));
+	/*
+	 * Set up per_cpu seq_buf buffers that the NMIs running on the other
+	 * CPUs will write to.
+	 */
+	for_each_cpu(cpu, to_cpumask(backtrace_mask)) {
+		s = &per_cpu(nmi_print_seq, cpu);
+		seq_buf_init(&s->seq, s->buffer, NMI_BUF_SIZE);
+	}
 
 	if (!cpumask_empty(to_cpumask(backtrace_mask))) {
 		pr_info("sending NMI to %s CPUs:\n",
@@ -65,11 +97,58 @@ void arch_trigger_all_cpu_backtrace(bool include_self)
 		touch_softlockup_watchdog();
 	}
 
+	/*
+	 * Now that all the NMIs have triggered, we can dump out their
+	 * back traces safely to the console.
+	 */
+	for_each_cpu(cpu, printtrace_mask) {
+		int last_i = 0;
+
+		s = &per_cpu(nmi_print_seq, cpu);
+		len = seq_buf_used(&s->seq);
+		if (!len)
+			continue;
+
+		/* Print line by line. */
+		for (i = 0; i < len; i++) {
+			if (s->buffer[i] == '\n') {
+				print_seq_line(s, last_i, i);
+				last_i = i + 1;
+			}
+		}
+		/* Check if there was a partial line. */
+		if (last_i < len) {
+			print_seq_line(s, last_i, len - 1);
+			pr_cont("\n");
+		}
+	}
+
 	clear_bit(0, &backtrace_flag);
 	smp_mb__after_atomic();
 	put_cpu();
 }
 
+/*
+ * It is not safe to call printk() directly from NMI handlers.
+ * It may be fine if the NMI detected a lock up and we have no choice
+ * but to do so, but doing a NMI on all other CPUs to get a back trace
+ * can be done with a sysrq-l. We don't want that to lock up, which
+ * can happen if the NMI interrupts a printk in progress.
+ *
+ * Instead, we redirect the vprintk() to this nmi_vprintk() that writes
+ * the content into a per cpu seq_buf buffer. Then when the NMIs are
+ * all done, we can safely dump the contents of the seq_buf to a printk()
+ * from a non NMI context.
+ */
+static int nmi_vprintk(const char *fmt, va_list args)
+{
+	struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
+	unsigned int len = seq_buf_used(&s->seq);
+
+	seq_buf_vprintf(&s->seq, fmt, args);
+	return seq_buf_used(&s->seq) - len;
+}
+
 static int
 arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
 {
@@ -78,12 +157,14 @@ arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
 	cpu = smp_processor_id();
 
 	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
-		static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
+		printk_func_t printk_func_save = this_cpu_read(printk_func);
 
-		arch_spin_lock(&lock);
+		/* Replace printk to write into the NMI seq */
+		this_cpu_write(printk_func, nmi_vprintk);
 		printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
 		show_regs(regs);
-		arch_spin_unlock(&lock);
+		this_cpu_write(printk_func, printk_func_save);
+
 		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
 		return NMI_HANDLED;
 	}
-- 
2.1.1



^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-11-19 13:03 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-19 21:33 [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Steven Rostedt
2014-06-19 21:33 ` [RFC][PATCH 1/3] trace_seq: Move the trace_seq code to lib/ Steven Rostedt
2014-06-20  4:45   ` Linus Torvalds
2014-06-20 16:21     ` Steven Rostedt
2014-06-20  5:06   ` Andrew Morton
2014-06-20 16:58     ` Steven Rostedt
2014-06-20 17:12       ` Andrew Morton
2014-06-20 17:17         ` Steven Rostedt
2014-06-20 20:28         ` Steven Rostedt
2014-06-20 20:51           ` Steven Rostedt
2014-06-23 16:33             ` Petr Mládek
2014-06-23 17:03               ` Steven Rostedt
2014-06-22  7:38       ` Johannes Berg
2014-06-23 16:08         ` Steven Rostedt
2014-06-23 17:38           ` Johannes Berg
2014-06-23 18:04             ` Steven Rostedt
2014-06-24  8:19               ` Johannes Berg
2014-06-19 21:33 ` [RFC][PATCH 2/3] printk: Add per_cpu printk func to allow printk to be diverted Steven Rostedt
2014-06-23 16:06   ` Paul E. McKenney
2014-06-19 21:33 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
2014-06-20 13:58   ` Don Zickus
2014-06-20 14:21     ` Steven Rostedt
2014-06-20 14:55   ` Petr Mládek
2014-06-20 15:17     ` Steven Rostedt
2014-06-23 16:12   ` Paul E. McKenney
2014-06-19 21:56 ` [RFC][PATCH 0/3] x86/nmi: Print all cpu stacks from NMI safely Jiri Kosina
2014-06-19 22:58   ` Steven Rostedt
2014-06-19 23:03     ` Jiri Kosina
2014-06-19 23:19       ` Steven Rostedt
2014-06-19 23:27         ` Jiri Kosina
2014-06-19 23:36           ` Steven Rostedt
2014-06-19 23:38             ` Jiri Kosina
2014-06-20 14:35               ` Petr Mládek
2014-06-24 13:32                 ` Konstantin Khlebnikov
2014-06-25 10:01                   ` Jiri Kosina
2014-06-25 11:04                     ` Konstantin Khlebnikov
2014-06-25 11:57                       ` Petr Mládek
2014-06-25 12:21                   ` Petr Mládek
2014-11-19  4:39 [RFC][PATCH 0/3] printk/seq-buf/NMI: Revisit of safe NMI printing with seq_buf code Steven Rostedt
2014-11-19  4:39 ` [RFC][PATCH 3/3] x86/nmi: Perform a safe NMI stack trace on all CPUs Steven Rostedt
2014-11-19 10:41   ` Borislav Petkov
2014-11-19 10:44     ` Jiri Kosina
2014-11-19 10:53       ` Borislav Petkov
2014-11-19 13:02     ` Petr Mladek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.