linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools
@ 2021-01-15 13:03 Alexander Potapenko
  2021-01-15 13:03 ` [PATCH v2 1/5] tracing: add error_report trace points Alexander Potapenko
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:03 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, glider, linux-mm

This patchset adds a library that captures error reports from debugging
tools like KASAN or KFENCE and exposes those reports to userspace via
sysfs. Report capturing is controlled by two new types of tracepoints:
error_report_start and error_report_end, that must be added to the tools
that want to use this new feature.

v2:
 - added ABI documentation for /sys/kernel/error_report/
 - changed error_report_start and error_report end tracepoints to take
   a fixed set of values for the error detector

Alexander Potapenko (5):
  tracing: add error_report trace points
  lib: add error_report_notify to collect debugging tools' reports
  docs: ABI: add /sys/kernel/error_report/ documentation
  kfence: use error_report_start and error_report_end tracepoints
  kasan: use error_report_start and error_report_end tracepoints

 .../ABI/testing/sysfs-kernel-error_report     |  41 +++
 include/trace/events/error_report.h           |  84 ++++++
 kernel/trace/Makefile                         |   1 +
 kernel/trace/error_report-traces.c            |  11 +
 lib/Kconfig.debug                             |  14 +
 lib/Makefile                                  |   2 +
 lib/error_report_notify.c                     | 278 ++++++++++++++++++
 mm/kasan/report.c                             |  15 +-
 mm/kfence/report.c                            |   3 +
 9 files changed, 443 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-error_report
 create mode 100644 include/trace/events/error_report.h
 create mode 100644 kernel/trace/error_report-traces.c
 create mode 100644 lib/error_report_notify.c

-- 
2.30.0.284.gd98b1dd5eaa7-goog



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/5] tracing: add error_report trace points
  2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
@ 2021-01-15 13:03 ` Alexander Potapenko
  2021-01-15 13:03 ` [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports Alexander Potapenko
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:03 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, glider, linux-mm

Introduce error_report_start and error_report_end tracepoints.
Those can be used in debugging tools like KASAN, KFENCE, etc.
to provide extensions to the error reporting mechanisms (e.g. allow
tests hook into error reporting, ease error report collection from
production kernels).
Another benefit would be making use of ftrace for debugging or
benchmarking the tools themselves.

Suggested-by: Marco Elver <elver@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v2: - change error_report_start and error_report_end prototypes
      to accept enum error_detector instead of char*
      (as suggested by Steven Rostedt)
---
 include/trace/events/error_report.h | 84 +++++++++++++++++++++++++++++
 kernel/trace/Makefile               |  1 +
 kernel/trace/error_report-traces.c  | 11 ++++
 3 files changed, 96 insertions(+)
 create mode 100644 include/trace/events/error_report.h
 create mode 100644 kernel/trace/error_report-traces.c

diff --git a/include/trace/events/error_report.h b/include/trace/events/error_report.h
new file mode 100644
index 000000000000..bbf8543fc31a
--- /dev/null
+++ b/include/trace/events/error_report.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM error_report
+
+#if !defined(_TRACE_ERROR_REPORT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ERROR_REPORT_H
+
+#include <linux/tracepoint.h>
+
+#ifndef __ERROR_REPORT_DECLARE_TRACE_ENUMS_ONCE_ONLY
+#define __ERROR_REPORT_DECLARE_TRACE_ENUMS_ONCE_ONLY
+
+enum error_detector {
+	ERROR_DETECTOR_KFENCE,
+	ERROR_DETECTOR_KASAN
+};
+
+#endif /* __ERROR_REPORT_DECLARE_TRACE_ENUMS_ONCE_ONLY */
+
+#define error_detector_list	\
+	EM(ERROR_DETECTOR_KFENCE, "kfence")	\
+	EMe(ERROR_DETECTOR_KASAN, "kasan")
+/* Always end the list with an EMe. */
+
+#undef EM
+#undef EMe
+
+#define EM(a, b)	TRACE_DEFINE_ENUM(a);
+#define EMe(a, b)	TRACE_DEFINE_ENUM(a);
+
+error_detector_list
+
+#undef EM
+#undef EMe
+
+#define EM(a, b) { a, b },
+#define EMe(a, b) { a, b }
+
+#define show_error_detector_list(val) \
+	__print_symbolic(val, error_detector_list)
+
+DECLARE_EVENT_CLASS(error_report_template,
+		    TP_PROTO(enum error_detector error_detector, unsigned long id),
+		    TP_ARGS(error_detector, id),
+		    TP_STRUCT__entry(__field(enum error_detector, error_detector)
+					     __field(unsigned long, id)),
+		    TP_fast_assign(__entry->error_detector = error_detector;
+				   __entry->id = id;),
+		    TP_printk("[%s] %lx",
+			      show_error_detector_list(__entry->error_detector),
+			      __entry->id));
+
+/**
+ * error_report_start - called before printing the error report
+ * @error_detector:	short string describing the error detection tool
+ * @id:			pseudo-unique descriptor that can help distinguish reports
+ *			from one another. Depending on the tool, good examples
+ *			could be: memory access address, call site, allocation
+ *			site, etc.
+ *
+ * This event occurs right before a debugging tool starts printing the error
+ * report.
+ */
+DEFINE_EVENT(error_report_template, error_report_start,
+	     TP_PROTO(enum error_detector error_detector, unsigned long id),
+	     TP_ARGS(error_detector, id));
+
+/**
+ * error_report_end - called after printing the error report
+ * @error_detector:	short string describing the error detection tool
+ * @id:			pseudo-unique descriptor, matches that passed to
+ *			error_report_start
+ *
+ * This event occurs right after a debugging tool finishes printing the error
+ * report.
+ */
+DEFINE_EVENT(error_report_template, error_report_end,
+	     TP_PROTO(enum error_detector error_detector, unsigned long id),
+	     TP_ARGS(error_detector, id));
+
+#endif /* _TRACE_ERROR_REPORT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 7e44cea89fdc..b28d3e5013cd 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o
 obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o
 obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
 obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o
+obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o
 obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_PM),y)
 obj-$(CONFIG_TRACEPOINTS) += rpm-traces.o
diff --git a/kernel/trace/error_report-traces.c b/kernel/trace/error_report-traces.c
new file mode 100644
index 000000000000..80960c52c705
--- /dev/null
+++ b/kernel/trace/error_report-traces.c
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Error reporting trace points
+ */
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/error_report.h>
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(error_report_start);
+EXPORT_TRACEPOINT_SYMBOL_GPL(error_report_end);
+
-- 
2.30.0.284.gd98b1dd5eaa7-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
  2021-01-15 13:03 ` [PATCH v2 1/5] tracing: add error_report trace points Alexander Potapenko
@ 2021-01-15 13:03 ` Alexander Potapenko
  2021-01-15 13:50   ` Greg KH
  2021-01-18 11:38   ` Petr Mladek
  2021-01-15 13:03 ` [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation Alexander Potapenko
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:03 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, glider, linux-mm

With the introduction of various production error-detection tools, such as
MTE-based KASAN and KFENCE, the need arises to efficiently notify the
userspace OS components about kernel errors. Currently, no facility exists
to notify userspace about a kernel error from such bug-detection tools.
The problem is obviously not restricted to the above bug detection tools,
and applies to any error reporting mechanism that does not panic the
kernel; this series, however, will only add support for KASAN and KFENCE
reporting.

All such error reports appear in the kernel log. But, when such errors
occur, userspace would normally need to read the entire kernel log and
parse the relevant errors. This is error prone and inefficient, as
userspace needs to continuously monitor the kernel log for error messages.
On certain devices, this is unfortunately not acceptable. Therefore, we
need to revisit how reports are propagated to userspace.

The library added, error_report_notify (CONFIG_ERROR_REPORT_NOTIFY),
solves the above by using the error_report_start/error_report_end tracing
events and exposing the last report and the total report count to the
userspace via /sys/kernel/error_report/last_report and
/sys/kernel/error_report/report_count.

Userspace apps can call poll(POLLPRI) on those files to get notified about
the new reports without having to watch dmesg in a loop.

Suggested-by: Marco Elver <elver@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v2:
 - s/memory error reports/error reports from debugging tools/
   (per Andrew Morton's comment)
 - change error_report_start and error_report_end prototypes
   to accept enum error_detector instead of char*
   (as suggested by Steven Rostedt)
---
 lib/Kconfig.debug         |  14 ++
 lib/Makefile              |   2 +
 lib/error_report_notify.c | 278 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 294 insertions(+)
 create mode 100644 lib/error_report_notify.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 4b8f9e018f0f..b950bf21b15c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -209,6 +209,20 @@ config DEBUG_BUGVERBOSE
 	  of the BUG call as well as the EIP and oops trace.  This aids
 	  debugging but costs about 70-100K of memory.
 
+config ERROR_REPORT_NOTIFY
+	bool "Expose error reports from debugging tools to the userspace"
+	depends on TRACING
+	help
+	  When enabled, captures error reports from debugging tools (such as
+	  KFENCE or KASAN) using console tracing, and exposes reports in
+	  /sys/kernel/error_report/: the file last_report contains the last
+	  report (with maximum report length of PAGE_SIZE), and report_count,
+	  the total report count.
+
+	  Userspace programs can call poll(POLLPRI) on those files to get
+	  notified about the new reports without having to watch dmesg in a
+	  loop.
+
 endmenu # "printk and dmesg options"
 
 menu "Compile-time checks and compiler options"
diff --git a/lib/Makefile b/lib/Makefile
index afeff05fa8c5..96ed8bc0cc5b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -240,6 +240,8 @@ obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
 
+obj-$(CONFIG_ERROR_REPORT_NOTIFY) += error_report_notify.o
+
 obj-$(CONFIG_GLOB) += glob.o
 obj-$(CONFIG_GLOB_SELFTEST) += globtest.o
 
diff --git a/lib/error_report_notify.c b/lib/error_report_notify.c
new file mode 100644
index 000000000000..66176cd94ba0
--- /dev/null
+++ b/lib/error_report_notify.c
@@ -0,0 +1,278 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Userspace notification interface for debugging tools.
+ *
+ * Provide two sysfs files:
+ *  - /sys/kernel/error_report/last_report
+ *  - /sys/kernel/error_report/report_count
+ * that contain the last debugging tool report (taken from dmesg, delimited by
+ * the error_report_start/error_report_end tracing events) and the total report
+ * count.
+ */
+
+#include <linux/atomic.h>
+#include <linux/fs.h>
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/tracepoint.h>
+#include <linux/workqueue.h>
+#include <trace/events/error_report.h>
+#include <trace/events/printk.h>
+
+static struct kobject *error_report_kobj;
+
+/* sysfs files are capped at PAGE_SIZE. */
+#define BUF_SIZE PAGE_SIZE
+/* Two buffers to store the finished report and the report being recorded. */
+static char report_buffer[2][BUF_SIZE];
+/*
+ * Total report count. Also serves as a latch for report_buffer:
+ * report_buffer[num_reports % 2] is the currently available report,
+ * report_buffer[(num_reports + 1) % 2] is the report being recorded.
+ */
+static atomic_t num_reports;
+
+/*
+ * PID of the task currently recording the report, as returned by
+ * get_encoded_pid(), or -1. Used as a writer lock for report_buffer.
+ * A regular spinlock couldn't be used here, as probe_console() can be called
+ * from any thread, and it needs to know whether that thread is holding the
+ * lock.
+ */
+static atomic_t current_pid = ATOMIC_INIT(-1);
+
+static size_t current_pos;
+static bool truncated;
+static const char TRUNC_MSG[] = "<truncated>\n";
+
+static struct delayed_work reporting_done;
+
+static void error_report_notify(struct work_struct *work)
+{
+	sysfs_notify(error_report_kobj, NULL, "last_report");
+	sysfs_notify(error_report_kobj, NULL, "report_count");
+}
+static DECLARE_DELAYED_WORK(reporting_done, error_report_notify);
+
+/*
+ * Return the current PID combined together with in_task(). This lets us
+ * distinguish between normal task context and IRQ context.
+ */
+static int get_encoded_pid(void)
+{
+	return (current->pid << 1) | (!!in_task());
+}
+
+/*
+ * Trace hook for the error_report_start event. In an unlikely case of another
+ * task already printing a report bail out, otherwise save the current pid
+ * together with in_task() return value.
+ *
+ * Because reporting code can be called from low-level routines (e.g. locking
+ * primitives or allocator guts), report recording is implemented using a
+ * seqlock lock-free algorithm.
+ */
+static void probe_report_start(void *ignore, enum error_detector detector,
+			       unsigned long id)
+{
+	/*
+	 * Acquire the writer lock. Any racing probe_report_start will not
+	 * record anything. Pairs with the release in probe_report_end().
+	 */
+	if (atomic_cmpxchg_acquire(&current_pid, -1, get_encoded_pid()) != -1)
+		return;
+	current_pos = 0;
+	truncated = false;
+}
+
+/*
+ * Trace hook for the error_report_end event. If an event from the mismatching
+ * error_report_start is received, it is ignored. Otherwise, null-terminate the
+ * buffer, increase the report count (effectively releasing the report to
+ * last_report_show() and schedule a notification about a new report.
+ */
+static void probe_report_end(void *ignore, enum error_detector detector,
+			     unsigned long id)
+{
+	pid_t pid = atomic_read(&current_pid);
+	int idx;
+
+	if (pid != get_encoded_pid())
+		return;
+
+	idx = (atomic_read(&num_reports) + 1) % 2;
+	if (current_pos == BUF_SIZE)
+		report_buffer[idx][current_pos - 1] = 0;
+	else
+		report_buffer[idx][current_pos] = 0;
+
+	/* Pairs with acquire in last_report_show(). */
+	atomic_inc_return_release(&num_reports);
+	schedule_delayed_work(&reporting_done, 0);
+	/*
+	 * Release the writer lock. Pairs with the acquire in
+	 * probe_report_start().
+	 */
+	atomic_set_release(&current_pid, -1);
+}
+
+/*
+ * Skip one or two leading pair of brackets containing the log timestamp and
+ * the task/CPU ID, plus the leading space, from the report line, e.g.:
+ *   [    0.698431][    T7] BUG: KFENCE: use-after-free ...
+ * becomes:
+ *   BUG: KFENCE: use-after-free ...
+ *
+ * Report size is only 4K, and this boilerplate can easily account for half of
+ * that amount.
+ */
+static void skip_extra_info(const char **buf, size_t *len)
+{
+	int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) +
+			   IS_ENABLED(CONFIG_PRINTK_CALLER);
+	const char *found;
+
+	if (!buf || !len)
+		return;
+
+	while (num_brackets--) {
+		if (!*len || *buf[0] != '[')
+			return;
+		found = strnchr(*buf, *len, ']');
+		if (!found)
+			return;
+		*len -= found - *buf + 1;
+		*buf = found + 1;
+	}
+	if (*len && *buf[0] == ' ') {
+		++*buf;
+		--*len;
+	}
+}
+
+/*
+ * Trace hook for the console event. If a line comes from a task/CPU that did
+ * not send the error_report_start event, that line is ignored. Otherwise, it
+ * is stored in the report_buffer[(num_reports + 1) % 2].
+ *
+ * To save space, the leading timestamps and (when enabled) CPU/task info is
+ * stripped away. The buffer may contain newlines, so this procedure is
+ * repeated for every line.
+ */
+static void probe_console(void *ignore, const char *buf, size_t len)
+{
+	int pid = atomic_read(&current_pid);
+	size_t to_copy, cur_len;
+	char *newline;
+	int idx;
+
+	if (pid != get_encoded_pid() || truncated)
+		return;
+
+	idx = (atomic_read(&num_reports) + 1) % 2;
+	while (len) {
+		newline = strnchr(buf, len, '\n');
+		if (newline)
+			cur_len = newline - buf + 1;
+		else
+			cur_len = len;
+		/* Adjust len now, because skip_extra_info() may change cur_len. */
+		len -= cur_len;
+		skip_extra_info(&buf, &cur_len);
+		to_copy = min(cur_len, BUF_SIZE - current_pos);
+		memcpy(report_buffer[idx] + current_pos, buf, to_copy);
+		current_pos += to_copy;
+		if (cur_len > to_copy) {
+			truncated = true;
+			memcpy(report_buffer[idx] + current_pos - sizeof(TRUNC_MSG),
+			       TRUNC_MSG, sizeof(TRUNC_MSG));
+			break;
+		}
+		buf += cur_len;
+	}
+}
+
+static void register_tracepoints(void)
+{
+	register_trace_console(probe_console, NULL);
+	register_trace_error_report_start(probe_report_start, NULL);
+	register_trace_error_report_end(probe_report_end, NULL);
+}
+
+/*
+ * read() handler for /sys/kernel/error_report/last_report.
+ * Because the number of reports can change under our feet, check it again
+ * after copying the report, and retry if the numbers mismatch.
+ */
+static ssize_t last_report_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	ssize_t ret;
+	int index;
+
+	do {
+		/* Pairs with release in probe_report_end(). */
+		index = atomic_read_acquire(&num_reports);
+		/*
+		 * If index and old_index mismatch, we might be accessing
+		 * report_buffer concurrently with a writer thread. In that
+		 * case the read data will be discarded.
+		 */
+		ret = data_race(strscpy(buf, report_buffer[index % 2], BUF_SIZE));
+		/*
+		 * Prevent reordering between the memcpy above and the atomic
+		 * read below.
+		 * See the comments in include/linux/seqlock.h for more
+		 * details.
+		 */
+		smp_rmb();
+	} while (index != atomic_read(&num_reports));
+	return ret;
+}
+
+/*
+ * read() handler for /sys/kernel/error_report/report_count.
+ */
+static ssize_t report_count_show(struct kobject *kobj,
+				 struct kobj_attribute *attr, char *buf)
+{
+	return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&num_reports));
+}
+
+static struct kobj_attribute last_report_attr = __ATTR_RO(last_report);
+static struct kobj_attribute report_count_attr = __ATTR_RO(report_count);
+static struct attribute *error_report_sysfs_attrs[] = {
+	&last_report_attr.attr,
+	&report_count_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group error_report_sysfs_attr_group = {
+	.attrs = error_report_sysfs_attrs,
+};
+
+/*
+ * Set up report notification: register tracepoints and create
+ * /sys/kernel/error_report/.
+ */
+static void error_report_notify_setup(void)
+{
+	int err;
+
+	register_tracepoints();
+	error_report_kobj = kobject_create_and_add("error_report", kernel_kobj);
+	if (!error_report_kobj)
+		goto error;
+	err = sysfs_create_group(error_report_kobj,
+				 &error_report_sysfs_attr_group);
+	if (err)
+		goto error;
+	return;
+
+error:
+	if (error_report_kobj)
+		kobject_del(error_report_kobj);
+}
+late_initcall(error_report_notify_setup);
-- 
2.30.0.284.gd98b1dd5eaa7-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
  2021-01-15 13:03 ` [PATCH v2 1/5] tracing: add error_report trace points Alexander Potapenko
  2021-01-15 13:03 ` [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports Alexander Potapenko
@ 2021-01-15 13:03 ` Alexander Potapenko
  2021-01-15 13:45   ` Greg KH
  2021-01-15 13:03 ` [PATCH v2 4/5] kfence: use error_report_start and error_report_end tracepoints Alexander Potapenko
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:03 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, glider, linux-mm

Add ABI documentation for files in /sys/kernel/error_report/

Requested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Alexander Potapenko <glider@google.com>
---
 .../ABI/testing/sysfs-kernel-error_report     | 41 +++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-error_report

diff --git a/Documentation/ABI/testing/sysfs-kernel-error_report b/Documentation/ABI/testing/sysfs-kernel-error_report
new file mode 100644
index 000000000000..666d039f93a9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-error_report
@@ -0,0 +1,41 @@
+What:		/sys/kernel/error_report/
+Date:		January 2021
+Contact:	Alexander Potapenko <glider@google.com>,
+		Marco Elver <elver@google.com>
+Description:
+		/sys/kernel/error_report/ contains two files: "report_count"
+		and "last_report". These files are used to notify userspace
+		about error reports from the enrolled kernel subsystems (those
+		that use error_report_start/error_report_end tracepoints).
+
+		"report_count" contains the current number of reported errors.
+		This number is incremented every time the error_report_end
+		trace event occurs in the kernel.
+
+		"last_report" contains the most recent error report; concurrent
+		report generation results in collection of any one report
+		("last_report" may not be the last shown on the console).
+		A "report" is everything the task had printed to the console
+		between issuing the error_report_start and error_report_end
+		trace events.
+
+		Due to sysfs limitations, the report size is truncated at
+		PAGE_SIZE. To save space, the leading info in square brackets
+		printed by CONFIG_PRINTK_TIME and CONFIG_PRINTK_CALLER is
+		trimmed from the output lines.
+
+		Both files use sysfs_notify() to notify userspace about
+		changes. Userspace programs can use poll() to block until an
+		error is reported:
+
+			pfd.fd = fd;
+			pfd.events = POLLPRI;
+			while (1) {
+				lseek(pfd.fd, 0, SEEK_SET);
+				poll(&pfd, 1, -1);
+				read(pfd.fd, buffer, PAGE_SIZE);
+				/* Process the report in @buffer. */
+			}
+
+		Files in /sys/kernel/error_report/ are available when
+		CONFIG_ERROR_REPORT_NOTIFY is enabled.
-- 
2.30.0.284.gd98b1dd5eaa7-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/5] kfence: use error_report_start and error_report_end tracepoints
  2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
                   ` (2 preceding siblings ...)
  2021-01-15 13:03 ` [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation Alexander Potapenko
@ 2021-01-15 13:03 ` Alexander Potapenko
  2021-01-15 13:03 ` [PATCH v2 5/5] kasan: " Alexander Potapenko
  2021-01-15 13:06 ` [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Vlastimil Babka
  5 siblings, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:03 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, glider, linux-mm

Make it possible to trace KFENCE error reporting (in particular, for
compatibility with CONFIG_ERROR_REPORT_NOTIFY).

Suggested-by: Marco Elver <elver@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v2:
 - change error_report_start and error_report_end prototypes
   to accept enum error_detector instead of char*
   (as suggested by Steven Rostedt)
---
 mm/kfence/report.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/kfence/report.c b/mm/kfence/report.c
index 4dedc2ff8f28..bbe7dad5f249 100644
--- a/mm/kfence/report.c
+++ b/mm/kfence/report.c
@@ -9,6 +9,7 @@
 #include <linux/seq_file.h>
 #include <linux/stacktrace.h>
 #include <linux/string.h>
+#include <trace/events/error_report.h>
 
 #include <asm/kfence.h>
 
@@ -184,6 +185,7 @@ void kfence_report_error(unsigned long address, struct pt_regs *regs,
 	lockdep_off();
 
 	pr_err("==================================================================\n");
+	trace_error_report_start(ERROR_DETECTOR_KFENCE, address);
 	/* Print report header. */
 	switch (type) {
 	case KFENCE_ERROR_OOB: {
@@ -232,6 +234,7 @@ void kfence_report_error(unsigned long address, struct pt_regs *regs,
 		show_regs(regs);
 	else
 		dump_stack_print_info(KERN_ERR);
+	trace_error_report_end(ERROR_DETECTOR_KFENCE, address);
 	pr_err("==================================================================\n");
 
 	lockdep_on();
-- 
2.30.0.284.gd98b1dd5eaa7-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/5] kasan: use error_report_start and error_report_end tracepoints
  2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
                   ` (3 preceding siblings ...)
  2021-01-15 13:03 ` [PATCH v2 4/5] kfence: use error_report_start and error_report_end tracepoints Alexander Potapenko
@ 2021-01-15 13:03 ` Alexander Potapenko
  2021-01-15 13:06 ` [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Vlastimil Babka
  5 siblings, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:03 UTC (permalink / raw)
  To: linux-kernel, akpm
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, glider, linux-mm

Make it possible to trace KASAN error reporting (in particular, for
compatibility with CONFIG_ERROR_REPORT_NOTIFY).

Suggested-by: Marco Elver <elver@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Alexander Potapenko <glider@google.com>

---
v2:
 - change error_report_start and error_report_end prototypes
   to accept enum error_detector instead of char*
   (as suggested by Steven Rostedt)
---
 mm/kasan/report.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/kasan/report.c b/mm/kasan/report.c
index c0fb21797550..78b8bbc84dc2 100644
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -25,6 +25,7 @@
 #include <linux/module.h>
 #include <linux/sched/task_stack.h>
 #include <linux/uaccess.h>
+#include <trace/events/error_report.h>
 
 #include <asm/sections.h>
 
@@ -74,7 +75,7 @@ static void print_error_description(struct kasan_access_info *info)
 
 static DEFINE_SPINLOCK(report_lock);
 
-static void start_report(unsigned long *flags)
+static void start_report(unsigned long *flags, unsigned long addr)
 {
 	/*
 	 * Make sure we don't end up in loop.
@@ -82,10 +83,12 @@ static void start_report(unsigned long *flags)
 	kasan_disable_current();
 	spin_lock_irqsave(&report_lock, *flags);
 	pr_err("==================================================================\n");
+	trace_error_report_start(ERROR_DETECTOR_KASAN, addr);
 }
 
-static void end_report(unsigned long *flags)
+static void end_report(unsigned long *flags, unsigned long addr)
 {
+	trace_error_report_end(ERROR_DETECTOR_KASAN, addr);
 	pr_err("==================================================================\n");
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 	spin_unlock_irqrestore(&report_lock, *flags);
@@ -348,14 +351,14 @@ void kasan_report_invalid_free(void *object, unsigned long ip)
 		kasan_update_kunit_status(current->kunit_test);
 #endif /* IS_ENABLED(CONFIG_KUNIT) */
 
-	start_report(&flags);
+	start_report(&flags, (unsigned long)object);
 	pr_err("BUG: KASAN: double-free or invalid-free in %pS\n", (void *)ip);
 	print_tags(tag, object);
 	pr_err("\n");
 	print_address_description(object, tag);
 	pr_err("\n");
 	print_memory_metadata(object);
-	end_report(&flags);
+	end_report(&flags, (unsigned long)object);
 }
 
 static void __kasan_report(unsigned long addr, size_t size, bool is_write,
@@ -385,7 +388,7 @@ static void __kasan_report(unsigned long addr, size_t size, bool is_write,
 	info.is_write = is_write;
 	info.ip = ip;
 
-	start_report(&flags);
+	start_report(&flags, addr);
 
 	print_error_description(&info);
 	if (addr_has_metadata(untagged_addr))
@@ -400,7 +403,7 @@ static void __kasan_report(unsigned long addr, size_t size, bool is_write,
 		dump_stack();
 	}
 
-	end_report(&flags);
+	end_report(&flags, addr);
 }
 
 bool kasan_report(unsigned long addr, size_t size, bool is_write,
-- 
2.30.0.284.gd98b1dd5eaa7-goog



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools
  2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
                   ` (4 preceding siblings ...)
  2021-01-15 13:03 ` [PATCH v2 5/5] kasan: " Alexander Potapenko
@ 2021-01-15 13:06 ` Vlastimil Babka
  2021-01-15 13:09   ` Alexander Potapenko
  5 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2021-01-15 13:06 UTC (permalink / raw)
  To: Alexander Potapenko, linux-kernel, akpm, Linux API
  Cc: andreyknvl, dvyukov, mingo, elver, pmladek, rostedt,
	sergey.senozhatsky, linux-mm

Should have CCd linux-api@, please do next time

On 1/15/21 2:03 PM, Alexander Potapenko wrote:
> This patchset adds a library that captures error reports from debugging
> tools like KASAN or KFENCE and exposes those reports to userspace via
> sysfs. Report capturing is controlled by two new types of tracepoints:
> error_report_start and error_report_end, that must be added to the tools
> that want to use this new feature.
> 
> v2:
>  - added ABI documentation for /sys/kernel/error_report/
>  - changed error_report_start and error_report end tracepoints to take
>    a fixed set of values for the error detector
> 
> Alexander Potapenko (5):
>   tracing: add error_report trace points
>   lib: add error_report_notify to collect debugging tools' reports
>   docs: ABI: add /sys/kernel/error_report/ documentation
>   kfence: use error_report_start and error_report_end tracepoints
>   kasan: use error_report_start and error_report_end tracepoints
> 
>  .../ABI/testing/sysfs-kernel-error_report     |  41 +++
>  include/trace/events/error_report.h           |  84 ++++++
>  kernel/trace/Makefile                         |   1 +
>  kernel/trace/error_report-traces.c            |  11 +
>  lib/Kconfig.debug                             |  14 +
>  lib/Makefile                                  |   2 +
>  lib/error_report_notify.c                     | 278 ++++++++++++++++++
>  mm/kasan/report.c                             |  15 +-
>  mm/kfence/report.c                            |   3 +
>  9 files changed, 443 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-error_report
>  create mode 100644 include/trace/events/error_report.h
>  create mode 100644 kernel/trace/error_report-traces.c
>  create mode 100644 lib/error_report_notify.c
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools
  2021-01-15 13:06 ` [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Vlastimil Babka
@ 2021-01-15 13:09   ` Alexander Potapenko
  2021-01-21 12:56     ` Alexander Potapenko
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 13:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: LKML, Andrew Morton, Linux API, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Steven Rostedt,
	Sergey Senozhatsky, Linux Memory Management List

On Fri, Jan 15, 2021 at 2:06 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Should have CCd linux-api@, please do next time
Thanks, will do!
Shall I also CC the v2 ABI patch explicitly?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-15 13:03 ` [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation Alexander Potapenko
@ 2021-01-15 13:45   ` Greg KH
  2021-01-15 15:26     ` Alexander Potapenko
  0 siblings, 1 reply; 22+ messages in thread
From: Greg KH @ 2021-01-15 13:45 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: linux-kernel, akpm, andreyknvl, dvyukov, mingo, elver, pmladek,
	rostedt, sergey.senozhatsky, linux-mm

On Fri, Jan 15, 2021 at 02:03:34PM +0100, Alexander Potapenko wrote:
> Add ABI documentation for files in /sys/kernel/error_report/
> 
> Requested-by: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Marco Elver <elver@google.com>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Alexander Potapenko <glider@google.com>
> ---
>  .../ABI/testing/sysfs-kernel-error_report     | 41 +++++++++++++++++++
>  1 file changed, 41 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-error_report
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-error_report b/Documentation/ABI/testing/sysfs-kernel-error_report
> new file mode 100644
> index 000000000000..666d039f93a9
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-error_report
> @@ -0,0 +1,41 @@
> +What:		/sys/kernel/error_report/
> +Date:		January 2021
> +Contact:	Alexander Potapenko <glider@google.com>,
> +		Marco Elver <elver@google.com>
> +Description:
> +		/sys/kernel/error_report/ contains two files: "report_count"
> +		and "last_report". These files are used to notify userspace
> +		about error reports from the enrolled kernel subsystems (those
> +		that use error_report_start/error_report_end tracepoints).
> +
> +		"report_count" contains the current number of reported errors.
> +		This number is incremented every time the error_report_end
> +		trace event occurs in the kernel.
> +
> +		"last_report" contains the most recent error report; concurrent
> +		report generation results in collection of any one report
> +		("last_report" may not be the last shown on the console).
> +		A "report" is everything the task had printed to the console
> +		between issuing the error_report_start and error_report_end
> +		trace events.
> +
> +		Due to sysfs limitations, the report size is truncated at
> +		PAGE_SIZE. To save space, the leading info in square brackets
> +		printed by CONFIG_PRINTK_TIME and CONFIG_PRINTK_CALLER is
> +		trimmed from the output lines.
> +
> +		Both files use sysfs_notify() to notify userspace about
> +		changes. Userspace programs can use poll() to block until an
> +		error is reported:
> +
> +			pfd.fd = fd;
> +			pfd.events = POLLPRI;
> +			while (1) {
> +				lseek(pfd.fd, 0, SEEK_SET);
> +				poll(&pfd, 1, -1);
> +				read(pfd.fd, buffer, PAGE_SIZE);
> +				/* Process the report in @buffer. */
> +			}
> +
> +		Files in /sys/kernel/error_report/ are available when
> +		CONFIG_ERROR_REPORT_NOTIFY is enabled.

sysfs is "one value per file", please put something like this in
tracefs, as there is no such rules there.  Or debugfs, but please, not
sysfs.

Also, any reason you didn't cc: the sysfs maintainers?

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-15 13:03 ` [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports Alexander Potapenko
@ 2021-01-15 13:50   ` Greg KH
  2021-01-15 17:17     ` Alexander Potapenko
  2021-01-18 11:38   ` Petr Mladek
  1 sibling, 1 reply; 22+ messages in thread
From: Greg KH @ 2021-01-15 13:50 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: linux-kernel, akpm, andreyknvl, dvyukov, mingo, elver, pmladek,
	rostedt, sergey.senozhatsky, linux-mm

Minor comments, if in the future, you really do want to mess around in sysfs:

On Fri, Jan 15, 2021 at 02:03:33PM +0100, Alexander Potapenko wrote:
> diff --git a/lib/error_report_notify.c b/lib/error_report_notify.c
> new file mode 100644
> index 000000000000..66176cd94ba0
> --- /dev/null
> +++ b/lib/error_report_notify.c
> @@ -0,0 +1,278 @@
> +// SPDX-License-Identifier: GPL-2.0
> +

No copyright notice for the file?  While acceptable, odds are your
corporate lawyers will not be happy with that :(

> +/*
> + * Userspace notification interface for debugging tools.
> + *
> + * Provide two sysfs files:
> + *  - /sys/kernel/error_report/last_report
> + *  - /sys/kernel/error_report/report_count
> + * that contain the last debugging tool report (taken from dmesg, delimited by
> + * the error_report_start/error_report_end tracing events) and the total report
> + * count.
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/fs.h>
> +#include <linux/kobject.h>
> +#include <linux/string.h>
> +#include <linux/sysfs.h>
> +#include <linux/tracepoint.h>
> +#include <linux/workqueue.h>
> +#include <trace/events/error_report.h>
> +#include <trace/events/printk.h>
> +
> +static struct kobject *error_report_kobj;
> +
> +/* sysfs files are capped at PAGE_SIZE. */
> +#define BUF_SIZE PAGE_SIZE
> +/* Two buffers to store the finished report and the report being recorded. */
> +static char report_buffer[2][BUF_SIZE];
> +/*
> + * Total report count. Also serves as a latch for report_buffer:
> + * report_buffer[num_reports % 2] is the currently available report,
> + * report_buffer[(num_reports + 1) % 2] is the report being recorded.
> + */
> +static atomic_t num_reports;
> +
> +/*
> + * PID of the task currently recording the report, as returned by
> + * get_encoded_pid(), or -1. Used as a writer lock for report_buffer.
> + * A regular spinlock couldn't be used here, as probe_console() can be called
> + * from any thread, and it needs to know whether that thread is holding the
> + * lock.
> + */
> +static atomic_t current_pid = ATOMIC_INIT(-1);

how do you handle pid namespaces?

> +
> +static size_t current_pos;
> +static bool truncated;
> +static const char TRUNC_MSG[] = "<truncated>\n";
> +
> +static struct delayed_work reporting_done;
> +
> +static void error_report_notify(struct work_struct *work)
> +{
> +	sysfs_notify(error_report_kobj, NULL, "last_report");
> +	sysfs_notify(error_report_kobj, NULL, "report_count");
> +}
> +static DECLARE_DELAYED_WORK(reporting_done, error_report_notify);
> +
> +/*
> + * Return the current PID combined together with in_task(). This lets us
> + * distinguish between normal task context and IRQ context.
> + */
> +static int get_encoded_pid(void)
> +{
> +	return (current->pid << 1) | (!!in_task());
> +}
> +
> +/*
> + * Trace hook for the error_report_start event. In an unlikely case of another
> + * task already printing a report bail out, otherwise save the current pid
> + * together with in_task() return value.
> + *
> + * Because reporting code can be called from low-level routines (e.g. locking
> + * primitives or allocator guts), report recording is implemented using a
> + * seqlock lock-free algorithm.
> + */
> +static void probe_report_start(void *ignore, enum error_detector detector,
> +			       unsigned long id)
> +{
> +	/*
> +	 * Acquire the writer lock. Any racing probe_report_start will not
> +	 * record anything. Pairs with the release in probe_report_end().
> +	 */
> +	if (atomic_cmpxchg_acquire(&current_pid, -1, get_encoded_pid()) != -1)
> +		return;

pid namespaces?

> +	current_pos = 0;
> +	truncated = false;
> +}
> +
> +/*
> + * Trace hook for the error_report_end event. If an event from the mismatching
> + * error_report_start is received, it is ignored. Otherwise, null-terminate the
> + * buffer, increase the report count (effectively releasing the report to
> + * last_report_show() and schedule a notification about a new report.
> + */
> +static void probe_report_end(void *ignore, enum error_detector detector,
> +			     unsigned long id)
> +{
> +	pid_t pid = atomic_read(&current_pid);

pid namespaces?

> +	int idx;
> +
> +	if (pid != get_encoded_pid())
> +		return;
> +
> +	idx = (atomic_read(&num_reports) + 1) % 2;

You read, but it could change before:

> +	if (current_pos == BUF_SIZE)
> +		report_buffer[idx][current_pos - 1] = 0;
> +	else
> +		report_buffer[idx][current_pos] = 0;
> +
> +	/* Pairs with acquire in last_report_show(). */
> +	atomic_inc_return_release(&num_reports);

Not good?

> +	schedule_delayed_work(&reporting_done, 0);
> +	/*
> +	 * Release the writer lock. Pairs with the acquire in
> +	 * probe_report_start().
> +	 */
> +	atomic_set_release(&current_pid, -1);
> +}
> +
> +/*
> + * Skip one or two leading pair of brackets containing the log timestamp and
> + * the task/CPU ID, plus the leading space, from the report line, e.g.:
> + *   [    0.698431][    T7] BUG: KFENCE: use-after-free ...
> + * becomes:
> + *   BUG: KFENCE: use-after-free ...
> + *
> + * Report size is only 4K, and this boilerplate can easily account for half of
> + * that amount.
> + */
> +static void skip_extra_info(const char **buf, size_t *len)
> +{
> +	int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) +
> +			   IS_ENABLED(CONFIG_PRINTK_CALLER);
> +	const char *found;
> +
> +	if (!buf || !len)
> +		return;
> +
> +	while (num_brackets--) {
> +		if (!*len || *buf[0] != '[')
> +			return;
> +		found = strnchr(*buf, *len, ']');
> +		if (!found)
> +			return;
> +		*len -= found - *buf + 1;
> +		*buf = found + 1;
> +	}
> +	if (*len && *buf[0] == ' ') {
> +		++*buf;
> +		--*len;
> +	}
> +}
> +
> +/*
> + * Trace hook for the console event. If a line comes from a task/CPU that did
> + * not send the error_report_start event, that line is ignored. Otherwise, it
> + * is stored in the report_buffer[(num_reports + 1) % 2].
> + *
> + * To save space, the leading timestamps and (when enabled) CPU/task info is
> + * stripped away. The buffer may contain newlines, so this procedure is
> + * repeated for every line.
> + */
> +static void probe_console(void *ignore, const char *buf, size_t len)
> +{
> +	int pid = atomic_read(&current_pid);
> +	size_t to_copy, cur_len;
> +	char *newline;
> +	int idx;
> +
> +	if (pid != get_encoded_pid() || truncated)
> +		return;
> +
> +	idx = (atomic_read(&num_reports) + 1) % 2;
> +	while (len) {
> +		newline = strnchr(buf, len, '\n');
> +		if (newline)
> +			cur_len = newline - buf + 1;
> +		else
> +			cur_len = len;
> +		/* Adjust len now, because skip_extra_info() may change cur_len. */
> +		len -= cur_len;
> +		skip_extra_info(&buf, &cur_len);
> +		to_copy = min(cur_len, BUF_SIZE - current_pos);
> +		memcpy(report_buffer[idx] + current_pos, buf, to_copy);
> +		current_pos += to_copy;
> +		if (cur_len > to_copy) {
> +			truncated = true;
> +			memcpy(report_buffer[idx] + current_pos - sizeof(TRUNC_MSG),
> +			       TRUNC_MSG, sizeof(TRUNC_MSG));
> +			break;
> +		}
> +		buf += cur_len;
> +	}
> +}
> +
> +static void register_tracepoints(void)
> +{
> +	register_trace_console(probe_console, NULL);
> +	register_trace_error_report_start(probe_report_start, NULL);
> +	register_trace_error_report_end(probe_report_end, NULL);
> +}
> +
> +/*
> + * read() handler for /sys/kernel/error_report/last_report.
> + * Because the number of reports can change under our feet, check it again
> + * after copying the report, and retry if the numbers mismatch.
> + */
> +static ssize_t last_report_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	ssize_t ret;
> +	int index;
> +
> +	do {
> +		/* Pairs with release in probe_report_end(). */
> +		index = atomic_read_acquire(&num_reports);
> +		/*
> +		 * If index and old_index mismatch, we might be accessing
> +		 * report_buffer concurrently with a writer thread. In that
> +		 * case the read data will be discarded.
> +		 */
> +		ret = data_race(strscpy(buf, report_buffer[index % 2], BUF_SIZE));
> +		/*
> +		 * Prevent reordering between the memcpy above and the atomic
> +		 * read below.
> +		 * See the comments in include/linux/seqlock.h for more
> +		 * details.
> +		 */
> +		smp_rmb();
> +	} while (index != atomic_read(&num_reports));

endless loops, what could go wrong...

Why are you rolling your own hacky locks in here?

And again, sysfs is "one value" not "one buffer".

> +	return ret;
> +}
> +
> +/*
> + * read() handler for /sys/kernel/error_report/report_count.
> + */
> +static ssize_t report_count_show(struct kobject *kobj,
> +				 struct kobj_attribute *attr, char *buf)
> +{
> +	return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&num_reports));

sysfs_emit()?

And you just read it, but what keeps it from changing?

> +}
> +
> +static struct kobj_attribute last_report_attr = __ATTR_RO(last_report);
> +static struct kobj_attribute report_count_attr = __ATTR_RO(report_count);
> +static struct attribute *error_report_sysfs_attrs[] = {
> +	&last_report_attr.attr,
> +	&report_count_attr.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group error_report_sysfs_attr_group = {
> +	.attrs = error_report_sysfs_attrs,
> +};

ATTRIBUTE_GROUPS()?

> +
> +/*
> + * Set up report notification: register tracepoints and create
> + * /sys/kernel/error_report/.
> + */
> +static void error_report_notify_setup(void)
> +{
> +	int err;
> +
> +	register_tracepoints();
> +	error_report_kobj = kobject_create_and_add("error_report", kernel_kobj);
> +	if (!error_report_kobj)
> +		goto error;
> +	err = sysfs_create_group(error_report_kobj,
> +				 &error_report_sysfs_attr_group);
> +	if (err)
> +		goto error;
> +	return;
> +
> +error:
> +	if (error_report_kobj)
> +		kobject_del(error_report_kobj);
> +}
> +late_initcall(error_report_notify_setup);

You never clean up the kobject or files?

Anyway, please move this to tracefs, that's where it belongs.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-15 13:45   ` Greg KH
@ 2021-01-15 15:26     ` Alexander Potapenko
  2021-01-15 15:45       ` Greg KH
  2021-01-15 16:52       ` Steven Rostedt
  0 siblings, 2 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 15:26 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Steven Rostedt,
	Sergey Senozhatsky, Linux Memory Management List

> sysfs is "one value per file"
What about existing interfaces that even export binary blobs through
sysfs (e.g. /sys/firmware, /sys/boot_params)?
What qualifies as a "value" in that case?

> please put something like this in
> tracefs, as there is no such rules there.  Or debugfs, but please, not
> sysfs.
Does tracefs have anything similar to sysfs_notify() or any other way
to implement a poll() handler?
Our main goal is to let users wait on poll(), so that they don't have
to check the file for new contents every now and then. Is it possible
with tracefs or debugfs?

Also, for our goal debugfs isn't a particularly good fit, as Android
kernels do not enable debugfs.
Not sure about tracefs, they seem to have it, need to check.

Do you think it is viable to keep
/sys/kernel/error_report/report_count in sysfs and use it for
notifications, but move last_report somewhere else?
(I'd probably prefer procfs, but it could be tracefs as well, if you
find that better).
This way it will still be possible to easily notify userspace about
new reports, but the report itself won't have any atomicity
guarantees. Users will have to check the report count to ensure it
didn't change under their feet.

> Also, any reason you didn't cc: the sysfs maintainers?
Only my lack of common sense :)
I'll add them should the following patches rely on sysfs, thank you!

Alex




--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-15 15:26     ` Alexander Potapenko
@ 2021-01-15 15:45       ` Greg KH
  2021-01-15 16:52       ` Steven Rostedt
  1 sibling, 0 replies; 22+ messages in thread
From: Greg KH @ 2021-01-15 15:45 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Steven Rostedt,
	Sergey Senozhatsky, Linux Memory Management List

On Fri, Jan 15, 2021 at 04:26:21PM +0100, Alexander Potapenko wrote:
> > sysfs is "one value per file"
> What about existing interfaces that even export binary blobs through
> sysfs (e.g. /sys/firmware, /sys/boot_params)?
> What qualifies as a "value" in that case?

binary files are fine as the kernel is just a "pipe" through to the
hardware / firmware.  No translation or parsing happens in the kernel.
And that's NOT trace events :)

> > please put something like this in
> > tracefs, as there is no such rules there.  Or debugfs, but please, not
> > sysfs.
> Does tracefs have anything similar to sysfs_notify() or any other way
> to implement a poll() handler?

Don't know, look and see!

> Our main goal is to let users wait on poll(), so that they don't have
> to check the file for new contents every now and then. Is it possible
> with tracefs or debugfs?

debugfs, yes, you can do whatever you want.  tracefs probably has this,
otherwise how would trace tools work?  :)

> Also, for our goal debugfs isn't a particularly good fit, as Android
> kernels do not enable debugfs.

Do you care about Android kernels?  If so, yes, debugfs is not good.
And have you asked the Android kernel team about this?

> Not sure about tracefs, they seem to have it, need to check.

It should be there.

> Do you think it is viable to keep
> /sys/kernel/error_report/report_count in sysfs and use it for
> notifications, but move last_report somewhere else?

No, not at all, please keep it out of sysfs.

> (I'd probably prefer procfs, but it could be tracefs as well, if you
> find that better).

If it does not have to do with processes, it does not belong in procfs.

Again, this seems to fit in with tracing, so please use tracefs.

> This way it will still be possible to easily notify userspace about
> new reports, but the report itself won't have any atomicity
> guarantees. Users will have to check the report count to ensure it
> didn't change under their feet.

Again, use a file in tracefs.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-15 15:26     ` Alexander Potapenko
  2021-01-15 15:45       ` Greg KH
@ 2021-01-15 16:52       ` Steven Rostedt
  2021-01-18 10:22         ` Alexander Potapenko
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2021-01-15 16:52 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Greg KH, LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Sergey Senozhatsky,
	Linux Memory Management List

On Fri, 15 Jan 2021 16:26:21 +0100
Alexander Potapenko <glider@google.com> wrote:

> > please put something like this in
> > tracefs, as there is no such rules there.  Or debugfs, but please, not
> > sysfs.  
> Does tracefs have anything similar to sysfs_notify() or any other way
> to implement a poll() handler?
> Our main goal is to let users wait on poll(), so that they don't have
> to check the file for new contents every now and then. Is it possible
> with tracefs or debugfs?

Polling should work on tracefs. I hope it does, as I depend on it ;-)

And if there's an issue, we can always add more features.

> 
> Also, for our goal debugfs isn't a particularly good fit, as Android
> kernels do not enable debugfs.
> Not sure about tracefs, they seem to have it, need to check.

I believe tracefs is used extensively on Android.

> 
> Do you think it is viable to keep
> /sys/kernel/error_report/report_count in sysfs and use it for
> notifications, but move last_report somewhere else?
> (I'd probably prefer procfs, but it could be tracefs as well, if you
> find that better).

If you do use tracefs, add it to the top level tracing directory (no need
to have instances of it), and rename it to "kernel_warnings", as
"error_report" is too close to the existing "error_log" which holds error
messages of syntactic errors done by users entering in commands to some of
the special files.

That is, /sys/kernel/tracing/kernel_warnings/ would be your error_report/
directory you have now.

Use the function in kernel/trace/trace.c: tracer_init_tracefs() to add that
directory. That's for files in the tracefs directory that will not be
duplicated by instances.

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-15 13:50   ` Greg KH
@ 2021-01-15 17:17     ` Alexander Potapenko
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-15 17:17 UTC (permalink / raw)
  To: Greg KH
  Cc: LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Steven Rostedt,
	Sergey Senozhatsky, Linux Memory Management List

On Fri, Jan 15, 2021 at 2:50 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> Minor comments, if in the future, you really do want to mess around in sysfs:
>
Thanks! Guess most of these comments apply even if I choose another FS
to mess around with.

> No copyright notice for the file?  While acceptable, odds are your
> corporate lawyers will not be happy with that :(

You are right, will fix.

> > +/*
> > + * Userspace notification interface for debugging tools.
> > + *
> > + * Provide two sysfs files:
> > + *  - /sys/kernel/error_report/last_report
> > + *  - /sys/kernel/error_report/report_count
> > + * that contain the last debugging tool report (taken from dmesg, delimited by
> > + * the error_report_start/error_report_end tracing events) and the total report
> > + * count.
> > + */
> > +
> > +#include <linux/atomic.h>
> > +#include <linux/fs.h>
> > +#include <linux/kobject.h>
> > +#include <linux/string.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/tracepoint.h>
> > +#include <linux/workqueue.h>
> > +#include <trace/events/error_report.h>
> > +#include <trace/events/printk.h>
> > +
> > +static struct kobject *error_report_kobj;
> > +
> > +/* sysfs files are capped at PAGE_SIZE. */
> > +#define BUF_SIZE PAGE_SIZE
> > +/* Two buffers to store the finished report and the report being recorded. */
> > +static char report_buffer[2][BUF_SIZE];
> > +/*
> > + * Total report count. Also serves as a latch for report_buffer:
> > + * report_buffer[num_reports % 2] is the currently available report,
> > + * report_buffer[(num_reports + 1) % 2] is the report being recorded.
> > + */
> > +static atomic_t num_reports;
> > +
> > +/*
> > + * PID of the task currently recording the report, as returned by
> > + * get_encoded_pid(), or -1. Used as a writer lock for report_buffer.
> > + * A regular spinlock couldn't be used here, as probe_console() can be called
> > + * from any thread, and it needs to know whether that thread is holding the
> > + * lock.
> > + */
> > +static atomic_t current_pid = ATOMIC_INIT(-1);
>
> how do you handle pid namespaces?

Doesn't current->pid hold the global PID of the task?
See the description of task_pid_nr() here:
https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L1386,
which is supposed to return a global task ID.

> > +     if (atomic_cmpxchg_acquire(&current_pid, -1, get_encoded_pid()) != -1)
> > +             return;
>
> pid namespaces?
See above.

>
> pid namespaces?
>
Same.

> > +     int idx;
> > +
> > +     if (pid != get_encoded_pid())
> > +             return;
> > +
> > +     idx = (atomic_read(&num_reports) + 1) % 2;
>
> You read, but it could change before:

Not sure I follow. num_reports can be only incremented by the same
task that started the report, this cannot happen concurrently.


>
> > +     if (current_pos == BUF_SIZE)
> > +             report_buffer[idx][current_pos - 1] = 0;
> > +     else
> > +             report_buffer[idx][current_pos] = 0;
> > +
> > +     /* Pairs with acquire in last_report_show(). */
> > +     atomic_inc_return_release(&num_reports);
>
> Not good?

> > +static ssize_t last_report_show(struct kobject *kobj,
> > +                             struct kobj_attribute *attr, char *buf)
> > +{
> > +     ssize_t ret;
> > +     int index;
> > +
> > +     do {
> > +             /* Pairs with release in probe_report_end(). */
> > +             index = atomic_read_acquire(&num_reports);
> > +             /*
> > +              * If index and old_index mismatch, we might be accessing
> > +              * report_buffer concurrently with a writer thread. In that
> > +              * case the read data will be discarded.
> > +              */
> > +             ret = data_race(strscpy(buf, report_buffer[index % 2], BUF_SIZE));
> > +             /*
> > +              * Prevent reordering between the memcpy above and the atomic
> > +              * read below.
> > +              * See the comments in include/linux/seqlock.h for more
> > +              * details.
> > +              */
> > +             smp_rmb();
> > +     } while (index != atomic_read(&num_reports));
>
> endless loops, what could go wrong...
Fair enough, this needs to be fixed.

>
> Why are you rolling your own hacky locks in here?

We've also considered using a seqlock here, but thought that required
too much boilerplate code (the current implementation reuses the
report counter as a seqlock latch, whereas otherwise we'd need to
introduce an extra seqcount_latch_t plus call the seqlock API
functions). I think this can be reconsidered.

> And again, sysfs is "one value" not "one buffer".
>
> > +     return ret;
> > +}
> > +
> > +/*
> > + * read() handler for /sys/kernel/error_report/report_count.
> > + */
> > +static ssize_t report_count_show(struct kobject *kobj,
> > +                              struct kobj_attribute *attr, char *buf)
> > +{
> > +     return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&num_reports));
>
> sysfs_emit()?

Good, haven't seen that one. I think I just took
Documentation/filesystems/sysfs.txt as an example.

> And you just read it, but what keeps it from changing?
Nothing; we can't really guarantee nobody reported another error while
we were processing the previous one.
Similarly, we cannot be sure that any other vfs file still has the
same contents once we read it ;)


> > +static const struct attribute_group error_report_sysfs_attr_group = {
> > +     .attrs = error_report_sysfs_attrs,
> > +};
>
> ATTRIBUTE_GROUPS()?

Ack.

> > +late_initcall(error_report_notify_setup);
>
> You never clean up the kobject or files?
Will fix, thanks!

> Anyway, please move this to tracefs, that's where it belongs.
Will do in v3.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-15 16:52       ` Steven Rostedt
@ 2021-01-18 10:22         ` Alexander Potapenko
  2021-01-18 14:52           ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-18 10:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Greg KH, LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Sergey Senozhatsky,
	Linux Memory Management List

> If you do use tracefs, add it to the top level tracing directory (no need
> to have instances of it), and rename it to "kernel_warnings", as
> "error_report" is too close to the existing "error_log" which holds error
> messages of syntactic errors done by users entering in commands to some of
> the special files.

Will do. Is it conventional to add a new dentry* to struct trace_array for that?
If not, maybe it's better to create this dir in
error_report_notify_setup(), like this is done e.g. for
/sys/kernel/tracing/trace_stat - just to loosen the coupling?

> That is, /sys/kernel/tracing/kernel_warnings/ would be your error_report/
> directory you have now.

WDYT about "kernel_errors" or "kernel_error_reports"?
"warnings" suggest we'll be notifying about any occurrence of WARN(),
which is not what we are planning to do.
Also, shall I rename the library/config/etc. accordingly (to e.g.
CONFIG_KERNEL_WARN_NOTIFY)?

> Use the function in kernel/trace/trace.c: tracer_init_tracefs() to add that
> directory. That's for files in the tracefs directory that will not be
> duplicated by instances.
>
> -- Steve



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-15 13:03 ` [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports Alexander Potapenko
  2021-01-15 13:50   ` Greg KH
@ 2021-01-18 11:38   ` Petr Mladek
  2021-01-18 13:08     ` Alexander Potapenko
  1 sibling, 1 reply; 22+ messages in thread
From: Petr Mladek @ 2021-01-18 11:38 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: linux-kernel, akpm, andreyknvl, dvyukov, mingo, elver, rostedt,
	sergey.senozhatsky, linux-mm

On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote:
> With the introduction of various production error-detection tools, such as
> MTE-based KASAN and KFENCE, the need arises to efficiently notify the
> userspace OS components about kernel errors. Currently, no facility exists
> to notify userspace about a kernel error from such bug-detection tools.
> The problem is obviously not restricted to the above bug detection tools,
> and applies to any error reporting mechanism that does not panic the
> kernel; this series, however, will only add support for KASAN and KFENCE
> reporting.
> 
> +++ b/lib/error_report_notify.c
> +/*
> + * Trace hook for the error_report_start event. In an unlikely case of another
> + * task already printing a report bail out, otherwise save the current pid
> + * together with in_task() return value.

This is not reliable. Some events might get lost.

> + * Because reporting code can be called from low-level routines (e.g. locking
> + * primitives or allocator guts), report recording is implemented using a
> + * seqlock lock-free algorithm.
> + */
> +static void probe_report_start(void *ignore, enum error_detector detector,
> +			       unsigned long id)
> +{
> +	/*
> +	 * Acquire the writer lock. Any racing probe_report_start will not
> +	 * record anything. Pairs with the release in probe_report_end().
> +	 */
> +	if (atomic_cmpxchg_acquire(&current_pid, -1, get_encoded_pid()) != -1)
> +		return;
> +	current_pos = 0;
> +	truncated = false;
> +}
> +
> +/*
> + * Trace hook for the error_report_end event. If an event from the mismatching
> + * error_report_start is received, it is ignored. Otherwise, null-terminate the
> + * buffer, increase the report count (effectively releasing the report to
> + * last_report_show() and schedule a notification about a new report.
> + */
> +static void probe_report_end(void *ignore, enum error_detector detector,
> +			     unsigned long id)
> +{
> +	pid_t pid = atomic_read(&current_pid);
> +	int idx;
> +
> +	if (pid != get_encoded_pid())
> +		return;
> +
> +	idx = (atomic_read(&num_reports) + 1) % 2;
> +	if (current_pos == BUF_SIZE)
> +		report_buffer[idx][current_pos - 1] = 0;
> +	else
> +		report_buffer[idx][current_pos] = 0;
> +
> +	/* Pairs with acquire in last_report_show(). */
> +	atomic_inc_return_release(&num_reports);
> +	schedule_delayed_work(&reporting_done, 0);

Why delayed work when it gets queued immediately?

> +	/*
> +	 * Release the writer lock. Pairs with the acquire in
> +	 * probe_report_start().
> +	 */
> +	atomic_set_release(&current_pid, -1);
> +}
> +
> +/*
> + * Skip one or two leading pair of brackets containing the log timestamp and
> + * the task/CPU ID, plus the leading space, from the report line, e.g.:
> + *   [    0.698431][    T7] BUG: KFENCE: use-after-free ...
> + * becomes:
> + *   BUG: KFENCE: use-after-free ...
> + *
> + * Report size is only 4K, and this boilerplate can easily account for half of
> + * that amount.
> + */
> +static void skip_extra_info(const char **buf, size_t *len)
> +{
> +	int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) +
> +			   IS_ENABLED(CONFIG_PRINTK_CALLER);

The timestamp can be disabled also at runtime by
/sys/module/printk/parameters/time

> +	const char *found;
> +
> +	if (!buf || !len)
> +		return;
> +
> +	while (num_brackets--) {
> +		if (!*len || *buf[0] != '[')
> +			return;
> +		found = strnchr(*buf, *len, ']');
> +		if (!found)
> +			return;
> +		*len -= found - *buf + 1;
> +		*buf = found + 1;
> +	}
> +	if (*len && *buf[0] == ' ') {
> +		++*buf;
> +		--*len;
> +	}
> +}
> +
> +/*
> + * Trace hook for the console event. If a line comes from a task/CPU that did
> + * not send the error_report_start event, that line is ignored. Otherwise, it
> + * is stored in the report_buffer[(num_reports + 1) % 2].
> + *
> + * To save space, the leading timestamps and (when enabled) CPU/task info is
> + * stripped away. The buffer may contain newlines, so this procedure is
> + * repeated for every line.

IMHO, removing the timestamp is a bad idea. It will complicate
matching the message with other events.

I am afraid that some reports would get shrunken anyway.
I suggest to use some lockless ringbuffer to store these
messages. Either ftrace or printk one.

But the biggest problem would be that console might be called
asynchronously from another process. The messages might get
lost in the meantime.

There are many other more reliable solutions:

Either add hook into vprintk_store() and make another copy of the
message into your buffer. But there will still be the problem
that you could not store more reports in parallel. So, it won't
be reliable anyway.

Or add a hook into /dev/kmsg interface. It allows to read any
stored message immediately and quickly. Well, you would need
to have your own reader of this interface.

Or add your own reader of the main printk log.

Or give up on having your own buffers. Instead just inform userspace
about a new report via the pooled interface. And get the messages
from the main log, e.g. via /dev/kmsg.


I agree with Andrew that this might be _very_ dangerous feature.
It is another way how to get messages from kernel. It looks like
a hack how to get certain messages from the main kernel log
into a separate log. This might end up with implementing
yet another printk() or trace_printk().

I suggest to integrate this better with printk():

   + We already have printk_context() that allows to handle
     messages a special way.

   + Just store extra flag into the main log buffer for these
     messages.

   + Allow to read the messages via some interface, either
     /dev/kmsg or a new one. We already store PID and context
     info into the main log buffer. It will allow to put
     relevant lines together.

   + Anyway, I strongly suggest to avoid your own buffer. It looks
     like an overhead. It it would get more complicated when
     there are more users, ...


> + */
> +static void probe_console(void *ignore, const char *buf, size_t len)
> +{
> +	int pid = atomic_read(&current_pid);
> +	size_t to_copy, cur_len;
> +	char *newline;
> +	int idx;
> +
> +	if (pid != get_encoded_pid() || truncated)
> +		return;
> +
> +	idx = (atomic_read(&num_reports) + 1) % 2;
> +	while (len) {
> +		newline = strnchr(buf, len, '\n');
> +		if (newline)
> +			cur_len = newline - buf + 1;
> +		else
> +			cur_len = len;
> +		/* Adjust len now, because skip_extra_info() may change cur_len. */
> +		len -= cur_len;
> +		skip_extra_info(&buf, &cur_len);
> +		to_copy = min(cur_len, BUF_SIZE - current_pos);
> +		memcpy(report_buffer[idx] + current_pos, buf, to_copy);
> +		current_pos += to_copy;
> +		if (cur_len > to_copy) {
> +			truncated = true;
> +			memcpy(report_buffer[idx] + current_pos - sizeof(TRUNC_MSG),
> +			       TRUNC_MSG, sizeof(TRUNC_MSG));
> +			break;
> +		}
> +		buf += cur_len;
> +	}
> +}

I still have to think about it. This feature heavily depends on
printk(). It makes sense to integrate it there. But we also
need to keep it simple because printk() is already pretty
complicated.

Best Regards,
Petr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-18 11:38   ` Petr Mladek
@ 2021-01-18 13:08     ` Alexander Potapenko
  2021-01-18 13:14       ` Alexander Potapenko
  2021-01-18 16:43       ` Petr Mladek
  0 siblings, 2 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-18 13:08 UTC (permalink / raw)
  To: Petr Mladek
  Cc: LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Steven Rostedt, Sergey Senozhatsky,
	Linux Memory Management List

On Mon, Jan 18, 2021 at 12:38 PM Petr Mladek <pmladek@suse.com> wrote:
Thanks for your input! Some responses below.

>
> On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote:
> > With the introduction of various production error-detection tools, such as
> > MTE-based KASAN and KFENCE, the need arises to efficiently notify the
> > userspace OS components about kernel errors. Currently, no facility exists
> > to notify userspace about a kernel error from such bug-detection tools.
> > The problem is obviously not restricted to the above bug detection tools,
> > and applies to any error reporting mechanism that does not panic the
> > kernel; this series, however, will only add support for KASAN and KFENCE
> > reporting.
> >
> > +++ b/lib/error_report_notify.c
> > +/*
> > + * Trace hook for the error_report_start event. In an unlikely case of another
> > + * task already printing a report bail out, otherwise save the current pid
> > + * together with in_task() return value.
>
> This is not reliable. Some events might get lost.


> > +
> > +     /* Pairs with acquire in last_report_show(). */
> > +     atomic_inc_return_release(&num_reports);
> > +     schedule_delayed_work(&reporting_done, 0);
>
> Why delayed work when it gets queued immediately?

Because error reports may be sent from a place where waiting is
forbidden (in the case of KFENCE that is the page fault handler).
A delayed work ensures the notification occurs in a wait-safe context.


> > +static void skip_extra_info(const char **buf, size_t *len)
> > +{
> > +     int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) +
> > +                        IS_ENABLED(CONFIG_PRINTK_CALLER);
>
> The timestamp can be disabled also at runtime by
> /sys/module/printk/parameters/time

Guess that would result in disabling it for the whole kernel, which is
too aggressive.


> IMHO, removing the timestamp is a bad idea. It will complicate
> matching the message with other events.

Given that we are moving away from sysfs, the report size limit can be
increased, therefore we don't have to remove the timestamp to save
space anymore.

> I am afraid that some reports would get shrunken anyway.
Do you have any particular subsystem in mind? We could probably
increase the report buffer to, say, 16K, to avoid this.

> I suggest to use some lockless ringbuffer to store these
> messages. Either ftrace or printk one.
>
> But the biggest problem would be that console might be called
> asynchronously from another process. The messages might get
> lost in the meantime.

Do you mean that one process may call printk(), but the console probe
for that message will be executed in the context of another process?
If so, that would be a problem. Could it be possible to extend the
console trace event to pass an extra caller id?

>
> There are many other more reliable solutions:
>
> Either add hook into vprintk_store() and make another copy of the
> message into your buffer. But there will still be the problem
> that you could not store more reports in parallel. So, it won't
> be reliable anyway.

At least for now, we do not expect many reports to come in parallel:
this feature is meant for collecting the long tail of memory
corruptions from production devices.

>
> Or add a hook into /dev/kmsg interface. It allows to read any
> stored message immediately and quickly. Well, you would need
> to have your own reader of this interface.

This probably could be done, but isn't the console tracepoint designed
exactly for this?

> Or add your own reader of the main printk log.
>
> Or give up on having your own buffers. Instead just inform userspace
> about a new report via the pooled interface. And get the messages
> from the main log, e.g. via /dev/kmsg.

Yes, that could be an option, although the need to search for
tool-specific strings in kmsg and figure out where the report starts
and ends still persists.
At the end of the day, this sounds like the best approach for now,
that we could later extend to store the logs.


> I agree with Andrew that this might be _very_ dangerous feature.
> It is another way how to get messages from kernel. It looks like
> a hack how to get certain messages from the main kernel log
> into a separate log. This might end up with implementing
> yet another printk() or trace_printk().

Initially our intent was to refactor KFENCE error reporting so that
the whole report is stored in a buffer and can be exposed to userspace
via a VFS entry.
Then it turned out we'd need something similar for KASAN, which
would've also resulted in a major rewrite.
So we thought a much more generic solution would be to let the tools
"tee" their output to a specific file on demand.
Yes, sounds like duplicating the kernel log :)
On the other hand, we never really planned to extend printk or
duplicate parts of it, so unless trace_console() is a good fit we'd
probably refrain from storing the logs for now.

> I suggest to integrate this better with printk():
>
>    + We already have printk_context() that allows to handle
>      messages a special way.
>
>    + Just store extra flag into the main log buffer for these
>      messages.
>
>    + Allow to read the messages via some interface, either
>      /dev/kmsg or a new one. We already store PID and context
>      info into the main log buffer. It will allow to put
>      relevant lines together.
>
>    + Anyway, I strongly suggest to avoid your own buffer. It looks
>      like an overhead. It it would get more complicated when
>      there are more users, ...

Having the special flag for the error reports is handy, but given the
existence of CONFIG_PRINTK_CALLER may not be worth the effort, at
least for now.

>
>
> I still have to think about it. This feature heavily depends on
> printk(). It makes sense to integrate it there. But we also
> need to keep it simple because printk() is already pretty
> complicated.
>
> Best Regards,
> Petr



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-18 13:08     ` Alexander Potapenko
@ 2021-01-18 13:14       ` Alexander Potapenko
  2021-01-18 16:43       ` Petr Mladek
  1 sibling, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-18 13:14 UTC (permalink / raw)
  To: Petr Mladek
  Cc: LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Steven Rostedt, Sergey Senozhatsky,
	Linux Memory Management List

> > > +
> > > +     /* Pairs with acquire in last_report_show(). */
> > > +     atomic_inc_return_release(&num_reports);
> > > +     schedule_delayed_work(&reporting_done, 0);
> >
> > Why delayed work when it gets queued immediately?
>
> Because error reports may be sent from a place where waiting is
> forbidden (in the case of KFENCE that is the page fault handler).
> A delayed work ensures the notification occurs in a wait-safe context.
>
Agreed, there's actually no point in calling schedule_delayed_work()
instead of schedule_work() here.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation
  2021-01-18 10:22         ` Alexander Potapenko
@ 2021-01-18 14:52           ` Steven Rostedt
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2021-01-18 14:52 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Greg KH, LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Sergey Senozhatsky,
	Linux Memory Management List

On Mon, 18 Jan 2021 11:22:27 +0100
Alexander Potapenko <glider@google.com> wrote:

> > If you do use tracefs, add it to the top level tracing directory (no need
> > to have instances of it), and rename it to "kernel_warnings", as
> > "error_report" is too close to the existing "error_log" which holds error
> > messages of syntactic errors done by users entering in commands to some of
> > the special files.  
> 
> Will do. Is it conventional to add a new dentry* to struct trace_array for that?
> If not, maybe it's better to create this dir in
> error_report_notify_setup(), like this is done e.g. for
> /sys/kernel/tracing/trace_stat - just to loosen the coupling?

Yeah, keep the dentry static in your own code. No need to add it to the
trace_array. The dentry's in the trace array are for instances (as there
are more than one of instance of them).

> 
> > That is, /sys/kernel/tracing/kernel_warnings/ would be your error_report/
> > directory you have now.  
> 
> WDYT about "kernel_errors" or "kernel_error_reports"?
> "warnings" suggest we'll be notifying about any occurrence of WARN(),
> which is not what we are planning to do.
> Also, shall I rename the library/config/etc. accordingly (to e.g.
> CONFIG_KERNEL_WARN_NOTIFY)?

I'm fine with "kernel_errors" as that helps to differentiate what they are.
Also, you may want to add a field about this in the documentation under
Documentation/trace/ftrace.rst (just a reference) and add a separate file
called, kernel_errors.rst.

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-18 13:08     ` Alexander Potapenko
  2021-01-18 13:14       ` Alexander Potapenko
@ 2021-01-18 16:43       ` Petr Mladek
  2021-01-21 13:13         ` Alexander Potapenko
  1 sibling, 1 reply; 22+ messages in thread
From: Petr Mladek @ 2021-01-18 16:43 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: LKML, Andrew Morton, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Steven Rostedt, Sergey Senozhatsky,
	Linux Memory Management List

On Mon 2021-01-18 14:08:19, Alexander Potapenko wrote:
> On Mon, Jan 18, 2021 at 12:38 PM Petr Mladek <pmladek@suse.com> wrote:
> Thanks for your input! Some responses below.
> 
> >
> > On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote:
> > > With the introduction of various production error-detection tools, such as
> > > MTE-based KASAN and KFENCE, the need arises to efficiently notify the
> > > userspace OS components about kernel errors. Currently, no facility exists
> > > to notify userspace about a kernel error from such bug-detection tools.
> > > The problem is obviously not restricted to the above bug detection tools,
> > > and applies to any error reporting mechanism that does not panic the
> > > kernel; this series, however, will only add support for KASAN and KFENCE
> > > reporting.
> > >
> > > +++ b/lib/error_report_notify.c
> > > +/*
> > > + * Trace hook for the error_report_start event. In an unlikely case of another
> > > + * task already printing a report bail out, otherwise save the current pid
> > > + * together with in_task() return value.
> >
> > This is not reliable. Some events might get lost.
> 
> 
> > > +
> > > +     /* Pairs with acquire in last_report_show(). */
> > > +     atomic_inc_return_release(&num_reports);
> > > +     schedule_delayed_work(&reporting_done, 0);
> >
> > Why delayed work when it gets queued immediately?
> 
> Because error reports may be sent from a place where waiting is
> forbidden (in the case of KFENCE that is the page fault handler).
> A delayed work ensures the notification occurs in a wait-safe context.
> 
> 
> > > +static void skip_extra_info(const char **buf, size_t *len)
> > > +{
> > > +     int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) +
> > > +                        IS_ENABLED(CONFIG_PRINTK_CALLER);
> >
> > The timestamp can be disabled also at runtime by
> > /sys/module/printk/parameters/time
> 
> Guess that would result in disabling it for the whole kernel, which is
> too aggressive.

I wanted to say that it was not enough to check
IS_ENABLED(CONFIG_PRINTK_TIME). Anyway, I think that you will not use
this code in the end.


> > IMHO, removing the timestamp is a bad idea. It will complicate
> > matching the message with other events.
> 
> Given that we are moving away from sysfs, the report size limit can be
> increased, therefore we don't have to remove the timestamp to save
> space anymore.
> 
> > I am afraid that some reports would get shrunken anyway.
> Do you have any particular subsystem in mind? We could probably

printk() is storing recursive messages and also messages in NMI
context into temporary per-CPU buffers. They include just a plain
text (no timestamps).

The original size was 4kB but it caused shrinking some backtraces.
They are 8kB by default now. But they still shrink, for example,
lockdep report.

I am not sure about the KASAN or KFENCE reports. But the
experience is that amount of debug messages is just growing year
over year.

JFYI, these temporary buffers should get removed once the new lockless
ringbuffer is fully integrated.


> increase the report buffer to, say, 16K, to avoid this.
> 
> > I suggest to use some lockless ringbuffer to store these
> > messages. Either ftrace or printk one.
> >
> > But the biggest problem would be that console might be called
> > asynchronously from another process. The messages might get
> > lost in the meantime.
> 
> Do you mean that one process may call printk(), but the console probe
> for that message will be executed in the context of another process?

Yes, this might happen. And it is going to be the default behavior.
The current solution might cause softlockups. John Ogness is working
on offloading the console handling into separate kthreads. It means
that call_console_drivers() will always get called by a separate
process.

> If so, that would be a problem. Could it be possible to extend the
> console trace event to pass an extra caller id?

We should pass the messages some other way. The good news is
that the caller id is always stored in the new lockless
ring buffer that has been integreated into 4.10. It is
used intarnally for handling continuous lines.

The caller id is printed on the console only when the configure
option is enabled. But we should create another interface for
these debugging dumps anyway.

I would implement one more dumper, like kmsg_dump(),
syslog_print_all(), console_unlock(). They cycle over all/recently
stored messages and print them.

console_unlock() filter the messages according to console_loglevel.
The new funtion might fitler them by the flag specific for
the interested tools, like KASAN.

> > Or add a hook into /dev/kmsg interface. It allows to read any
> > stored message immediately and quickly. Well, you would need
> > to have your own reader of this interface.
> 
> This probably could be done, but isn't the console tracepoint designed
> exactly for this?

Hmm, the tracepoint has been added in 2011 by the commit
95100358491abaa2 ("printk/tracing: Add console output tracing").

It was a time when the messages were added into the main logbuffer
one-by-one character. I guess that it was easier to add the
tracing here because the console code already found the beginning
and end of each message.

/dev/kmsg interface has been added later. Also the messages were
later stored as records (one message with its metadata) into
the log buffer. So, it might be done an easier way now.


> Initially our intent was to refactor KFENCE error reporting so that
> the whole report is stored in a buffer and can be exposed to userspace
> via a VFS entry.
> Then it turned out we'd need something similar for KASAN, which
> would've also resulted in a major rewrite.
> So we thought a much more generic solution would be to let the tools
> "tee" their output to a specific file on demand.

Thanks for the background.

> Yes, sounds like duplicating the kernel log :)
> On the other hand, we never really planned to extend printk or
> duplicate parts of it, so unless trace_console() is a good fit we'd
> probably refrain from storing the logs for now.

IMHO, using the tracepoint for your purpose looks too complicated
and less reliable. It might be interesting for 3rd-party code because
the tracepoint is rather stable. But we might do it better in upstream.


> Having the special flag for the error reports is handy, but given the
> existence of CONFIG_PRINTK_CALLER may not be worth the effort, at
> least for now.

Of course, the less we need to complicate the printk() code the
better. So if you could find the right messages using a normal
interface, it would be great.

That said, there are currently the following ways to read the messages:

   + consoles
   + syslog syscall (man 2 syslog)
   + /dev/kmsg
   + kmsg dumpers

Three of them use record_print_text() to format the message. It means
that they could show the caller_id in the brackets.

The only exception is /dev/kmsg. It uses info_print_ext_header()
and msg_print_ext_body() to format the output. It currently does
not show caller_id. Nobody wanted it yet. We could update it.
We just need to be careful because systemd journald uses this
interface and relies on the format.

Best Regards,
Petr


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools
  2021-01-15 13:09   ` Alexander Potapenko
@ 2021-01-21 12:56     ` Alexander Potapenko
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-21 12:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: LKML, Andrew Morton, Linux API, Andrey Konovalov, Dmitriy Vyukov,
	Ingo Molnar, Marco Elver, Petr Mladek, Steven Rostedt,
	Sergey Senozhatsky, Linux Memory Management List

On Fri, Jan 15, 2021 at 2:09 PM Alexander Potapenko <glider@google.com> wrote:
>
> On Fri, Jan 15, 2021 at 2:06 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Should have CCd linux-api@, please do next time
> Thanks, will do!
> Shall I also CC the v2 ABI patch explicitly?

I'll be dropping the sysfs changes from this patchset, but assume it
still makes sense to CC linux-api@ since v3 will be introducing
tracepoints.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports
  2021-01-18 16:43       ` Petr Mladek
@ 2021-01-21 13:13         ` Alexander Potapenko
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Potapenko @ 2021-01-21 13:13 UTC (permalink / raw)
  To: Petr Mladek, Andrew Morton, Steven Rostedt, Greg Kroah-Hartman
  Cc: LKML, Andrey Konovalov, Dmitriy Vyukov, Ingo Molnar, Marco Elver,
	Sergey Senozhatsky, Linux Memory Management List

Thank you all for the comments!

Since concerns have been raised that another error reporting system
may quickly go out of control, we've decided to not pursue this
solution any further.
Instead, we will try to rely on existing ftrace mechanisms to deliver
notifications to the userspace, and perform dmesg collection/parsing
once a notification is received.
As a result, we won't need the error_report_start tracepoint anytime
soon (I didn't rename error_report_end to something like error_report
anticipating that error_report_start may be needed in the future).
I've also dropped the error_report_notify library and the
corresponding documentation from the patchset.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-01-21 13:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-15 13:03 [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Alexander Potapenko
2021-01-15 13:03 ` [PATCH v2 1/5] tracing: add error_report trace points Alexander Potapenko
2021-01-15 13:03 ` [PATCH v2 2/5] lib: add error_report_notify to collect debugging tools' reports Alexander Potapenko
2021-01-15 13:50   ` Greg KH
2021-01-15 17:17     ` Alexander Potapenko
2021-01-18 11:38   ` Petr Mladek
2021-01-18 13:08     ` Alexander Potapenko
2021-01-18 13:14       ` Alexander Potapenko
2021-01-18 16:43       ` Petr Mladek
2021-01-21 13:13         ` Alexander Potapenko
2021-01-15 13:03 ` [PATCH v2 3/5] docs: ABI: add /sys/kernel/error_report/ documentation Alexander Potapenko
2021-01-15 13:45   ` Greg KH
2021-01-15 15:26     ` Alexander Potapenko
2021-01-15 15:45       ` Greg KH
2021-01-15 16:52       ` Steven Rostedt
2021-01-18 10:22         ` Alexander Potapenko
2021-01-18 14:52           ` Steven Rostedt
2021-01-15 13:03 ` [PATCH v2 4/5] kfence: use error_report_start and error_report_end tracepoints Alexander Potapenko
2021-01-15 13:03 ` [PATCH v2 5/5] kasan: " Alexander Potapenko
2021-01-15 13:06 ` [PATCH v2 0/5] Add sysfs interface to collect reports from debugging tools Vlastimil Babka
2021-01-15 13:09   ` Alexander Potapenko
2021-01-21 12:56     ` Alexander Potapenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).