All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] kernel: taint when the driver firmware crashes
@ 2020-05-26 14:58 Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 1/8] kernel.h: move taint and system state flags to uapi Luis Chamberlain
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain

To those new on CC -- this is intended to be a simple generic interface
to the kernel to annotate when the firwmare has crashed leaving the
driver or system in a questionable state, in the worst case requiring
full system reboot. This series is first addressing only a few
networking patches, however, I already have an idea of where such
firmware crashes happen across the tree. The goal with this series then
is to first introduce the simple framework, and only if that moves
forward will I continue to chug on with the rest of the drivers /
subsystems.

This is *not* a networking specific problem only.

This v3 augments the last series by introducing the uevent for panic
events, one of them is during tainting. The uvent mechanism is
independent from any of this firmware taint mechanism. I've also
addressed Jessica Yu's feedback. Given I've extended the patches a bit
with other minor cleanup which checkpatch.pl complains over, and since
this infrastructure is still being discussed, I've trimmed the patch
series size to only cover drivers for which I've received an Acked-by
from the respective driver maintainer, or where we have bug reports to
support such dire situations on the driver such as ath10k.

During the last v2 it was discussed that we should instead use devlink
for this work, however the initial RFC patches produced by Jakub
Kicinski [0] shows how devlink is networking specific, and the intent
behind this series is to produce simple helpers which can be used by *any*
device driver, for any subsystem, not just networking. Subsystem
specific infrastructure to help address firwmare crashes may still make
sense, however that does not mean we *don't* need something even more
generic regardless of the subsystem the issue happens on. Since uevents
for taints are exposed, we now expose these through uapi as well, and
that was something which eventually had to happen given that the current
scheme of relying on sensible character representations for each taint
will not scale beyond the alphabet.

This series is avaialble my 20200526-taint-firmware-net-intro branch, based on
linux-next tag next-20200526 [1].

[0] https://lkml.kernel.org/r/20200519211531.3702593-1-kuba@kernel.org
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200526-taint-firmware-net-intro

Luis Chamberlain (7):
  kernel.h: move taint and system state flags to uapi
  panic: add uevent support
  taint: add firmware crash taint support
  panic: make taint data type clearer
  ath10k: use new taint_firmware_crashed()
  liquidio: use new taint_firmware_crashed()
  qed: use new taint_firmware_crashed()

Vasundhara Volam (1):
  bnxt_en: use new taint_firmware_crashed()

 Documentation/admin-guide/tainted-kernels.rst |   6 +
 MAINTAINERS                                   |   8 +
 .../net/ethernet/broadcom/bnxt/bnxt_devlink.c |   1 +
 .../net/ethernet/cavium/liquidio/lio_main.c   |   1 +
 drivers/net/ethernet/qlogic/qed/qed_mcp.c     |   1 +
 drivers/net/wireless/ath/ath10k/pci.c         |   2 +
 drivers/net/wireless/ath/ath10k/sdio.c        |   2 +
 drivers/net/wireless/ath/ath10k/snoc.c        |   1 +
 include/asm-generic/bug.h                     |   4 +-
 include/linux/kernel.h                        |  40 +--
 include/linux/module.h                        |  13 +
 include/linux/panic_events.h                  |  26 ++
 include/trace/events/module.h                 |   3 +-
 include/uapi/linux/kernel.h                   |  36 +++
 include/uapi/linux/panic_events.h             |  17 ++
 init/main.c                                   |   1 +
 kernel/Makefile                               |   1 +
 kernel/module.c                               |  13 +-
 kernel/panic.c                                |  16 +-
 kernel/panic_events.c                         | 289 ++++++++++++++++++
 lib/Kconfig.debug                             |  13 +
 tools/debugging/kernel-chktaint               |   7 +
 22 files changed, 454 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/panic_events.h
 create mode 100644 include/uapi/linux/panic_events.h
 create mode 100644 kernel/panic_events.c

-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/8] kernel.h: move taint and system state flags to uapi
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 2/8] panic: add uevent support Luis Chamberlain
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain

The taint and system state flags will be used in a subsequent patch
exposing these to userspace, so move them to uapi. We keep the
TAINT_FLAGS_COUNT outside of uapi, as this value can change per release.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/kernel.h      | 34 +---------------------------------
 include/uapi/linux/kernel.h | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 82d91547d122..337634363d00 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -569,40 +569,8 @@ extern unsigned long get_taint(void);
 extern int root_mountflags;
 
 extern bool early_boot_irqs_disabled;
+extern enum system_states  system_state;
 
-/*
- * Values used for system_state. Ordering of the states must not be changed
- * as code checks for <, <=, >, >= STATE.
- */
-extern enum system_states {
-	SYSTEM_BOOTING,
-	SYSTEM_SCHEDULING,
-	SYSTEM_RUNNING,
-	SYSTEM_HALT,
-	SYSTEM_POWER_OFF,
-	SYSTEM_RESTART,
-	SYSTEM_SUSPEND,
-} system_state;
-
-/* This cannot be an enum because some may be used in assembly source. */
-#define TAINT_PROPRIETARY_MODULE	0
-#define TAINT_FORCED_MODULE		1
-#define TAINT_CPU_OUT_OF_SPEC		2
-#define TAINT_FORCED_RMMOD		3
-#define TAINT_MACHINE_CHECK		4
-#define TAINT_BAD_PAGE			5
-#define TAINT_USER			6
-#define TAINT_DIE			7
-#define TAINT_OVERRIDDEN_ACPI_TABLE	8
-#define TAINT_WARN			9
-#define TAINT_CRAP			10
-#define TAINT_FIRMWARE_WORKAROUND	11
-#define TAINT_OOT_MODULE		12
-#define TAINT_UNSIGNED_MODULE		13
-#define TAINT_SOFTLOCKUP		14
-#define TAINT_LIVEPATCH			15
-#define TAINT_AUX			16
-#define TAINT_RANDSTRUCT		17
 #define TAINT_FLAGS_COUNT		18
 #define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
 
diff --git a/include/uapi/linux/kernel.h b/include/uapi/linux/kernel.h
index 0ff8f7477847..4bbd4093eb64 100644
--- a/include/uapi/linux/kernel.h
+++ b/include/uapi/linux/kernel.h
@@ -12,4 +12,39 @@
 
 #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
 
+/*
+ * Values used for system_state. Ordering of the states must not be changed
+ * as code checks for <, <=, >, >= STATE.
+ */
+enum system_states {
+	SYSTEM_BOOTING,
+	SYSTEM_SCHEDULING,
+	SYSTEM_RUNNING,
+	SYSTEM_HALT,
+	SYSTEM_POWER_OFF,
+	SYSTEM_RESTART,
+	SYSTEM_SUSPEND,
+};
+
+/* This cannot be an enum because some may be used in assembly source. */
+#define TAINT_PROPRIETARY_MODULE	0
+#define TAINT_FORCED_MODULE		1
+#define TAINT_CPU_OUT_OF_SPEC		2
+#define TAINT_FORCED_RMMOD		3
+#define TAINT_MACHINE_CHECK		4
+#define TAINT_BAD_PAGE			5
+#define TAINT_USER			6
+#define TAINT_DIE			7
+#define TAINT_OVERRIDDEN_ACPI_TABLE	8
+#define TAINT_WARN			9
+#define TAINT_CRAP			10
+#define TAINT_FIRMWARE_WORKAROUND	11
+#define TAINT_OOT_MODULE		12
+#define TAINT_UNSIGNED_MODULE		13
+#define TAINT_SOFTLOCKUP		14
+#define TAINT_LIVEPATCH			15
+#define TAINT_AUX			16
+#define TAINT_RANDSTRUCT		17
+/* be sure to update TAINT_FLAGS_COUNT when extending this */
+
 #endif /* _UAPI_LINUX_KERNEL_H */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/8] panic: add uevent support
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 1/8] kernel.h: move taint and system state flags to uapi Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-31  4:46   ` kbuild test robot
  2020-06-03 17:55   ` kernel test robot
  2020-05-26 14:58 ` [PATCH v3 3/8] taint: add firmware crash taint support Luis Chamberlain
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain

Add support to let panic events such as taints trigger uevents.
If you don't have a journalctl -k -f window going and you're not
actively monitoring what is going on in the kernel you may not
realize that your kernel is hosed. Only those clueful that
something may be off might inspect the kernel logs. Since
not everyone is, add support to throw some bones to userspace
that something might be fishy.

Let userspace figure out how to inform users, and what to do.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 MAINTAINERS                       |   8 +
 include/linux/panic_events.h      |  26 +++
 include/uapi/linux/panic_events.h |  17 ++
 init/main.c                       |   1 +
 kernel/Makefile                   |   1 +
 kernel/module.c                   |   2 +
 kernel/panic.c                    |   7 +-
 kernel/panic_events.c             | 289 ++++++++++++++++++++++++++++++
 lib/Kconfig.debug                 |  13 ++
 9 files changed, 363 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/panic_events.h
 create mode 100644 include/uapi/linux/panic_events.h
 create mode 100644 kernel/panic_events.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3a003f310574..5bb467c0b36f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12876,6 +12876,14 @@ L:	platform-driver-x86@vger.kernel.org
 S:	Maintained
 F:	drivers/platform/x86/panasonic-laptop.c
 
+PANIC EVENTS
+M:	Luis Chamberlain <mcgrof@kernel.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	include/linux/panic_events.h
+F:	include/uapi/linux/panic_events.h
+F:	kernel/panic_events.c
+
 PARALLAX PING IIO SENSOR DRIVER
 M:	Andreas Klinger <ak@it-klinger.de>
 L:	linux-iio@vger.kernel.org
diff --git a/include/linux/panic_events.h b/include/linux/panic_events.h
new file mode 100644
index 000000000000..8e7e331ecaaf
--- /dev/null
+++ b/include/linux/panic_events.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef _LINUX_PANIC_EVENTS_H
+
+#include <linux/module.h>
+#include <uapi/linux/panic_events.h>
+
+#ifdef CONFIG_PANIC_EVENTS
+
+void panic_uevent(enum panic_uevent event);
+void panic_uevent_taint(unsigned int flag, struct module *mod);
+
+#else
+static inline void panic_events_init(void)
+{
+}
+
+static inline panic_uevent(enum panic_uevent event)
+{
+}
+
+static inline void panic_uevent_taint(unsigned int flag, struct module *mod)
+{
+}
+#endif
+
+#endif /* _LINUX_PANIC_EVENTS_H */
diff --git a/include/uapi/linux/panic_events.h b/include/uapi/linux/panic_events.h
new file mode 100644
index 000000000000..4f409597be52
--- /dev/null
+++ b/include/uapi/linux/panic_events.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef _UAPI_PANIC_EVENTS_H
+#define _UAPI_PANIC_EVENTS_H
+
+/**
+ * enum panic_uevent - panic uevents
+ *
+ * @PANIC_LOCKDEP_DISABLED: lockdep has been disabled
+ * @PANIC_TAINT: lockdep has been disabled
+ */
+enum panic_uevent {
+	PANIC_LOCKDEP_DISABLED,
+	PANIC_TAINT,
+	__PANIC_MAX, /* non-ABI */
+};
+
+#endif /* _UAPI_PANIC_EVENTS_H */
diff --git a/init/main.c b/init/main.c
index 0ead83e86b5a..2bf33b5ec7b4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -96,6 +96,7 @@
 #include <linux/jump_label.h>
 #include <linux/mem_encrypt.h>
 #include <linux/kcsan.h>
+#include <linux/panic_events.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
diff --git a/kernel/Makefile b/kernel/Makefile
index 0bd4ed7ca157..0c1794818579 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -12,6 +12,7 @@ obj-y     = fork.o exec_domain.o panic.o \
 	    notifier.o ksysfs.o cred.o reboot.o \
 	    async.o range.o smpboot.o ucount.o
 
+obj-$(CONFIG_PANIC_EVENTS) += panic_events.o
 obj-$(CONFIG_MODULES) += kmod.o
 obj-$(CONFIG_MULTIUSER) += groups.o
 
diff --git a/kernel/module.c b/kernel/module.c
index 128bfc3e7ada..9b85d58441a2 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -56,6 +56,7 @@
 #include <linux/bsearch.h>
 #include <linux/dynamic_debug.h>
 #include <linux/audit.h>
+#include <linux/panic_events.h>
 #include <uapi/linux/module.h>
 #include "module-internal.h"
 
@@ -330,6 +331,7 @@ static inline void add_taint_module(struct module *mod, unsigned flag,
 {
 	add_taint(flag, lockdep_ok);
 	set_bit(flag, &mod->taints);
+	panic_uevent_taint(flag, mod);
 }
 
 /*
diff --git a/kernel/panic.c b/kernel/panic.c
index e2157ca387c8..48e9e2efa5bb 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -30,6 +30,7 @@
 #include <linux/console.h>
 #include <linux/bug.h>
 #include <linux/ratelimit.h>
+#include <linux/panic_events.h>
 #include <linux/debugfs.h>
 #include <asm/sections.h>
 
@@ -440,8 +441,10 @@ unsigned long get_taint(void)
  */
 void add_taint(unsigned flag, enum lockdep_ok lockdep_ok)
 {
-	if (lockdep_ok == LOCKDEP_NOW_UNRELIABLE && __debug_locks_off())
+	if (lockdep_ok == LOCKDEP_NOW_UNRELIABLE && __debug_locks_off()) {
 		pr_warn("Disabling lock debugging due to kernel taint\n");
+		panic_uevent(PANIC_LOCKDEP_DISABLED);
+	}
 
 	set_bit(flag, &tainted_mask);
 
@@ -449,6 +452,8 @@ void add_taint(unsigned flag, enum lockdep_ok lockdep_ok)
 		panic_on_taint = 0;
 		panic("panic_on_taint set ...");
 	}
+
+	panic_uevent_taint(flag, NULL);
 }
 EXPORT_SYMBOL(add_taint);
 
diff --git a/kernel/panic_events.c b/kernel/panic_events.c
new file mode 100644
index 000000000000..5a53a1b1fd9a
--- /dev/null
+++ b/kernel/panic_events.c
@@ -0,0 +1,289 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/panic_events.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+/**
+ *  DOC: Panic events
+ *
+ *  Panic events are sent to userspace to inform userspace that something
+ *  critical has happened to the kernel. These events can happen in any
+ *  context, and so to send these events to userspace we preallocate memory
+ *  needed during initialization as needed for operation. The events are
+ *  queued and later dispatched. The uevents sent are best effort, if we are
+ *  short of memory kobject_uevent_env() can fail.
+ */
+
+/* The max amount of lines on a uevent we support */
+#define PANIC_UEVENT_MAX_LINES	8
+
+/* We assume each possible CPU can trigger these events */
+#define PANIC_MAX_EVENTS_PER_CPU 4
+
+/* The max number of concurrent uvents we support, otherwise we drop events */
+#define PANIC_NUM_CACHE_EVENTS (num_possible_cpus() * PANIC_MAX_EVENTS_PER_CPU)
+
+static LIST_HEAD(panic_free_list);
+static spinlock_t free_lock;
+
+static LIST_HEAD(panic_pend_list);
+static spinlock_t pend_lock;
+
+struct panic_event {
+	enum panic_uevent uevent;
+	char module_name[MODULE_NAME_LEN];
+	enum system_states sys_state;
+	unsigned int flag;
+	struct list_head list;
+};
+
+static struct panic_event *panic_events;
+
+static struct kset *panic_kset;
+static struct kobj_type empty_ktype;
+static struct kobject *panic_events_kobj;
+static struct kobject _panic_events_kobj;
+
+static void panic_process_events(struct work_struct *work);
+static DECLARE_WORK(panic_events_work, panic_process_events);
+
+static char (*panic_envp)[PATH_MAX];
+/* Protects panic_envp */
+static DEFINE_MUTEX(panic_mutex);
+
+struct panic_event *get_free_panic_event(void)
+{
+	struct panic_event *event = NULL;
+
+	spin_lock(&free_lock);
+	if (list_empty(&panic_free_list)) {
+		pr_warn_once("Not enough free panic pool events, we need to bump PANIC_NUM_CACHE_EVENTS, please report this\n");
+		goto out;
+	}
+	event = list_first_entry(&panic_free_list, struct panic_event, list);
+	list_del_init(&event->list);
+
+out:
+	spin_unlock(&free_lock);
+
+	return event;
+}
+
+static void queue_panic_event(struct panic_event *event)
+{
+	spin_lock(&pend_lock);
+	list_add_tail(&event->list, &panic_pend_list);
+	spin_unlock(&pend_lock);
+}
+
+struct panic_event *get_pend_panic_event(void)
+{
+	struct panic_event *event = NULL;
+
+	spin_lock(&pend_lock);
+	if (list_empty(&panic_pend_list))
+		goto out;
+
+	event = list_first_entry(&panic_pend_list, struct panic_event, list);
+	list_del_init(&event->list);
+
+out:
+	spin_unlock(&pend_lock);
+
+	return event;
+}
+
+static void panic_send_event(struct panic_event *event)
+{
+	unsigned int idx = 0, i;
+	bool pending = false;
+	char *envp[PANIC_UEVENT_MAX_LINES];
+	int r;
+
+	mutex_lock(&panic_mutex);
+	memset(panic_envp, 0, PATH_MAX * PANIC_UEVENT_MAX_LINES);
+	snprintf(panic_envp[idx++], PATH_MAX, "SYSTEM_STATE=%d",
+		 event->sys_state);
+	snprintf(panic_envp[idx++], PATH_MAX, "EVENT=%d", event->uevent);
+
+	if (event->uevent == PANIC_LOCKDEP_DISABLED)
+		goto out_send;
+
+	/*
+	 * add_taint_module() will trigger two uevents, one for the kernel,
+	 * and another for the module, if the module was not built-in.
+	 */
+	if (event->uevent == PANIC_TAINT) {
+		snprintf(panic_envp[idx++], PATH_MAX, "TAINT=%d", event->flag);
+		if (strcmp(event->module_name, "") != 0)
+			snprintf(panic_envp[idx++], PATH_MAX, "MODULE_NAME=%s",
+				 event->module_name);
+	}
+
+out_send:
+	for (i = 0; i < idx; i++)
+		envp[i] = panic_envp[i];
+	envp[idx] = NULL;
+
+	r = kobject_uevent_env(panic_events_kobj, KOBJ_CHANGE, envp);
+	if (!r)
+		pr_debug("failed to sent uevent: %d\n", event->uevent);
+	mutex_unlock(&panic_mutex);
+
+	memset(event, 0, sizeof(struct panic_event));
+
+	spin_lock(&free_lock);
+	list_add_tail(&event->list, &panic_free_list);
+	spin_unlock(&free_lock);
+
+	spin_lock(&pend_lock);
+	if (!list_empty(&panic_pend_list))
+		pending = true;
+	spin_unlock(&pend_lock);
+
+	if (pending)
+		schedule_work(&panic_events_work);
+}
+
+static void panic_process_events(struct work_struct *work)
+{
+	struct panic_event *event;
+	bool pending = false;
+
+	event = get_pend_panic_event();
+	if (!event)
+		goto out;
+
+	panic_send_event(event);
+
+out:
+	spin_lock(&pend_lock);
+	if (!list_empty(&panic_pend_list))
+		pending = true;
+	spin_unlock(&pend_lock);
+
+	if (pending)
+		schedule_work(&panic_events_work);
+}
+
+/* For simple panic uvents which only need an event type specified */
+void panic_uevent(enum panic_uevent uevent)
+{
+	struct panic_event *event;
+
+	if (!panic_events_kobj)
+		return;
+
+	event = get_free_panic_event();
+	if (!event)
+		return;
+
+	event->uevent = uevent;
+	event->sys_state = system_state;
+
+	queue_panic_event(event);
+	schedule_work(&panic_events_work);
+}
+
+void panic_uevent_taint(unsigned int flag, struct module *mod)
+{
+	struct panic_event *event;
+
+	if (!panic_events_kobj)
+		return;
+
+	event = get_free_panic_event();
+	if (!event)
+		return;
+
+	event->uevent = PANIC_TAINT;
+	event->sys_state = system_state;
+	event->flag = flag;
+
+	if (mod)
+		strncpy(event->module_name, module_name(mod), MODULE_NAME_LEN);
+
+	queue_panic_event(event);
+	schedule_work(&panic_events_work);
+}
+
+static __init void panic_events_init(void)
+{
+	struct panic_event *event;
+	char *envp[2];
+	unsigned int i;
+	size_t used;
+	int r;
+
+	spin_lock_init(&free_lock);
+	spin_lock_init(&pend_lock);
+
+	mutex_lock(&panic_mutex);
+
+	panic_envp = kzalloc(PATH_MAX * PANIC_UEVENT_MAX_LINES, GFP_KERNEL);
+	if (!panic_envp)
+		goto out_unlock;
+
+	panic_events = kzalloc(sizeof(struct panic_event) *
+			       PANIC_NUM_CACHE_EVENTS, GFP_KERNEL);
+	if (!panic_events)
+		goto out_env;
+
+	for (i = 0; i < PANIC_NUM_CACHE_EVENTS; i++) {
+		event = &panic_events[i];
+		list_add_tail(&event->list, &panic_free_list);
+	}
+
+	snprintf(panic_envp[0], PATH_MAX, "MAX_EVENT_SUPPORTED=%d",
+		 __PANIC_MAX-1);
+
+	envp[0] = panic_envp[0];
+	envp[1] = NULL;
+
+	panic_kset = kset_create_and_add("panic", NULL, kernel_kobj);
+	if (!panic_kset)
+		goto out_events;
+
+	_panic_events_kobj.kset = panic_kset;
+
+	r = kobject_init_and_add(&_panic_events_kobj, &empty_ktype,
+				 NULL, "events");
+	if (r) {
+		pr_warn("Could not add panic events kobject\n");
+		goto out_kset;
+	}
+
+	/* Without this set this infrastructure is ignored */
+	panic_events_kobj = &_panic_events_kobj;
+
+	/* Inform userspace which events we'll send uevents for */
+	r = kobject_uevent_env(panic_events_kobj, KOBJ_ADD, envp);
+	if (r != 0)
+		pr_debug("failed to send first event\n");
+
+	mutex_unlock(&panic_mutex);
+
+	used = (PATH_MAX * PANIC_UEVENT_MAX_LINES) +
+	       (sizeof(struct panic_event) * PANIC_NUM_CACHE_EVENTS);
+
+	pr_info("initialized using %zu preallocated bytes\n", used);
+
+	return;
+
+out_kset:
+	kset_unregister(panic_kset);
+out_events:
+	kfree(panic_events);
+out_env:
+	kfree(panic_envp);
+out_unlock:
+	mutex_unlock(&panic_mutex);
+}
+
+core_initcall(panic_events_init);
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index cf77b3881a21..966e8eaad09e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -835,6 +835,19 @@ config DEBUG_SHIRQ
 
 menu "Debug Oops, Lockups and Hangs"
 
+config PANIC_EVENTS
+	bool "Enable uevents relating to panics or taints"
+	help
+	  Say Y here to enable the kernel to send uevents relating to panics or
+	  when taint events happen. This may be useful if you want to craft some
+	  userspace component to analyze a taint, or inform the user of this
+	  event. These events are useful later in boot, prior to device driver
+	  initialization, so it won't capture events early in boot. The events
+	  are also best effort, if the system is low on memory some uevents
+	  not reach userspace.
+
+	  Say N if unsure.
+
 config PANIC_ON_OOPS
 	bool "Panic on Oops"
 	help
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/8] taint: add firmware crash taint support
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 1/8] kernel.h: move taint and system state flags to uapi Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 2/8] panic: add uevent support Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 4/8] panic: make taint data type clearer Luis Chamberlain
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain

Device drivers firmware can crash, and *sometimes*, this can leave your
system in a state which makes the device or subsystem completely useless.
In the worst of cases not even removing the module and adding it back again
will correct your situation and you are left with no other option but to do
a full system reboot. Some drivers have work arounds for these situations,
and sometimes they can recover the device / functionality but not all
device drivers have these arrangements and in the worst cases requiring
a full reboot is completely hidden from the user experience, leaving them
dumbfounded with what has happened.

Detecting this by inspecting /proc/sys/kernel/tainted instead of scraping
some magical words from the kernel log, which is driver specific, is much
easier. So instead provide a helper which lets drivers annotate this.

Once this happens, scrapers can easily look for modules taint flags for a
firmware crash. This will taint both the kernel and respective calling
module.

The new helper taint_firmware_crashed() uses LOCKDEP_STILL_OK as this fact
should in no way shape or form affect lockdep. This taint is device driver
specific.

While extending the declaration of add_taint_module(), just make the flag
unsigned int clear.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 Documentation/admin-guide/tainted-kernels.rst |  6 ++++++
 include/linux/kernel.h                        |  2 +-
 include/linux/module.h                        | 13 +++++++++++++
 include/trace/events/module.h                 |  3 ++-
 include/uapi/linux/kernel.h                   |  1 +
 kernel/module.c                               | 13 +++++++++----
 kernel/panic.c                                |  1 +
 tools/debugging/kernel-chktaint               |  7 +++++++
 8 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 71e9184a9079..92530f1d60ae 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -100,6 +100,7 @@ Bit  Log  Number  Reason that got the kernel tainted
  15  _/K   32768  kernel has been live patched
  16  _/X   65536  auxiliary taint, defined for and used by distros
  17  _/T  131072  kernel was built with the struct randomization plugin
+ 18  _/Q  262144  driver firmware crash annotation
 ===  ===  ======  ========================================================
 
 Note: The character ``_`` is representing a blank in this table to make reading
@@ -162,3 +163,8 @@ More detailed explanation for tainting
      produce extremely unusual kernel structure layouts (even performance
      pathological ones), which is important to know when debugging. Set at
      build time.
+
+ 18) ``Q`` used by device drivers to annotate that the device driver's firmware
+     has crashed and the device's operation has been severely affected. The
+     device may be left in a crippled state, requiring full driver removal /
+     addition, system reboot, or it is unclear how long recovery will take.
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 337634363d00..a1974907c320 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -571,7 +571,7 @@ extern int root_mountflags;
 extern bool early_boot_irqs_disabled;
 extern enum system_states  system_state;
 
-#define TAINT_FLAGS_COUNT		18
+#define TAINT_FLAGS_COUNT		19
 #define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
 
 struct taint_flag {
diff --git a/include/linux/module.h b/include/linux/module.h
index 2e6670860d27..b3e143d2993e 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -705,6 +705,14 @@ static inline bool is_livepatch_module(struct module *mod)
 bool is_module_sig_enforced(void);
 void set_module_sig_enforced(void);
 
+void add_taint_module(struct module *mod, unsigned int flag,
+		      enum lockdep_ok lockdep_ok);
+
+static inline void taint_firmware_crashed(void)
+{
+	add_taint_module(THIS_MODULE, TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
+}
+
 #else /* !CONFIG_MODULES... */
 
 static inline struct module *__module_address(unsigned long addr)
@@ -852,6 +860,11 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr)
 	return ptr;
 }
 
+static inline void taint_firmware_crashed(void)
+{
+	add_taint(TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
+}
+
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index 097485c73c01..b749ea25affd 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -26,7 +26,8 @@ struct module;
 	{ (1UL << TAINT_OOT_MODULE),		"O" },		\
 	{ (1UL << TAINT_FORCED_MODULE),		"F" },		\
 	{ (1UL << TAINT_CRAP),			"C" },		\
-	{ (1UL << TAINT_UNSIGNED_MODULE),	"E" })
+	{ (1UL << TAINT_UNSIGNED_MODULE),	"E" },		\
+	{ (1UL << TAINT_FIRMWARE_CRASH),	"Q" })
 
 TRACE_EVENT(module_load,
 
diff --git a/include/uapi/linux/kernel.h b/include/uapi/linux/kernel.h
index 4bbd4093eb64..1e364659afca 100644
--- a/include/uapi/linux/kernel.h
+++ b/include/uapi/linux/kernel.h
@@ -45,6 +45,7 @@ enum system_states {
 #define TAINT_LIVEPATCH			15
 #define TAINT_AUX			16
 #define TAINT_RANDSTRUCT		17
+#define TAINT_FIRMWARE_CRASH		18
 /* be sure to update TAINT_FLAGS_COUNT when extending this */
 
 #endif /* _UAPI_LINUX_KERNEL_H */
diff --git a/kernel/module.c b/kernel/module.c
index 9b85d58441a2..538def226332 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -326,13 +326,18 @@ static inline int strong_try_module_get(struct module *mod)
 		return -ENOENT;
 }
 
-static inline void add_taint_module(struct module *mod, unsigned flag,
-				    enum lockdep_ok lockdep_ok)
+void add_taint_module(struct module *mod, unsigned int flag,
+		      enum lockdep_ok lockdep_ok)
 {
 	add_taint(flag, lockdep_ok);
-	set_bit(flag, &mod->taints);
-	panic_uevent_taint(flag, mod);
+
+	/* Skip this if the module is built-in */
+	if (mod) {
+		set_bit(flag, &mod->taints);
+		panic_uevent_taint(flag, mod);
+	}
 }
+EXPORT_SYMBOL_GPL(add_taint_module);
 
 /*
  * A thread that wants to hold a reference to a module only while it
diff --git a/kernel/panic.c b/kernel/panic.c
index 48e9e2efa5bb..cb1c5619e983 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -387,6 +387,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
 	[ TAINT_LIVEPATCH ]		= { 'K', ' ', true },
 	[ TAINT_AUX ]			= { 'X', ' ', true },
 	[ TAINT_RANDSTRUCT ]		= { 'T', ' ', true },
+	[ TAINT_FIRMWARE_CRASH ]	= { 'Q', ' ', true },
 };
 
 /**
diff --git a/tools/debugging/kernel-chktaint b/tools/debugging/kernel-chktaint
index 2240cb56e6e5..c397c6aabea7 100755
--- a/tools/debugging/kernel-chktaint
+++ b/tools/debugging/kernel-chktaint
@@ -194,6 +194,13 @@ else
 	addout "T"
 	echo " * kernel was built with the struct randomization plugin (#17)"
 fi
+T=`expr $T / 2`
+if [ `expr $T % 2` -eq 0 ]; then
+	addout " "
+else
+	addout "Q"
+	echo " * a device driver's firmware has crashed (#18)"
+fi
 
 echo "For a more detailed explanation of the various taint flags see"
 echo " Documentation/admin-guide/tainted-kernels.rst in the the Linux kernel sources"
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/8] panic: make taint data type clearer
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
                   ` (2 preceding siblings ...)
  2020-05-26 14:58 ` [PATCH v3 3/8] taint: add firmware crash taint support Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-26 14:58   ` Luis Chamberlain
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain

Let us be clearer about the the data type for the taint flag.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/asm-generic/bug.h | 4 ++--
 include/linux/kernel.h    | 4 ++--
 kernel/panic.c            | 8 ++++----
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index c94e33ae3e7b..87dbe57301f4 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -80,7 +80,7 @@ struct bug_entry {
  */
 #ifndef __WARN_FLAGS
 extern __printf(4, 5)
-void warn_slowpath_fmt(const char *file, const int line, unsigned taint,
+void warn_slowpath_fmt(const char *file, const int line, unsigned int taint,
 		       const char *fmt, ...);
 #define __WARN()		__WARN_printf(TAINT_WARN, NULL)
 #define __WARN_printf(taint, arg...) do {				\
@@ -110,7 +110,7 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 struct warn_args;
 struct pt_regs;
 
-void __warn(const char *file, int line, void *caller, unsigned taint,
+void __warn(const char *file, int line, void *caller, unsigned int taint,
 	    struct pt_regs *regs, struct warn_args *args);
 
 #ifndef WARN_ON
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index a1974907c320..d154844eb9cd 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -563,8 +563,8 @@ enum lockdep_ok {
 	LOCKDEP_STILL_OK,
 	LOCKDEP_NOW_UNRELIABLE
 };
-extern void add_taint(unsigned flag, enum lockdep_ok);
-extern int test_taint(unsigned flag);
+extern void add_taint(unsigned int flag, enum lockdep_ok);
+extern int test_taint(unsigned int flag);
 extern unsigned long get_taint(void);
 extern int root_mountflags;
 
diff --git a/kernel/panic.c b/kernel/panic.c
index cb1c5619e983..3cfe84318ecf 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -421,7 +421,7 @@ const char *print_tainted(void)
 	return buf;
 }
 
-int test_taint(unsigned flag)
+int test_taint(unsigned int flag)
 {
 	return test_bit(flag, &tainted_mask);
 }
@@ -440,7 +440,7 @@ unsigned long get_taint(void)
  * If something bad has gone wrong, you'll want @lockdebug_ok = false, but for
  * some notewortht-but-not-corrupting cases, it can be set to true.
  */
-void add_taint(unsigned flag, enum lockdep_ok lockdep_ok)
+void add_taint(unsigned int flag, enum lockdep_ok lockdep_ok)
 {
 	if (lockdep_ok == LOCKDEP_NOW_UNRELIABLE && __debug_locks_off()) {
 		pr_warn("Disabling lock debugging due to kernel taint\n");
@@ -579,7 +579,7 @@ struct warn_args {
 	va_list args;
 };
 
-void __warn(const char *file, int line, void *caller, unsigned taint,
+void __warn(const char *file, int line, void *caller, unsigned int taint,
 	    struct pt_regs *regs, struct warn_args *args)
 {
 	disable_trace_on_warning();
@@ -622,7 +622,7 @@ void __warn(const char *file, int line, void *caller, unsigned taint,
 }
 
 #ifndef __WARN_FLAGS
-void warn_slowpath_fmt(const char *file, int line, unsigned taint,
+void warn_slowpath_fmt(const char *file, int line, unsigned int taint,
 		       const char *fmt, ...)
 {
 	struct warn_args args;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 5/8] ath10k: use new taint_firmware_crashed()
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
@ 2020-05-26 14:58   ` Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 2/8] panic: add uevent support Luis Chamberlain
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain,
	linux-wireless, ath10k

This makes use of the new taint_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

I have run into this situation with this driver with the latest
firmware as of today, May 21, 2020 using v5.6.0, leaving me at
a state at which my only option is to reboot. Driver removal and
addition does not fix the situation. This is reported on kernel.org
bugzilla korg#207851 [0]. But this isn't the first firmware crash reported,
others have been filed before and none of these bugs have yet been
addressed [1] [2] [3].  Including my own I see these firmware crash
reports:

  * korg#207851 [0]
  * korg#197013 [1]
  * korg#201237 [2]
  * korg#195987 [3]

[0] https://bugzilla.kernel.org/show_bug.cgi?id=207851
[1] https://bugzilla.kernel.org/show_bug.cgi?id=197013
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201237
[3] https://bugzilla.kernel.org/show_bug.cgi?id=195987

Cc: linux-wireless@vger.kernel.org
Cc: ath10k@lists.infradead.org
Cc: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
 drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
 drivers/net/wireless/ath/ath10k/snoc.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..818c3acc2468 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	taint_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_pci_dump_registers(ar, crash_data);
 	ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
 	if (ret) {
 		if (ath10k_pci_has_fw_crashed(ar)) {
 			ath10k_warn(ar, "firmware crashed during chip reset\n");
+			taint_firmware_crashed();
 			ath10k_pci_fw_crashed_clear(ar);
 			ath10k_pci_fw_crashed_dump(ar);
 		}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..8b2fc0b89be4 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
 
 	/* TODO: Add firmware crash handling */
 	ath10k_warn(ar, "firmware crashed\n");
+	taint_firmware_crashed();
 
 	/* read counter to clear the interrupt, the debug error interrupt is
 	 * counter 0.
@@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
 	if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
 		ath10k_err(ar, "firmware crashed!\n");
 		queue_work(ar->workqueue, &ar->restart_work);
+		taint_firmware_crashed();
 	}
 	return ret;
 }
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..071ee7607a4c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	taint_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_msa_dump_memory(ar, crash_data);
 	mutex_unlock(&ar->dump_mutex);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 5/8] ath10k: use new taint_firmware_crashed()
@ 2020-05-26 14:58   ` Luis Chamberlain
  0 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: linux-wireless, aquini, linux-doc, peterz, daniel.vetter, linux,
	linux-kernel, yamada.masahiro, glider, GR-everest-linux-l2,
	mchehab+samsung, will, michael.chan, robh, paulmck, bhe, corbet,
	mchehab+huawei, Luis Chamberlain, ath10k, derosier, tiwai, mingo,
	dvyukov, samitolvanen, yzaikin, dyoung, pmladek, elver, sburla,
	aelior, keescook, arnd, sfr, gpiccoli, rostedt, fmanlunas, cai,
	tglx, andriy.shevchenko, johannes, kvalo, netdev, rdunlap,
	schlad, dianders, vkoul, mhiramat, akpm, dchickles, bauerman

This makes use of the new taint_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

I have run into this situation with this driver with the latest
firmware as of today, May 21, 2020 using v5.6.0, leaving me at
a state at which my only option is to reboot. Driver removal and
addition does not fix the situation. This is reported on kernel.org
bugzilla korg#207851 [0]. But this isn't the first firmware crash reported,
others have been filed before and none of these bugs have yet been
addressed [1] [2] [3].  Including my own I see these firmware crash
reports:

  * korg#207851 [0]
  * korg#197013 [1]
  * korg#201237 [2]
  * korg#195987 [3]

[0] https://bugzilla.kernel.org/show_bug.cgi?id=207851
[1] https://bugzilla.kernel.org/show_bug.cgi?id=197013
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201237
[3] https://bugzilla.kernel.org/show_bug.cgi?id=195987

Cc: linux-wireless@vger.kernel.org
Cc: ath10k@lists.infradead.org
Cc: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
 drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
 drivers/net/wireless/ath/ath10k/snoc.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..818c3acc2468 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	taint_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_pci_dump_registers(ar, crash_data);
 	ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
 	if (ret) {
 		if (ath10k_pci_has_fw_crashed(ar)) {
 			ath10k_warn(ar, "firmware crashed during chip reset\n");
+			taint_firmware_crashed();
 			ath10k_pci_fw_crashed_clear(ar);
 			ath10k_pci_fw_crashed_dump(ar);
 		}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..8b2fc0b89be4 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
 
 	/* TODO: Add firmware crash handling */
 	ath10k_warn(ar, "firmware crashed\n");
+	taint_firmware_crashed();
 
 	/* read counter to clear the interrupt, the debug error interrupt is
 	 * counter 0.
@@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
 	if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
 		ath10k_err(ar, "firmware crashed!\n");
 		queue_work(ar->workqueue, &ar->restart_work);
+		taint_firmware_crashed();
 	}
 	return ret;
 }
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..071ee7607a4c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	taint_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_msa_dump_memory(ar, crash_data);
 	mutex_unlock(&ar->dump_mutex);
-- 
2.26.2


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 6/8] bnxt_en: use new taint_firmware_crashed()
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
                   ` (4 preceding siblings ...)
  2020-05-26 14:58   ` Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-26 18:09   ` Michael Chan
  2020-05-26 14:58 ` [PATCH v3 7/8] liquidio: " Luis Chamberlain
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Vasundhara Volam,
	Luis Chamberlain

From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c
index a812beb46325..776a7ae0ef7f 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c
@@ -121,6 +121,7 @@ static int bnxt_fw_fatal_recover(struct devlink_health_reporter *reporter,
 	if (!priv_ctx)
 		return -EOPNOTSUPP;
 
+	taint_firmware_crashed();
 	bp->fw_health->fatal = true;
 	event = fw_reporter_ctx->sp_event;
 	if (event == BNXT_FW_RESET_NOTIFY_SP_EVENT)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 7/8] liquidio: use new taint_firmware_crashed()
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
                   ` (5 preceding siblings ...)
  2020-05-26 14:58 ` [PATCH v3 6/8] bnxt_en: " Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-26 14:58 ` [PATCH v3 8/8] qed: " Luis Chamberlain
  2020-05-26 22:46 ` [PATCH v3 0/8] kernel: taint when the driver firmware crashes Jakub Kicinski
  8 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain

This makes use of the new taint_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Derek Chickles <dchickles@marvell.com>
Cc: Satanand Burla <sburla@marvell.com>
Cc: Felix Manlunas <fmanlunas@marvell.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Derek Chickles <dchickles@marvell.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 66d31c018c7e..ee1796ea4818 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -801,6 +801,7 @@ static int liquidio_watchdog(void *param)
 			continue;
 
 		WRITE_ONCE(oct->cores_crashed, true);
+		taint_firmware_crashed();
 		other_oct = get_other_octeon_device(oct);
 		if (other_oct)
 			WRITE_ONCE(other_oct->cores_crashed, true);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 8/8] qed: use new taint_firmware_crashed()
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
                   ` (6 preceding siblings ...)
  2020-05-26 14:58 ` [PATCH v3 7/8] liquidio: " Luis Chamberlain
@ 2020-05-26 14:58 ` Luis Chamberlain
  2020-05-26 22:46 ` [PATCH v3 0/8] kernel: taint when the driver firmware crashes Jakub Kicinski
  8 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 14:58 UTC (permalink / raw)
  To: jeyu, davem, kuba
  Cc: michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, Luis Chamberlain, Igor Russkikh

This makes use of the new taint_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Ariel Elior <aelior@marvell.com>
Cc: GR-everest-linux-l2@marvell.com
Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 9624616806e7..dd4357b0b5d1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -566,6 +566,7 @@ _qed_mcp_cmd_and_union(struct qed_hwfn *p_hwfn,
 		DP_NOTICE(p_hwfn,
 			  "The MFW failed to respond to command 0x%08x [param 0x%08x].\n",
 			  p_mb_params->cmd, p_mb_params->param);
+		taint_firmware_crashed();
 		qed_mcp_print_cpu_info(p_hwfn, p_ptt);
 
 		spin_lock_bh(&p_hwfn->mcp_info->cmd_lock);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 6/8] bnxt_en: use new taint_firmware_crashed()
  2020-05-26 14:58 ` [PATCH v3 6/8] bnxt_en: " Luis Chamberlain
@ 2020-05-26 18:09   ` Michael Chan
  0 siblings, 0 replies; 23+ messages in thread
From: Michael Chan @ 2020-05-26 18:09 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, David Miller, Jakub Kicinski, dchickles, sburla, fmanlunas,
	aelior, GR-everest-linux-l2, kvalo, johannes, akpm,
	Arnd Bergmann, rostedt, mingo, aquini, cai, dyoung, Baoquan He,
	peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, derosier, keescook, daniel.vetter, will,
	mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat, sfr,
	linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	Netdev, open list, linux-doc, Vasundhara Volam

On Tue, May 26, 2020 at 7:58 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> From: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
>
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
>
> Using a taint flag allows us to annotate when this happens clearly.
>
> Cc: Michael Chan <michael.chan@broadcom.com>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Reviewed-by: Michael Chan <michael.chan@broadcom.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
                   ` (7 preceding siblings ...)
  2020-05-26 14:58 ` [PATCH v3 8/8] qed: " Luis Chamberlain
@ 2020-05-26 22:46 ` Jakub Kicinski
  2020-05-26 23:07   ` Luis Chamberlain
  8 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2020-05-26 22:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc

On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:
> To those new on CC -- this is intended to be a simple generic interface
> to the kernel to annotate when the firwmare has crashed leaving the
> driver or system in a questionable state, in the worst case requiring
> full system reboot. This series is first addressing only a few
> networking patches, however, I already have an idea of where such
> firmware crashes happen across the tree. The goal with this series then
> is to first introduce the simple framework, and only if that moves
> forward will I continue to chug on with the rest of the drivers /
> subsystems.
> 
> This is *not* a networking specific problem only.
> 
> This v3 augments the last series by introducing the uevent for panic
> events, one of them is during tainting. The uvent mechanism is
> independent from any of this firmware taint mechanism. I've also
> addressed Jessica Yu's feedback. Given I've extended the patches a bit
> with other minor cleanup which checkpatch.pl complains over, and since
> this infrastructure is still being discussed, I've trimmed the patch
> series size to only cover drivers for which I've received an Acked-by
> from the respective driver maintainer, or where we have bug reports to
> support such dire situations on the driver such as ath10k.
> 
> During the last v2 it was discussed that we should instead use devlink
> for this work, however the initial RFC patches produced by Jakub
> Kicinski [0] shows how devlink is networking specific, and the intent
> behind this series is to produce simple helpers which can be used by *any*
> device driver, for any subsystem, not just networking. Subsystem
> specific infrastructure to help address firwmare crashes may still make
> sense, however that does not mean we *don't* need something even more
> generic regardless of the subsystem the issue happens on. Since uevents
> for taints are exposed, we now expose these through uapi as well, and
> that was something which eventually had to happen given that the current
> scheme of relying on sensible character representations for each taint
> will not scale beyond the alphabet.

Nacked-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-26 22:46 ` [PATCH v3 0/8] kernel: taint when the driver firmware crashes Jakub Kicinski
@ 2020-05-26 23:07   ` Luis Chamberlain
  2020-05-26 23:30     ` Jakub Kicinski
  0 siblings, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-26 23:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc

On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote:
> On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:
> > To those new on CC -- this is intended to be a simple generic interface
> > to the kernel to annotate when the firwmare has crashed leaving the
> > driver or system in a questionable state, in the worst case requiring
> > full system reboot. This series is first addressing only a few
> > networking patches, however, I already have an idea of where such
> > firmware crashes happen across the tree. The goal with this series then
> > is to first introduce the simple framework, and only if that moves
> > forward will I continue to chug on with the rest of the drivers /
> > subsystems.
> > 
> > This is *not* a networking specific problem only.
> > 
> > This v3 augments the last series by introducing the uevent for panic
> > events, one of them is during tainting. The uvent mechanism is
> > independent from any of this firmware taint mechanism. I've also
> > addressed Jessica Yu's feedback. Given I've extended the patches a bit
> > with other minor cleanup which checkpatch.pl complains over, and since
> > this infrastructure is still being discussed, I've trimmed the patch
> > series size to only cover drivers for which I've received an Acked-by
> > from the respective driver maintainer, or where we have bug reports to
> > support such dire situations on the driver such as ath10k.
> > 
> > During the last v2 it was discussed that we should instead use devlink
> > for this work, however the initial RFC patches produced by Jakub
> > Kicinski [0] shows how devlink is networking specific, and the intent
> > behind this series is to produce simple helpers which can be used by *any*
> > device driver, for any subsystem, not just networking. Subsystem
> > specific infrastructure to help address firwmare crashes may still make
> > sense, however that does not mean we *don't* need something even more
> > generic regardless of the subsystem the issue happens on. Since uevents
> > for taints are exposed, we now expose these through uapi as well, and
> > that was something which eventually had to happen given that the current
> > scheme of relying on sensible character representations for each taint
> > will not scale beyond the alphabet.
> 
> Nacked-by: Jakub Kicinski <kuba@kernel.org>

Care to elaborate?
 
  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-26 23:07   ` Luis Chamberlain
@ 2020-05-26 23:30     ` Jakub Kicinski
  2020-05-27  3:19       ` Luis Chamberlain
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2020-05-26 23:30 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc

On Tue, 26 May 2020 23:07:48 +0000 Luis Chamberlain wrote:
> On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote:
> > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:  
> > > To those new on CC -- this is intended to be a simple generic interface
> > > to the kernel to annotate when the firwmare has crashed leaving the
> > > driver or system in a questionable state, in the worst case requiring
> > > full system reboot. This series is first addressing only a few
> > > networking patches, however, I already have an idea of where such
> > > firmware crashes happen across the tree. The goal with this series then
> > > is to first introduce the simple framework, and only if that moves
> > > forward will I continue to chug on with the rest of the drivers /
> > > subsystems.
> > > 
> > > This is *not* a networking specific problem only.
> > > 
> > > This v3 augments the last series by introducing the uevent for panic
> > > events, one of them is during tainting. The uvent mechanism is
> > > independent from any of this firmware taint mechanism. I've also
> > > addressed Jessica Yu's feedback. Given I've extended the patches a bit
> > > with other minor cleanup which checkpatch.pl complains over, and since
> > > this infrastructure is still being discussed, I've trimmed the patch
> > > series size to only cover drivers for which I've received an Acked-by
> > > from the respective driver maintainer, or where we have bug reports to
> > > support such dire situations on the driver such as ath10k.
> > > 
> > > During the last v2 it was discussed that we should instead use devlink
> > > for this work, however the initial RFC patches produced by Jakub
> > > Kicinski [0] shows how devlink is networking specific, and the intent
> > > behind this series is to produce simple helpers which can be used by *any*
> > > device driver, for any subsystem, not just networking. Subsystem
> > > specific infrastructure to help address firwmare crashes may still make
> > > sense, however that does not mean we *don't* need something even more
> > > generic regardless of the subsystem the issue happens on. Since uevents
> > > for taints are exposed, we now expose these through uapi as well, and
> > > that was something which eventually had to happen given that the current
> > > scheme of relying on sensible character representations for each taint
> > > will not scale beyond the alphabet.  
> > 
> > Nacked-by: Jakub Kicinski <kuba@kernel.org>  
> 
> Care to elaborate?

I elaborated in the previous thread and told you I will nack this, 
but sure let's go over this again.

For the third time saying the devlink is networking specific is not
true. It was created as a netlink configuration channel for devices
when there is no networking reference that could be used. It can be
compiled in or out much like sysfs.

And as I've shown you devlink already has the uAPI for what you're
trying to achieve.

Regardless of your opinions about wider interfaces, networking drivers
should implement devlink, and not have to sprinkle magic taint calls.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-26 23:30     ` Jakub Kicinski
@ 2020-05-27  3:19       ` Luis Chamberlain
  2020-05-27 21:36         ` Jakub Kicinski
  0 siblings, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-27  3:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc

On Tue, May 26, 2020 at 04:30:31PM -0700, Jakub Kicinski wrote:
> On Tue, 26 May 2020 23:07:48 +0000 Luis Chamberlain wrote:
> > On Tue, May 26, 2020 at 03:46:06PM -0700, Jakub Kicinski wrote:
> > > On Tue, 26 May 2020 14:58:07 +0000 Luis Chamberlain wrote:  
> > > > To those new on CC -- this is intended to be a simple generic interface
> > > > to the kernel to annotate when the firwmare has crashed leaving the
> > > > driver or system in a questionable state, in the worst case requiring
> > > > full system reboot. This series is first addressing only a few
> > > > networking patches, however, I already have an idea of where such
> > > > firmware crashes happen across the tree. The goal with this series then
> > > > is to first introduce the simple framework, and only if that moves
> > > > forward will I continue to chug on with the rest of the drivers /
> > > > subsystems.
> > > > 
> > > > This is *not* a networking specific problem only.
> > > > 
> > > > This v3 augments the last series by introducing the uevent for panic
> > > > events, one of them is during tainting. The uvent mechanism is
> > > > independent from any of this firmware taint mechanism. I've also
> > > > addressed Jessica Yu's feedback. Given I've extended the patches a bit
> > > > with other minor cleanup which checkpatch.pl complains over, and since
> > > > this infrastructure is still being discussed, I've trimmed the patch
> > > > series size to only cover drivers for which I've received an Acked-by
> > > > from the respective driver maintainer, or where we have bug reports to
> > > > support such dire situations on the driver such as ath10k.
> > > > 
> > > > During the last v2 it was discussed that we should instead use devlink
> > > > for this work, however the initial RFC patches produced by Jakub
> > > > Kicinski [0] shows how devlink is networking specific, and the intent
> > > > behind this series is to produce simple helpers which can be used by *any*
> > > > device driver, for any subsystem, not just networking. Subsystem
> > > > specific infrastructure to help address firwmare crashes may still make
> > > > sense, however that does not mean we *don't* need something even more
> > > > generic regardless of the subsystem the issue happens on. Since uevents
> > > > for taints are exposed, we now expose these through uapi as well, and
> > > > that was something which eventually had to happen given that the current
> > > > scheme of relying on sensible character representations for each taint
> > > > will not scale beyond the alphabet.  
> > > 
> > > Nacked-by: Jakub Kicinski <kuba@kernel.org>  
> > 
> > Care to elaborate?
> 
> I elaborated in the previous thread

No you didn't.

> and told you I will nack this, 

That's all you said.

> but sure let's go over this again.
> 
> For the third time saying the devlink is networking specific is not
> true. It was created as a netlink configuration channel for devices
> when there is no networking reference that could be used. It can be
> compiled in or out much like sysfs.

Perhaps I didn't get your email but this clarification was in no way
shape or form present in your reply on that thread.

> And as I've shown you devlink already has the uAPI for what you're
> trying to achieve.

I read your patch, and granted, I will accept I was under the incorrect
assumption that this can only be used by networking devices, however it
the devlink approach achieves getting userspace the ability with
iproute2 devlink util to query a device health, on to which we can peg
firmware health. But *this* patch series is not about health status and
letting users query it, its about a *critical* situation which has come up
with firmware requiring me to reboot my system, and the lack of *any*
infrastructure in the kernel today to inform userspace about it.

So say we use netlink to report a critical health situation, how are we
informing userspace with your patch series about requring a reboot?

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-27  3:19       ` Luis Chamberlain
@ 2020-05-27 21:36         ` Jakub Kicinski
  2020-05-28 14:27           ` Luis Chamberlain
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2020-05-27 21:36 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc

On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote:
> I read your patch, and granted, I will accept I was under the incorrect
> assumption that this can only be used by networking devices, however it
> the devlink approach achieves getting userspace the ability with
> iproute2 devlink util to query a device health, on to which we can peg
> firmware health. But *this* patch series is not about health status and
> letting users query it, its about a *critical* situation which has come up
> with firmware requiring me to reboot my system, and the lack of *any*
> infrastructure in the kernel today to inform userspace about it.
> 
> So say we use netlink to report a critical health situation, how are we
> informing userspace with your patch series about requring a reboot?

One of main features of netlink is pub/sub model of notifications.

Whatever you imagine listening to your uevent can listen to
devlink-health notifications via devlink. 

In fact I've shown this off in the RFC patches I sent to you, see 
the devlink mon health command being used.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-27 21:36         ` Jakub Kicinski
@ 2020-05-28 14:27           ` Luis Chamberlain
  2020-05-28 15:04             ` Ben Greear
  0 siblings, 1 reply; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-28 14:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, linux-wireless

On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote:
> On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote:
> > I read your patch, and granted, I will accept I was under the incorrect
> > assumption that this can only be used by networking devices, however it
> > the devlink approach achieves getting userspace the ability with
> > iproute2 devlink util to query a device health, on to which we can peg
> > firmware health. But *this* patch series is not about health status and
> > letting users query it, its about a *critical* situation which has come up
> > with firmware requiring me to reboot my system, and the lack of *any*
> > infrastructure in the kernel today to inform userspace about it.
> > 
> > So say we use netlink to report a critical health situation, how are we
> > informing userspace with your patch series about requring a reboot?
> 
> One of main features of netlink is pub/sub model of notifications.
> 
> Whatever you imagine listening to your uevent can listen to
> devlink-health notifications via devlink. 
> 
> In fact I've shown this off in the RFC patches I sent to you, see 
> the devlink mon health command being used.

Yes but I looked at iputils2 devlink and seems I made an incorrect
assumption this can only be used for a network device rather than
a struct device.

I'll take a second look.

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-28 14:27           ` Luis Chamberlain
@ 2020-05-28 15:04             ` Ben Greear
  2020-05-28 16:33               ` Luis Chamberlain
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Greear @ 2020-05-28 15:04 UTC (permalink / raw)
  To: Luis Chamberlain, Jakub Kicinski
  Cc: jeyu, davem, michael.chan, dchickles, sburla, fmanlunas, aelior,
	GR-everest-linux-l2, kvalo, johannes, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, derosier, keescook, daniel.vetter,
	will, mchehab+samsung, vkoul, mchehab+huawei, robh, mhiramat,
	sfr, linux, glider, paulmck, elver, bauerman, yamada.masahiro,
	samitolvanen, yzaikin, dvyukov, rdunlap, corbet, dianders,
	netdev, linux-kernel, linux-doc, linux-wireless



On 05/28/2020 07:27 AM, Luis Chamberlain wrote:
> On Wed, May 27, 2020 at 02:36:42PM -0700, Jakub Kicinski wrote:
>> On Wed, 27 May 2020 03:19:18 +0000 Luis Chamberlain wrote:
>>> I read your patch, and granted, I will accept I was under the incorrect
>>> assumption that this can only be used by networking devices, however it
>>> the devlink approach achieves getting userspace the ability with
>>> iproute2 devlink util to query a device health, on to which we can peg
>>> firmware health. But *this* patch series is not about health status and
>>> letting users query it, its about a *critical* situation which has come up
>>> with firmware requiring me to reboot my system, and the lack of *any*
>>> infrastructure in the kernel today to inform userspace about it.
>>>
>>> So say we use netlink to report a critical health situation, how are we
>>> informing userspace with your patch series about requring a reboot?
>>
>> One of main features of netlink is pub/sub model of notifications.
>>
>> Whatever you imagine listening to your uevent can listen to
>> devlink-health notifications via devlink.
>>
>> In fact I've shown this off in the RFC patches I sent to you, see
>> the devlink mon health command being used.
>
> Yes but I looked at iputils2 devlink and seems I made an incorrect
> assumption this can only be used for a network device rather than
> a struct device.
>
> I'll take a second look.

Hello Jakub,

I'm thinking about something similar to what Luis is proposing, but in
my case I'd like to report just when the driver knows the hardware is gone
and cannot be recovered, like when this is reported:

[ 2548.851832] WARNING: CPU: 3 PID: 98 at backports-4.19.98-1/net/mac80211/util.c:2040 ieee80211_reconfig+0x98/0xb64 [mac80211]
[ 2548.856020] Hardware became unavailable during restart.

I'd like to be able to tie this into a watch-dog program to allow automatic reboot
of the system soon after this event is seen, for instance.

Could you post your devlink RFC patches somewhere public?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/8] kernel: taint when the driver firmware crashes
  2020-05-28 15:04             ` Ben Greear
@ 2020-05-28 16:33               ` Luis Chamberlain
  0 siblings, 0 replies; 23+ messages in thread
From: Luis Chamberlain @ 2020-05-28 16:33 UTC (permalink / raw)
  To: Ben Greear
  Cc: Jakub Kicinski, jeyu, davem, michael.chan, dchickles, sburla,
	fmanlunas, aelior, GR-everest-linux-l2, kvalo, johannes, akpm,
	arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, derosier,
	keescook, daniel.vetter, will, mchehab+samsung, vkoul,
	mchehab+huawei, robh, mhiramat, sfr, linux, glider, paulmck,
	elver, bauerman, yamada.masahiro, samitolvanen, yzaikin, dvyukov,
	rdunlap, corbet, dianders, netdev, linux-kernel, linux-doc,
	linux-wireless

On Thu, May 28, 2020 at 08:04:50AM -0700, Ben Greear wrote:
> 
> Could you post your devlink RFC patches somewhere public?

This cover letter provided a URL to these.

  Luis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/8] panic: add uevent support
  2020-05-26 14:58 ` [PATCH v3 2/8] panic: add uevent support Luis Chamberlain
@ 2020-05-31  4:46   ` kbuild test robot
  2020-06-03 17:55   ` kernel test robot
  1 sibling, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2020-05-31  4:46 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1488 bytes --]

Hi Luis,

I love your patch! Yet something to improve:

[auto build test ERROR on next-20200526]
[cannot apply to ath6kl/ath-next jeyu/modules-next tip/perf/core sparc-next/master linus/master asm-generic/master v5.7-rc7 v5.7-rc6 v5.7-rc5 v5.7-rc7]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Luis-Chamberlain/kernel-taint-when-the-driver-firmware-crashes/20200526-231556
base:    b0523c7b1c9d0edcd6c0fe6d2cb558a9ad5c60a8
config: i386-randconfig-m021-20200531 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> error: include/uapi/linux/panic_events.h: missing "WITH Linux-syscall-note" for SPDX-License-Identifier
make[2]: *** [scripts/Makefile.headersinst:63: usr/include/linux/panic_events.h] Error 1
make[2]: Target '__headers' not remade because of errors.
make[1]: *** [Makefile:1251: headers] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:185: __sub-make] Error 2

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 41136 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/8] ath10k: use new taint_firmware_crashed()
  2020-05-26 14:58   ` Luis Chamberlain
@ 2020-06-02 21:01     ` Brian Norris
  -1 siblings, 0 replies; 23+ messages in thread
From: Brian Norris @ 2020-06-02 21:01 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, David S. Miller, kuba, linux-wireless, aquini, linux-doc,
	peterz, Daniel Vetter, linux, Linux Kernel, Masahiro Yamada,
	glider, GR-everest-linux-l2, mchehab+samsung, will, michael.chan,
	Rob Herring, paulmck, bhe, corbet, mchehab+huawei, ath10k,
	derosier, Takashi Iwai, mingo, Dmitry Vyukov, Sami Tolvanen,
	yzaikin, dyoung, pmladek, elver, sburla, aelior, Kees Cook,
	Arnd Bergmann, sfr, gpiccoli, Steven Rostedt, fmanlunas, cai,
	tglx, Andy Shevchenko, Johannes Berg, Kalle Valo,
	<netdev@vger.kernel.org>,
	rdunlap, schlad, Doug Anderson, vkoul, mhiramat, Andrew Morton,
	dchickles, bauerman

On Tue, May 26, 2020 at 7:58 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> This makes use of the new taint_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.

Just for the record, the underlying problem you seem to be complaining
about does not appear to be a firmware crash at all. It does happen to
result in a firmware crash report much later on (because when the PCIe
endpoint is this hosed, sooner or later the driver thinks the firmware
is dead), but it's not likely the root cause. More below.

> Using a taint flag allows us to annotate when this happens clearly.
>
> I have run into this situation with this driver with the latest
> firmware as of today, May 21, 2020 using v5.6.0, leaving me at
> a state at which my only option is to reboot. Driver removal and
> addition does not fix the situation. This is reported on kernel.org
> bugzilla korg#207851 [0].

I took a look, and replied there:
https://bugzilla.kernel.org/show_bug.cgi?id=207851#c2

Per the above, it seems more likely you have a PCI or power management
problem, not an ath10k or ath10k-firmware problem.

> But this isn't the first firmware crash reported,
> others have been filed before and none of these bugs have yet been
> addressed [1] [2] [3].  Including my own I see these firmware crash
> reports:

Yes, firmware does crash. Sometimes repeatedly. It also happens to be
closed source, so it's nearly impossible for the average Linux dev to
debug. But FWIW, those 3 all appear to be recoverable -- and then they
crash again a few minutes later. So just as claimed on prior
iterations of this patchset, ath10k is doing fine at recovery [*] --
it's "only" the firmware that's a problem. (And, if a WiFi firmware
doesn't like something in the RF environment...it's totally
understandable that the crash will happen more than once. Of course
that sucks, but it's not unexpected.) Crucially, rebooting won't
really do anything to help these people, AIUI.

Maybe what you really want is to taint the kernel every time a
non-free firmware is loaded ;)

I'd also note that those 3 reports are 3 years old. There have been
many ath10k-firmware updates since then, so it's not necessarily fair
to dig those back up. Also, bugzilla.kernel.org is totally ignored by
many linux-wireless@ folks. But I digress...

All in all, I have no interest in this proposal, for many of the
reasons already mentioned on previous iterations. It's way too coarse
and won't be useful in understanding what's going on in a system, IMO,
at least for ath10k. But it's also easy enough to ignore, so if it
makes somebody happy to claim a taint, then so be it.

Regards,
Brian

[*] Although, at least one of those doesn't appear to be as "clean" of
a recovery attempt as typical. Maybe there are some lurking driver
bugs in there too.


>   * korg#207851 [0]
>   * korg#197013 [1]
>   * korg#201237 [2]
>   * korg#195987 [3]
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=207851
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=197013
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201237
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=195987

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/8] ath10k: use new taint_firmware_crashed()
@ 2020-06-02 21:01     ` Brian Norris
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Norris @ 2020-06-02 21:01 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: aquini, linux-doc, peterz, Daniel Vetter, linux, Doug Anderson,
	Masahiro Yamada, glider, GR-everest-linux-l2, mchehab+samsung,
	will, tglx, Rob Herring, Arnd Bergmann, bhe, corbet,
	mchehab+huawei, ath10k, derosier, Takashi Iwai, mingo,
	Kalle Valo, Sami Tolvanen, kuba, yzaikin, dyoung, mhiramat,
	pmladek, elver, gpiccoli, aelior, Kees Cook, paulmck, sfr,
	sburla, Steven Rostedt, fmanlunas, cai, michael.chan,
	Andy Shevchenko, Andrew Morton, Dmitry Vyukov,
	<netdev@vger.kernel.org>,
	rdunlap, linux-wireless, Linux Kernel, vkoul, schlad, jeyu,
	Johannes Berg, dchickles, David S. Miller, bauerman

On Tue, May 26, 2020 at 7:58 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> This makes use of the new taint_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.

Just for the record, the underlying problem you seem to be complaining
about does not appear to be a firmware crash at all. It does happen to
result in a firmware crash report much later on (because when the PCIe
endpoint is this hosed, sooner or later the driver thinks the firmware
is dead), but it's not likely the root cause. More below.

> Using a taint flag allows us to annotate when this happens clearly.
>
> I have run into this situation with this driver with the latest
> firmware as of today, May 21, 2020 using v5.6.0, leaving me at
> a state at which my only option is to reboot. Driver removal and
> addition does not fix the situation. This is reported on kernel.org
> bugzilla korg#207851 [0].

I took a look, and replied there:
https://bugzilla.kernel.org/show_bug.cgi?id=207851#c2

Per the above, it seems more likely you have a PCI or power management
problem, not an ath10k or ath10k-firmware problem.

> But this isn't the first firmware crash reported,
> others have been filed before and none of these bugs have yet been
> addressed [1] [2] [3].  Including my own I see these firmware crash
> reports:

Yes, firmware does crash. Sometimes repeatedly. It also happens to be
closed source, so it's nearly impossible for the average Linux dev to
debug. But FWIW, those 3 all appear to be recoverable -- and then they
crash again a few minutes later. So just as claimed on prior
iterations of this patchset, ath10k is doing fine at recovery [*] --
it's "only" the firmware that's a problem. (And, if a WiFi firmware
doesn't like something in the RF environment...it's totally
understandable that the crash will happen more than once. Of course
that sucks, but it's not unexpected.) Crucially, rebooting won't
really do anything to help these people, AIUI.

Maybe what you really want is to taint the kernel every time a
non-free firmware is loaded ;)

I'd also note that those 3 reports are 3 years old. There have been
many ath10k-firmware updates since then, so it's not necessarily fair
to dig those back up. Also, bugzilla.kernel.org is totally ignored by
many linux-wireless@ folks. But I digress...

All in all, I have no interest in this proposal, for many of the
reasons already mentioned on previous iterations. It's way too coarse
and won't be useful in understanding what's going on in a system, IMO,
at least for ath10k. But it's also easy enough to ignore, so if it
makes somebody happy to claim a taint, then so be it.

Regards,
Brian

[*] Although, at least one of those doesn't appear to be as "clean" of
a recovery attempt as typical. Maybe there are some lurking driver
bugs in there too.


>   * korg#207851 [0]
>   * korg#197013 [1]
>   * korg#201237 [2]
>   * korg#195987 [3]
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=207851
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=197013
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201237
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=195987

_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/8] panic: add uevent support
  2020-05-26 14:58 ` [PATCH v3 2/8] panic: add uevent support Luis Chamberlain
  2020-05-31  4:46   ` kbuild test robot
@ 2020-06-03 17:55   ` kernel test robot
  1 sibling, 0 replies; 23+ messages in thread
From: kernel test robot @ 2020-06-03 17:55 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2327 bytes --]

Hi Luis,

I love your patch! Yet something to improve:

[auto build test ERROR on next-20200526]
[cannot apply to ath6kl/ath-next jeyu/modules-next tip/perf/core sparc-next/master linus/master asm-generic/master v5.7-rc7 v5.7-rc6 v5.7-rc5 v5.7]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Luis-Chamberlain/kernel-taint-when-the-driver-firmware-crashes/20200526-231556
base:    b0523c7b1c9d0edcd6c0fe6d2cb558a9ad5c60a8
config: riscv-randconfig-r016-20200603 (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 16437992cac249f6fe1efd392d20e3469b47e39e)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> error: include/uapi/linux/panic_events.h: missing "WITH Linux-syscall-note" for SPDX-License-Identifier
make[2]: *** [scripts/Makefile.headersinst:63: usr/include/linux/panic_events.h] Error 1
make[2]: Target '__headers' not remade because of errors.
make[1]: *** [Makefile:1251: headers] Error 2
arch/riscv/kernel/asm-offsets.c:14:6: warning: no previous prototype for function 'asm_offsets' [-Wmissing-prototypes]
void asm_offsets(void)
^
arch/riscv/kernel/asm-offsets.c:14:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void asm_offsets(void)
^
static
1 warning generated.
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:185: __sub-make] Error 2

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 23541 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-06-03 17:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-26 14:58 [PATCH v3 0/8] kernel: taint when the driver firmware crashes Luis Chamberlain
2020-05-26 14:58 ` [PATCH v3 1/8] kernel.h: move taint and system state flags to uapi Luis Chamberlain
2020-05-26 14:58 ` [PATCH v3 2/8] panic: add uevent support Luis Chamberlain
2020-05-31  4:46   ` kbuild test robot
2020-06-03 17:55   ` kernel test robot
2020-05-26 14:58 ` [PATCH v3 3/8] taint: add firmware crash taint support Luis Chamberlain
2020-05-26 14:58 ` [PATCH v3 4/8] panic: make taint data type clearer Luis Chamberlain
2020-05-26 14:58 ` [PATCH v3 5/8] ath10k: use new taint_firmware_crashed() Luis Chamberlain
2020-05-26 14:58   ` Luis Chamberlain
2020-06-02 21:01   ` Brian Norris
2020-06-02 21:01     ` Brian Norris
2020-05-26 14:58 ` [PATCH v3 6/8] bnxt_en: " Luis Chamberlain
2020-05-26 18:09   ` Michael Chan
2020-05-26 14:58 ` [PATCH v3 7/8] liquidio: " Luis Chamberlain
2020-05-26 14:58 ` [PATCH v3 8/8] qed: " Luis Chamberlain
2020-05-26 22:46 ` [PATCH v3 0/8] kernel: taint when the driver firmware crashes Jakub Kicinski
2020-05-26 23:07   ` Luis Chamberlain
2020-05-26 23:30     ` Jakub Kicinski
2020-05-27  3:19       ` Luis Chamberlain
2020-05-27 21:36         ` Jakub Kicinski
2020-05-28 14:27           ` Luis Chamberlain
2020-05-28 15:04             ` Ben Greear
2020-05-28 16:33               ` Luis Chamberlain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.