[PATCH v2 12/15] ath10k: use new module_firmware

linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
       [not found] <20200515212846.1347-1-mcgrof@kernel.org>
@ 2020-05-15 21:28 ` Luis Chamberlain
  2020-05-16  4:11   ` Rafael Aquini
  2020-05-16 13:24   ` Johannes Berg
  2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw)
  To: jeyu
  Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz,
	tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko,
	keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem,
	netdev, linux-kernel, Luis Chamberlain, linux-wireless, ath10k

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wireless@vger.kernel.org
Cc: ath10k@lists.infradead.org
Cc: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
 drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
 drivers/net/wireless/ath/ath10k/snoc.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..6bd0f3b518b9 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	module_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_pci_dump_registers(ar, crash_data);
 	ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
 	if (ret) {
 		if (ath10k_pci_has_fw_crashed(ar)) {
 			ath10k_warn(ar, "firmware crashed during chip reset\n");
+			module_firmware_crashed();
 			ath10k_pci_fw_crashed_clear(ar);
 			ath10k_pci_fw_crashed_dump(ar);
 		}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..d34ad289380f 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
 
 	/* TODO: Add firmware crash handling */
 	ath10k_warn(ar, "firmware crashed\n");
+	module_firmware_crashed();
 
 	/* read counter to clear the interrupt, the debug error interrupt is
 	 * counter 0.
@@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
 	if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
 		ath10k_err(ar, "firmware crashed!\n");
 		queue_work(ar->workqueue, &ar->restart_work);
+		module_firmware_crashed();
 	}
 	return ret;
 }
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..7cfc123c345c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
 		scnprintf(guid, sizeof(guid), "n/a");
 
 	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+	module_firmware_crashed();
 	ath10k_print_driver_info(ar);
 	ath10k_msa_dump_memory(ar, crash_data);
 	mutex_unlock(&ar->dump_mutex);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain
@ 2020-05-16  4:11   ` Rafael Aquini
  2020-05-16 13:24   ` Johannes Berg
  1 sibling, 0 replies; 51+ messages in thread
From: Rafael Aquini @ 2020-05-16  4:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev,
	linux-kernel, linux-wireless, ath10k

On Fri, May 15, 2020 at 09:28:43PM +0000, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wireless@vger.kernel.org
> Cc: ath10k@lists.infradead.org
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
>  drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
>  drivers/net/wireless/ath/ath10k/snoc.c | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
> index 1d941d53fdc9..6bd0f3b518b9 100644
> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work)
>  		scnprintf(guid, sizeof(guid), "n/a");
>  
>  	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> +	module_firmware_crashed();
>  	ath10k_print_driver_info(ar);
>  	ath10k_pci_dump_registers(ar, crash_data);
>  	ath10k_ce_dump_registers(ar, crash_data);
> @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
>  	if (ret) {
>  		if (ath10k_pci_has_fw_crashed(ar)) {
>  			ath10k_warn(ar, "firmware crashed during chip reset\n");
> +			module_firmware_crashed();
>  			ath10k_pci_fw_crashed_clear(ar);
>  			ath10k_pci_fw_crashed_dump(ar);
>  		}
> diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c
> index e2aff2254a40..d34ad289380f 100644
> --- a/drivers/net/wireless/ath/ath10k/sdio.c
> +++ b/drivers/net/wireless/ath/ath10k/sdio.c
> @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
>  
>  	/* TODO: Add firmware crash handling */
>  	ath10k_warn(ar, "firmware crashed\n");
> +	module_firmware_crashed();
>  
>  	/* read counter to clear the interrupt, the debug error interrupt is
>  	 * counter 0.
> @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
>  	if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
>  		ath10k_err(ar, "firmware crashed!\n");
>  		queue_work(ar->workqueue, &ar->restart_work);
> +		module_firmware_crashed();
>  	}
>  	return ret;
>  }
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
> index 354d49b1cd45..7cfc123c345c 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
>  		scnprintf(guid, sizeof(guid), "n/a");
>  
>  	ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
> +	module_firmware_crashed();
>  	ath10k_print_driver_info(ar);
>  	ath10k_msa_dump_memory(ar, crash_data);
>  	mutex_unlock(&ar->dump_mutex);
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain
  2020-05-16  4:11   ` Rafael Aquini
@ 2020-05-16 13:24   ` Johannes Berg
  2020-05-16 13:50     ` Johannes Berg
  2020-05-18 16:51     ` Luis Chamberlain
  1 sibling, 2 replies; 51+ messages in thread
From: Johannes Berg @ 2020-05-16 13:24 UTC (permalink / raw)
  To: Luis Chamberlain, jeyu
  Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz,
	tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko,
	keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem,
	netdev, linux-kernel, linux-wireless, ath10k

On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed

You didn't CC me or the wireless list on the rest of the patches, so I'm
replying to a random one, but ...

What is the point here?

This should in no way affect the integrity of the system/kernel, for
most devices anyway.

So what if ath10k's firmware crashes? If there's a driver bug it will
not handle it right (and probably crash, WARN_ON, or something else),
but if the driver is working right then that will not affect the kernel
at all.

So maybe I can understand that maybe you want an easy way to discover -
per device - that the firmware crashed, but that still doesn't warrant a
complete kernel taint.

Instead of the kernel taint, IMHO you should provide an annotation in
sysfs (or somewhere else) for the *struct device* that had its firmware
crash. Or maybe, if it's too complex to walk the entire hierarchy
checking for that, have a uevent, or add the ability for the kernel to
print out elsewhere in debugfs the list of devices that crashed at some
point... All of that is fine, but a kernel taint?

johannes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-16 13:24   ` Johannes Berg
@ 2020-05-16 13:50     ` Johannes Berg
  2020-05-18 16:56       ` Luis Chamberlain
  2020-05-19  1:23       ` Brian Norris
  2020-05-18 16:51     ` Luis Chamberlain
  1 sibling, 2 replies; 51+ messages in thread
From: Johannes Berg @ 2020-05-16 13:50 UTC (permalink / raw)
  To: Luis Chamberlain, jeyu
  Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz,
	tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko,
	keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem,
	netdev, linux-kernel, linux-wireless, ath10k

On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote:

> Instead of the kernel taint, IMHO you should provide an annotation in
> sysfs (or somewhere else) for the *struct device* that had its firmware
> crash. Or maybe, if it's too complex to walk the entire hierarchy
> checking for that, have a uevent, or add the ability for the kernel to
> print out elsewhere in debugfs the list of devices that crashed at some

I mean sysfs, oops.

In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
detect that the device is really wedged enough that the only way we can
still try to recover is by completely unbinding the driver from it, then
we give userspace a uevent for that. I don't remember exactly how and
where that gets used (ChromeOS) though, but it'd be nice to have that
sort of thing as part of the infrastructure, in a sort of two-level
notification?

Level 1: firmware crashed, but we're recovering, at least mostly, and
it's more informational

Level 2: device is wedged, going to try to recover by some more forceful
means (perhaps some devices can be power-cycled? etc.) but (more) state
would be lost in these cases?

Still don't think a kernel taint is appropriate for either of these.

johannes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-16 13:50     ` Johannes Berg
@ 2020-05-18 16:56       ` Luis Chamberlain
  2020-05-19  1:23       ` Brian Norris
  1 sibling, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 16:56 UTC (permalink / raw)
  To: Johannes Berg
  Cc: jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe,
	peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k

On Sat, May 16, 2020 at 03:50:55PM +0200, Johannes Berg wrote:
> On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote:
> 
> > Instead of the kernel taint, IMHO you should provide an annotation in
> > sysfs (or somewhere else) for the *struct device* that had its firmware
> > crash. Or maybe, if it's too complex to walk the entire hierarchy
> > checking for that, have a uevent, or add the ability for the kernel to
> > print out elsewhere in debugfs the list of devices that crashed at some
> 
> I mean sysfs, oops.
> 
> 
> In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> detect that the device is really wedged enough that the only way we can
> still try to recover is by completely unbinding the driver from it, then
> we give userspace a uevent for that.

Nice! Indeed a uevent is in order for these sorts of things, and I'd
argue that it begs the question if we should even uevent for any taint
as well. Today these are silent. If the kernel crashes, today we only
give userspace a log.

> I don't remember exactly how and
> where that gets used (ChromeOS) though, but it'd be nice to have that
> sort of thing as part of the infrastructure, in a sort of two-level
> notification?
> 
> Level 1: firmware crashed, but we're recovering, at least mostly, and
> it's more informational
> 
> Level 2: device is wedged, going to try to recover by some more forceful
> means (perhaps some devices can be power-cycled? etc.) but (more) state
> would be lost in these cases?

I agree that *all* this would be ideal. I don't see this as mutually
exclusive with a taint on the kernel and module for the device.

> Still don't think a kernel taint is appropriate for either of these.

From a support perspective, I do think it is vital quick information.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-16 13:50     ` Johannes Berg
  2020-05-18 16:56       ` Luis Chamberlain
@ 2020-05-19  1:23       ` Brian Norris
  2020-05-19 14:02         ` Luis Chamberlain
  1 sibling, 1 reply; 51+ messages in thread
From: Brian Norris @ 2020-05-19  1:23 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luis Chamberlain, jeyu, Andrew Morton, Arnd Bergmann,
	Steven Rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, Takashi Iwai, schlad, Andy Shevchenko,
	Kees Cook, daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, <netdev@vger.kernel.org>,
	Linux Kernel, linux-wireless, ath10k

On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> detect that the device is really wedged enough that the only way we can
> still try to recover is by completely unbinding the driver from it, then
> we give userspace a uevent for that. I don't remember exactly how and
> where that gets used (ChromeOS) though, but it'd be nice to have that
> sort of thing as part of the infrastructure, in a sort of two-level
> notification?

<slight side track>
We use this on certain devices where we know the underlying hardware
has design issues that may lead to device failure -- then when we see
this sort of unrecoverable "firmware-death", we remove the
device[*]+driver, force-reset the PCI device (SBR), and try to
reload/reattach the driver. This all happens by way of a udev rule. We
also log this sort of stuff (and metrics around it) for bug reports
and health statistics, since we really hope to not see this happen
often.

[*] "We" (user space) don't actually do this...it happens via the
'remove_when_gone' module parameter abomination found in iwlwifi. I'd
personally rather see the EVENT=INACESSIBLE stuff on its own, and let
user space deal with when and how to remove and reset the device. But
I digress too much here ;)
</slight side track>

I really came to this thread to say that I also love the idea of a
generic mechanism (a la $subject) to report firmware crashes, but I
also have no interest in seeing a taint flag for it. For Chrome OS, I
would readily (as in, we're already looking at more-hacky /
non-generic ways to do this for drivers we care about) process these
kinds of stats as they happen, logging metrics for bug reports and/or
for automated crash statistics, when we see a firmware crash. A uevent
would suit us very well I think, although it would be nice if drivers
could also supply some small amount of informative text along with it
(e.g., a sort of "reason code", in case we can possibly aggregate
certain failure types). We already do this sort of thing for WARN()
and friends (not via uevent, but via log parsing; at least it has nice
"cut here" markers!).

Perhaps devlink (as proposed down-thread) would also fit the bill. I
don't think sysfs alone would fit our needs, as we'd like to process
these things as they happen, not only when a user submits a bug
report.

> Level 1: firmware crashed, but we're recovering, at least mostly, and
> it's more informational

Chrome OS would love to track these things too, since we'd like to see
these minimized, even if they're usually recoverable ;)

> Level 2: device is wedged, going to try to recover by some more forceful
> means (perhaps some devices can be power-cycled? etc.) but (more) state
> would be lost in these cases?

And we'd definitely want to know about these. We already get this for
the iwlwifi case described above, in a non-generic way.

In general, it's probably not that easy to tell the difference between
1 and 2, since even as you and Luis have noted, with the same driver
(and the same driver location), you find the same crashes may or may
not be recoverable. iwlwifi has extracted certain level 2 cases into
iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
ways in which level 1 crashes actually lead to severe
(non-recoverable) failure.

Regards,
Brian

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-19  1:23       ` Brian Norris
@ 2020-05-19 14:02         ` Luis Chamberlain
  2020-05-20  0:47           ` Brian Norris
  0 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-19 14:02 UTC (permalink / raw)
  To: Brian Norris
  Cc: Johannes Berg, linux-wireless, aquini, peterz, daniel.vetter,
	mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung,
	pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai,
	tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>,
	schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller

On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > detect that the device is really wedged enough that the only way we can
> > still try to recover is by completely unbinding the driver from it, then
> > we give userspace a uevent for that. I don't remember exactly how and
> > where that gets used (ChromeOS) though, but it'd be nice to have that
> > sort of thing as part of the infrastructure, in a sort of two-level
> > notification?
> 
> <slight side track>
> We use this on certain devices where we know the underlying hardware
> has design issues that may lead to device failure 

Ah, after reading below I see you meant for iwlwifi.

If userspace can indeed grow to support this, that would be fantastic.

I should note that I don't discourage hiding firmware or hardware
issues. Quite the contrary, I suspect that taking pride in being
trasnparent about it, and dealing with it fast can help lead the pack.
I wrote about this long ago in 2015 [0], and stand by it.

[0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html

> -- then when we see
> this sort of unrecoverable "firmware-death", we remove the
> device[*]+driver, force-reset the PCI device (SBR), and try to
> reload/reattach the driver. This all happens by way of a udev rule.

So you've sprikled your own udev event here as part of your kernel delta?

> We
> also log this sort of stuff (and metrics around it) for bug reports
> and health statistics, since we really hope to not see this happen
> often.

Assuming perfection is ideal but silly. So, what infrastructure do you
use for this sort of issue?

> [*] "We" (user space) don't actually do this...it happens via the
> 'remove_when_gone' module parameter abomination found in iwlwifi.

Holy moly.. but hey, at least it may seem a bit more seemless than forcing
a reboot / manual driver removal / addition to the user.

BTW is this likely a place on iwlwifi where the firmware likely crashed?

> I'd
> personally rather see the EVENT=INACESSIBLE stuff on its own, and let
> user space deal with when and how to remove and reset the device. But
> I digress too much here ;)
> </slight side track>

This is all useful information. We are just touching the surface of the
topic by addressing networking first. Imagine when we address other
subsystems.

> I really came to this thread to say that I also love the idea of a
> generic mechanism (a la $subject) to report firmware crashes, but I
> also have no interest in seeing a taint flag for it. For Chrome OS, I
> would readily (as in, we're already looking at more-hacky /
> non-generic ways to do this for drivers we care about) process these
> kinds of stats as they happen, logging metrics for bug reports and/or
> for automated crash statistics, when we see a firmware crash.

Great!

> A uevent
> would suit us very well I think, although it would be nice if drivers
> could also supply some small amount of informative text along with it

A follow up to this series was to add a uevent to add_taint(), however
since a *count* is not considered I think it is correct to seek
alternatives at this point. The leaner the solution the better though.

Do you have a pointer to what guys use so I can read?

> (e.g., a sort of "reason code", in case we can possibly aggregate
> certain failure types). We already do this sort of thing for WARN()
> and friends (not via uevent, but via log parsing; at least it has nice
> "cut here" markers!).

Indeed, similar things can indeed be argued about WARN*()... this
however can be non-device specific. With panic-on-warn becoming a
"thing", the more important it becomes to really tally exactly *why*
these WARN*()s may trigger.

> Perhaps 

Note below.

> devlink (as proposed down-thread) would also fit the bill. I
> don't think sysfs alone would fit our needs, as we'd like to process
> these things as they happen, not only when a user submits a bug
> report.

I think we've reached a point where using "*Perhaps*" does not suffice,
and if there is already a *user* of similar desired infrastructure I
think we should jump on the opportunity to replace what you have with
something which could be used by other devices / subsystems which
require firmware. And indeed, also even consider in the abstract sense,
the possibility to leverage something like this for WARN*()s later too.

> > Level 1: firmware crashed, but we're recovering, at least mostly, and
> > it's more informational
> 
> Chrome OS would love to track these things too, since we'd like to see
> these minimized, even if they're usually recoverable ;)
> 
> > Level 2: device is wedged, going to try to recover by some more forceful
> > means (perhaps some devices can be power-cycled? etc.) but (more) state
> > would be lost in these cases?
> 
> And we'd definitely want to know about these. We already get this for
> the iwlwifi case described above, in a non-generic way.
> 
> In general, it's probably not that easy to tell the difference between
> 1 and 2, since even as you and Luis have noted, with the same driver
> (and the same driver location), you find the same crashes may or may
> not be recoverable. iwlwifi has extracted certain level 2 cases into
> iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
> ways in which level 1 crashes actually lead to severe
> (non-recoverable) failure.

And that is fine, accepting these for what they are will help. However,
leaving the user in the *dark*, is what we should *not do*.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-19 14:02         ` Luis Chamberlain
@ 2020-05-20  0:47           ` Brian Norris
  2020-05-20  5:37             ` Emmanuel Grumbach
  0 siblings, 1 reply; 51+ messages in thread
From: Brian Norris @ 2020-05-20  0:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Johannes Berg, linux-wireless, aquini, peterz, Daniel Vetter,
	mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung,
	pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai,
	tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>,
	schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller

Hi Luis,

On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > > detect that the device is really wedged enough that the only way we can
> > > still try to recover is by completely unbinding the driver from it, then
> > > we give userspace a uevent for that. I don't remember exactly how and
> > > where that gets used (ChromeOS) though, but it'd be nice to have that
> > > sort of thing as part of the infrastructure, in a sort of two-level
> > > notification?
> >
> > <slight side track>
> > We use this on certain devices where we know the underlying hardware
> > has design issues that may lead to device failure
>
> Ah, after reading below I see you meant for iwlwifi.

Sorry, I was replying to Johannes, who I believe had his "we"="Intel"
hat (as iwlwifi maintainer) on, and was pointing at
iwl_trans_pcie_removal_wk().

> If userspace can indeed grow to support this, that would be fantastic.

Well, Chrome OS tailors its user space a bit more to the hardware (and
kernel/drivers in use) than the average distro might. We already do
this (for some values of "this") today. Is that "fantastic" to you? :D

> > -- then when we see
> > this sort of unrecoverable "firmware-death", we remove the
> > device[*]+driver, force-reset the PCI device (SBR), and try to
> > reload/reattach the driver. This all happens by way of a udev rule.
>
> So you've sprikled your own udev event here as part of your kernel delta?

No kernel delta -- the event is there already:
iwl_trans_pcie_removal_wk()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027

And you can see our udev rules and scripts, in all their ugly details
here, if you really care:
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/

> > We
> > also log this sort of stuff (and metrics around it) for bug reports
> > and health statistics, since we really hope to not see this happen
> > often.
>
> Assuming perfection is ideal but silly. So, what infrastructure do you
> use for this sort of issue?

We don't yet log firmware crashes generally, but for all our current
crash reports (including WARN()), they go through this:
https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md

For example, look for "cut here" in:
https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc

For other specific metrics (like counting "EVENT=INACCESSIBLE"), we
use the Chrome UMA system:
https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md

I don't imagine the "infrastructure" side of any of that would be
useful to you, but maybe the client-side gathering can at least show
you what we do.

> > [*] "We" (user space) don't actually do this...it happens via the
> > 'remove_when_gone' module parameter abomination found in iwlwifi.
>
> BTW is this likely a place on iwlwifi where the firmware likely crashed?

iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out
in a way that is likely due to a dead PCIe endpoint. It's not directly
a firmware crash, although there may be firmware crashes reported
around the same time.

Intel folks can probably give a more nuanced answer, but their
firmware crashes usually land in something like iwl_mvm_nic_error() ->
iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make
your proposal less punishing (e.g., removing the "taint" as Johannes
requested), maybe iwlwifi would be another good candidate for
$subject. iwlwifi generally expects to recover seamlessly, but it's
also good to know if you've seen a hundred of these in a row.

> > A uevent
> > would suit us very well I think, although it would be nice if drivers
> > could also supply some small amount of informative text along with it
>
> A follow up to this series was to add a uevent to add_taint(), however
> since a *count* is not considered I think it is correct to seek
> alternatives at this point. The leaner the solution the better though.
>
> Do you have a pointer to what guys use so I can read?

Hopefully the above pointers are useful to you. We don't yet have
firmware-crash parsing implemented, so I don't have pointers for that
piece yet.

> > (e.g., a sort of "reason code", in case we can possibly aggregate
> > certain failure types). We already do this sort of thing for WARN()
> > and friends (not via uevent, but via log parsing; at least it has nice
> > "cut here" markers!).
>
> Indeed, similar things can indeed be argued about WARN*()... this
> however can be non-device specific. With panic-on-warn becoming a
> "thing", the more important it becomes to really tally exactly *why*
> these WARN*()s may trigger.

panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm
probably not supposed to publicize information related to how many
Chromebooks are out there, but we regularly see a scary number of
non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users
every day, and a scary number of those are in WiFi drivers...

> > Perhaps
>
> Note below.

(My use of "perhaps" is only because I'm not familiar with devlink at all.)

> > devlink (as proposed down-thread) would also fit the bill. I
> > don't think sysfs alone would fit our needs, as we'd like to process
> > these things as they happen, not only when a user submits a bug
> > report.
>
> I think we've reached a point where using "*Perhaps*" does not suffice,
> and if there is already a *user* of similar desired infrastructure I
> think we should jump on the opportunity to replace what you have with
> something which could be used by other devices / subsystems which
> require firmware. And indeed, also even consider in the abstract sense,
> the possibility to leverage something like this for WARN*()s later too.

To precisely lay out what Chrome OS does today:

* WARN() and similar: implemented, see anomaly_detector.cc above. It's
just log parsing, and that handy "cut here" stuff -- I'm nearly
certain there are other distros using this already? A uevent would
probably be nice, but log parsing is good enough for us today.
* EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked
above. It's a uevent, and we handle it via udev. Code is linked above.
* Other firmware crashes: not yet implemented here, but we're looking
at it. Currently, we plan to do something similar to WARN(), but it
will be ugly and non-generic. A uevent would be a lovely replacement,
if it has some basic metadata with it. Stats in sysfs might be icing
on the cake, but mostly redundant for us, if we can grab it on the fly
via uevent.

HTH,
Brian

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-20  0:47           ` Brian Norris
@ 2020-05-20  5:37             ` Emmanuel Grumbach
  2020-05-20  8:32               ` Andy Shevchenko
  2020-05-21 19:01               ` Brian Norris
  0 siblings, 2 replies; 51+ messages in thread
From: Emmanuel Grumbach @ 2020-05-20  5:37 UTC (permalink / raw)
  To: Brian Norris
  Cc: Luis Chamberlain, Johannes Berg, linux-wireless, aquini, peterz,
	Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai,
	mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli,
	Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo,
	<netdev@vger.kernel.org>,
	schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller

Hi all,

<top post intro>

Since I have been involved quite a bit in the firmware debugging
features in iwlwifi, I think I can give a few insights here.

But before this, we need to understand that there are several sources of issues:
1) the firmware may crash but the bus is still alive, you can still
use the bus to get the crash data
2) the bus is dead, when that happens, the firmware might even be in a
good condition, but since the bus is dead, you stop getting any
information about the firmware, and then, at some point, you get to
the conclusion that the firmware is dead. You can't get the crash data
that resides on the other side of the bus (you may have gathered data
in the DRAM directly, but that's a different thing), and you don't
have much recovery to do besides re-starting the PCI enumeration.

At Intel, we have seen both unfortunately. The bus issues are the ones
that are trickier obviously. Trickier to detect (because you just get
garbage from any request you issue on the bus), and trickier to
handle. One can argue that the kernel should *not* handle those and
let this in userspace hands. I guess it all depends on what component
you ship to your customer and what you customer asks from you  :).

</top post intro>

>
> Hi Luis,
>
> On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > > > detect that the device is really wedged enough that the only way we can
> > > > still try to recover is by completely unbinding the driver from it, then
> > > > we give userspace a uevent for that. I don't remember exactly how and
> > > > where that gets used (ChromeOS) though, but it'd be nice to have that
> > > > sort of thing as part of the infrastructure, in a sort of two-level
> > > > notification?
> > >
> > > <slight side track>
> > > We use this on certain devices where we know the underlying hardware
> > > has design issues that may lead to device failure
> >
> > Ah, after reading below I see you meant for iwlwifi.
>
> Sorry, I was replying to Johannes, who I believe had his "we"="Intel"
> hat (as iwlwifi maintainer) on, and was pointing at
> iwl_trans_pcie_removal_wk().
>

This pcie_removal thing is for the bus dead thing. My 2) above.

> > If userspace can indeed grow to support this, that would be fantastic.
>
> Well, Chrome OS tailors its user space a bit more to the hardware (and
> kernel/drivers in use) than the average distro might. We already do
> this (for some values of "this") today. Is that "fantastic" to you? :D

I guess it can be fantastic if other vendors also suffer from this. Or
maybe that could be done as part of the PCI bus driver inside the
kernel?

>
> > > -- then when we see
> > > this sort of unrecoverable "firmware-death", we remove the
> > > device[*]+driver, force-reset the PCI device (SBR), and try to
> > > reload/reattach the driver. This all happens by way of a udev rule.
> >
> > So you've sprikled your own udev event here as part of your kernel delta?
>
> No kernel delta -- the event is there already:
> iwl_trans_pcie_removal_wk()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027
>
> And you can see our udev rules and scripts, in all their ugly details
> here, if you really care:
> https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/
>
> > > We
> > > also log this sort of stuff (and metrics around it) for bug reports
> > > and health statistics, since we really hope to not see this happen
> > > often.
> >
> > Assuming perfection is ideal but silly. So, what infrastructure do you
> > use for this sort of issue?
>
> We don't yet log firmware crashes generally, but for all our current
> crash reports (including WARN()), they go through this:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md
>
> For example, look for "cut here" in:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc
>
> For other specific metrics (like counting "EVENT=INACCESSIBLE"), we
> use the Chrome UMA system:
> https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md
>
> I don't imagine the "infrastructure" side of any of that would be
> useful to you, but maybe the client-side gathering can at least show
> you what we do.
>
> > > [*] "We" (user space) don't actually do this...it happens via the
> > > 'remove_when_gone' module parameter abomination found in iwlwifi.
> >
> > BTW is this likely a place on iwlwifi where the firmware likely crashed?
>
> iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out
> in a way that is likely due to a dead PCIe endpoint. It's not directly
> a firmware crash, although there may be firmware crashes reported
> around the same time.

iwl_trans_pcie_removal_wk()  is only because of a dead bus, not
because of a firmware crash.
>
> Intel folks can probably give a more nuanced answer, but their
> firmware crashes usually land in something like iwl_mvm_nic_error() ->
> iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make
> your proposal less punishing (e.g., removing the "taint" as Johannes
> requested), maybe iwlwifi would be another good candidate for
> $subject. iwlwifi generally expects to recover seamlessly, but it's
> also good to know if you've seen a hundred of these in a row.

Yes, you are right, mvm_nic_error is the place.

So here is the proposal.
I think that a standard interface that can allow a driver to report a
firmware crash along with a proprietary unique id that makes sense to
the vendor can be useful. Then, yes, distros can listen to this,
optionally open bugs (automatically?) when that happens. I even
planned to do this long ago and integrated with a specific distro, but
it never rolled out. The vendor supplied unique id is very important
for the bug de-duplication part. For iwlwifi, we have the SYSASSERT
number which is basically how the firmware tells us briefly what
happened. Of course, there is always more data that can be useful, but
for a first level of bug de-duplication this can be enough. Then, you
have a standard way to track the firmware crashes.
We need to remember that the firmware recovery path is expected to
work, it is, yet, less tested. I specifically remember a bug where a
crash because by a bad handling of a CSA frame caused a firmware crash
in a flow that caused the driver not to re-add a station or something
like that, and because of that, you get another firmware crash. So
sometimes it is interesting to see not only the data on which firmware
crash happened and how many times, but if there is a timing
relationship between them. I guess that's for the next level though...
Not really critical for now.

The next problem here is that when you tell the firmware folks that
they have a bug, the first they'll do is to ask for more data. Back
then, I enabled our firmware debug infra on Linux, and from there
devcoredump infra was born (thanks Johannes). What we do at Intel, is
that we have a script that runs when a udev event reports the creation
of a devcoredump. It parses the kernel log (dmesg) to determine the
unique id I mentioned earlier (the SYSSASSERT number basically) and
then it creates a file in /var/log/ whose name contain the SYSSASSERT
number. It is then quite easy to match the firmware data with the
firmware crash.
So I believe the right way to do this is to create the devcoredump
along with the id. Meaning that we basically don't need another
interface, we just need to use the same one, but to provide the unique
id of the crash. This unique id can then be part of the uevent that is
thrown to the userspace and userspace can use it to name the dump file
with the right id. This way, it is fairly easy (and standard across
vendors) to track the firmware crashes, but the most important is that
you already have the firmware data that goes along with it. It would
look like this.

driver_code.c:
void my_vendor_error_interrupt()
{
  u64 uid = get_the_unique_id_from_your_device();
  void *firmware_data = get_the_data_you_need();

 dev_coredumpsg(dev_struct, firmware_data, firmware_data_len,
                              GFP_whatver, uid);
}

And then, this would cause a:
/var/log/wifi-crash-$(KBUILD_MODNAME)-,uid.bin to appear our your filesystem.

>
> > > A uevent
> > > would suit us very well I think, although it would be nice if drivers
> > > could also supply some small amount of informative text along with it
> >
> > A follow up to this series was to add a uevent to add_taint(), however
> > since a *count* is not considered I think it is correct to seek
> > alternatives at this point. The leaner the solution the better though.
> >
> > Do you have a pointer to what guys use so I can read?
>
> Hopefully the above pointers are useful to you. We don't yet have
> firmware-crash parsing implemented, so I don't have pointers for that
> piece yet.

See above. I don't think it is the device driver's role to count those.
We can add this count this in userspace I think. Debatable though.

>
> > > (e.g., a sort of "reason code", in case we can possibly aggregate
> > > certain failure types). We already do this sort of thing for WARN()
> > > and friends (not via uevent, but via log parsing; at least it has nice
> > > "cut here" markers!).
> >
> > Indeed, similar things can indeed be argued about WARN*()... this
> > however can be non-device specific. With panic-on-warn becoming a
> > "thing", the more important it becomes to really tally exactly *why*
> > these WARN*()s may trigger.
>
> panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm
> probably not supposed to publicize information related to how many
> Chromebooks are out there, but we regularly see a scary number of
> non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users
> every day, and a scary number of those are in WiFi drivers...
>
> > > Perhaps
> >
> > Note below.
>
> (My use of "perhaps" is only because I'm not familiar with devlink at all.)
>
> > > devlink (as proposed down-thread) would also fit the bill. I
> > > don't think sysfs alone would fit our needs, as we'd like to process
> > > these things as they happen, not only when a user submits a bug
> > > report.
> >
> > I think we've reached a point where using "*Perhaps*" does not suffice,
> > and if there is already a *user* of similar desired infrastructure I
> > think we should jump on the opportunity to replace what you have with
> > something which could be used by other devices / subsystems which
> > require firmware. And indeed, also even consider in the abstract sense,
> > the possibility to leverage something like this for WARN*()s later too.
>
> To precisely lay out what Chrome OS does today:
>
> * WARN() and similar: implemented, see anomaly_detector.cc above. It's
> just log parsing, and that handy "cut here" stuff -- I'm nearly
> certain there are other distros using this already? A uevent would
> probably be nice, but log parsing is good enough for us today.
> * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked
> above. It's a uevent, and we handle it via udev. Code is linked above.
> * Other firmware crashes: not yet implemented here, but we're looking
> at it. Currently, we plan to do something similar to WARN(), but it
> will be ugly and non-generic. A uevent would be a lovely replacement,
> if it has some basic metadata with it. Stats in sysfs might be icing
> on the cake, but mostly redundant for us, if we can grab it on the fly
> via uevent.

So I believe we already have this uevent, it is the devcoredump. All
we need is to add the unique id.
Note that all this is good for firmware crashes, and not for bus dead
scenarios, but those two are fundamentally different even if they look
alike at the beginning of your error detection flow.
>
> HTH,
> Brian

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-20  5:37             ` Emmanuel Grumbach
@ 2020-05-20  8:32               ` Andy Shevchenko
  2020-05-21 19:01               ` Brian Norris
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Shevchenko @ 2020-05-20  8:32 UTC (permalink / raw)
  To: Emmanuel Grumbach
  Cc: Brian Norris, Luis Chamberlain, Johannes Berg, linux-wireless,
	aquini, Peter Zijlstra (Intel),
	Daniel Vetter, Mauro Carvalho Chehab, Will Deacon, Baoquan He,
	ath10k, Takashi Iwai, Ingo Molnar, Dave Young, Petr Mladek,
	Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai,
	Thomas Gleixner, Andy Shevchenko, Kalle Valo,
	<netdev@vger.kernel.org>,
	schlad, Linux Kernel, Jessica Yu, Andrew Morton, David S. Miller

On Wed, May 20, 2020 at 8:40 AM Emmanuel Grumbach <egrumbach@gmail.com> wrote:

> Since I have been involved quite a bit in the firmware debugging
> features in iwlwifi, I think I can give a few insights here.
>
> But before this, we need to understand that there are several sources of issues:
> 1) the firmware may crash but the bus is still alive, you can still
> use the bus to get the crash data
> 2) the bus is dead, when that happens, the firmware might even be in a
> good condition, but since the bus is dead, you stop getting any
> information about the firmware, and then, at some point, you get to
> the conclusion that the firmware is dead. You can't get the crash data
> that resides on the other side of the bus (you may have gathered data
> in the DRAM directly, but that's a different thing), and you don't
> have much recovery to do besides re-starting the PCI enumeration.
>
> At Intel, we have seen both unfortunately. The bus issues are the ones
> that are trickier obviously. Trickier to detect (because you just get
> garbage from any request you issue on the bus), and trickier to
> handle. One can argue that the kernel should *not* handle those and
> let this in userspace hands. I guess it all depends on what component
> you ship to your customer and what you customer asks from you  :).

Or the two best approaches:
1) get rid of firmware completely;
2) make it OSS (like SOF).

I think any of these is a right thing to do in long-term perspective.

How many firmwares average computer has? 50? 100? Any of them is a
burden and PITA.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-20  5:37             ` Emmanuel Grumbach
  2020-05-20  8:32               ` Andy Shevchenko
@ 2020-05-21 19:01               ` Brian Norris
  2020-05-22  5:12                 ` Emmanuel Grumbach
  1 sibling, 1 reply; 51+ messages in thread
From: Brian Norris @ 2020-05-21 19:01 UTC (permalink / raw)
  To: Emmanuel Grumbach
  Cc: Luis Chamberlain, Johannes Berg, linux-wireless, aquini, peterz,
	Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai,
	mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli,
	Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo,
	<netdev@vger.kernel.org>,
	schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller

On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> So I believe we already have this uevent, it is the devcoredump. All
> we need is to add the unique id.

I think there are a few reasons that devcoredump doesn't satisfy what
either Luis or I want.

1) it can be disabled entirely [1], for good reasons (e.g., think of
non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
with the opaque dumps provided by closed-source firmware)
2) not all drivers necessarily have a useful dump to provide when
there's a crash; look at the rest of Luis's series to see the kinds of
drivers-with-firmware that are crashing, some of which aren't dumping
anything
3) for those that do support devcoredump, it may be used for purposes
that are not "crashes" -- e.g., some provide debugfs or other knobs to
initiate dumps, for diagnostic or debugging purposes

Brian

[1] devcd_disabled
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-21 19:01               ` Brian Norris
@ 2020-05-22  5:12                 ` Emmanuel Grumbach
  2020-05-22  5:23                   ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Emmanuel Grumbach @ 2020-05-22  5:12 UTC (permalink / raw)
  To: Brian Norris
  Cc: Luis Chamberlain, Johannes Berg, linux-wireless, aquini, peterz,
	Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai,
	mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli,
	Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo,
	<netdev@vger.kernel.org>,
	schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller

>
> On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> > So I believe we already have this uevent, it is the devcoredump. All
> > we need is to add the unique id.
>
> I think there are a few reasons that devcoredump doesn't satisfy what
> either Luis or I want.
>
> 1) it can be disabled entirely [1], for good reasons (e.g., think of
> non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
> with the opaque dumps provided by closed-source firmware)

Ok, if all you're interested into is the information that this event
happen (as opposed to report a bug and providing the data), then I
agree. True, not everybody want or can enable devcoredump. I am just a
bit concerned that we may end up with two interface that notify the
same event basically. The ideal maybe would be to be able to
optionally reduce the content of the devoredump to nothing more that
is already in the dmesg output. But then, it is not what it is meant
to be: namely, a core dump..

> 2) not all drivers necessarily have a useful dump to provide when
> there's a crash; look at the rest of Luis's series to see the kinds of
> drivers-with-firmware that are crashing, some of which aren't dumping
> anything

Fair enouh.

> 3) for those that do support devcoredump, it may be used for purposes
> that are not "crashes" -- e.g., some provide debugfs or other knobs to
> initiate dumps, for diagnostic or debugging purposes

Not sure I really think we need to care about those cases, but you
already have 2 good arguments :)

>
> Brian
>
> [1] devcd_disabled
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-22  5:12                 ` Emmanuel Grumbach
@ 2020-05-22  5:23                   ` Luis Chamberlain
  0 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-22  5:23 UTC (permalink / raw)
  To: Emmanuel Grumbach
  Cc: Brian Norris, Johannes Berg, linux-wireless, aquini, peterz,
	Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai,
	mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli,
	Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo,
	<netdev@vger.kernel.org>,
	schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller

On Fri, May 22, 2020 at 08:12:59AM +0300, Emmanuel Grumbach wrote:
> >
> > On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote:
> > > So I believe we already have this uevent, it is the devcoredump. All
> > > we need is to add the unique id.
> >
> > I think there are a few reasons that devcoredump doesn't satisfy what
> > either Luis or I want.
> >
> > 1) it can be disabled entirely [1], for good reasons (e.g., think of
> > non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything
> > with the opaque dumps provided by closed-source firmware)
> 
> Ok, if all you're interested into is the information that this event
> happen (as opposed to report a bug and providing the data), then I
> agree. 

I've now hit again a firmware crash with ath10k with the latest firwmare
and kernel and the *only* thing that helped recovery was a full reboot,
so that is a crystal clear case that this needs to taint the kernel, and
yes we do want to inform users too, so I've just added uevent support
for a few panic / taint events in the kernel now and rolled into my
series. I'll run some final tests and then post this as a follow up.

devlink didn't cut it, its networking specific.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-16 13:24   ` Johannes Berg
  2020-05-16 13:50     ` Johannes Berg
@ 2020-05-18 16:51     ` Luis Chamberlain
  2020-05-18 16:58       ` Ben Greear
  1 sibling, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 16:51 UTC (permalink / raw)
  To: Johannes Berg
  Cc: jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe,
	peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k

On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> 
> You didn't CC me or the wireless list on the rest of the patches, so I'm
> replying to a random one, but ...
> 
> What is the point here?
> 
> This should in no way affect the integrity of the system/kernel, for
> most devices anyway.

Keyword you used here is "most device". And in the worst case, *who*
knows what other odd things may happen afterwards.

> So what if ath10k's firmware crashes? If there's a driver bug it will
> not handle it right (and probably crash, WARN_ON, or something else),
> but if the driver is working right then that will not affect the kernel
> at all.

Sometimes the device can go into a state which requires driver removal
and addition to get things back up.

> So maybe I can understand that maybe you want an easy way to discover -
> per device - that the firmware crashed, but that still doesn't warrant a
> complete kernel taint.

That is one reason, another is that a taint helps support cases *fast*
easily detect if the issue was a firmware crash, instead of scraping
logs for driver specific ways to say the firmware has crashed.

> Instead of the kernel taint, IMHO you should provide an annotation in
> sysfs (or somewhere else) for the *struct device* that had its firmware
> crash.

It would seem the way some folks are thinking about getting more details
would be through devlink.

> Or maybe, if it's too complex to walk the entire hierarchy
> checking for that, have a uevent,  or add the ability for the kernel to
> print out elsewhere in debugfs the list of devices that crashed at some
> point... All of that is fine, but a kernel taint?

debugfs is optional, a taint is simple, and device agnostic. From a
support perspective it is very easy to see if a possible issue may
be device firmware specific.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 16:51     ` Luis Chamberlain
@ 2020-05-18 16:58       ` Ben Greear
  2020-05-18 17:09         ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2020-05-18 16:58 UTC (permalink / raw)
  To: Luis Chamberlain, Johannes Berg
  Cc: jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe,
	peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k



On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
>>
>> You didn't CC me or the wireless list on the rest of the patches, so I'm
>> replying to a random one, but ...
>>
>> What is the point here?
>>
>> This should in no way affect the integrity of the system/kernel, for
>> most devices anyway.
>
> Keyword you used here is "most device". And in the worst case, *who*
> knows what other odd things may happen afterwards.
>
>> So what if ath10k's firmware crashes? If there's a driver bug it will
>> not handle it right (and probably crash, WARN_ON, or something else),
>> but if the driver is working right then that will not affect the kernel
>> at all.
>
> Sometimes the device can go into a state which requires driver removal
> and addition to get things back up.

It would be lovely to be able to detect this case in the driver/system
somehow!  I haven't seen any such cases recently, but in case there is
some common case you see, maybe we can think of a way to detect it?

>
>> So maybe I can understand that maybe you want an easy way to discover -
>> per device - that the firmware crashed, but that still doesn't warrant a
>> complete kernel taint.
>
> That is one reason, another is that a taint helps support cases *fast*
> easily detect if the issue was a firmware crash, instead of scraping
> logs for driver specific ways to say the firmware has crashed.

You can listen for udev events (I think that is the right term),
and find crashes that way.  You get the actual crash info as well.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 16:58       ` Ben Greear
@ 2020-05-18 17:09         ` Luis Chamberlain
  2020-05-18 17:15           ` Ben Greear
  0 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 17:09 UTC (permalink / raw)
  To: Ben Greear
  Cc: Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai,
	dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k

On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> 
> 
> On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > 
> > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > replying to a random one, but ...
> > > 
> > > What is the point here?
> > > 
> > > This should in no way affect the integrity of the system/kernel, for
> > > most devices anyway.
> > 
> > Keyword you used here is "most device". And in the worst case, *who*
> > knows what other odd things may happen afterwards.
> > 
> > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > not handle it right (and probably crash, WARN_ON, or something else),
> > > but if the driver is working right then that will not affect the kernel
> > > at all.
> > 
> > Sometimes the device can go into a state which requires driver removal
> > and addition to get things back up.
> 
> It would be lovely to be able to detect this case in the driver/system
> somehow!  I haven't seen any such cases recently,

I assure you that I have run into it. Once it does again I'll report
the crash, but the problem with some of this is that unless you scrape
the log you won't know. Eventually, a uevent would indeed tell inform
me.

> but in case there is
> some common case you see, maybe we can think of a way to detect it?

ath10k is just one case, this patch series addresses a simple way to
annotate this tree-wide.

> > > So maybe I can understand that maybe you want an easy way to discover -
> > > per device - that the firmware crashed, but that still doesn't warrant a
> > > complete kernel taint.
> > 
> > That is one reason, another is that a taint helps support cases *fast*
> > easily detect if the issue was a firmware crash, instead of scraping
> > logs for driver specific ways to say the firmware has crashed.
> 
> You can listen for udev events (I think that is the right term),
> and find crashes that way.  You get the actual crash info as well.

My follow up to this was to add uevent to add_taint() as well, this way
these could generically be processed by userspace.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 17:09         ` Luis Chamberlain
@ 2020-05-18 17:15           ` Ben Greear
  2020-05-18 17:18             ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Ben Greear @ 2020-05-18 17:15 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai,
	dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k



On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
>>
>>
>> On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
>>> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
>>>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
>>>>
>>>> You didn't CC me or the wireless list on the rest of the patches, so I'm
>>>> replying to a random one, but ...
>>>>
>>>> What is the point here?
>>>>
>>>> This should in no way affect the integrity of the system/kernel, for
>>>> most devices anyway.
>>>
>>> Keyword you used here is "most device". And in the worst case, *who*
>>> knows what other odd things may happen afterwards.
>>>
>>>> So what if ath10k's firmware crashes? If there's a driver bug it will
>>>> not handle it right (and probably crash, WARN_ON, or something else),
>>>> but if the driver is working right then that will not affect the kernel
>>>> at all.
>>>
>>> Sometimes the device can go into a state which requires driver removal
>>> and addition to get things back up.
>>
>> It would be lovely to be able to detect this case in the driver/system
>> somehow!  I haven't seen any such cases recently,
>
> I assure you that I have run into it. Once it does again I'll report
> the crash, but the problem with some of this is that unless you scrape
> the log you won't know. Eventually, a uevent would indeed tell inform
> me.
>
>> but in case there is
>> some common case you see, maybe we can think of a way to detect it?
>
> ath10k is just one case, this patch series addresses a simple way to
> annotate this tree-wide.
>
>>>> So maybe I can understand that maybe you want an easy way to discover -
>>>> per device - that the firmware crashed, but that still doesn't warrant a
>>>> complete kernel taint.
>>>
>>> That is one reason, another is that a taint helps support cases *fast*
>>> easily detect if the issue was a firmware crash, instead of scraping
>>> logs for driver specific ways to say the firmware has crashed.
>>
>> You can listen for udev events (I think that is the right term),
>> and find crashes that way.  You get the actual crash info as well.
>
> My follow up to this was to add uevent to add_taint() as well, this way
> these could generically be processed by userspace.

I'm not opposed to the taint, though I have not thought much on it.

But, if you can already get the crash info from uevent, and it automatically
comes without polling or scraping logs, then what benefit beyond that does
the taint give you?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 17:15           ` Ben Greear
@ 2020-05-18 17:18             ` Luis Chamberlain
  2020-05-18 18:06               ` Steve deRosier
  0 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 17:18 UTC (permalink / raw)
  To: Ben Greear
  Cc: Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai,
	dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k

On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote:
> 
> 
> On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> > > 
> > > 
> > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > > > 
> > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > > > replying to a random one, but ...
> > > > > 
> > > > > What is the point here?
> > > > > 
> > > > > This should in no way affect the integrity of the system/kernel, for
> > > > > most devices anyway.
> > > > 
> > > > Keyword you used here is "most device". And in the worst case, *who*
> > > > knows what other odd things may happen afterwards.
> > > > 
> > > > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > > > not handle it right (and probably crash, WARN_ON, or something else),
> > > > > but if the driver is working right then that will not affect the kernel
> > > > > at all.
> > > > 
> > > > Sometimes the device can go into a state which requires driver removal
> > > > and addition to get things back up.
> > > 
> > > It would be lovely to be able to detect this case in the driver/system
> > > somehow!  I haven't seen any such cases recently,
> > 
> > I assure you that I have run into it. Once it does again I'll report
> > the crash, but the problem with some of this is that unless you scrape
> > the log you won't know. Eventually, a uevent would indeed tell inform
> > me.
> > 
> > > but in case there is
> > > some common case you see, maybe we can think of a way to detect it?
> > 
> > ath10k is just one case, this patch series addresses a simple way to
> > annotate this tree-wide.
> > 
> > > > > So maybe I can understand that maybe you want an easy way to discover -
> > > > > per device - that the firmware crashed, but that still doesn't warrant a
> > > > > complete kernel taint.
> > > > 
> > > > That is one reason, another is that a taint helps support cases *fast*
> > > > easily detect if the issue was a firmware crash, instead of scraping
> > > > logs for driver specific ways to say the firmware has crashed.
> > > 
> > > You can listen for udev events (I think that is the right term),
> > > and find crashes that way.  You get the actual crash info as well.
> > 
> > My follow up to this was to add uevent to add_taint() as well, this way
> > these could generically be processed by userspace.
> 
> I'm not opposed to the taint, though I have not thought much on it.
> 
> But, if you can already get the crash info from uevent, and it automatically
> comes without polling or scraping logs, then what benefit beyond that does
> the taint give you?

From a support perspective it is a *crystal* clear sign that the device
and / or device driver may be in a very bad state, in a generic way.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 17:18             ` Luis Chamberlain
@ 2020-05-18 18:06               ` Steve deRosier
  2020-05-18 19:09                 ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Steve deRosier @ 2020-05-18 18:06 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Ben Greear, Johannes Berg, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter,
	will, mchehab+samsung, Kalle Valo, David S. Miller,
	Network Development, LKML, linux-wireless, ath10k

On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote:
> >
> >
> > On 05/18/2020 10:09 AM, Luis Chamberlain wrote:
> > > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote:
> > > >
> > > >
> > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote:
> > > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote:
> > > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed
> > > > > >
> > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm
> > > > > > replying to a random one, but ...
> > > > > >
> > > > > > What is the point here?
> > > > > >
> > > > > > This should in no way affect the integrity of the system/kernel, for
> > > > > > most devices anyway.
> > > > >
> > > > > Keyword you used here is "most device". And in the worst case, *who*
> > > > > knows what other odd things may happen afterwards.
> > > > >
> > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will
> > > > > > not handle it right (and probably crash, WARN_ON, or something else),
> > > > > > but if the driver is working right then that will not affect the kernel
> > > > > > at all.
> > > > >
> > > > > Sometimes the device can go into a state which requires driver removal
> > > > > and addition to get things back up.
> > > >
> > > > It would be lovely to be able to detect this case in the driver/system
> > > > somehow!  I haven't seen any such cases recently,
> > >
> > > I assure you that I have run into it. Once it does again I'll report
> > > the crash, but the problem with some of this is that unless you scrape
> > > the log you won't know. Eventually, a uevent would indeed tell inform
> > > me.
> > >
> > > > but in case there is
> > > > some common case you see, maybe we can think of a way to detect it?
> > >
> > > ath10k is just one case, this patch series addresses a simple way to
> > > annotate this tree-wide.
> > >
> > > > > > So maybe I can understand that maybe you want an easy way to discover -
> > > > > > per device - that the firmware crashed, but that still doesn't warrant a
> > > > > > complete kernel taint.
> > > > >
> > > > > That is one reason, another is that a taint helps support cases *fast*
> > > > > easily detect if the issue was a firmware crash, instead of scraping
> > > > > logs for driver specific ways to say the firmware has crashed.
> > > >
> > > > You can listen for udev events (I think that is the right term),
> > > > and find crashes that way.  You get the actual crash info as well.
> > >
> > > My follow up to this was to add uevent to add_taint() as well, this way
> > > these could generically be processed by userspace.
> >
> > I'm not opposed to the taint, though I have not thought much on it.
> >
> > But, if you can already get the crash info from uevent, and it automatically
> > comes without polling or scraping logs, then what benefit beyond that does
> > the taint give you?
>
> From a support perspective it is a *crystal* clear sign that the device
> and / or device driver may be in a very bad state, in a generic way.
>

Unfortunately a "taint" is interpreted by many users as: "your kernel
is really F#*D up, you better do something about it right now."
Assuming they're paying attention at all in the first place of course.

The fact is, WiFi chip firmware crashes, and in most cases the driver
is able to recover seamlessly. At least that is the case with most QCA
chipsets I work with. And the users or our ability to do anything
about it is minimal to none as we don't have access to firmware
source. It's too bad and I wish it weren't the case, but we have
embraced reality and most drivers have a recovery mechanism built in
for this case. In short, it's a non-event. I fear that elevating this
to a kernel taint will significantly increase "support" requests that
really are nothing but noise; similar to how the firmware load failure
messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded
fw-0.bin) cause a lot of noise.

Not specifically opposed, but I wonder what it really accomplishes in
a world where the firmware crashing is pretty much a normal
occurrence.

If it goes in, I think that the drivers shouldn't trigger the taint if
they're able to recover normally. Only trigger on failure to come back
up.  In other words, the ideal place in the ath10k driver isn't where
you have proposed as at that point operation is normal and we're doing
a routine recovery.

- Steve





>   Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 18:06               ` Steve deRosier
@ 2020-05-18 19:09                 ` Luis Chamberlain
  2020-05-18 19:25                   ` Johannes Berg
  0 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 19:09 UTC (permalink / raw)
  To: Steve deRosier
  Cc: Ben Greear, Johannes Berg, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter,
	will, mchehab+samsung, Kalle Valo, David S. Miller,
	Network Development, LKML, linux-wireless, ath10k

On Mon, May 18, 2020 at 11:06:27AM -0700, Steve deRosier wrote:
> On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> > From a support perspective it is a *crystal* clear sign that the device
> > and / or device driver may be in a very bad state, in a generic way.
> >
> 
> Unfortunately a "taint" is interpreted by many users as: "your kernel
> is really F#*D up, you better do something about it right now."
> Assuming they're paying attention at all in the first place of course.

Taint historically has been used and still is today to help rule out
whether or not you get support, or how you get support.

For instance, a staging driver is not supported by some upstream
developers, but it will be by those who help staging and Greg. TAINT_CRAP
cannot be even more clear.

So, no, it is not just about "hey your kernel is messed up", there are
clear support boundaries being drawn.

> The fact is, WiFi chip firmware crashes, and in most cases the driver
> is able to recover seamlessly. At least that is the case with most QCA
> chipsets I work with. 

That has not been my exerience with the same driver, and so how do we
know? And this patch set is not about ath10k alone, I want you to
think about *all* device drivers with firmware. In my journey to scrape
the kernel for these cases I was very surprised by the amount of code
which clearly annotates these situations.

> And the users or our ability to do anything
> about it is minimal to none as we don't have access to firmware
> source.

This is not true, we have open firmware in WiFi. Some vendors choose
to not open source their firmware, that is their decision.

These days though, I think we all admit, that firmware crashes can use
a better generic infrastructure for ensuring that clearly affecting-user
experience issues. This patch is about that *when and if these happen*,
we annotate it in the kernel for support pursposes.

> It's too bad and I wish it weren't the case, but we have
> embraced reality and most drivers have a recovery mechanism built in
> for this case.

The mentality about firmware crashes being the end of the world is
certainly what will lead developers to often hide these. Where this
is openly clear, and not obfucscated I'd argue that firmware issues
get fixed likely more common.

So what you describe is not bad, its just accepting evolution.

> In short, it's a non-event. I fear that elevating this
> to a kernel taint will significantly increase "support" requests that
> really are nothing but noise;

That will depend on where you put this on the driver, and that is
why it is important to place it in the right place, if any.

> similar to how the firmware load failure
> messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded
> fw-0.bin) cause a lot of noise.

That can be fixed, the developers behind this series gave up on it.
It has to do with a range version of supported firmwares, and all
being optional, but at least one is required.

> Not specifically opposed, but I wonder what it really accomplishes in
> a world where the firmware crashing is pretty much a normal
> occurrence.

Recovery without affecting user experience would be great, the taint is
*not* for those cases. The taint definition has:

+ 18) ``Q`` used by device drivers to annotate that the device driver's firmware
+     has crashed and the device's operation has been severely affected. The    
+     device may be left in a crippled state, requiring full driver removal /   
+     addition, system reboot, or it is unclear how long recovery will take.

Let me know if this is not clear.

> If it goes in, I think that the drivers shouldn't trigger the taint if
> they're able to recover normally. Only trigger on failure to come back
> up.  In other words, the ideal place in the ath10k driver isn't where
> you have proposed as at that point operation is normal and we're doing
> a routine recovery.

Sure, happy to remove it if indeed it is the case that the firwmare
crash is not happening to cripple the device, but I can vouch for the
fact that the exact place where I placed it left my device driver in a
state where I had to remove / add again.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 19:09                 ` Luis Chamberlain
@ 2020-05-18 19:25                   ` Johannes Berg
  2020-05-18 19:59                     ` Luis Chamberlain
  2020-05-18 20:28                     ` Jakub Kicinski
  0 siblings, 2 replies; 51+ messages in thread
From: Johannes Berg @ 2020-05-18 19:25 UTC (permalink / raw)
  To: Luis Chamberlain, Steve deRosier
  Cc: Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai,
	dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai,
	schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, Kalle Valo, David S. Miller,
	Network Development, LKML, linux-wireless, ath10k

On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote:

> > Unfortunately a "taint" is interpreted by many users as: "your kernel
> > is really F#*D up, you better do something about it right now."
> > Assuming they're paying attention at all in the first place of course.
> 
> Taint historically has been used and still is today to help rule out
> whether or not you get support, or how you get support.
> 
> For instance, a staging driver is not supported by some upstream
> developers, but it will be by those who help staging and Greg. TAINT_CRAP
> cannot be even more clear.
> 
> So, no, it is not just about "hey your kernel is messed up", there are
> clear support boundaries being drawn.

Err, no. Those two are most definitely related. Have you looked at (most
or some or whatever) staging drivers recently? Those contain all kinds
of garbage that might do whatever with your kernel.

Of course that's not a completely clear boundary, maybe you can find a
driver in staging that's perfect code just not written to kernel style?
But I find that hard to believe, in most cases.

So no, it's really not about "[a] staging driver is not supported" vs.
"your kernel is messed up". The very fact that you loaded one of those
things might very well have messed up your kernel entirely.

> These days though, I think we all admit, that firmware crashes can use
> a better generic infrastructure for ensuring that clearly affecting-user
> experience issues. This patch is about that *when and if these happen*,
> we annotate it in the kernel for support pursposes.

That's all fine, I just don't think it's appropriate to pretend that
your kernel is now 'tainted' (think about the meaning of that word) when
the firmware of some random device crashed. Heck, that could have been a
USB device that was since unplugged. Unless the driver is complete
garbage (hello staging again?) that really should have no lasting effect
on the system itself.

> Recovery without affecting user experience would be great, the taint is
> *not* for those cases. The taint definition has:
> 
> + 18) ``Q`` used by device drivers to annotate that the device driver's firmware
> +     has crashed and the device's operation has been severely affected. The    
> +     device may be left in a crippled state, requiring full driver removal /   
> +     addition, system reboot, or it is unclear how long recovery will take.
> 
> Let me know if this is not clear.

It's pretty clear, but even then, first of all I doubt this is the case
for many of the places that you've sprinkled the annotation on, and
secondly it actually hides useful information.

Regardless of the support issue, I think this hiding of information is
also problematic.

I really think we'd all be better off if you just made a sysfs file (I
mistyped debugfs in some other email, sorry, apparently you didn't see
the correction in time) that listed which device(s) crashed and how many
times. That would actually be useful. Because honestly, if a random
device crashed for some random reason, that's pretty much a non-event.
If it keeps happening, then we might even want to know about it.

You can obviously save the contents of this file into your bug reports
automatically and act accordingly, but I think you'll find that this is
far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this
report. Yeah, that might be reasonable thing if the bug report is about
slow wifi *and* you see that ath10k firmware crashed every 10 seconds,
but if it just crashed once a few days earlier it's of no importance to
the system anymore ... And certainly a reasonable driver (which I
believe ath10k to be) would _not_ randomly start corrupting memory
because its firmware crashed. Which really is what tainting the kernel
is about.

So no, even with all that, I still really believe you're solving the
wrong problem. Having information about firmware crashes, preferably
with some kind of frequency information attached, and *clearly* with
information about which device attached would be _great_. Munging it all
into one bit is actively harmful, IMO.

johannes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 19:25                   ` Johannes Berg
@ 2020-05-18 19:59                     ` Luis Chamberlain
  2020-05-18 20:07                       ` Johannes Berg
  2020-05-18 20:28                     ` Jakub Kicinski
  1 sibling, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 19:59 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter,
	will, mchehab+samsung, Kalle Valo, David S. Miller,
	Network Development, LKML, linux-wireless, ath10k

On Mon, May 18, 2020 at 09:25:09PM +0200, Johannes Berg wrote:
> On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote:
> 
> > > Unfortunately a "taint" is interpreted by many users as: "your kernel
> > > is really F#*D up, you better do something about it right now."
> > > Assuming they're paying attention at all in the first place of course.
> > 
> > Taint historically has been used and still is today to help rule out
> > whether or not you get support, or how you get support.
> > 
> > For instance, a staging driver is not supported by some upstream
> > developers, but it will be by those who help staging and Greg. TAINT_CRAP
> > cannot be even more clear.
> > 
> > So, no, it is not just about "hey your kernel is messed up", there are
> > clear support boundaries being drawn.
> 
> Err, no. Those two are most definitely related. Have you looked at (most
> or some or whatever) staging drivers recently? Those contain all kinds
> of garbage that might do whatever with your kernel.

No, I stay away :)

> Of course that's not a completely clear boundary, maybe you can find a
> driver in staging that's perfect code just not written to kernel style?
> But I find that hard to believe, in most cases.
> 
> So no, it's really not about "[a] staging driver is not supported" vs.
> "your kernel is messed up". The very fact that you loaded one of those
> things might very well have messed up your kernel entirely.
> 
> > These days though, I think we all admit, that firmware crashes can use
> > a better generic infrastructure for ensuring that clearly affecting-user
> > experience issues. This patch is about that *when and if these happen*,
> > we annotate it in the kernel for support pursposes.
> 
> That's all fine, I just don't think it's appropriate to pretend that
> your kernel is now 'tainted' (think about the meaning of that word) when
> the firmware of some random device crashed.

If the firmware crash *does* require driver remove / addition again,
or a reboot, would you think that this is a situation that merits a taint?

> > Recovery without affecting user experience would be great, the taint is
> > *not* for those cases. The taint definition has:
> > 
> > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware
> > +     has crashed and the device's operation has been severely affected. The    
> > +     device may be left in a crippled state, requiring full driver removal /   
> > +     addition, system reboot, or it is unclear how long recovery will take.
> > 
> > Let me know if this is not clear.
> 
> It's pretty clear, but even then, first of all I doubt this is the case
> for many of the places that you've sprinkled the annotation on,

We can remove it, for this driver I can vouch for its location as it did
reach a state where I required a reboot. And its not the first time this
has happened. This got me thinking about the bigger picture of the lack
of proper way to address these cases in the kernel, and how the user is
left dumbfounded.

> and secondly it actually hides useful information.

What is it hiding?

> Regardless of the support issue, I think this hiding of information is
> also problematic.
> 
> I really think we'd all be better off if you just made a sysfs file (I
> mistyped debugfs in some other email, sorry, apparently you didn't see
> the correction in time) that listed which device(s) crashed and how many
> times. 

Ah yes, count. The taint does not address count.

> That would actually be useful. Because honestly, if a random
> device crashed for some random reason, that's pretty much a non-event.
> If it keeps happening, then we might even want to know about it.

True.

> You can obviously save the contents of this file into your bug reports
> automatically and act accordingly, but I think you'll find that this is
> far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this
> report.

Absolutely.

> Yeah, that might be reasonable thing if the bug report is about
> slow wifi *and* you see that ath10k firmware crashed every 10 seconds,
> but if it just crashed once a few days earlier it's of no importance to
> the system anymore ... And certainly a reasonable driver (which I
> believe ath10k to be) would _not_ randomly start corrupting memory
> because its firmware crashed. Which really is what tainting the kernel
> is about.

I still see it as a support thing too. But discussing this further is
pointless as I agree that taint does not cover count and that it is
important.


  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 19:59                     ` Luis Chamberlain
@ 2020-05-18 20:07                       ` Johannes Berg
  2020-05-18 21:18                         ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Berg @ 2020-05-18 20:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter,
	will, mchehab+samsung, Kalle Valo, David S. Miller,
	Network Development, LKML, linux-wireless, ath10k

On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote:

> > Err, no. Those two are most definitely related. Have you looked at (most
> > or some or whatever) staging drivers recently? Those contain all kinds
> > of garbage that might do whatever with your kernel.
> 
> No, I stay away :)

:)

> > That's all fine, I just don't think it's appropriate to pretend that
> > your kernel is now 'tainted' (think about the meaning of that word) when
> > the firmware of some random device crashed.
> 
> If the firmware crash *does* require driver remove / addition again,
> or a reboot, would you think that this is a situation that merits a taint?

Not really. In my experience, that's more likely a hardware issue (card
not properly seated, for example) that a bus reset happens to "fix".

> > It's pretty clear, but even then, first of all I doubt this is the case
> > for many of the places that you've sprinkled the annotation on,
> 
> We can remove it, for this driver I can vouch for its location as it did
> reach a state where I required a reboot. And its not the first time this
> has happened. This got me thinking about the bigger picture of the lack
> of proper way to address these cases in the kernel, and how the user is
> left dumbfounded.

Fair, so the driver is still broken wrt. recovery here. I still don't
think that's a situation where e.g. the system should say "hey you have
a taint here, if your graphics go bad now you should not report that
bug" (which is effectively what the single taint bit does).

> > and secondly it actually hides useful information.
> 
> What is it hiding?

Most importantly, which device crashed. Secondarily I'd say how many
times (*).

The information "firmware crashed" is really only useful in relation to
the device. If your graphics firmware crashed, yeah, well, you probably
won't even see this. If your USB wifi firmware crashed? Not really
interesting, you'll anyway just unplug. In fact it's very hard for a USB
driver (short of arbitrary memory corruption) to significantly mess up
the system.

johannes

(*) though if it crashed only once, was that because it was wedged
enough to be unusable afterwards, or because everything was fine?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 20:07                       ` Johannes Berg
@ 2020-05-18 21:18                         ` Luis Chamberlain
  0 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 21:18 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter,
	will, mchehab+samsung, Kalle Valo, David S. Miller,
	Network Development, LKML, linux-wireless, ath10k

On Mon, May 18, 2020 at 10:07:49PM +0200, Johannes Berg wrote:
> On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote:
> 
> > > Err, no. Those two are most definitely related. Have you looked at (most
> > > or some or whatever) staging drivers recently? Those contain all kinds
> > > of garbage that might do whatever with your kernel.
> > 
> > No, I stay away :)
> 
> :)
> 
> > > That's all fine, I just don't think it's appropriate to pretend that
> > > your kernel is now 'tainted' (think about the meaning of that word) when
> > > the firmware of some random device crashed.
> > 
> > If the firmware crash *does* require driver remove / addition again,
> > or a reboot, would you think that this is a situation that merits a taint?
> 
> Not really. In my experience, that's more likely a hardware issue (card
> not properly seated, for example) that a bus reset happens to "fix".
> 
> > > It's pretty clear, but even then, first of all I doubt this is the case
> > > for many of the places that you've sprinkled the annotation on,
> > 
> > We can remove it, for this driver I can vouch for its location as it did
> > reach a state where I required a reboot. And its not the first time this
> > has happened. This got me thinking about the bigger picture of the lack
> > of proper way to address these cases in the kernel, and how the user is
> > left dumbfounded.
> 
> Fair, so the driver is still broken wrt. recovery here. I still don't
> think that's a situation where e.g. the system should say "hey you have
> a taint here, if your graphics go bad now you should not report that
> bug" (which is effectively what the single taint bit does).

But again, let's think about the generic type of issue, and the
unexpected type of state that can be reached. The circumstance here
*does* lead to a case which is not recoverable. Now, consider how
many cases in the kernel where similar situations can happen and leave
the device or driver in a non-functional state.

> > > and secondly it actually hides useful information.
> > 
> > What is it hiding?
> 
> Most importantly, which device crashed. Secondarily I'd say how many
> times (*).

The device is implied by the module, the taint is applied to both.
If you had multiple devices, however, yes, it would not be possible
to distinguish from the taint which exact device it happened on.

So the only thing *generic* which would be left out is count.

> The information "firmware crashed" is really only useful in relation to
> the device.

If you have to reboot to get a functional network again then the device
is quite useless for many people, regardless of which device that
happened on.

But from a support perspective a sysfs interface which provides a tiny
bit more generic information indeed provides more value than a taint.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 19:25                   ` Johannes Berg
  2020-05-18 19:59                     ` Luis Chamberlain
@ 2020-05-18 20:28                     ` Jakub Kicinski
  2020-05-18 20:29                       ` Johannes Berg
  1 sibling, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-18 20:28 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:
> It's pretty clear, but even then, first of all I doubt this is the case
> for many of the places that you've sprinkled the annotation on, and
> secondly it actually hides useful information.
> 
> Regardless of the support issue, I think this hiding of information is
> also problematic.
> 
> I really think we'd all be better off if you just made a sysfs file (I
> mistyped debugfs in some other email, sorry, apparently you didn't see
> the correction in time) that listed which device(s) crashed and how many
> times. That would actually be useful. Because honestly, if a random
> device crashed for some random reason, that's pretty much a non-event.
> If it keeps happening, then we might even want to know about it.

Johannes - have you seen devlink health? I think we should just use
that interface, since it supports all the things you're requesting,
rather than duplicate it in sysfs.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 20:28                     ` Jakub Kicinski
@ 2020-05-18 20:29                       ` Johannes Berg
  2020-05-18 20:35                         ` Jakub Kicinski
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Berg @ 2020-05-18 20:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:
> > It's pretty clear, but even then, first of all I doubt this is the case
> > for many of the places that you've sprinkled the annotation on, and
> > secondly it actually hides useful information.
> > 
> > Regardless of the support issue, I think this hiding of information is
> > also problematic.
> > 
> > I really think we'd all be better off if you just made a sysfs file (I
> > mistyped debugfs in some other email, sorry, apparently you didn't see
> > the correction in time) that listed which device(s) crashed and how many
> > times. That would actually be useful. Because honestly, if a random
> > device crashed for some random reason, that's pretty much a non-event.
> > If it keeps happening, then we might even want to know about it.
> 
> Johannes - have you seen devlink health? I think we should just use
> that interface, since it supports all the things you're requesting,
> rather than duplicate it in sysfs.

I haven't, and I'm glad to hear that's there, sounds good!

I suspect that Luis wants something more generic though, that isn't just
applicable to netdevices, unless devlink grew some kind of non-netdev
stuff while I wasn't looking? :)

johannes


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 20:29                       ` Johannes Berg
@ 2020-05-18 20:35                         ` Jakub Kicinski
  2020-05-18 20:41                           ` Johannes Berg
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-18 20:35 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, 18 May 2020 22:29:53 +0200 Johannes Berg wrote:
> On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote:
> > On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote:  
> > > It's pretty clear, but even then, first of all I doubt this is the case
> > > for many of the places that you've sprinkled the annotation on, and
> > > secondly it actually hides useful information.
> > > 
> > > Regardless of the support issue, I think this hiding of information is
> > > also problematic.
> > > 
> > > I really think we'd all be better off if you just made a sysfs file (I
> > > mistyped debugfs in some other email, sorry, apparently you didn't see
> > > the correction in time) that listed which device(s) crashed and how many
> > > times. That would actually be useful. Because honestly, if a random
> > > device crashed for some random reason, that's pretty much a non-event.
> > > If it keeps happening, then we might even want to know about it.  
> > 
> > Johannes - have you seen devlink health? I think we should just use
> > that interface, since it supports all the things you're requesting,
> > rather than duplicate it in sysfs.  
> 
> I haven't, and I'm glad to hear that's there, sounds good!
> 
> I suspect that Luis wants something more generic though, that isn't just
> applicable to netdevices, unless devlink grew some kind of non-netdev
> stuff while I wasn't looking? :)

It's intended to be a generic netlink channel for configuring devices.

All the firmware-related interfaces have no dependencies on netdevs,
in fact that's one of the reasons we moved to devlink - we don't want
to hold rtnl lock just for talking to device firmware.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 20:35                         ` Jakub Kicinski
@ 2020-05-18 20:41                           ` Johannes Berg
  2020-05-18 20:46                             ` Jakub Kicinski
  0 siblings, 1 reply; 51+ messages in thread
From: Johannes Berg @ 2020-05-18 20:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> 
> It's intended to be a generic netlink channel for configuring devices.
> 
> All the firmware-related interfaces have no dependencies on netdevs,
> in fact that's one of the reasons we moved to devlink - we don't want
> to hold rtnl lock just for talking to device firmware.

Sounds good :)

So I guess Luis just has to add some way in devlink to hook up devlink
health in a simple way to drivers, perhaps? I mean, many drivers won't
really want to use devlink for anything else, so I guess it should be as
simple as the API that Luis proposed ("firmware crashed for this struct
device"), if nothing more interesting is done with devlink?

Dunno. But anyway sounds like it should somehow integrate there rather
than the way this patchset proposed?

johannes


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 20:41                           ` Johannes Berg
@ 2020-05-18 20:46                             ` Jakub Kicinski
  2020-05-18 21:22                               ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-18 20:46 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote:
> On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> > It's intended to be a generic netlink channel for configuring devices.
> > 
> > All the firmware-related interfaces have no dependencies on netdevs,
> > in fact that's one of the reasons we moved to devlink - we don't want
> > to hold rtnl lock just for talking to device firmware.  
> 
> Sounds good :)
> 
> So I guess Luis just has to add some way in devlink to hook up devlink
> health in a simple way to drivers, perhaps? I mean, many drivers won't
> really want to use devlink for anything else, so I guess it should be as
> simple as the API that Luis proposed ("firmware crashed for this struct
> device"), if nothing more interesting is done with devlink?
> 
> Dunno. But anyway sounds like it should somehow integrate there rather
> than the way this patchset proposed?

Right, that'd be great. Simple API to register a devlink instance with
whatever number of reporters the device would need. I'm happy to help.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 20:46                             ` Jakub Kicinski
@ 2020-05-18 21:22                               ` Luis Chamberlain
  2020-05-18 22:16                                 ` Jakub Kicinski
  0 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-18 21:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Johannes Berg, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, May 18, 2020 at 01:46:43PM -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote:
> > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote:
> > > It's intended to be a generic netlink channel for configuring devices.
> > > 
> > > All the firmware-related interfaces have no dependencies on netdevs,
> > > in fact that's one of the reasons we moved to devlink - we don't want
> > > to hold rtnl lock just for talking to device firmware.  
> > 
> > Sounds good :)
> > 
> > So I guess Luis just has to add some way in devlink to hook up devlink
> > health in a simple way to drivers, perhaps? I mean, many drivers won't
> > really want to use devlink for anything else, so I guess it should be as
> > simple as the API that Luis proposed ("firmware crashed for this struct
> > device"), if nothing more interesting is done with devlink?
> > 
> > Dunno. But anyway sounds like it should somehow integrate there rather
> > than the way this patchset proposed?
> 
> Right, that'd be great. Simple API to register a devlink instance with
> whatever number of reporters the device would need. I'm happy to help.

Indeed my issue with devlink is that it did not seem generic enough for
all devices which use firmware and for which firmware can crash. Support
should not have to be "add devlink support" + "now use this new hook",
but rather a very lighweight devlink_crash(device) call we can sprinkly
with only the device as a functional requirement.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 21:22                               ` Luis Chamberlain
@ 2020-05-18 22:16                                 ` Jakub Kicinski
  2020-05-19  1:05                                   ` Luis Chamberlain
  0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-18 22:16 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Johannes Berg, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote:
> Indeed my issue with devlink is that it did not seem generic enough for
> all devices which use firmware and for which firmware can crash. Support
> should not have to be "add devlink support" + "now use this new hook",
> but rather a very lighweight devlink_crash(device) call we can sprinkly
> with only the device as a functional requirement.

We can provide a lightweight devlink_crash(device) which only generates
the notification, without the need to register the health reporter or a
devlink instance upfront. But then we loose the ability to control the
recovery, count errors, etc. So I'd think that's not the direction we
want to go in.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
  2020-05-18 22:16                                 ` Jakub Kicinski
@ 2020-05-19  1:05                                   ` Luis Chamberlain
  2020-05-19 21:15                                     ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski
  2020-05-19 21:15                                     ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski
  0 siblings, 2 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-19  1:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Johannes Berg, Steve deRosier, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k

On Mon, May 18, 2020 at 03:16:45PM -0700, Jakub Kicinski wrote:
> On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote:
> > Indeed my issue with devlink is that it did not seem generic enough for
> > all devices which use firmware and for which firmware can crash. Support
> > should not have to be "add devlink support" + "now use this new hook",
> > but rather a very lighweight devlink_crash(device) call we can sprinkly
> > with only the device as a functional requirement.
> 
> We can provide a lightweight devlink_crash(device) which only generates
> the notification, without the need to register the health reporter or a
> devlink instance upfront. But then we loose the ability to control the
> recovery, count errors, etc. So I'd think that's not the direction we
> want to go in.

Care to show me what the diff stat for a non devlink driver would look
like? Just keep in mind this taint is 1 line addition. Granted, if we
can use SmPL grammar to automate addition of an initial framework to a
driver that'd be great, but since the device addition is subsystem
specific (device_add() and friends), I don't suspect this will be easily
automated.

   Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-19  1:05                                   ` Luis Chamberlain
@ 2020-05-19 21:15                                     ` Jakub Kicinski
  2020-05-22  5:20                                       ` Luis Chamberlain
  2020-05-19 21:15                                     ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski
  1 sibling, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-19 21:15 UTC (permalink / raw)
  To: mcgrof
  Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris, Jakub Kicinski

Add infra for creating devlink instances for a device to report
fw crashes. This patch expects the devlink instance to be registered
at probe time. I belive to be the cleanest. We can also add a devm
version of the helpers, so that we don't have to do the clean up.
Or we can go even further and register the devlink instance only
once error has happened (for the first time, then we can just
find out if already registered by traversing the list like we
do here).

With the patch applied and a sample driver converted we get:

$ devlink dev
pci/0000:07:00.0

Then monitor for errors:

$ devlink mon health
[health,status] pci/0000:07:00.0:
  reporter fw
    state error error 1 recover 0
[health,status] pci/0000:07:00.0:
  reporter fw
    state error error 2 recover 0

These are the events I triggered on purpose. One can also inspect
the health of all devices capable of reporting fw errors:

$ devlink health
pci/0000:07:00.0:
  reporter fw
    state error error 7 recover 0

Obviously drivers may upgrade to the full devlink health API
which includes state dump, state dump auto-collect and automatic
error recovery control.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/linux/devlink.h               |  11 +++
 net/core/Makefile                     |   2 +-
 net/core/devlink_simple_fw_reporter.c | 101 ++++++++++++++++++++++++++
 3 files changed, 113 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/devlink.h
 create mode 100644 net/core/devlink_simple_fw_reporter.c

diff --git a/include/linux/devlink.h b/include/linux/devlink.h
new file mode 100644
index 000000000000..2b73987eefca
--- /dev/null
+++ b/include/linux/devlink.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_DEVLINK_H_
+#define _LINUX_DEVLINK_H_
+
+struct device;
+
+void devlink_simple_fw_reporter_prepare(struct device *dev);
+void devlink_simple_fw_reporter_cleanup(struct device *dev);
+void devlink_simple_fw_reporter_report_crash(struct device *dev);
+
+#endif
diff --git a/net/core/Makefile b/net/core/Makefile
index 3e2c378e5f31..6f1513781c17 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
 obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
-obj-$(CONFIG_NET_DEVLINK) += devlink.o
+obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
 obj-$(CONFIG_FAILOVER) += failover.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
diff --git a/net/core/devlink_simple_fw_reporter.c b/net/core/devlink_simple_fw_reporter.c
new file mode 100644
index 000000000000..48dde9123c3c
--- /dev/null
+++ b/net/core/devlink_simple_fw_reporter.c
@@ -0,0 +1,101 @@
+#include <linux/devlink.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <net/devlink.h>
+
+struct devlink_simple_fw_reporter {
+	struct list_head list;
+	struct devlink_health_reporter *reporter;
+};
+
+
+static LIST_HEAD(devlink_simple_fw_reporters);
+static DEFINE_MUTEX(devlink_simple_fw_reporters_mutex);
+
+static const struct devlink_health_reporter_ops simple_devlink_health = {
+	.name = "fw",
+};
+
+static const struct devlink_ops simple_devlink_ops = {
+};
+
+static struct devlink_simple_fw_reporter *
+devlink_simple_fw_reporter_find_for_dev(struct device *dev)
+{
+	struct devlink_simple_fw_reporter *simple_devlink, *ret = NULL;
+	struct devlink *devlink;
+
+	mutex_lock(&devlink_simple_fw_reporters_mutex);
+	list_for_each_entry(simple_devlink, &devlink_simple_fw_reporters,
+			    list) {
+		devlink = priv_to_devlink(simple_devlink);
+		if (devlink->dev == dev) {
+			ret = simple_devlink;
+			break;
+		}
+	}
+	mutex_unlock(&devlink_simple_fw_reporters_mutex);
+
+	return ret;
+}
+
+void devlink_simple_fw_reporter_report_crash(struct device *dev)
+{
+	struct devlink_simple_fw_reporter *simple_devlink;
+
+	simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
+	if (!simple_devlink)
+		return;
+
+	devlink_health_report(simple_devlink->reporter, "firmware crash", NULL);
+}
+EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_report_crash);
+
+void devlink_simple_fw_reporter_prepare(struct device *dev)
+{
+	struct devlink_simple_fw_reporter *simple_devlink;
+	struct devlink *devlink;
+
+	devlink = devlink_alloc(&simple_devlink_ops,
+				sizeof(struct devlink_simple_fw_reporter));
+	if (!devlink)
+		return;
+
+	if (devlink_register(devlink, dev))
+		goto err_free;
+
+	simple_devlink = devlink_priv(devlink);
+	simple_devlink->reporter =
+		devlink_health_reporter_create(devlink, &simple_devlink_health,
+					       0, NULL);
+	if (IS_ERR(simple_devlink->reporter))
+		goto err_unregister;
+
+	mutex_lock(&devlink_simple_fw_reporters_mutex);
+	list_add_tail(&simple_devlink->list, &devlink_simple_fw_reporters);
+	mutex_unlock(&devlink_simple_fw_reporters_mutex);
+
+	return;
+
+err_unregister:
+	devlink_unregister(devlink);
+err_free:
+	devlink_free(devlink);
+}
+EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_prepare);
+
+void devlink_simple_fw_reporter_cleanup(struct device *dev)
+{
+	struct devlink_simple_fw_reporter *simple_devlink;
+	struct devlink *devlink;
+
+	simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
+	if (!simple_devlink)
+		return;
+
+	devlink = priv_to_devlink(simple_devlink);
+	devlink_health_reporter_destroy(simple_devlink->reporter);
+	devlink_unregister(devlink);
+	devlink_free(devlink);
+}
+EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_cleanup);
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-19 21:15                                     ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski
@ 2020-05-22  5:20                                       ` Luis Chamberlain
  2020-05-22 17:17                                         ` Jakub Kicinski
  0 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-22  5:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

On Tue, May 19, 2020 at 02:15:30PM -0700, Jakub Kicinski wrote:
> Add infra for creating devlink instances for a device to report

Thanks for doing this series as a PoC, counter to the module_firmware_crash()
which I proposed to taint the kernel with a firmware crash flag to the kernel
and module.

For those not famliar about devlink:

https://lwn.net/Articles/677967/
https://www.kernel.org/doc/html/latest/networking/devlink/index.html

The github page also is now 404 as Jiri merged that stuff into iproute2:

git://git.kernel.org/pub/scm/network/iproute2/iproute2.git

> fw crashes. This patch expects the devlink instance to be registered
> at probe time. I belive to be the cleanest. We can also add a devm
> version of the helpers, so that we don't have to do the clean up.
> Or we can go even further and register the devlink instance only
> once error has happened (for the first time, then we can just
> find out if already registered by traversing the list like we
> do here).
> 
> With the patch applied and a sample driver converted we get:
> 
> $ devlink dev
> pci/0000:07:00.0
> 
> Then monitor for errors:
> 
> $ devlink mon health
> [health,status] pci/0000:07:00.0:
>   reporter fw
>     state error error 1 recover 0
> [health,status] pci/0000:07:00.0:
>   reporter fw
>     state error error 2 recover 0
> 
> These are the events I triggered on purpose. One can also inspect
> the health of all devices capable of reporting fw errors:
> 
> $ devlink health
> pci/0000:07:00.0:
>   reporter fw
>     state error error 7 recover 0
> 
> Obviously drivers may upgrade to the full devlink health API
> which includes state dump, state dump auto-collect and automatic
> error recovery control.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/linux/devlink.h               |  11 +++
>  net/core/Makefile                     |   2 +-
>  net/core/devlink_simple_fw_reporter.c | 101 ++++++++++++++++++++++++++
>  3 files changed, 113 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/devlink.h
>  create mode 100644 net/core/devlink_simple_fw_reporter.c
> 
> diff --git a/include/linux/devlink.h b/include/linux/devlink.h
> new file mode 100644
> index 000000000000..2b73987eefca
> --- /dev/null
> +++ b/include/linux/devlink.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef _LINUX_DEVLINK_H_
> +#define _LINUX_DEVLINK_H_
> +
> +struct device;
> +
> +void devlink_simple_fw_reporter_prepare(struct device *dev);
> +void devlink_simple_fw_reporter_cleanup(struct device *dev);
> +void devlink_simple_fw_reporter_report_crash(struct device *dev);
> +
> +#endif
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 3e2c378e5f31..6f1513781c17 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
>  obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
>  obj-$(CONFIG_DST_CACHE) += dst_cache.o
>  obj-$(CONFIG_HWBM) += hwbm.o
> -obj-$(CONFIG_NET_DEVLINK) += devlink.o
> +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o

This was looking super sexy up to here. This is networking specific.
We want something generic for *anything* that requests firmware.

I'm afraid this won't work for something generic. I don't think its
throw-away work though, the idea to provide a generic interface to
dump firmware through netlink might be nice for networking, or other
things.

But I have a feeling we'll want something still more generic than this.

So networking may want to be aware that a firmware crash happened as
part of this network device health thing, but firmware crashing is a
generic thing.

I have now extended my patch set to include uvents and I am more set on
that we need the taint now more than ever.

  Luis

>  obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>  obj-$(CONFIG_FAILOVER) += failover.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
> diff --git a/net/core/devlink_simple_fw_reporter.c b/net/core/devlink_simple_fw_reporter.c
> new file mode 100644
> index 000000000000..48dde9123c3c
> --- /dev/null
> +++ b/net/core/devlink_simple_fw_reporter.c
> @@ -0,0 +1,101 @@
> +#include <linux/devlink.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <net/devlink.h>
> +
> +struct devlink_simple_fw_reporter {
> +	struct list_head list;
> +	struct devlink_health_reporter *reporter;
> +};
> +
> +
> +static LIST_HEAD(devlink_simple_fw_reporters);
> +static DEFINE_MUTEX(devlink_simple_fw_reporters_mutex);
> +
> +static const struct devlink_health_reporter_ops simple_devlink_health = {
> +	.name = "fw",
> +};
> +
> +static const struct devlink_ops simple_devlink_ops = {
> +};
> +
> +static struct devlink_simple_fw_reporter *
> +devlink_simple_fw_reporter_find_for_dev(struct device *dev)
> +{
> +	struct devlink_simple_fw_reporter *simple_devlink, *ret = NULL;
> +	struct devlink *devlink;
> +
> +	mutex_lock(&devlink_simple_fw_reporters_mutex);
> +	list_for_each_entry(simple_devlink, &devlink_simple_fw_reporters,
> +			    list) {
> +		devlink = priv_to_devlink(simple_devlink);
> +		if (devlink->dev == dev) {
> +			ret = simple_devlink;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&devlink_simple_fw_reporters_mutex);
> +
> +	return ret;
> +}
> +
> +void devlink_simple_fw_reporter_report_crash(struct device *dev)
> +{
> +	struct devlink_simple_fw_reporter *simple_devlink;
> +
> +	simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
> +	if (!simple_devlink)
> +		return;
> +
> +	devlink_health_report(simple_devlink->reporter, "firmware crash", NULL);
> +}
> +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_report_crash);
> +
> +void devlink_simple_fw_reporter_prepare(struct device *dev)
> +{
> +	struct devlink_simple_fw_reporter *simple_devlink;
> +	struct devlink *devlink;
> +
> +	devlink = devlink_alloc(&simple_devlink_ops,
> +				sizeof(struct devlink_simple_fw_reporter));
> +	if (!devlink)
> +		return;
> +
> +	if (devlink_register(devlink, dev))
> +		goto err_free;
> +
> +	simple_devlink = devlink_priv(devlink);
> +	simple_devlink->reporter =
> +		devlink_health_reporter_create(devlink, &simple_devlink_health,
> +					       0, NULL);
> +	if (IS_ERR(simple_devlink->reporter))
> +		goto err_unregister;
> +
> +	mutex_lock(&devlink_simple_fw_reporters_mutex);
> +	list_add_tail(&simple_devlink->list, &devlink_simple_fw_reporters);
> +	mutex_unlock(&devlink_simple_fw_reporters_mutex);
> +
> +	return;
> +
> +err_unregister:
> +	devlink_unregister(devlink);
> +err_free:
> +	devlink_free(devlink);
> +}
> +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_prepare);
> +
> +void devlink_simple_fw_reporter_cleanup(struct device *dev)
> +{
> +	struct devlink_simple_fw_reporter *simple_devlink;
> +	struct devlink *devlink;
> +
> +	simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev);
> +	if (!simple_devlink)
> +		return;
> +
> +	devlink = priv_to_devlink(simple_devlink);
> +	devlink_health_reporter_destroy(simple_devlink->reporter);
> +	devlink_unregister(devlink);
> +	devlink_free(devlink);
> +}
> +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_cleanup);
> -- 
> 2.25.4
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22  5:20                                       ` Luis Chamberlain
@ 2020-05-22 17:17                                         ` Jakub Kicinski
  2020-05-22 20:46                                           ` Johannes Berg
  2020-05-22 21:49                                           ` Luis Chamberlain
  0 siblings, 2 replies; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-22 17:17 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

On Fri, 22 May 2020 05:20:46 +0000 Luis Chamberlain wrote:
> > diff --git a/net/core/Makefile b/net/core/Makefile
> > index 3e2c378e5f31..6f1513781c17 100644
> > --- a/net/core/Makefile
> > +++ b/net/core/Makefile
> > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
> >  obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
> >  obj-$(CONFIG_DST_CACHE) += dst_cache.o
> >  obj-$(CONFIG_HWBM) += hwbm.o
> > -obj-$(CONFIG_NET_DEVLINK) += devlink.o
> > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o  
> 
> This was looking super sexy up to here. This is networking specific.
> We want something generic for *anything* that requests firmware.

You can't be serious. It's network specific because of how the Kconfig
is named?

Working for a company operating large data centers I would strongly
prefer if we didn't have ten different ways of reporting firmware
problems in the fleet.

> I'm afraid this won't work for something generic. I don't think its
> throw-away work though, the idea to provide a generic interface to
> dump firmware through netlink might be nice for networking, or other
> things.
> 
> But I have a feeling we'll want something still more generic than this.

Please be specific. Saying generic a lot is not helpful. The code (as
you can see in this patch) is in no way network specific. Or are you
saying there are machines out there running without netlink sockets?

> So networking may want to be aware that a firmware crash happened as
> part of this network device health thing, but firmware crashing is a
> generic thing.
> 
> I have now extended my patch set to include uvents and I am more set on
> that we need the taint now more than ever.

Please expect my nack if you're trying to add this to networking
drivers.

The irony is you have a problem with a networking device and all the
devices your initial set touched are networking. Two of the drivers 
you touched either have or will soon have devlink health reporters
implemented.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 17:17                                         ` Jakub Kicinski
@ 2020-05-22 20:46                                           ` Johannes Berg
  2020-05-22 21:51                                             ` Luis Chamberlain
  2020-05-25 20:57                                             ` Jakub Kicinski
  2020-05-22 21:49                                           ` Luis Chamberlain
  1 sibling, 2 replies; 51+ messages in thread
From: Johannes Berg @ 2020-05-22 20:46 UTC (permalink / raw)
  To: Jakub Kicinski, Luis Chamberlain
  Cc: derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai,
	dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad,
	andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

On Fri, 2020-05-22 at 10:17 -0700, Jakub Kicinski wrote:

> > > --- a/net/core/Makefile
> > > +++ b/net/core/Makefile
> > > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
> > >  obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
> > >  obj-$(CONFIG_DST_CACHE) += dst_cache.o
> > >  obj-$(CONFIG_HWBM) += hwbm.o
> > > -obj-$(CONFIG_NET_DEVLINK) += devlink.o
> > > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o  
> > 
> > This was looking super sexy up to here. This is networking specific.
> > We want something generic for *anything* that requests firmware.
> 
> You can't be serious. It's network specific because of how the Kconfig
> is named?

Wait, yeah, what?

> Working for a company operating large data centers I would strongly
> prefer if we didn't have ten different ways of reporting firmware
> problems in the fleet.

Agree. I don't actually operate anything, but still ...

Thinking about this - maybe there's a way to still combine devcoredump
and devlink somehow?

Or (optionally) make devlink trigger devcoredump while userspace
migrates?

> > So networking may want to be aware that a firmware crash happened as
> > part of this network device health thing, but firmware crashing is a
> > generic thing.
> > 
> > I have now extended my patch set to include uvents and I am more set on
> > that we need the taint now more than ever.

FWIW, I still completely disagree on that taint. You (Luis) obviously
have been running into a bug in that driver, I doubt the firmware
actually managed to wedge the hardware.

But even if it did, that's still not really a kernel taint. The kernel
itself isn't in any way affected by this.

Yes, the system is in a weird state now. But that's *not* equivalent to
"kernel tainted".

> The irony is you have a problem with a networking device and all the
> devices your initial set touched are networking. Two of the drivers 
> you touched either have or will soon have devlink health reporters
> implemented.

Like I said above, do you think it'd be feasible to make a devcoredump
out of devlink health reports? And can the report be in a way that we
control the file format, or are there limits? I guess I should read the
code to find out, but I figure you probably just know. But feel free to
tell me to read it :)

The reason I'm asking is that it's starting to sound like we really
ought to be implementing devlink, but we've got a bunch of
infrastructure that uses the devcoredump, and it'll take time
(significantly so) to change all that...

johannes

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 20:46                                           ` Johannes Berg
@ 2020-05-22 21:51                                             ` Luis Chamberlain
  2020-05-22 23:23                                               ` Steve deRosier
  2020-05-25 20:57                                             ` Jakub Kicinski
  1 sibling, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-22 21:51 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Jakub Kicinski, derosier, greearb, jeyu, akpm, arnd, rostedt,
	mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

On Fri, May 22, 2020 at 10:46:07PM +0200, Johannes Berg wrote:
> FWIW, I still completely disagree on that taint. You (Luis) obviously
> have been running into a bug in that driver, I doubt the firmware
> actually managed to wedge the hardware.

This hasn't happened just once, its happed many times sporadically now,
once a week or two weeks I'd say. And the system isn't being moved
around.

> But even if it did, that's still not really a kernel taint. The kernel
> itself isn't in any way affected by this.

Of course it is, a full reboot is required.

> Yes, the system is in a weird state now. But that's *not* equivalent to
> "kernel tainted".

Requiring a full reboot is a dire situation to be in, and loosing
connectivity to the point this is not recoverable likewise.

You guys are making out a taint to be the end of the world. We have a
taint even for a kernel warning, and as others have mentioned mac80211
already produces these.

What exactly is the opposition to a taint to clarify that a device
firmware has crashed and your system requires a full reboot?

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 21:51                                             ` Luis Chamberlain
@ 2020-05-22 23:23                                               ` Steve deRosier
  2020-05-22 23:44                                                 ` Luis Chamberlain
  2020-05-25  9:07                                                 ` Andy Shevchenko
  0 siblings, 2 replies; 51+ messages in thread
From: Steve deRosier @ 2020-05-22 23:23 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Johannes Berg, Jakub Kicinski, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, Kees Cook,
	Daniel Vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k, jiri, briannorris

On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On Fri, May 22, 2020 at 10:46:07PM +0200, Johannes Berg wrote:
> > FWIW, I still completely disagree on that taint. You (Luis) obviously
> > have been running into a bug in that driver, I doubt the firmware
> > actually managed to wedge the hardware.
>
> This hasn't happened just once, its happed many times sporadically now,
> once a week or two weeks I'd say. And the system isn't being moved
> around.
>
> > But even if it did, that's still not really a kernel taint. The kernel
> > itself isn't in any way affected by this.
>
> Of course it is, a full reboot is required.
>
> > Yes, the system is in a weird state now. But that's *not* equivalent to
> > "kernel tainted".
>
> Requiring a full reboot is a dire situation to be in, and loosing
> connectivity to the point this is not recoverable likewise.
>
> You guys are making out a taint to be the end of the world. We have a
> taint even for a kernel warning, and as others have mentioned mac80211
> already produces these.
>

I had to go RTFM re: kernel taints because it has been a very long
time since I looked at them. It had always seemed to me that most were
caused by "kernel-unfriendly" user actions.  The most famous of course
is loading proprietary modules, out-of-tree modules, forced module
loads, etc...  Honestly, I had forgotten the large variety of uses of
the taint flags. For anyone who hasn't looked at taints recently, I
recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html

In light of this I don't object to setting a taint on this anymore.
I'm a little uneasy, but I've softened on it now, and now I feel it
depends on implementation.

Specifically, I don't think we should set a taint flag when a driver
easily handles a routine firmware crash and is confident that things
have come up just fine again. In other words, triggering the taint in
every driver module where it spits out a log comment that it had a
firmware crash and had to recover seems too much. Sure, firmware
shouldn't crash, sure it should be open source so we can fix it,
whatever... those sort of wishful comments simply ignore reality and
our ability to affect effective change. A lot of WiFi firmware crashes
and for well-known cases the drivers handle them well. And in some
cases, not so well and that should be a place the driver should detect
and thus raise a red flag.  If a WiFi firmware crash can bring down
the kernel, there's either a major driver bug or some very funky
hardware crap going on. That sort of thing we should be able to
detect, mark with a taint (or something), and fix if within our sphere
of influence. I guess what it comes down to me is how aggressive we
are about setting the flag.

I would like there to be a single solution, or a minimized set
depending on what makes sense for the requirements. I haven't had time
to look into the alternatives mentioned here so I don't have an
informed opinion about the solution. I do think Luis is trying to
solve a real problem though. Can we look at this from the point of
view of what are the requirements?  What is it we're trying to solve?

I _think_ that the goal of Luis's original proposal is to report up to
the user, at some future point when the user is interested (because
something super drastic just occured, but long after the fw crash),
that there was a firmware crash without the user having to grep
through all logs on the machine. And then if the user sees that flag
and suspects it, then they can bother to find it in the logs or do
more drastic debugging steps like finding the fw crash in the log and
pulling firmware crash dumps, etc.

I think the various alternate solutions are great but perhaps solving
a superset of features (like adding in user-space notifications etc)?
Perhaps different people on these related threads are trying to solve
different problems?

- Steve

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 23:23                                               ` Steve deRosier
@ 2020-05-22 23:44                                                 ` Luis Chamberlain
  2020-05-25  9:07                                                 ` Andy Shevchenko
  1 sibling, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-22 23:44 UTC (permalink / raw)
  To: Steve deRosier
  Cc: Johannes Berg, Jakub Kicinski, Ben Greear, jeyu, akpm, arnd,
	rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli,
	pmladek, Takashi Iwai, schlad, andriy.shevchenko, Kees Cook,
	Daniel Vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k, jiri, briannorris

On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote:
> Specifically, I don't think we should set a taint flag when a driver
> easily handles a routine firmware crash and is confident that things
> have come up just fine again. In other words, triggering the taint in
> every driver module where it spits out a log comment that it had a
> firmware crash and had to recover seems too much. Sure, firmware
> shouldn't crash, sure it should be open source so we can fix it,
> whatever... those sort of wishful comments simply ignore reality and
> our ability to affect effective change. A lot of WiFi firmware crashes
> and for well-known cases the drivers handle them well. And in some
> cases, not so well and that should be a place the driver should detect
> and thus raise a red flag.  If a WiFi firmware crash can bring down
> the kernel, there's either a major driver bug or some very funky
> hardware crap going on. That sort of thing we should be able to
> detect, mark with a taint (or something), and fix if within our sphere
> of influence. I guess what it comes down to me is how aggressive we
> are about setting the flag.

Exactly the crux of the issue.

I hope that by now we should all be in agreement that at least a
firmware crash requiring a reboot is something we should record and
inform the user of. A taint seems like a reasonable standard practice
for these sorts of things.

> I would like there to be a single solution, or a minimized set
> depending on what makes sense for the requirements. I haven't had time
> to look into the alternatives mentioned here so I don't have an
> informed opinion about the solution. I do think Luis is trying to
> solve a real problem though. Can we look at this from the point of
> view of what are the requirements?  What is it we're trying to solve?
> 
> I _think_ that the goal of Luis's original proposal is to report up to
> the user, at some future point when the user is interested (because
> something super drastic just occured, but long after the fw crash),
> that there was a firmware crash without the user having to grep
> through all logs on the machine. And then if the user sees that flag
> and suspects it, then they can bother to find it in the logs or do
> more drastic debugging steps like finding the fw crash in the log and
> pulling firmware crash dumps, etc.

Yes, that's exactly it. Not all users are clueful to inspect logs. I now
have a generic uevent mechanism drafted which sends a uevent for *any*
taint. So that is, it does not even depend on this series. But it
accomplishes the goal of informing the user of taints.

> I think the various alternate solutions are great but perhaps solving
> a superset of features (like adding in user-space notifications etc)?
> Perhaps different people on these related threads are trying to solve
> different problems?

The uevent mechanism I implemented (but not yet posted for review) at
least sends out a smoke signal. I think that if each subsystem wants
to expand on this with dumping facilities that is great too!

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 23:23                                               ` Steve deRosier
  2020-05-22 23:44                                                 ` Luis Chamberlain
@ 2020-05-25  9:07                                                 ` Andy Shevchenko
  2020-05-25 17:08                                                   ` Ben Greear
  1 sibling, 1 reply; 51+ messages in thread
From: Andy Shevchenko @ 2020-05-25  9:07 UTC (permalink / raw)
  To: Steve deRosier
  Cc: Luis Chamberlain, Johannes Berg, Jakub Kicinski, Ben Greear,
	jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe,
	peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, Kees Cook,
	Daniel Vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k, jiri, briannorris

On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote:
> On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@kernel.org> wrote:

> I had to go RTFM re: kernel taints because it has been a very long
> time since I looked at them. It had always seemed to me that most were
> caused by "kernel-unfriendly" user actions.  The most famous of course
> is loading proprietary modules, out-of-tree modules, forced module
> loads, etc...  Honestly, I had forgotten the large variety of uses of
> the taint flags. For anyone who hasn't looked at taints recently, I
> recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html
> 
> In light of this I don't object to setting a taint on this anymore.
> I'm a little uneasy, but I've softened on it now, and now I feel it
> depends on implementation.
> 
> Specifically, I don't think we should set a taint flag when a driver
> easily handles a routine firmware crash and is confident that things
> have come up just fine again. In other words, triggering the taint in
> every driver module where it spits out a log comment that it had a
> firmware crash and had to recover seems too much. Sure, firmware
> shouldn't crash, sure it should be open source so we can fix it,
> whatever...

While it may sound idealistic the firmware for the end-user, and even for mere
kernel developer like me, is a complete blackbox which has more access than
root user in the kernel. We have tons of firmwares and each of them potentially
dangerous beast. As a user I really care about my data and privacy (hacker can
oops a firmware in order to set a specific vector attack). So, tainting kernel
is _a least_ we can do there, the strict rules would be to reboot immediately.

> those sort of wishful comments simply ignore reality and
> our ability to affect effective change.

We can encourage users not to buy cheap crap for the starter.

> A lot of WiFi firmware crashes
> and for well-known cases the drivers handle them well. And in some
> cases, not so well and that should be a place the driver should detect
> and thus raise a red flag.  If a WiFi firmware crash can bring down
> the kernel, there's either a major driver bug or some very funky
> hardware crap going on. That sort of thing we should be able to
> detect, mark with a taint (or something), and fix if within our sphere
> of influence. I guess what it comes down to me is how aggressive we
> are about setting the flag.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-25  9:07                                                 ` Andy Shevchenko
@ 2020-05-25 17:08                                                   ` Ben Greear
  0 siblings, 0 replies; 51+ messages in thread
From: Ben Greear @ 2020-05-25 17:08 UTC (permalink / raw)
  To: Andy Shevchenko, Steve deRosier
  Cc: Luis Chamberlain, Johannes Berg, Jakub Kicinski, jeyu, akpm,
	arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, Takashi Iwai, schlad, Kees Cook,
	Daniel Vetter, will, mchehab+samsung, Kalle Valo,
	David S. Miller, Network Development, LKML, linux-wireless,
	ath10k, jiri, briannorris



On 05/25/2020 02:07 AM, Andy Shevchenko wrote:
> On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote:
>> On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
>> I had to go RTFM re: kernel taints because it has been a very long
>> time since I looked at them. It had always seemed to me that most were
>> caused by "kernel-unfriendly" user actions.  The most famous of course
>> is loading proprietary modules, out-of-tree modules, forced module
>> loads, etc...  Honestly, I had forgotten the large variety of uses of
>> the taint flags. For anyone who hasn't looked at taints recently, I
>> recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html
>>
>> In light of this I don't object to setting a taint on this anymore.
>> I'm a little uneasy, but I've softened on it now, and now I feel it
>> depends on implementation.
>>
>> Specifically, I don't think we should set a taint flag when a driver
>> easily handles a routine firmware crash and is confident that things
>> have come up just fine again. In other words, triggering the taint in
>> every driver module where it spits out a log comment that it had a
>> firmware crash and had to recover seems too much. Sure, firmware
>> shouldn't crash, sure it should be open source so we can fix it,
>> whatever...
>
> While it may sound idealistic the firmware for the end-user, and even for mere
> kernel developer like me, is a complete blackbox which has more access than
> root user in the kernel. We have tons of firmwares and each of them potentially
> dangerous beast. As a user I really care about my data and privacy (hacker can
> oops a firmware in order to set a specific vector attack). So, tainting kernel
> is _a least_ we can do there, the strict rules would be to reboot immediately.
>
>> those sort of wishful comments simply ignore reality and
>> our ability to affect effective change.
>
> We can encourage users not to buy cheap crap for the starter.

There is no stable wifi firmware for any price.

There is also no obvious feedback from even name-brand NICs like ath10k or AX200
when you report a crash.

That said, at least in my experience with ath10k-ct, the OS normally recovers fine
from firmware crashes.  ath10k already reports full crash reports on udev, so
easy for user-space to notice and report bug reports upstream if it cares to.  Probably
other NICs do the same, and if not, they certainly could.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 20:46                                           ` Johannes Berg
  2020-05-22 21:51                                             ` Luis Chamberlain
@ 2020-05-25 20:57                                             ` Jakub Kicinski
  2020-07-30 13:56                                               ` Johannes Berg
  1 sibling, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-25 20:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Luis Chamberlain, derosier, greearb, jeyu, akpm, arnd, rostedt,
	mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

On Fri, 22 May 2020 22:46:07 +0200 Johannes Berg wrote:
> > The irony is you have a problem with a networking device and all the
> > devices your initial set touched are networking. Two of the drivers 
> > you touched either have or will soon have devlink health reporters
> > implemented.  
> 
> Like I said above, do you think it'd be feasible to make a devcoredump
> out of devlink health reports? And can the report be in a way that we
> control the file format, or are there limits? I guess I should read the
> code to find out, but I figure you probably just know. But feel free to
> tell me to read it :)
> 
> The reason I'm asking is that it's starting to sound like we really
> ought to be implementing devlink, but we've got a bunch of
> infrastructure that uses the devcoredump, and it'll take time
> (significantly so) to change all that...

In devlink world pure FW core dumps are exposed by devlink regions.
An API allowing reading device memory, registers, etc., but also 
creating dumps of memory regions when things go wrong. It should be
a fairly straightforward migration.

Devlink health is more targeted, the dump is supposed to contain only
relevant information, selected and formatted by the driver. When device
misbehaves driver reads the relevant registers and FW state and creates
a formatted state dump. I believe each element of the dump must fit
into a netlink message (but there may be multiple elements, see
devlink_fmsg_prepare_skb()).

We should be able to convert dl-regions dumps and dl-health dumps into
devcoredumps, but since health reporters are supposed to be more
targeted there's usually multiple of them per device.

Conversely devcoredumps can be trivially exposed as dl-region dumps,
but I believe dl-health would require driver changes to format the
information appropriately.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-25 20:57                                             ` Jakub Kicinski
@ 2020-07-30 13:56                                               ` Johannes Berg
  0 siblings, 0 replies; 51+ messages in thread
From: Johannes Berg @ 2020-07-30 13:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Luis Chamberlain, derosier, greearb, jeyu, akpm, arnd, rostedt,
	mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek,
	tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

Hi,

Sorry ... long delay.

> > The reason I'm asking is that it's starting to sound like we really
> > ought to be implementing devlink, but we've got a bunch of
> > infrastructure that uses the devcoredump, and it'll take time
> > (significantly so) to change all that...
> 
> In devlink world pure FW core dumps are exposed by devlink regions.
> An API allowing reading device memory, registers, etc., but also 
> creating dumps of memory regions when things go wrong. It should be
> a fairly straightforward migration.

Right. We also have regions (various memory pieces, registers, ...).

One issue might be that for devlink we wouldn't want to expose these as
a single blob, I guess, but for devcoredump we already have a custom
format to glue all the things together. Since it seems unlikely that
anyone else would want to use the *iwlwifi* format to glue things
together, we'd have to do something there.

But perhaps that could be a matter of providing a "glue things into a
devcoredump" function that would have a reasonable default but could be
overridden by the driver for those migration cases.

> Devlink health is more targeted, the dump is supposed to contain only
> relevant information, selected and formatted by the driver. When device
> misbehaves driver reads the relevant registers and FW state and creates
> a formatted state dump. I believe each element of the dump must fit
> into a netlink message (but there may be multiple elements, see
> devlink_fmsg_prepare_skb()).

That wouldn't help for our big memory dumps, but OK.

> We should be able to convert dl-regions dumps and dl-health dumps into
> devcoredumps, but since health reporters are supposed to be more
> targeted there's usually multiple of them per device.

Right.

> Conversely devcoredumps can be trivially exposed as dl-region dumps,
> but I believe dl-health would require driver changes to format the
> information appropriately.

Agree.

Anyway, thanks. I'll put it on my list of things to look at ... not too
hopeful that will be soon, given how long it even took me to get back to
this email :)

johannes


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC 1/2] devlink: add simple fw crash helpers
  2020-05-22 17:17                                         ` Jakub Kicinski
  2020-05-22 20:46                                           ` Johannes Berg
@ 2020-05-22 21:49                                           ` Luis Chamberlain
  1 sibling, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-22 21:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris

On Fri, May 22, 2020 at 10:17:38AM -0700, Jakub Kicinski wrote:
> On Fri, 22 May 2020 05:20:46 +0000 Luis Chamberlain wrote:
> > > diff --git a/net/core/Makefile b/net/core/Makefile
> > > index 3e2c378e5f31..6f1513781c17 100644
> > > --- a/net/core/Makefile
> > > +++ b/net/core/Makefile
> > > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
> > >  obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o
> > >  obj-$(CONFIG_DST_CACHE) += dst_cache.o
> > >  obj-$(CONFIG_HWBM) += hwbm.o
> > > -obj-$(CONFIG_NET_DEVLINK) += devlink.o
> > > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o  
> > 
> > This was looking super sexy up to here. This is networking specific.
> > We want something generic for *anything* that requests firmware.
> 
> You can't be serious. It's network specific because of how the Kconfig
> is named?

Kconfig? What has that to do with anything? The issue I have is that the
solution I am looking for is for it to be agnostic to the subsystem. I
have found similar firmware crashes on gpu, media, scsci.

> Working for a company operating large data centers I would strongly
> prefer if we didn't have ten different ways of reporting firmware
> problems in the fleet.

Indeed.

> > I'm afraid this won't work for something generic. I don't think its
> > throw-away work though, the idea to provide a generic interface to
> > dump firmware through netlink might be nice for networking, or other
> > things.
> > 
> > But I have a feeling we'll want something still more generic than this.
> 
> Please be specific. Saying generic a lot is not helpful. The code (as
> you can see in this patch) is in no way network specific. Or are you
> saying there are machines out there running without netlink sockets?

No, I am saying I want something to work with any struct device.

> > So networking may want to be aware that a firmware crash happened as
> > part of this network device health thing, but firmware crashing is a
> > generic thing.
> > 
> > I have now extended my patch set to include uvents and I am more set on
> > that we need the taint now more than ever.
> 
> Please expect my nack if you're trying to add this to networking
> drivers.

The uevent mechanism is not for networking.

The taint however is, and I'd like to undertand how it is you do not see
that an undesirable requirement for a reboot is a clear case for a taint.

> The irony is you have a problem with a networking device and all the
> devices your initial set touched are networking. Two of the drivers 
> you touched either have or will soon have devlink health reporters
> implemented.

That is all great, and I don't think its a bad idea to add
infrastructure / extend it to get more information about a firmware
crash dump. However, suggesting that devlink is the only solution we
need in the kernel without considering other subsystems is what I am
suggesting doesn't suit my needs. Networking was just the first
subsystem I am taclking now but I have patches where similar situations
happen across the kernel.

  Luis

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC 2/2] i2400m: use devlink health reporter
  2020-05-19  1:05                                   ` Luis Chamberlain
  2020-05-19 21:15                                     ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski
@ 2020-05-19 21:15                                     ` Jakub Kicinski
  1 sibling, 0 replies; 51+ messages in thread
From: Jakub Kicinski @ 2020-05-19 21:15 UTC (permalink / raw)
  To: mcgrof
  Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo,
	aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai,
	schlad, andriy.shevchenko, keescook, daniel.vetter, will,
	mchehab+samsung, kvalo, davem, netdev, linux-kernel,
	linux-wireless, ath10k, jiri, briannorris, Jakub Kicinski

It builds.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/wimax/i2400m/rx.c  | 2 ++
 drivers/net/wimax/i2400m/usb.c | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
index c9fb619a9e01..cc7fe78f2df0 100644
--- a/drivers/net/wimax/i2400m/rx.c
+++ b/drivers/net/wimax/i2400m/rx.c
@@ -144,6 +144,7 @@
  *       i2400m_msg_size_check
  *       wimax_msg
  */
+#include <linux/devlink.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
 #include <linux/if_arp.h>
@@ -712,6 +713,7 @@ void __i2400m_roq_queue(struct i2400m *i2400m, struct i2400m_roq *roq,
 	dev_err(dev, "SW BUG? failed to insert packet\n");
 	dev_err(dev, "ERX: roq %p [ws %u] skb %p nsn %d sn %u\n",
 		roq, roq->ws, skb, nsn, roq_data->sn);
+	devlink_simple_fw_reporter_report_crash(dev);
 	skb_queue_walk(&roq->queue, skb_itr) {
 		roq_data_itr = (struct i2400m_roq_data *) &skb_itr->cb;
 		nsn_itr = __i2400m_roq_nsn(roq, roq_data_itr->sn);
diff --git a/drivers/net/wimax/i2400m/usb.c b/drivers/net/wimax/i2400m/usb.c
index 9659f9e1aaa6..5c811dccbf1d 100644
--- a/drivers/net/wimax/i2400m/usb.c
+++ b/drivers/net/wimax/i2400m/usb.c
@@ -49,6 +49,7 @@
  *   usb_reset_device()
  */
 #include "i2400m-usb.h"
+#include <linux/devlink.h>
 #include <linux/wimax/i2400m.h>
 #include <linux/debugfs.h>
 #include <linux/slab.h>
@@ -423,6 +424,8 @@ int i2400mu_probe(struct usb_interface *iface,
 	if (usb_dev->speed != USB_SPEED_HIGH)
 		dev_err(dev, "device not connected as high speed\n");
 
+	devlink_simple_fw_reporter_prepare(dev);
+
 	/* Allocate instance [calls i2400m_netdev_setup() on it]. */
 	result = -ENOMEM;
 	net_dev = alloc_netdev(sizeof(*i2400mu), "wmx%d", NET_NAME_UNKNOWN,
@@ -506,6 +509,7 @@ int i2400mu_probe(struct usb_interface *iface,
 	usb_put_dev(i2400mu->usb_dev);
 	free_netdev(net_dev);
 error_alloc_netdev:
+	devlink_simple_fw_reporter_cleanup(dev);
 	return result;
 }
 
@@ -532,6 +536,7 @@ void i2400mu_disconnect(struct usb_interface *iface)
 	usb_set_intfdata(iface, NULL);
 	usb_put_dev(i2400mu->usb_dev);
 	free_netdev(net_dev);
+	devlink_simple_fw_reporter_cleanup(dev);
 	d_fnend(3, dev, "(iface %p i2400m %p) = void\n", iface, i2400m);
 }
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 13/15] ath6kl: use new module_firmware_crashed()
       [not found] <20200515212846.1347-1-mcgrof@kernel.org>
  2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain
@ 2020-05-15 21:28 ` Luis Chamberlain
  2020-05-16  4:12   ` Rafael Aquini
  2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain
  2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain
  3 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw)
  To: jeyu
  Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz,
	tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko,
	keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem,
	netdev, linux-kernel, Luis Chamberlain, linux-wireless, ath10k

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wireless@vger.kernel.org
Cc: ath10k@lists.infradead.org
Cc: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/ath/ath6kl/hif.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/ath/ath6kl/hif.c b/drivers/net/wireless/ath/ath6kl/hif.c
index d1942537ea10..cfd838607544 100644
--- a/drivers/net/wireless/ath/ath6kl/hif.c
+++ b/drivers/net/wireless/ath/ath6kl/hif.c
@@ -120,6 +120,7 @@ static int ath6kl_hif_proc_dbg_intr(struct ath6kl_device *dev)
 	int ret;
 
 	ath6kl_warn("firmware crashed\n");
+	module_firmware_crashed();
 
 	/*
 	 * read counter to clear the interrupt, the debug error interrupt is
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 13/15] ath6kl: use new module_firmware_crashed()
  2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain
@ 2020-05-16  4:12   ` Rafael Aquini
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael Aquini @ 2020-05-16  4:12 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev,
	linux-kernel, linux-wireless, ath10k

On Fri, May 15, 2020 at 09:28:44PM +0000, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wireless@vger.kernel.org
> Cc: ath10k@lists.infradead.org
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/net/wireless/ath/ath6kl/hif.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/ath/ath6kl/hif.c b/drivers/net/wireless/ath/ath6kl/hif.c
> index d1942537ea10..cfd838607544 100644
> --- a/drivers/net/wireless/ath/ath6kl/hif.c
> +++ b/drivers/net/wireless/ath/ath6kl/hif.c
> @@ -120,6 +120,7 @@ static int ath6kl_hif_proc_dbg_intr(struct ath6kl_device *dev)
>  	int ret;
>  
>  	ath6kl_warn("firmware crashed\n");
> +	module_firmware_crashed();
>  
>  	/*
>  	 * read counter to clear the interrupt, the debug error interrupt is
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 14/15] brcm80211: use new module_firmware_crashed()
       [not found] <20200515212846.1347-1-mcgrof@kernel.org>
  2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain
  2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain
@ 2020-05-15 21:28 ` Luis Chamberlain
  2020-05-16  4:13   ` Rafael Aquini
  2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain
  3 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw)
  To: jeyu
  Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz,
	tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko,
	keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem,
	netdev, linux-kernel, Luis Chamberlain, linux-wireless,
	brcm80211-dev-list.pdl, brcm80211-dev-list, Arend van Spriel,
	Franky Lin, Hante Meuleman, Chi-Hsien Lin, Wright Feng,
	Rafał Miłecki, Pieter-Paul Giesberts

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wireless@vger.kernel.org
Cc: brcm80211-dev-list.pdl@broadcom.com
Cc: brcm80211-dev-list@cypress.com
Cc: Arend van Spriel <arend.vanspriel@broadcom.com>
Cc: Franky Lin <franky.lin@broadcom.com>
Cc: Hante Meuleman <hante.meuleman@broadcom.com>
Cc: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Cc: Wright Feng <wright.feng@cypress.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: "Rafał Miłecki" <rafal@milecki.pl>
Cc: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index c88655acc78c..d623f83568b3 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -1393,6 +1393,7 @@ void brcmf_fw_crashed(struct device *dev)
 	struct brcmf_pub *drvr = bus_if->drvr;
 
 	bphy_err(drvr, "Firmware has halted or crashed\n");
+	module_firmware_crashed();
 
 	brcmf_dev_coredump(dev);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 14/15] brcm80211: use new module_firmware_crashed()
  2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain
@ 2020-05-16  4:13   ` Rafael Aquini
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael Aquini @ 2020-05-16  4:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev,
	linux-kernel, linux-wireless, brcm80211-dev-list.pdl,
	brcm80211-dev-list, Arend van Spriel, Franky Lin, Hante Meuleman,
	Chi-Hsien Lin, Wright Feng, Rafał Miłecki,
	Pieter-Paul Giesberts

On Fri, May 15, 2020 at 09:28:45PM +0000, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wireless@vger.kernel.org
> Cc: brcm80211-dev-list.pdl@broadcom.com
> Cc: brcm80211-dev-list@cypress.com
> Cc: Arend van Spriel <arend.vanspriel@broadcom.com>
> Cc: Franky Lin <franky.lin@broadcom.com>
> Cc: Hante Meuleman <hante.meuleman@broadcom.com>
> Cc: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
> Cc: Wright Feng <wright.feng@cypress.com>
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Cc: "Rafał Miłecki" <rafal@milecki.pl>
> Cc: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
> index c88655acc78c..d623f83568b3 100644
> --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
> +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
> @@ -1393,6 +1393,7 @@ void brcmf_fw_crashed(struct device *dev)
>  	struct brcmf_pub *drvr = bus_if->drvr;
>  
>  	bphy_err(drvr, "Firmware has halted or crashed\n");
> +	module_firmware_crashed();
>  
>  	brcmf_dev_coredump(dev);
>  
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 15/15] mwl8k: use new module_firmware_crashed()
       [not found] <20200515212846.1347-1-mcgrof@kernel.org>
                   ` (2 preceding siblings ...)
  2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain
@ 2020-05-15 21:28 ` Luis Chamberlain
  2020-05-16  4:13   ` Rafael Aquini
  3 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw)
  To: jeyu
  Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz,
	tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko,
	keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem,
	netdev, linux-kernel, Luis Chamberlain, linux-wireless,
	Lennert Buytenhek, Gustavo A. R. Silva, Johannes Berg,
	Ganapathi Bhat

This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wireless@vger.kernel.org
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 drivers/net/wireless/marvell/mwl8k.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
index 97f23f93f6e7..d609ef1bb879 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -1551,6 +1551,7 @@ static int mwl8k_tx_wait_empty(struct ieee80211_hw *hw)
 	 * the firmware has crashed
 	 */
 	if (priv->hw_restart_in_progress) {
+		module_firmware_crashed();
 		if (priv->hw_restart_owner == current)
 			return 0;
 		else
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 15/15] mwl8k: use new module_firmware_crashed()
  2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain
@ 2020-05-16  4:13   ` Rafael Aquini
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael Aquini @ 2020-05-16  4:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx,
	gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook,
	daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev,
	linux-kernel, linux-wireless, Lennert Buytenhek,
	Gustavo A. R. Silva, Johannes Berg, Ganapathi Bhat

On Fri, May 15, 2020 at 09:28:46PM +0000, Luis Chamberlain wrote:
> This makes use of the new module_firmware_crashed() to help
> annotate when firmware for device drivers crash. When firmware
> crashes devices can sometimes become unresponsive, and recovery
> sometimes requires a driver unload / reload and in the worst cases
> a reboot.
> 
> Using a taint flag allows us to annotate when this happens clearly.
> 
> Cc: linux-wireless@vger.kernel.org
> Cc: Lennert Buytenhek <buytenh@wantstofly.org>
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
> Cc: Johannes Berg <johannes.berg@intel.com>
> Cc: Ganapathi Bhat <ganapathi.bhat@nxp.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  drivers/net/wireless/marvell/mwl8k.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
> index 97f23f93f6e7..d609ef1bb879 100644
> --- a/drivers/net/wireless/marvell/mwl8k.c
> +++ b/drivers/net/wireless/marvell/mwl8k.c
> @@ -1551,6 +1551,7 @@ static int mwl8k_tx_wait_empty(struct ieee80211_hw *hw)
>  	 * the firmware has crashed
>  	 */
>  	if (priv->hw_restart_in_progress) {
> +		module_firmware_crashed();
>  		if (priv->hw_restart_owner == current)
>  			return 0;
>  		else
> -- 
> 2.26.2
> 
Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-07-30 13:57 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200515212846.1347-1-mcgrof@kernel.org>
2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain
2020-05-16  4:11   ` Rafael Aquini
2020-05-16 13:24   ` Johannes Berg
2020-05-16 13:50     ` Johannes Berg
2020-05-18 16:56       ` Luis Chamberlain
2020-05-19  1:23       ` Brian Norris
2020-05-19 14:02         ` Luis Chamberlain
2020-05-20  0:47           ` Brian Norris
2020-05-20  5:37             ` Emmanuel Grumbach
2020-05-20  8:32               ` Andy Shevchenko
2020-05-21 19:01               ` Brian Norris
2020-05-22  5:12                 ` Emmanuel Grumbach
2020-05-22  5:23                   ` Luis Chamberlain
2020-05-18 16:51     ` Luis Chamberlain
2020-05-18 16:58       ` Ben Greear
2020-05-18 17:09         ` Luis Chamberlain
2020-05-18 17:15           ` Ben Greear
2020-05-18 17:18             ` Luis Chamberlain
2020-05-18 18:06               ` Steve deRosier
2020-05-18 19:09                 ` Luis Chamberlain
2020-05-18 19:25                   ` Johannes Berg
2020-05-18 19:59                     ` Luis Chamberlain
2020-05-18 20:07                       ` Johannes Berg
2020-05-18 21:18                         ` Luis Chamberlain
2020-05-18 20:28                     ` Jakub Kicinski
2020-05-18 20:29                       ` Johannes Berg
2020-05-18 20:35                         ` Jakub Kicinski
2020-05-18 20:41                           ` Johannes Berg
2020-05-18 20:46                             ` Jakub Kicinski
2020-05-18 21:22                               ` Luis Chamberlain
2020-05-18 22:16                                 ` Jakub Kicinski
2020-05-19  1:05                                   ` Luis Chamberlain
2020-05-19 21:15                                     ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski
2020-05-22  5:20                                       ` Luis Chamberlain
2020-05-22 17:17                                         ` Jakub Kicinski
2020-05-22 20:46                                           ` Johannes Berg
2020-05-22 21:51                                             ` Luis Chamberlain
2020-05-22 23:23                                               ` Steve deRosier
2020-05-22 23:44                                                 ` Luis Chamberlain
2020-05-25  9:07                                                 ` Andy Shevchenko
2020-05-25 17:08                                                   ` Ben Greear
2020-05-25 20:57                                             ` Jakub Kicinski
2020-07-30 13:56                                               ` Johannes Berg
2020-05-22 21:49                                           ` Luis Chamberlain
2020-05-19 21:15                                     ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski
2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain
2020-05-16  4:12   ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain
2020-05-16  4:13   ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain
2020-05-16  4:13   ` Rafael Aquini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).