* [PATCH v2 12/15] ath10k: use new module_firmware_crashed() [not found] <20200515212846.1347-1-mcgrof@kernel.org> @ 2020-05-15 21:28 ` Luis Chamberlain 2020-05-16 4:11 ` Rafael Aquini 2020-05-16 13:24 ` Johannes Berg 2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain ` (2 subsequent siblings) 3 siblings, 2 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw) To: jeyu Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, Luis Chamberlain, linux-wireless, ath10k This makes use of the new module_firmware_crashed() to help annotate when firmware for device drivers crash. When firmware crashes devices can sometimes become unresponsive, and recovery sometimes requires a driver unload / reload and in the worst cases a reboot. Using a taint flag allows us to annotate when this happens clearly. Cc: linux-wireless@vger.kernel.org Cc: ath10k@lists.infradead.org Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/net/wireless/ath/ath10k/pci.c | 2 ++ drivers/net/wireless/ath/ath10k/sdio.c | 2 ++ drivers/net/wireless/ath/ath10k/snoc.c | 1 + 3 files changed, 5 insertions(+) diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c index 1d941d53fdc9..6bd0f3b518b9 100644 --- a/drivers/net/wireless/ath/ath10k/pci.c +++ b/drivers/net/wireless/ath/ath10k/pci.c @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work) scnprintf(guid, sizeof(guid), "n/a"); ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); + module_firmware_crashed(); ath10k_print_driver_info(ar); ath10k_pci_dump_registers(ar, crash_data); ath10k_ce_dump_registers(ar, crash_data); @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar, if (ret) { if (ath10k_pci_has_fw_crashed(ar)) { ath10k_warn(ar, "firmware crashed during chip reset\n"); + module_firmware_crashed(); ath10k_pci_fw_crashed_clear(ar); ath10k_pci_fw_crashed_dump(ar); } diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c index e2aff2254a40..d34ad289380f 100644 --- a/drivers/net/wireless/ath/ath10k/sdio.c +++ b/drivers/net/wireless/ath/ath10k/sdio.c @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar) /* TODO: Add firmware crash handling */ ath10k_warn(ar, "firmware crashed\n"); + module_firmware_crashed(); /* read counter to clear the interrupt, the debug error interrupt is * counter 0. @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar) if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) { ath10k_err(ar, "firmware crashed!\n"); queue_work(ar->workqueue, &ar->restart_work); + module_firmware_crashed(); } return ret; } diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c index 354d49b1cd45..7cfc123c345c 100644 --- a/drivers/net/wireless/ath/ath10k/snoc.c +++ b/drivers/net/wireless/ath/ath10k/snoc.c @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar) scnprintf(guid, sizeof(guid), "n/a"); ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); + module_firmware_crashed(); ath10k_print_driver_info(ar); ath10k_msa_dump_memory(ar, crash_data); mutex_unlock(&ar->dump_mutex); -- 2.26.2 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain @ 2020-05-16 4:11 ` Rafael Aquini 2020-05-16 13:24 ` Johannes Berg 1 sibling, 0 replies; 51+ messages in thread From: Rafael Aquini @ 2020-05-16 4:11 UTC (permalink / raw) To: Luis Chamberlain Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Fri, May 15, 2020 at 09:28:43PM +0000, Luis Chamberlain wrote: > This makes use of the new module_firmware_crashed() to help > annotate when firmware for device drivers crash. When firmware > crashes devices can sometimes become unresponsive, and recovery > sometimes requires a driver unload / reload and in the worst cases > a reboot. > > Using a taint flag allows us to annotate when this happens clearly. > > Cc: linux-wireless@vger.kernel.org > Cc: ath10k@lists.infradead.org > Cc: Kalle Valo <kvalo@codeaurora.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > drivers/net/wireless/ath/ath10k/pci.c | 2 ++ > drivers/net/wireless/ath/ath10k/sdio.c | 2 ++ > drivers/net/wireless/ath/ath10k/snoc.c | 1 + > 3 files changed, 5 insertions(+) > > diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c > index 1d941d53fdc9..6bd0f3b518b9 100644 > --- a/drivers/net/wireless/ath/ath10k/pci.c > +++ b/drivers/net/wireless/ath/ath10k/pci.c > @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work) > scnprintf(guid, sizeof(guid), "n/a"); > > ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); > + module_firmware_crashed(); > ath10k_print_driver_info(ar); > ath10k_pci_dump_registers(ar, crash_data); > ath10k_ce_dump_registers(ar, crash_data); > @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar, > if (ret) { > if (ath10k_pci_has_fw_crashed(ar)) { > ath10k_warn(ar, "firmware crashed during chip reset\n"); > + module_firmware_crashed(); > ath10k_pci_fw_crashed_clear(ar); > ath10k_pci_fw_crashed_dump(ar); > } > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c > index e2aff2254a40..d34ad289380f 100644 > --- a/drivers/net/wireless/ath/ath10k/sdio.c > +++ b/drivers/net/wireless/ath/ath10k/sdio.c > @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar) > > /* TODO: Add firmware crash handling */ > ath10k_warn(ar, "firmware crashed\n"); > + module_firmware_crashed(); > > /* read counter to clear the interrupt, the debug error interrupt is > * counter 0. > @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar) > if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) { > ath10k_err(ar, "firmware crashed!\n"); > queue_work(ar->workqueue, &ar->restart_work); > + module_firmware_crashed(); > } > return ret; > } > diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c > index 354d49b1cd45..7cfc123c345c 100644 > --- a/drivers/net/wireless/ath/ath10k/snoc.c > +++ b/drivers/net/wireless/ath/ath10k/snoc.c > @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar) > scnprintf(guid, sizeof(guid), "n/a"); > > ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); > + module_firmware_crashed(); > ath10k_print_driver_info(ar); > ath10k_msa_dump_memory(ar, crash_data); > mutex_unlock(&ar->dump_mutex); > -- > 2.26.2 > Acked-by: Rafael Aquini <aquini@redhat.com> ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain 2020-05-16 4:11 ` Rafael Aquini @ 2020-05-16 13:24 ` Johannes Berg 2020-05-16 13:50 ` Johannes Berg 2020-05-18 16:51 ` Luis Chamberlain 1 sibling, 2 replies; 51+ messages in thread From: Johannes Berg @ 2020-05-16 13:24 UTC (permalink / raw) To: Luis Chamberlain, jeyu Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed You didn't CC me or the wireless list on the rest of the patches, so I'm replying to a random one, but ... What is the point here? This should in no way affect the integrity of the system/kernel, for most devices anyway. So what if ath10k's firmware crashes? If there's a driver bug it will not handle it right (and probably crash, WARN_ON, or something else), but if the driver is working right then that will not affect the kernel at all. So maybe I can understand that maybe you want an easy way to discover - per device - that the firmware crashed, but that still doesn't warrant a complete kernel taint. Instead of the kernel taint, IMHO you should provide an annotation in sysfs (or somewhere else) for the *struct device* that had its firmware crash. Or maybe, if it's too complex to walk the entire hierarchy checking for that, have a uevent, or add the ability for the kernel to print out elsewhere in debugfs the list of devices that crashed at some point... All of that is fine, but a kernel taint? johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-16 13:24 ` Johannes Berg @ 2020-05-16 13:50 ` Johannes Berg 2020-05-18 16:56 ` Luis Chamberlain 2020-05-19 1:23 ` Brian Norris 2020-05-18 16:51 ` Luis Chamberlain 1 sibling, 2 replies; 51+ messages in thread From: Johannes Berg @ 2020-05-16 13:50 UTC (permalink / raw) To: Luis Chamberlain, jeyu Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote: > Instead of the kernel taint, IMHO you should provide an annotation in > sysfs (or somewhere else) for the *struct device* that had its firmware > crash. Or maybe, if it's too complex to walk the entire hierarchy > checking for that, have a uevent, or add the ability for the kernel to > print out elsewhere in debugfs the list of devices that crashed at some I mean sysfs, oops. In addition, look what we have in iwl_trans_pcie_removal_wk(). If we detect that the device is really wedged enough that the only way we can still try to recover is by completely unbinding the driver from it, then we give userspace a uevent for that. I don't remember exactly how and where that gets used (ChromeOS) though, but it'd be nice to have that sort of thing as part of the infrastructure, in a sort of two-level notification? Level 1: firmware crashed, but we're recovering, at least mostly, and it's more informational Level 2: device is wedged, going to try to recover by some more forceful means (perhaps some devices can be power-cycled? etc.) but (more) state would be lost in these cases? Still don't think a kernel taint is appropriate for either of these. johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-16 13:50 ` Johannes Berg @ 2020-05-18 16:56 ` Luis Chamberlain 2020-05-19 1:23 ` Brian Norris 1 sibling, 0 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 16:56 UTC (permalink / raw) To: Johannes Berg Cc: jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Sat, May 16, 2020 at 03:50:55PM +0200, Johannes Berg wrote: > On Sat, 2020-05-16 at 15:24 +0200, Johannes Berg wrote: > > > Instead of the kernel taint, IMHO you should provide an annotation in > > sysfs (or somewhere else) for the *struct device* that had its firmware > > crash. Or maybe, if it's too complex to walk the entire hierarchy > > checking for that, have a uevent, or add the ability for the kernel to > > print out elsewhere in debugfs the list of devices that crashed at some > > I mean sysfs, oops. > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > detect that the device is really wedged enough that the only way we can > still try to recover is by completely unbinding the driver from it, then > we give userspace a uevent for that. Nice! Indeed a uevent is in order for these sorts of things, and I'd argue that it begs the question if we should even uevent for any taint as well. Today these are silent. If the kernel crashes, today we only give userspace a log. > I don't remember exactly how and > where that gets used (ChromeOS) though, but it'd be nice to have that > sort of thing as part of the infrastructure, in a sort of two-level > notification? > > Level 1: firmware crashed, but we're recovering, at least mostly, and > it's more informational > > Level 2: device is wedged, going to try to recover by some more forceful > means (perhaps some devices can be power-cycled? etc.) but (more) state > would be lost in these cases? I agree that *all* this would be ideal. I don't see this as mutually exclusive with a taint on the kernel and module for the device. > Still don't think a kernel taint is appropriate for either of these. From a support perspective, I do think it is vital quick information. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-16 13:50 ` Johannes Berg 2020-05-18 16:56 ` Luis Chamberlain @ 2020-05-19 1:23 ` Brian Norris 2020-05-19 14:02 ` Luis Chamberlain 1 sibling, 1 reply; 51+ messages in thread From: Brian Norris @ 2020-05-19 1:23 UTC (permalink / raw) To: Johannes Berg Cc: Luis Chamberlain, jeyu, Andrew Morton, Arnd Bergmann, Steven Rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, Andy Shevchenko, Kees Cook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, <netdev@vger.kernel.org>, Linux Kernel, linux-wireless, ath10k On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > detect that the device is really wedged enough that the only way we can > still try to recover is by completely unbinding the driver from it, then > we give userspace a uevent for that. I don't remember exactly how and > where that gets used (ChromeOS) though, but it'd be nice to have that > sort of thing as part of the infrastructure, in a sort of two-level > notification? <slight side track> We use this on certain devices where we know the underlying hardware has design issues that may lead to device failure -- then when we see this sort of unrecoverable "firmware-death", we remove the device[*]+driver, force-reset the PCI device (SBR), and try to reload/reattach the driver. This all happens by way of a udev rule. We also log this sort of stuff (and metrics around it) for bug reports and health statistics, since we really hope to not see this happen often. [*] "We" (user space) don't actually do this...it happens via the 'remove_when_gone' module parameter abomination found in iwlwifi. I'd personally rather see the EVENT=INACESSIBLE stuff on its own, and let user space deal with when and how to remove and reset the device. But I digress too much here ;) </slight side track> I really came to this thread to say that I also love the idea of a generic mechanism (a la $subject) to report firmware crashes, but I also have no interest in seeing a taint flag for it. For Chrome OS, I would readily (as in, we're already looking at more-hacky / non-generic ways to do this for drivers we care about) process these kinds of stats as they happen, logging metrics for bug reports and/or for automated crash statistics, when we see a firmware crash. A uevent would suit us very well I think, although it would be nice if drivers could also supply some small amount of informative text along with it (e.g., a sort of "reason code", in case we can possibly aggregate certain failure types). We already do this sort of thing for WARN() and friends (not via uevent, but via log parsing; at least it has nice "cut here" markers!). Perhaps devlink (as proposed down-thread) would also fit the bill. I don't think sysfs alone would fit our needs, as we'd like to process these things as they happen, not only when a user submits a bug report. > Level 1: firmware crashed, but we're recovering, at least mostly, and > it's more informational Chrome OS would love to track these things too, since we'd like to see these minimized, even if they're usually recoverable ;) > Level 2: device is wedged, going to try to recover by some more forceful > means (perhaps some devices can be power-cycled? etc.) but (more) state > would be lost in these cases? And we'd definitely want to know about these. We already get this for the iwlwifi case described above, in a non-generic way. In general, it's probably not that easy to tell the difference between 1 and 2, since even as you and Luis have noted, with the same driver (and the same driver location), you find the same crashes may or may not be recoverable. iwlwifi has extracted certain level 2 cases into iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the ways in which level 1 crashes actually lead to severe (non-recoverable) failure. Regards, Brian ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-19 1:23 ` Brian Norris @ 2020-05-19 14:02 ` Luis Chamberlain 2020-05-20 0:47 ` Brian Norris 0 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-19 14:02 UTC (permalink / raw) To: Brian Norris Cc: Johannes Berg, linux-wireless, aquini, peterz, daniel.vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote: > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > > detect that the device is really wedged enough that the only way we can > > still try to recover is by completely unbinding the driver from it, then > > we give userspace a uevent for that. I don't remember exactly how and > > where that gets used (ChromeOS) though, but it'd be nice to have that > > sort of thing as part of the infrastructure, in a sort of two-level > > notification? > > <slight side track> > We use this on certain devices where we know the underlying hardware > has design issues that may lead to device failure Ah, after reading below I see you meant for iwlwifi. If userspace can indeed grow to support this, that would be fantastic. I should note that I don't discourage hiding firmware or hardware issues. Quite the contrary, I suspect that taking pride in being trasnparent about it, and dealing with it fast can help lead the pack. I wrote about this long ago in 2015 [0], and stand by it. [0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html > -- then when we see > this sort of unrecoverable "firmware-death", we remove the > device[*]+driver, force-reset the PCI device (SBR), and try to > reload/reattach the driver. This all happens by way of a udev rule. So you've sprikled your own udev event here as part of your kernel delta? > We > also log this sort of stuff (and metrics around it) for bug reports > and health statistics, since we really hope to not see this happen > often. Assuming perfection is ideal but silly. So, what infrastructure do you use for this sort of issue? > [*] "We" (user space) don't actually do this...it happens via the > 'remove_when_gone' module parameter abomination found in iwlwifi. Holy moly.. but hey, at least it may seem a bit more seemless than forcing a reboot / manual driver removal / addition to the user. BTW is this likely a place on iwlwifi where the firmware likely crashed? > I'd > personally rather see the EVENT=INACESSIBLE stuff on its own, and let > user space deal with when and how to remove and reset the device. But > I digress too much here ;) > </slight side track> This is all useful information. We are just touching the surface of the topic by addressing networking first. Imagine when we address other subsystems. > I really came to this thread to say that I also love the idea of a > generic mechanism (a la $subject) to report firmware crashes, but I > also have no interest in seeing a taint flag for it. For Chrome OS, I > would readily (as in, we're already looking at more-hacky / > non-generic ways to do this for drivers we care about) process these > kinds of stats as they happen, logging metrics for bug reports and/or > for automated crash statistics, when we see a firmware crash. Great! > A uevent > would suit us very well I think, although it would be nice if drivers > could also supply some small amount of informative text along with it A follow up to this series was to add a uevent to add_taint(), however since a *count* is not considered I think it is correct to seek alternatives at this point. The leaner the solution the better though. Do you have a pointer to what guys use so I can read? > (e.g., a sort of "reason code", in case we can possibly aggregate > certain failure types). We already do this sort of thing for WARN() > and friends (not via uevent, but via log parsing; at least it has nice > "cut here" markers!). Indeed, similar things can indeed be argued about WARN*()... this however can be non-device specific. With panic-on-warn becoming a "thing", the more important it becomes to really tally exactly *why* these WARN*()s may trigger. > Perhaps Note below. > devlink (as proposed down-thread) would also fit the bill. I > don't think sysfs alone would fit our needs, as we'd like to process > these things as they happen, not only when a user submits a bug > report. I think we've reached a point where using "*Perhaps*" does not suffice, and if there is already a *user* of similar desired infrastructure I think we should jump on the opportunity to replace what you have with something which could be used by other devices / subsystems which require firmware. And indeed, also even consider in the abstract sense, the possibility to leverage something like this for WARN*()s later too. > > Level 1: firmware crashed, but we're recovering, at least mostly, and > > it's more informational > > Chrome OS would love to track these things too, since we'd like to see > these minimized, even if they're usually recoverable ;) > > > Level 2: device is wedged, going to try to recover by some more forceful > > means (perhaps some devices can be power-cycled? etc.) but (more) state > > would be lost in these cases? > > And we'd definitely want to know about these. We already get this for > the iwlwifi case described above, in a non-generic way. > > In general, it's probably not that easy to tell the difference between > 1 and 2, since even as you and Luis have noted, with the same driver > (and the same driver location), you find the same crashes may or may > not be recoverable. iwlwifi has extracted certain level 2 cases into > iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the > ways in which level 1 crashes actually lead to severe > (non-recoverable) failure. And that is fine, accepting these for what they are will help. However, leaving the user in the *dark*, is what we should *not do*. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-19 14:02 ` Luis Chamberlain @ 2020-05-20 0:47 ` Brian Norris 2020-05-20 5:37 ` Emmanuel Grumbach 0 siblings, 1 reply; 51+ messages in thread From: Brian Norris @ 2020-05-20 0:47 UTC (permalink / raw) To: Luis Chamberlain Cc: Johannes Berg, linux-wireless, aquini, peterz, Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller Hi Luis, On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote: > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > > > detect that the device is really wedged enough that the only way we can > > > still try to recover is by completely unbinding the driver from it, then > > > we give userspace a uevent for that. I don't remember exactly how and > > > where that gets used (ChromeOS) though, but it'd be nice to have that > > > sort of thing as part of the infrastructure, in a sort of two-level > > > notification? > > > > <slight side track> > > We use this on certain devices where we know the underlying hardware > > has design issues that may lead to device failure > > Ah, after reading below I see you meant for iwlwifi. Sorry, I was replying to Johannes, who I believe had his "we"="Intel" hat (as iwlwifi maintainer) on, and was pointing at iwl_trans_pcie_removal_wk(). > If userspace can indeed grow to support this, that would be fantastic. Well, Chrome OS tailors its user space a bit more to the hardware (and kernel/drivers in use) than the average distro might. We already do this (for some values of "this") today. Is that "fantastic" to you? :D > > -- then when we see > > this sort of unrecoverable "firmware-death", we remove the > > device[*]+driver, force-reset the PCI device (SBR), and try to > > reload/reattach the driver. This all happens by way of a udev rule. > > So you've sprikled your own udev event here as part of your kernel delta? No kernel delta -- the event is there already: iwl_trans_pcie_removal_wk() https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027 And you can see our udev rules and scripts, in all their ugly details here, if you really care: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/ > > We > > also log this sort of stuff (and metrics around it) for bug reports > > and health statistics, since we really hope to not see this happen > > often. > > Assuming perfection is ideal but silly. So, what infrastructure do you > use for this sort of issue? We don't yet log firmware crashes generally, but for all our current crash reports (including WARN()), they go through this: https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md For example, look for "cut here" in: https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc For other specific metrics (like counting "EVENT=INACCESSIBLE"), we use the Chrome UMA system: https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md I don't imagine the "infrastructure" side of any of that would be useful to you, but maybe the client-side gathering can at least show you what we do. > > [*] "We" (user space) don't actually do this...it happens via the > > 'remove_when_gone' module parameter abomination found in iwlwifi. > > BTW is this likely a place on iwlwifi where the firmware likely crashed? iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out in a way that is likely due to a dead PCIe endpoint. It's not directly a firmware crash, although there may be firmware crashes reported around the same time. Intel folks can probably give a more nuanced answer, but their firmware crashes usually land in something like iwl_mvm_nic_error() -> iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make your proposal less punishing (e.g., removing the "taint" as Johannes requested), maybe iwlwifi would be another good candidate for $subject. iwlwifi generally expects to recover seamlessly, but it's also good to know if you've seen a hundred of these in a row. > > A uevent > > would suit us very well I think, although it would be nice if drivers > > could also supply some small amount of informative text along with it > > A follow up to this series was to add a uevent to add_taint(), however > since a *count* is not considered I think it is correct to seek > alternatives at this point. The leaner the solution the better though. > > Do you have a pointer to what guys use so I can read? Hopefully the above pointers are useful to you. We don't yet have firmware-crash parsing implemented, so I don't have pointers for that piece yet. > > (e.g., a sort of "reason code", in case we can possibly aggregate > > certain failure types). We already do this sort of thing for WARN() > > and friends (not via uevent, but via log parsing; at least it has nice > > "cut here" markers!). > > Indeed, similar things can indeed be argued about WARN*()... this > however can be non-device specific. With panic-on-warn becoming a > "thing", the more important it becomes to really tally exactly *why* > these WARN*()s may trigger. panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm probably not supposed to publicize information related to how many Chromebooks are out there, but we regularly see a scary number of non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users every day, and a scary number of those are in WiFi drivers... > > Perhaps > > Note below. (My use of "perhaps" is only because I'm not familiar with devlink at all.) > > devlink (as proposed down-thread) would also fit the bill. I > > don't think sysfs alone would fit our needs, as we'd like to process > > these things as they happen, not only when a user submits a bug > > report. > > I think we've reached a point where using "*Perhaps*" does not suffice, > and if there is already a *user* of similar desired infrastructure I > think we should jump on the opportunity to replace what you have with > something which could be used by other devices / subsystems which > require firmware. And indeed, also even consider in the abstract sense, > the possibility to leverage something like this for WARN*()s later too. To precisely lay out what Chrome OS does today: * WARN() and similar: implemented, see anomaly_detector.cc above. It's just log parsing, and that handy "cut here" stuff -- I'm nearly certain there are other distros using this already? A uevent would probably be nice, but log parsing is good enough for us today. * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked above. It's a uevent, and we handle it via udev. Code is linked above. * Other firmware crashes: not yet implemented here, but we're looking at it. Currently, we plan to do something similar to WARN(), but it will be ugly and non-generic. A uevent would be a lovely replacement, if it has some basic metadata with it. Stats in sysfs might be icing on the cake, but mostly redundant for us, if we can grab it on the fly via uevent. HTH, Brian ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-20 0:47 ` Brian Norris @ 2020-05-20 5:37 ` Emmanuel Grumbach 2020-05-20 8:32 ` Andy Shevchenko 2020-05-21 19:01 ` Brian Norris 0 siblings, 2 replies; 51+ messages in thread From: Emmanuel Grumbach @ 2020-05-20 5:37 UTC (permalink / raw) To: Brian Norris Cc: Luis Chamberlain, Johannes Berg, linux-wireless, aquini, peterz, Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller Hi all, <top post intro> Since I have been involved quite a bit in the firmware debugging features in iwlwifi, I think I can give a few insights here. But before this, we need to understand that there are several sources of issues: 1) the firmware may crash but the bus is still alive, you can still use the bus to get the crash data 2) the bus is dead, when that happens, the firmware might even be in a good condition, but since the bus is dead, you stop getting any information about the firmware, and then, at some point, you get to the conclusion that the firmware is dead. You can't get the crash data that resides on the other side of the bus (you may have gathered data in the DRAM directly, but that's a different thing), and you don't have much recovery to do besides re-starting the PCI enumeration. At Intel, we have seen both unfortunately. The bus issues are the ones that are trickier obviously. Trickier to detect (because you just get garbage from any request you issue on the bus), and trickier to handle. One can argue that the kernel should *not* handle those and let this in userspace hands. I guess it all depends on what component you ship to your customer and what you customer asks from you :). </top post intro> > > Hi Luis, > > On Tue, May 19, 2020 at 7:02 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote: > > > On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote: > > > > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we > > > > detect that the device is really wedged enough that the only way we can > > > > still try to recover is by completely unbinding the driver from it, then > > > > we give userspace a uevent for that. I don't remember exactly how and > > > > where that gets used (ChromeOS) though, but it'd be nice to have that > > > > sort of thing as part of the infrastructure, in a sort of two-level > > > > notification? > > > > > > <slight side track> > > > We use this on certain devices where we know the underlying hardware > > > has design issues that may lead to device failure > > > > Ah, after reading below I see you meant for iwlwifi. > > Sorry, I was replying to Johannes, who I believe had his "we"="Intel" > hat (as iwlwifi maintainer) on, and was pointing at > iwl_trans_pcie_removal_wk(). > This pcie_removal thing is for the bus dead thing. My 2) above. > > If userspace can indeed grow to support this, that would be fantastic. > > Well, Chrome OS tailors its user space a bit more to the hardware (and > kernel/drivers in use) than the average distro might. We already do > this (for some values of "this") today. Is that "fantastic" to you? :D I guess it can be fantastic if other vendors also suffer from this. Or maybe that could be done as part of the PCI bus driver inside the kernel? > > > > -- then when we see > > > this sort of unrecoverable "firmware-death", we remove the > > > device[*]+driver, force-reset the PCI device (SBR), and try to > > > reload/reattach the driver. This all happens by way of a udev rule. > > > > So you've sprikled your own udev event here as part of your kernel delta? > > No kernel delta -- the event is there already: > iwl_trans_pcie_removal_wk() > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?h=v5.6#n2027 > > And you can see our udev rules and scripts, in all their ugly details > here, if you really care: > https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/net-wireless/iwlwifi_rescan/files/ > > > > We > > > also log this sort of stuff (and metrics around it) for bug reports > > > and health statistics, since we really hope to not see this happen > > > often. > > > > Assuming perfection is ideal but silly. So, what infrastructure do you > > use for this sort of issue? > > We don't yet log firmware crashes generally, but for all our current > crash reports (including WARN()), they go through this: > https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/README.md > > For example, look for "cut here" in: > https://chromium.googlesource.com/chromiumos/platform2/+/master/crash-reporter/anomaly_detector.cc > > For other specific metrics (like counting "EVENT=INACCESSIBLE"), we > use the Chrome UMA system: > https://chromium.googlesource.com/chromiumos/platform2/+/master/metrics/README.md > > I don't imagine the "infrastructure" side of any of that would be > useful to you, but maybe the client-side gathering can at least show > you what we do. > > > > [*] "We" (user space) don't actually do this...it happens via the > > > 'remove_when_gone' module parameter abomination found in iwlwifi. > > > > BTW is this likely a place on iwlwifi where the firmware likely crashed? > > iwl_trans_pcie_removal_wk() is triggered because HW accesses timed out > in a way that is likely due to a dead PCIe endpoint. It's not directly > a firmware crash, although there may be firmware crashes reported > around the same time. iwl_trans_pcie_removal_wk() is only because of a dead bus, not because of a firmware crash. > > Intel folks can probably give a more nuanced answer, but their > firmware crashes usually land in something like iwl_mvm_nic_error() -> > iwl_mvm_dump_nic_error_log() + iwl_mvm_nic_restart(). If you can make > your proposal less punishing (e.g., removing the "taint" as Johannes > requested), maybe iwlwifi would be another good candidate for > $subject. iwlwifi generally expects to recover seamlessly, but it's > also good to know if you've seen a hundred of these in a row. Yes, you are right, mvm_nic_error is the place. So here is the proposal. I think that a standard interface that can allow a driver to report a firmware crash along with a proprietary unique id that makes sense to the vendor can be useful. Then, yes, distros can listen to this, optionally open bugs (automatically?) when that happens. I even planned to do this long ago and integrated with a specific distro, but it never rolled out. The vendor supplied unique id is very important for the bug de-duplication part. For iwlwifi, we have the SYSASSERT number which is basically how the firmware tells us briefly what happened. Of course, there is always more data that can be useful, but for a first level of bug de-duplication this can be enough. Then, you have a standard way to track the firmware crashes. We need to remember that the firmware recovery path is expected to work, it is, yet, less tested. I specifically remember a bug where a crash because by a bad handling of a CSA frame caused a firmware crash in a flow that caused the driver not to re-add a station or something like that, and because of that, you get another firmware crash. So sometimes it is interesting to see not only the data on which firmware crash happened and how many times, but if there is a timing relationship between them. I guess that's for the next level though... Not really critical for now. The next problem here is that when you tell the firmware folks that they have a bug, the first they'll do is to ask for more data. Back then, I enabled our firmware debug infra on Linux, and from there devcoredump infra was born (thanks Johannes). What we do at Intel, is that we have a script that runs when a udev event reports the creation of a devcoredump. It parses the kernel log (dmesg) to determine the unique id I mentioned earlier (the SYSSASSERT number basically) and then it creates a file in /var/log/ whose name contain the SYSSASSERT number. It is then quite easy to match the firmware data with the firmware crash. So I believe the right way to do this is to create the devcoredump along with the id. Meaning that we basically don't need another interface, we just need to use the same one, but to provide the unique id of the crash. This unique id can then be part of the uevent that is thrown to the userspace and userspace can use it to name the dump file with the right id. This way, it is fairly easy (and standard across vendors) to track the firmware crashes, but the most important is that you already have the firmware data that goes along with it. It would look like this. driver_code.c: void my_vendor_error_interrupt() { u64 uid = get_the_unique_id_from_your_device(); void *firmware_data = get_the_data_you_need(); dev_coredumpsg(dev_struct, firmware_data, firmware_data_len, GFP_whatver, uid); } And then, this would cause a: /var/log/wifi-crash-$(KBUILD_MODNAME)-,uid.bin to appear our your filesystem. > > > > A uevent > > > would suit us very well I think, although it would be nice if drivers > > > could also supply some small amount of informative text along with it > > > > A follow up to this series was to add a uevent to add_taint(), however > > since a *count* is not considered I think it is correct to seek > > alternatives at this point. The leaner the solution the better though. > > > > Do you have a pointer to what guys use so I can read? > > Hopefully the above pointers are useful to you. We don't yet have > firmware-crash parsing implemented, so I don't have pointers for that > piece yet. See above. I don't think it is the device driver's role to count those. We can add this count this in userspace I think. Debatable though. > > > > (e.g., a sort of "reason code", in case we can possibly aggregate > > > certain failure types). We already do this sort of thing for WARN() > > > and friends (not via uevent, but via log parsing; at least it has nice > > > "cut here" markers!). > > > > Indeed, similar things can indeed be argued about WARN*()... this > > however can be non-device specific. With panic-on-warn becoming a > > "thing", the more important it becomes to really tally exactly *why* > > these WARN*()s may trigger. > > panic-on-warn? Yikes. I guess such folks don't run mac80211... I'm > probably not supposed to publicize information related to how many > Chromebooks are out there, but we regularly see a scary number of > non-fatal warnings (WARN(), WARN_ON(), etc.) logged by Chrome OS users > every day, and a scary number of those are in WiFi drivers... > > > > Perhaps > > > > Note below. > > (My use of "perhaps" is only because I'm not familiar with devlink at all.) > > > > devlink (as proposed down-thread) would also fit the bill. I > > > don't think sysfs alone would fit our needs, as we'd like to process > > > these things as they happen, not only when a user submits a bug > > > report. > > > > I think we've reached a point where using "*Perhaps*" does not suffice, > > and if there is already a *user* of similar desired infrastructure I > > think we should jump on the opportunity to replace what you have with > > something which could be used by other devices / subsystems which > > require firmware. And indeed, also even consider in the abstract sense, > > the possibility to leverage something like this for WARN*()s later too. > > To precisely lay out what Chrome OS does today: > > * WARN() and similar: implemented, see anomaly_detector.cc above. It's > just log parsing, and that handy "cut here" stuff -- I'm nearly > certain there are other distros using this already? A uevent would > probably be nice, but log parsing is good enough for us today. > * EVENT=INACCESSIBLE: iwlwifi-specific, but reference code is linked > above. It's a uevent, and we handle it via udev. Code is linked above. > * Other firmware crashes: not yet implemented here, but we're looking > at it. Currently, we plan to do something similar to WARN(), but it > will be ugly and non-generic. A uevent would be a lovely replacement, > if it has some basic metadata with it. Stats in sysfs might be icing > on the cake, but mostly redundant for us, if we can grab it on the fly > via uevent. So I believe we already have this uevent, it is the devcoredump. All we need is to add the unique id. Note that all this is good for firmware crashes, and not for bus dead scenarios, but those two are fundamentally different even if they look alike at the beginning of your error detection flow. > > HTH, > Brian ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-20 5:37 ` Emmanuel Grumbach @ 2020-05-20 8:32 ` Andy Shevchenko 2020-05-21 19:01 ` Brian Norris 1 sibling, 0 replies; 51+ messages in thread From: Andy Shevchenko @ 2020-05-20 8:32 UTC (permalink / raw) To: Emmanuel Grumbach Cc: Brian Norris, Luis Chamberlain, Johannes Berg, linux-wireless, aquini, Peter Zijlstra (Intel), Daniel Vetter, Mauro Carvalho Chehab, Will Deacon, Baoquan He, ath10k, Takashi Iwai, Ingo Molnar, Dave Young, Petr Mladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, Thomas Gleixner, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, Jessica Yu, Andrew Morton, David S. Miller On Wed, May 20, 2020 at 8:40 AM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > Since I have been involved quite a bit in the firmware debugging > features in iwlwifi, I think I can give a few insights here. > > But before this, we need to understand that there are several sources of issues: > 1) the firmware may crash but the bus is still alive, you can still > use the bus to get the crash data > 2) the bus is dead, when that happens, the firmware might even be in a > good condition, but since the bus is dead, you stop getting any > information about the firmware, and then, at some point, you get to > the conclusion that the firmware is dead. You can't get the crash data > that resides on the other side of the bus (you may have gathered data > in the DRAM directly, but that's a different thing), and you don't > have much recovery to do besides re-starting the PCI enumeration. > > At Intel, we have seen both unfortunately. The bus issues are the ones > that are trickier obviously. Trickier to detect (because you just get > garbage from any request you issue on the bus), and trickier to > handle. One can argue that the kernel should *not* handle those and > let this in userspace hands. I guess it all depends on what component > you ship to your customer and what you customer asks from you :). Or the two best approaches: 1) get rid of firmware completely; 2) make it OSS (like SOF). I think any of these is a right thing to do in long-term perspective. How many firmwares average computer has? 50? 100? Any of them is a burden and PITA. -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-20 5:37 ` Emmanuel Grumbach 2020-05-20 8:32 ` Andy Shevchenko @ 2020-05-21 19:01 ` Brian Norris 2020-05-22 5:12 ` Emmanuel Grumbach 1 sibling, 1 reply; 51+ messages in thread From: Brian Norris @ 2020-05-21 19:01 UTC (permalink / raw) To: Emmanuel Grumbach Cc: Luis Chamberlain, Johannes Berg, linux-wireless, aquini, peterz, Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > So I believe we already have this uevent, it is the devcoredump. All > we need is to add the unique id. I think there are a few reasons that devcoredump doesn't satisfy what either Luis or I want. 1) it can be disabled entirely [1], for good reasons (e.g., think of non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything with the opaque dumps provided by closed-source firmware) 2) not all drivers necessarily have a useful dump to provide when there's a crash; look at the rest of Luis's series to see the kinds of drivers-with-firmware that are crashing, some of which aren't dumping anything 3) for those that do support devcoredump, it may be used for purposes that are not "crashes" -- e.g., some provide debugfs or other knobs to initiate dumps, for diagnostic or debugging purposes Brian [1] devcd_disabled https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-21 19:01 ` Brian Norris @ 2020-05-22 5:12 ` Emmanuel Grumbach 2020-05-22 5:23 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Emmanuel Grumbach @ 2020-05-22 5:12 UTC (permalink / raw) To: Brian Norris Cc: Luis Chamberlain, Johannes Berg, linux-wireless, aquini, peterz, Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller > > On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > > So I believe we already have this uevent, it is the devcoredump. All > > we need is to add the unique id. > > I think there are a few reasons that devcoredump doesn't satisfy what > either Luis or I want. > > 1) it can be disabled entirely [1], for good reasons (e.g., think of > non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything > with the opaque dumps provided by closed-source firmware) Ok, if all you're interested into is the information that this event happen (as opposed to report a bug and providing the data), then I agree. True, not everybody want or can enable devcoredump. I am just a bit concerned that we may end up with two interface that notify the same event basically. The ideal maybe would be to be able to optionally reduce the content of the devoredump to nothing more that is already in the dmesg output. But then, it is not what it is meant to be: namely, a core dump.. > 2) not all drivers necessarily have a useful dump to provide when > there's a crash; look at the rest of Luis's series to see the kinds of > drivers-with-firmware that are crashing, some of which aren't dumping > anything Fair enouh. > 3) for those that do support devcoredump, it may be used for purposes > that are not "crashes" -- e.g., some provide debugfs or other knobs to > initiate dumps, for diagnostic or debugging purposes Not sure I really think we need to care about those cases, but you already have 2 good arguments :) > > Brian > > [1] devcd_disabled > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/base/devcoredump.c?h=v5.6#n22 ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-22 5:12 ` Emmanuel Grumbach @ 2020-05-22 5:23 ` Luis Chamberlain 0 siblings, 0 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-22 5:23 UTC (permalink / raw) To: Emmanuel Grumbach Cc: Brian Norris, Johannes Berg, linux-wireless, aquini, peterz, Daniel Vetter, mchehab+samsung, will, bhe, ath10k, Takashi Iwai, mingo, dyoung, pmladek, Kees Cook, Arnd Bergmann, gpiccoli, Steven Rostedt, cai, tglx, Andy Shevchenko, Kalle Valo, <netdev@vger.kernel.org>, schlad, Linux Kernel, jeyu, Andrew Morton, David S. Miller On Fri, May 22, 2020 at 08:12:59AM +0300, Emmanuel Grumbach wrote: > > > > On Tue, May 19, 2020 at 10:37 PM Emmanuel Grumbach <egrumbach@gmail.com> wrote: > > > So I believe we already have this uevent, it is the devcoredump. All > > > we need is to add the unique id. > > > > I think there are a few reasons that devcoredump doesn't satisfy what > > either Luis or I want. > > > > 1) it can be disabled entirely [1], for good reasons (e.g., think of > > non-${CHIP_VENDOR} folks, who can't (and don't want to) do anything > > with the opaque dumps provided by closed-source firmware) > > Ok, if all you're interested into is the information that this event > happen (as opposed to report a bug and providing the data), then I > agree. I've now hit again a firmware crash with ath10k with the latest firwmare and kernel and the *only* thing that helped recovery was a full reboot, so that is a crystal clear case that this needs to taint the kernel, and yes we do want to inform users too, so I've just added uevent support for a few panic / taint events in the kernel now and rolled into my series. I'll run some final tests and then post this as a follow up. devlink didn't cut it, its networking specific. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-16 13:24 ` Johannes Berg 2020-05-16 13:50 ` Johannes Berg @ 2020-05-18 16:51 ` Luis Chamberlain 2020-05-18 16:58 ` Ben Greear 1 sibling, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 16:51 UTC (permalink / raw) To: Johannes Berg Cc: jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > You didn't CC me or the wireless list on the rest of the patches, so I'm > replying to a random one, but ... > > What is the point here? > > This should in no way affect the integrity of the system/kernel, for > most devices anyway. Keyword you used here is "most device". And in the worst case, *who* knows what other odd things may happen afterwards. > So what if ath10k's firmware crashes? If there's a driver bug it will > not handle it right (and probably crash, WARN_ON, or something else), > but if the driver is working right then that will not affect the kernel > at all. Sometimes the device can go into a state which requires driver removal and addition to get things back up. > So maybe I can understand that maybe you want an easy way to discover - > per device - that the firmware crashed, but that still doesn't warrant a > complete kernel taint. That is one reason, another is that a taint helps support cases *fast* easily detect if the issue was a firmware crash, instead of scraping logs for driver specific ways to say the firmware has crashed. > Instead of the kernel taint, IMHO you should provide an annotation in > sysfs (or somewhere else) for the *struct device* that had its firmware > crash. It would seem the way some folks are thinking about getting more details would be through devlink. > Or maybe, if it's too complex to walk the entire hierarchy > checking for that, have a uevent, or add the ability for the kernel to > print out elsewhere in debugfs the list of devices that crashed at some > point... All of that is fine, but a kernel taint? debugfs is optional, a taint is simple, and device agnostic. From a support perspective it is very easy to see if a possible issue may be device firmware specific. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 16:51 ` Luis Chamberlain @ 2020-05-18 16:58 ` Ben Greear 2020-05-18 17:09 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2020-05-18 16:58 UTC (permalink / raw) To: Luis Chamberlain, Johannes Berg Cc: jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: >> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed >> >> You didn't CC me or the wireless list on the rest of the patches, so I'm >> replying to a random one, but ... >> >> What is the point here? >> >> This should in no way affect the integrity of the system/kernel, for >> most devices anyway. > > Keyword you used here is "most device". And in the worst case, *who* > knows what other odd things may happen afterwards. > >> So what if ath10k's firmware crashes? If there's a driver bug it will >> not handle it right (and probably crash, WARN_ON, or something else), >> but if the driver is working right then that will not affect the kernel >> at all. > > Sometimes the device can go into a state which requires driver removal > and addition to get things back up. It would be lovely to be able to detect this case in the driver/system somehow! I haven't seen any such cases recently, but in case there is some common case you see, maybe we can think of a way to detect it? > >> So maybe I can understand that maybe you want an easy way to discover - >> per device - that the firmware crashed, but that still doesn't warrant a >> complete kernel taint. > > That is one reason, another is that a taint helps support cases *fast* > easily detect if the issue was a firmware crash, instead of scraping > logs for driver specific ways to say the firmware has crashed. You can listen for udev events (I think that is the right term), and find crashes that way. You get the actual crash info as well. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 16:58 ` Ben Greear @ 2020-05-18 17:09 ` Luis Chamberlain 2020-05-18 17:15 ` Ben Greear 0 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 17:09 UTC (permalink / raw) To: Ben Greear Cc: Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm > > > replying to a random one, but ... > > > > > > What is the point here? > > > > > > This should in no way affect the integrity of the system/kernel, for > > > most devices anyway. > > > > Keyword you used here is "most device". And in the worst case, *who* > > knows what other odd things may happen afterwards. > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will > > > not handle it right (and probably crash, WARN_ON, or something else), > > > but if the driver is working right then that will not affect the kernel > > > at all. > > > > Sometimes the device can go into a state which requires driver removal > > and addition to get things back up. > > It would be lovely to be able to detect this case in the driver/system > somehow! I haven't seen any such cases recently, I assure you that I have run into it. Once it does again I'll report the crash, but the problem with some of this is that unless you scrape the log you won't know. Eventually, a uevent would indeed tell inform me. > but in case there is > some common case you see, maybe we can think of a way to detect it? ath10k is just one case, this patch series addresses a simple way to annotate this tree-wide. > > > So maybe I can understand that maybe you want an easy way to discover - > > > per device - that the firmware crashed, but that still doesn't warrant a > > > complete kernel taint. > > > > That is one reason, another is that a taint helps support cases *fast* > > easily detect if the issue was a firmware crash, instead of scraping > > logs for driver specific ways to say the firmware has crashed. > > You can listen for udev events (I think that is the right term), > and find crashes that way. You get the actual crash info as well. My follow up to this was to add uevent to add_taint() as well, this way these could generically be processed by userspace. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 17:09 ` Luis Chamberlain @ 2020-05-18 17:15 ` Ben Greear 2020-05-18 17:18 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Ben Greear @ 2020-05-18 17:15 UTC (permalink / raw) To: Luis Chamberlain Cc: Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On 05/18/2020 10:09 AM, Luis Chamberlain wrote: > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: >> >> >> On 05/18/2020 09:51 AM, Luis Chamberlain wrote: >>> On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: >>>> On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed >>>> >>>> You didn't CC me or the wireless list on the rest of the patches, so I'm >>>> replying to a random one, but ... >>>> >>>> What is the point here? >>>> >>>> This should in no way affect the integrity of the system/kernel, for >>>> most devices anyway. >>> >>> Keyword you used here is "most device". And in the worst case, *who* >>> knows what other odd things may happen afterwards. >>> >>>> So what if ath10k's firmware crashes? If there's a driver bug it will >>>> not handle it right (and probably crash, WARN_ON, or something else), >>>> but if the driver is working right then that will not affect the kernel >>>> at all. >>> >>> Sometimes the device can go into a state which requires driver removal >>> and addition to get things back up. >> >> It would be lovely to be able to detect this case in the driver/system >> somehow! I haven't seen any such cases recently, > > I assure you that I have run into it. Once it does again I'll report > the crash, but the problem with some of this is that unless you scrape > the log you won't know. Eventually, a uevent would indeed tell inform > me. > >> but in case there is >> some common case you see, maybe we can think of a way to detect it? > > ath10k is just one case, this patch series addresses a simple way to > annotate this tree-wide. > >>>> So maybe I can understand that maybe you want an easy way to discover - >>>> per device - that the firmware crashed, but that still doesn't warrant a >>>> complete kernel taint. >>> >>> That is one reason, another is that a taint helps support cases *fast* >>> easily detect if the issue was a firmware crash, instead of scraping >>> logs for driver specific ways to say the firmware has crashed. >> >> You can listen for udev events (I think that is the right term), >> and find crashes that way. You get the actual crash info as well. > > My follow up to this was to add uevent to add_taint() as well, this way > these could generically be processed by userspace. I'm not opposed to the taint, though I have not thought much on it. But, if you can already get the crash info from uevent, and it automatically comes without polling or scraping logs, then what benefit beyond that does the taint give you? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 17:15 ` Ben Greear @ 2020-05-18 17:18 ` Luis Chamberlain 2020-05-18 18:06 ` Steve deRosier 0 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 17:18 UTC (permalink / raw) To: Ben Greear Cc: Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote: > > > On 05/18/2020 10:09 AM, Luis Chamberlain wrote: > > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: > > > > > > > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > > > > > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm > > > > > replying to a random one, but ... > > > > > > > > > > What is the point here? > > > > > > > > > > This should in no way affect the integrity of the system/kernel, for > > > > > most devices anyway. > > > > > > > > Keyword you used here is "most device". And in the worst case, *who* > > > > knows what other odd things may happen afterwards. > > > > > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will > > > > > not handle it right (and probably crash, WARN_ON, or something else), > > > > > but if the driver is working right then that will not affect the kernel > > > > > at all. > > > > > > > > Sometimes the device can go into a state which requires driver removal > > > > and addition to get things back up. > > > > > > It would be lovely to be able to detect this case in the driver/system > > > somehow! I haven't seen any such cases recently, > > > > I assure you that I have run into it. Once it does again I'll report > > the crash, but the problem with some of this is that unless you scrape > > the log you won't know. Eventually, a uevent would indeed tell inform > > me. > > > > > but in case there is > > > some common case you see, maybe we can think of a way to detect it? > > > > ath10k is just one case, this patch series addresses a simple way to > > annotate this tree-wide. > > > > > > > So maybe I can understand that maybe you want an easy way to discover - > > > > > per device - that the firmware crashed, but that still doesn't warrant a > > > > > complete kernel taint. > > > > > > > > That is one reason, another is that a taint helps support cases *fast* > > > > easily detect if the issue was a firmware crash, instead of scraping > > > > logs for driver specific ways to say the firmware has crashed. > > > > > > You can listen for udev events (I think that is the right term), > > > and find crashes that way. You get the actual crash info as well. > > > > My follow up to this was to add uevent to add_taint() as well, this way > > these could generically be processed by userspace. > > I'm not opposed to the taint, though I have not thought much on it. > > But, if you can already get the crash info from uevent, and it automatically > comes without polling or scraping logs, then what benefit beyond that does > the taint give you? From a support perspective it is a *crystal* clear sign that the device and / or device driver may be in a very bad state, in a generic way. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 17:18 ` Luis Chamberlain @ 2020-05-18 18:06 ` Steve deRosier 2020-05-18 19:09 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Steve deRosier @ 2020-05-18 18:06 UTC (permalink / raw) To: Luis Chamberlain Cc: Ben Greear, Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, May 18, 2020 at 10:15:45AM -0700, Ben Greear wrote: > > > > > > On 05/18/2020 10:09 AM, Luis Chamberlain wrote: > > > On Mon, May 18, 2020 at 09:58:53AM -0700, Ben Greear wrote: > > > > > > > > > > > > On 05/18/2020 09:51 AM, Luis Chamberlain wrote: > > > > > On Sat, May 16, 2020 at 03:24:01PM +0200, Johannes Berg wrote: > > > > > > On Fri, 2020-05-15 at 21:28 +0000, Luis Chamberlain wrote:> module_firmware_crashed > > > > > > > > > > > > You didn't CC me or the wireless list on the rest of the patches, so I'm > > > > > > replying to a random one, but ... > > > > > > > > > > > > What is the point here? > > > > > > > > > > > > This should in no way affect the integrity of the system/kernel, for > > > > > > most devices anyway. > > > > > > > > > > Keyword you used here is "most device". And in the worst case, *who* > > > > > knows what other odd things may happen afterwards. > > > > > > > > > > > So what if ath10k's firmware crashes? If there's a driver bug it will > > > > > > not handle it right (and probably crash, WARN_ON, or something else), > > > > > > but if the driver is working right then that will not affect the kernel > > > > > > at all. > > > > > > > > > > Sometimes the device can go into a state which requires driver removal > > > > > and addition to get things back up. > > > > > > > > It would be lovely to be able to detect this case in the driver/system > > > > somehow! I haven't seen any such cases recently, > > > > > > I assure you that I have run into it. Once it does again I'll report > > > the crash, but the problem with some of this is that unless you scrape > > > the log you won't know. Eventually, a uevent would indeed tell inform > > > me. > > > > > > > but in case there is > > > > some common case you see, maybe we can think of a way to detect it? > > > > > > ath10k is just one case, this patch series addresses a simple way to > > > annotate this tree-wide. > > > > > > > > > So maybe I can understand that maybe you want an easy way to discover - > > > > > > per device - that the firmware crashed, but that still doesn't warrant a > > > > > > complete kernel taint. > > > > > > > > > > That is one reason, another is that a taint helps support cases *fast* > > > > > easily detect if the issue was a firmware crash, instead of scraping > > > > > logs for driver specific ways to say the firmware has crashed. > > > > > > > > You can listen for udev events (I think that is the right term), > > > > and find crashes that way. You get the actual crash info as well. > > > > > > My follow up to this was to add uevent to add_taint() as well, this way > > > these could generically be processed by userspace. > > > > I'm not opposed to the taint, though I have not thought much on it. > > > > But, if you can already get the crash info from uevent, and it automatically > > comes without polling or scraping logs, then what benefit beyond that does > > the taint give you? > > From a support perspective it is a *crystal* clear sign that the device > and / or device driver may be in a very bad state, in a generic way. > Unfortunately a "taint" is interpreted by many users as: "your kernel is really F#*D up, you better do something about it right now." Assuming they're paying attention at all in the first place of course. The fact is, WiFi chip firmware crashes, and in most cases the driver is able to recover seamlessly. At least that is the case with most QCA chipsets I work with. And the users or our ability to do anything about it is minimal to none as we don't have access to firmware source. It's too bad and I wish it weren't the case, but we have embraced reality and most drivers have a recovery mechanism built in for this case. In short, it's a non-event. I fear that elevating this to a kernel taint will significantly increase "support" requests that really are nothing but noise; similar to how the firmware load failure messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded fw-0.bin) cause a lot of noise. Not specifically opposed, but I wonder what it really accomplishes in a world where the firmware crashing is pretty much a normal occurrence. If it goes in, I think that the drivers shouldn't trigger the taint if they're able to recover normally. Only trigger on failure to come back up. In other words, the ideal place in the ath10k driver isn't where you have proposed as at that point operation is normal and we're doing a routine recovery. - Steve > Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 18:06 ` Steve deRosier @ 2020-05-18 19:09 ` Luis Chamberlain 2020-05-18 19:25 ` Johannes Berg 0 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 19:09 UTC (permalink / raw) To: Steve deRosier Cc: Ben Greear, Johannes Berg, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, May 18, 2020 at 11:06:27AM -0700, Steve deRosier wrote: > On Mon, May 18, 2020 at 10:19 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > From a support perspective it is a *crystal* clear sign that the device > > and / or device driver may be in a very bad state, in a generic way. > > > > Unfortunately a "taint" is interpreted by many users as: "your kernel > is really F#*D up, you better do something about it right now." > Assuming they're paying attention at all in the first place of course. Taint historically has been used and still is today to help rule out whether or not you get support, or how you get support. For instance, a staging driver is not supported by some upstream developers, but it will be by those who help staging and Greg. TAINT_CRAP cannot be even more clear. So, no, it is not just about "hey your kernel is messed up", there are clear support boundaries being drawn. > The fact is, WiFi chip firmware crashes, and in most cases the driver > is able to recover seamlessly. At least that is the case with most QCA > chipsets I work with. That has not been my exerience with the same driver, and so how do we know? And this patch set is not about ath10k alone, I want you to think about *all* device drivers with firmware. In my journey to scrape the kernel for these cases I was very surprised by the amount of code which clearly annotates these situations. > And the users or our ability to do anything > about it is minimal to none as we don't have access to firmware > source. This is not true, we have open firmware in WiFi. Some vendors choose to not open source their firmware, that is their decision. These days though, I think we all admit, that firmware crashes can use a better generic infrastructure for ensuring that clearly affecting-user experience issues. This patch is about that *when and if these happen*, we annotate it in the kernel for support pursposes. > It's too bad and I wish it weren't the case, but we have > embraced reality and most drivers have a recovery mechanism built in > for this case. The mentality about firmware crashes being the end of the world is certainly what will lead developers to often hide these. Where this is openly clear, and not obfucscated I'd argue that firmware issues get fixed likely more common. So what you describe is not bad, its just accepting evolution. > In short, it's a non-event. I fear that elevating this > to a kernel taint will significantly increase "support" requests that > really are nothing but noise; That will depend on where you put this on the driver, and that is why it is important to place it in the right place, if any. > similar to how the firmware load failure > messages (fail to load fw-2.bin, fail to load fw-1.bin, yay loaded > fw-0.bin) cause a lot of noise. That can be fixed, the developers behind this series gave up on it. It has to do with a range version of supported firmwares, and all being optional, but at least one is required. > Not specifically opposed, but I wonder what it really accomplishes in > a world where the firmware crashing is pretty much a normal > occurrence. Recovery without affecting user experience would be great, the taint is *not* for those cases. The taint definition has: + 18) ``Q`` used by device drivers to annotate that the device driver's firmware + has crashed and the device's operation has been severely affected. The + device may be left in a crippled state, requiring full driver removal / + addition, system reboot, or it is unclear how long recovery will take. Let me know if this is not clear. > If it goes in, I think that the drivers shouldn't trigger the taint if > they're able to recover normally. Only trigger on failure to come back > up. In other words, the ideal place in the ath10k driver isn't where > you have proposed as at that point operation is normal and we're doing > a routine recovery. Sure, happy to remove it if indeed it is the case that the firwmare crash is not happening to cripple the device, but I can vouch for the fact that the exact place where I placed it left my device driver in a state where I had to remove / add again. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 19:09 ` Luis Chamberlain @ 2020-05-18 19:25 ` Johannes Berg 2020-05-18 19:59 ` Luis Chamberlain 2020-05-18 20:28 ` Jakub Kicinski 0 siblings, 2 replies; 51+ messages in thread From: Johannes Berg @ 2020-05-18 19:25 UTC (permalink / raw) To: Luis Chamberlain, Steve deRosier Cc: Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote: > > Unfortunately a "taint" is interpreted by many users as: "your kernel > > is really F#*D up, you better do something about it right now." > > Assuming they're paying attention at all in the first place of course. > > Taint historically has been used and still is today to help rule out > whether or not you get support, or how you get support. > > For instance, a staging driver is not supported by some upstream > developers, but it will be by those who help staging and Greg. TAINT_CRAP > cannot be even more clear. > > So, no, it is not just about "hey your kernel is messed up", there are > clear support boundaries being drawn. Err, no. Those two are most definitely related. Have you looked at (most or some or whatever) staging drivers recently? Those contain all kinds of garbage that might do whatever with your kernel. Of course that's not a completely clear boundary, maybe you can find a driver in staging that's perfect code just not written to kernel style? But I find that hard to believe, in most cases. So no, it's really not about "[a] staging driver is not supported" vs. "your kernel is messed up". The very fact that you loaded one of those things might very well have messed up your kernel entirely. > These days though, I think we all admit, that firmware crashes can use > a better generic infrastructure for ensuring that clearly affecting-user > experience issues. This patch is about that *when and if these happen*, > we annotate it in the kernel for support pursposes. That's all fine, I just don't think it's appropriate to pretend that your kernel is now 'tainted' (think about the meaning of that word) when the firmware of some random device crashed. Heck, that could have been a USB device that was since unplugged. Unless the driver is complete garbage (hello staging again?) that really should have no lasting effect on the system itself. > Recovery without affecting user experience would be great, the taint is > *not* for those cases. The taint definition has: > > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware > + has crashed and the device's operation has been severely affected. The > + device may be left in a crippled state, requiring full driver removal / > + addition, system reboot, or it is unclear how long recovery will take. > > Let me know if this is not clear. It's pretty clear, but even then, first of all I doubt this is the case for many of the places that you've sprinkled the annotation on, and secondly it actually hides useful information. Regardless of the support issue, I think this hiding of information is also problematic. I really think we'd all be better off if you just made a sysfs file (I mistyped debugfs in some other email, sorry, apparently you didn't see the correction in time) that listed which device(s) crashed and how many times. That would actually be useful. Because honestly, if a random device crashed for some random reason, that's pretty much a non-event. If it keeps happening, then we might even want to know about it. You can obviously save the contents of this file into your bug reports automatically and act accordingly, but I think you'll find that this is far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this report. Yeah, that might be reasonable thing if the bug report is about slow wifi *and* you see that ath10k firmware crashed every 10 seconds, but if it just crashed once a few days earlier it's of no importance to the system anymore ... And certainly a reasonable driver (which I believe ath10k to be) would _not_ randomly start corrupting memory because its firmware crashed. Which really is what tainting the kernel is about. So no, even with all that, I still really believe you're solving the wrong problem. Having information about firmware crashes, preferably with some kind of frequency information attached, and *clearly* with information about which device attached would be _great_. Munging it all into one bit is actively harmful, IMO. johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 19:25 ` Johannes Berg @ 2020-05-18 19:59 ` Luis Chamberlain 2020-05-18 20:07 ` Johannes Berg 2020-05-18 20:28 ` Jakub Kicinski 1 sibling, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 19:59 UTC (permalink / raw) To: Johannes Berg Cc: Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, May 18, 2020 at 09:25:09PM +0200, Johannes Berg wrote: > On Mon, 2020-05-18 at 19:09 +0000, Luis Chamberlain wrote: > > > > Unfortunately a "taint" is interpreted by many users as: "your kernel > > > is really F#*D up, you better do something about it right now." > > > Assuming they're paying attention at all in the first place of course. > > > > Taint historically has been used and still is today to help rule out > > whether or not you get support, or how you get support. > > > > For instance, a staging driver is not supported by some upstream > > developers, but it will be by those who help staging and Greg. TAINT_CRAP > > cannot be even more clear. > > > > So, no, it is not just about "hey your kernel is messed up", there are > > clear support boundaries being drawn. > > Err, no. Those two are most definitely related. Have you looked at (most > or some or whatever) staging drivers recently? Those contain all kinds > of garbage that might do whatever with your kernel. No, I stay away :) > Of course that's not a completely clear boundary, maybe you can find a > driver in staging that's perfect code just not written to kernel style? > But I find that hard to believe, in most cases. > > So no, it's really not about "[a] staging driver is not supported" vs. > "your kernel is messed up". The very fact that you loaded one of those > things might very well have messed up your kernel entirely. > > > These days though, I think we all admit, that firmware crashes can use > > a better generic infrastructure for ensuring that clearly affecting-user > > experience issues. This patch is about that *when and if these happen*, > > we annotate it in the kernel for support pursposes. > > That's all fine, I just don't think it's appropriate to pretend that > your kernel is now 'tainted' (think about the meaning of that word) when > the firmware of some random device crashed. If the firmware crash *does* require driver remove / addition again, or a reboot, would you think that this is a situation that merits a taint? > > Recovery without affecting user experience would be great, the taint is > > *not* for those cases. The taint definition has: > > > > + 18) ``Q`` used by device drivers to annotate that the device driver's firmware > > + has crashed and the device's operation has been severely affected. The > > + device may be left in a crippled state, requiring full driver removal / > > + addition, system reboot, or it is unclear how long recovery will take. > > > > Let me know if this is not clear. > > It's pretty clear, but even then, first of all I doubt this is the case > for many of the places that you've sprinkled the annotation on, We can remove it, for this driver I can vouch for its location as it did reach a state where I required a reboot. And its not the first time this has happened. This got me thinking about the bigger picture of the lack of proper way to address these cases in the kernel, and how the user is left dumbfounded. > and secondly it actually hides useful information. What is it hiding? > Regardless of the support issue, I think this hiding of information is > also problematic. > > I really think we'd all be better off if you just made a sysfs file (I > mistyped debugfs in some other email, sorry, apparently you didn't see > the correction in time) that listed which device(s) crashed and how many > times. Ah yes, count. The taint does not address count. > That would actually be useful. Because honestly, if a random > device crashed for some random reason, that's pretty much a non-event. > If it keeps happening, then we might even want to know about it. True. > You can obviously save the contents of this file into your bug reports > automatically and act accordingly, but I think you'll find that this is > far more useful than saying "TAINT_FIRMWARE_CRASHED" so I'll ignore this > report. Absolutely. > Yeah, that might be reasonable thing if the bug report is about > slow wifi *and* you see that ath10k firmware crashed every 10 seconds, > but if it just crashed once a few days earlier it's of no importance to > the system anymore ... And certainly a reasonable driver (which I > believe ath10k to be) would _not_ randomly start corrupting memory > because its firmware crashed. Which really is what tainting the kernel > is about. I still see it as a support thing too. But discussing this further is pointless as I agree that taint does not cover count and that it is important. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 19:59 ` Luis Chamberlain @ 2020-05-18 20:07 ` Johannes Berg 2020-05-18 21:18 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Johannes Berg @ 2020-05-18 20:07 UTC (permalink / raw) To: Luis Chamberlain Cc: Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote: > > Err, no. Those two are most definitely related. Have you looked at (most > > or some or whatever) staging drivers recently? Those contain all kinds > > of garbage that might do whatever with your kernel. > > No, I stay away :) :) > > That's all fine, I just don't think it's appropriate to pretend that > > your kernel is now 'tainted' (think about the meaning of that word) when > > the firmware of some random device crashed. > > If the firmware crash *does* require driver remove / addition again, > or a reboot, would you think that this is a situation that merits a taint? Not really. In my experience, that's more likely a hardware issue (card not properly seated, for example) that a bus reset happens to "fix". > > It's pretty clear, but even then, first of all I doubt this is the case > > for many of the places that you've sprinkled the annotation on, > > We can remove it, for this driver I can vouch for its location as it did > reach a state where I required a reboot. And its not the first time this > has happened. This got me thinking about the bigger picture of the lack > of proper way to address these cases in the kernel, and how the user is > left dumbfounded. Fair, so the driver is still broken wrt. recovery here. I still don't think that's a situation where e.g. the system should say "hey you have a taint here, if your graphics go bad now you should not report that bug" (which is effectively what the single taint bit does). > > and secondly it actually hides useful information. > > What is it hiding? Most importantly, which device crashed. Secondarily I'd say how many times (*). The information "firmware crashed" is really only useful in relation to the device. If your graphics firmware crashed, yeah, well, you probably won't even see this. If your USB wifi firmware crashed? Not really interesting, you'll anyway just unplug. In fact it's very hard for a USB driver (short of arbitrary memory corruption) to significantly mess up the system. johannes (*) though if it crashed only once, was that because it was wedged enough to be unusable afterwards, or because everything was fine? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 20:07 ` Johannes Berg @ 2020-05-18 21:18 ` Luis Chamberlain 0 siblings, 0 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 21:18 UTC (permalink / raw) To: Johannes Berg Cc: Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, May 18, 2020 at 10:07:49PM +0200, Johannes Berg wrote: > On Mon, 2020-05-18 at 19:59 +0000, Luis Chamberlain wrote: > > > > Err, no. Those two are most definitely related. Have you looked at (most > > > or some or whatever) staging drivers recently? Those contain all kinds > > > of garbage that might do whatever with your kernel. > > > > No, I stay away :) > > :) > > > > That's all fine, I just don't think it's appropriate to pretend that > > > your kernel is now 'tainted' (think about the meaning of that word) when > > > the firmware of some random device crashed. > > > > If the firmware crash *does* require driver remove / addition again, > > or a reboot, would you think that this is a situation that merits a taint? > > Not really. In my experience, that's more likely a hardware issue (card > not properly seated, for example) that a bus reset happens to "fix". > > > > It's pretty clear, but even then, first of all I doubt this is the case > > > for many of the places that you've sprinkled the annotation on, > > > > We can remove it, for this driver I can vouch for its location as it did > > reach a state where I required a reboot. And its not the first time this > > has happened. This got me thinking about the bigger picture of the lack > > of proper way to address these cases in the kernel, and how the user is > > left dumbfounded. > > Fair, so the driver is still broken wrt. recovery here. I still don't > think that's a situation where e.g. the system should say "hey you have > a taint here, if your graphics go bad now you should not report that > bug" (which is effectively what the single taint bit does). But again, let's think about the generic type of issue, and the unexpected type of state that can be reached. The circumstance here *does* lead to a case which is not recoverable. Now, consider how many cases in the kernel where similar situations can happen and leave the device or driver in a non-functional state. > > > and secondly it actually hides useful information. > > > > What is it hiding? > > Most importantly, which device crashed. Secondarily I'd say how many > times (*). The device is implied by the module, the taint is applied to both. If you had multiple devices, however, yes, it would not be possible to distinguish from the taint which exact device it happened on. So the only thing *generic* which would be left out is count. > The information "firmware crashed" is really only useful in relation to > the device. If you have to reboot to get a functional network again then the device is quite useless for many people, regardless of which device that happened on. But from a support perspective a sysfs interface which provides a tiny bit more generic information indeed provides more value than a taint. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 19:25 ` Johannes Berg 2020-05-18 19:59 ` Luis Chamberlain @ 2020-05-18 20:28 ` Jakub Kicinski 2020-05-18 20:29 ` Johannes Berg 1 sibling, 1 reply; 51+ messages in thread From: Jakub Kicinski @ 2020-05-18 20:28 UTC (permalink / raw) To: Johannes Berg Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote: > It's pretty clear, but even then, first of all I doubt this is the case > for many of the places that you've sprinkled the annotation on, and > secondly it actually hides useful information. > > Regardless of the support issue, I think this hiding of information is > also problematic. > > I really think we'd all be better off if you just made a sysfs file (I > mistyped debugfs in some other email, sorry, apparently you didn't see > the correction in time) that listed which device(s) crashed and how many > times. That would actually be useful. Because honestly, if a random > device crashed for some random reason, that's pretty much a non-event. > If it keeps happening, then we might even want to know about it. Johannes - have you seen devlink health? I think we should just use that interface, since it supports all the things you're requesting, rather than duplicate it in sysfs. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 20:28 ` Jakub Kicinski @ 2020-05-18 20:29 ` Johannes Berg 2020-05-18 20:35 ` Jakub Kicinski 0 siblings, 1 reply; 51+ messages in thread From: Johannes Berg @ 2020-05-18 20:29 UTC (permalink / raw) To: Jakub Kicinski Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote: > On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote: > > It's pretty clear, but even then, first of all I doubt this is the case > > for many of the places that you've sprinkled the annotation on, and > > secondly it actually hides useful information. > > > > Regardless of the support issue, I think this hiding of information is > > also problematic. > > > > I really think we'd all be better off if you just made a sysfs file (I > > mistyped debugfs in some other email, sorry, apparently you didn't see > > the correction in time) that listed which device(s) crashed and how many > > times. That would actually be useful. Because honestly, if a random > > device crashed for some random reason, that's pretty much a non-event. > > If it keeps happening, then we might even want to know about it. > > Johannes - have you seen devlink health? I think we should just use > that interface, since it supports all the things you're requesting, > rather than duplicate it in sysfs. I haven't, and I'm glad to hear that's there, sounds good! I suspect that Luis wants something more generic though, that isn't just applicable to netdevices, unless devlink grew some kind of non-netdev stuff while I wasn't looking? :) johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 20:29 ` Johannes Berg @ 2020-05-18 20:35 ` Jakub Kicinski 2020-05-18 20:41 ` Johannes Berg 0 siblings, 1 reply; 51+ messages in thread From: Jakub Kicinski @ 2020-05-18 20:35 UTC (permalink / raw) To: Johannes Berg Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 18 May 2020 22:29:53 +0200 Johannes Berg wrote: > On Mon, 2020-05-18 at 13:28 -0700, Jakub Kicinski wrote: > > On Mon, 18 May 2020 21:25:09 +0200 Johannes Berg wrote: > > > It's pretty clear, but even then, first of all I doubt this is the case > > > for many of the places that you've sprinkled the annotation on, and > > > secondly it actually hides useful information. > > > > > > Regardless of the support issue, I think this hiding of information is > > > also problematic. > > > > > > I really think we'd all be better off if you just made a sysfs file (I > > > mistyped debugfs in some other email, sorry, apparently you didn't see > > > the correction in time) that listed which device(s) crashed and how many > > > times. That would actually be useful. Because honestly, if a random > > > device crashed for some random reason, that's pretty much a non-event. > > > If it keeps happening, then we might even want to know about it. > > > > Johannes - have you seen devlink health? I think we should just use > > that interface, since it supports all the things you're requesting, > > rather than duplicate it in sysfs. > > I haven't, and I'm glad to hear that's there, sounds good! > > I suspect that Luis wants something more generic though, that isn't just > applicable to netdevices, unless devlink grew some kind of non-netdev > stuff while I wasn't looking? :) It's intended to be a generic netlink channel for configuring devices. All the firmware-related interfaces have no dependencies on netdevs, in fact that's one of the reasons we moved to devlink - we don't want to hold rtnl lock just for talking to device firmware. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 20:35 ` Jakub Kicinski @ 2020-05-18 20:41 ` Johannes Berg 2020-05-18 20:46 ` Jakub Kicinski 0 siblings, 1 reply; 51+ messages in thread From: Johannes Berg @ 2020-05-18 20:41 UTC (permalink / raw) To: Jakub Kicinski Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote: > > It's intended to be a generic netlink channel for configuring devices. > > All the firmware-related interfaces have no dependencies on netdevs, > in fact that's one of the reasons we moved to devlink - we don't want > to hold rtnl lock just for talking to device firmware. Sounds good :) So I guess Luis just has to add some way in devlink to hook up devlink health in a simple way to drivers, perhaps? I mean, many drivers won't really want to use devlink for anything else, so I guess it should be as simple as the API that Luis proposed ("firmware crashed for this struct device"), if nothing more interesting is done with devlink? Dunno. But anyway sounds like it should somehow integrate there rather than the way this patchset proposed? johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 20:41 ` Johannes Berg @ 2020-05-18 20:46 ` Jakub Kicinski 2020-05-18 21:22 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Jakub Kicinski @ 2020-05-18 20:46 UTC (permalink / raw) To: Johannes Berg Cc: Luis Chamberlain, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote: > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote: > > It's intended to be a generic netlink channel for configuring devices. > > > > All the firmware-related interfaces have no dependencies on netdevs, > > in fact that's one of the reasons we moved to devlink - we don't want > > to hold rtnl lock just for talking to device firmware. > > Sounds good :) > > So I guess Luis just has to add some way in devlink to hook up devlink > health in a simple way to drivers, perhaps? I mean, many drivers won't > really want to use devlink for anything else, so I guess it should be as > simple as the API that Luis proposed ("firmware crashed for this struct > device"), if nothing more interesting is done with devlink? > > Dunno. But anyway sounds like it should somehow integrate there rather > than the way this patchset proposed? Right, that'd be great. Simple API to register a devlink instance with whatever number of reporters the device would need. I'm happy to help. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 20:46 ` Jakub Kicinski @ 2020-05-18 21:22 ` Luis Chamberlain 2020-05-18 22:16 ` Jakub Kicinski 0 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-18 21:22 UTC (permalink / raw) To: Jakub Kicinski Cc: Johannes Berg, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, May 18, 2020 at 01:46:43PM -0700, Jakub Kicinski wrote: > On Mon, 18 May 2020 22:41:48 +0200 Johannes Berg wrote: > > On Mon, 2020-05-18 at 13:35 -0700, Jakub Kicinski wrote: > > > It's intended to be a generic netlink channel for configuring devices. > > > > > > All the firmware-related interfaces have no dependencies on netdevs, > > > in fact that's one of the reasons we moved to devlink - we don't want > > > to hold rtnl lock just for talking to device firmware. > > > > Sounds good :) > > > > So I guess Luis just has to add some way in devlink to hook up devlink > > health in a simple way to drivers, perhaps? I mean, many drivers won't > > really want to use devlink for anything else, so I guess it should be as > > simple as the API that Luis proposed ("firmware crashed for this struct > > device"), if nothing more interesting is done with devlink? > > > > Dunno. But anyway sounds like it should somehow integrate there rather > > than the way this patchset proposed? > > Right, that'd be great. Simple API to register a devlink instance with > whatever number of reporters the device would need. I'm happy to help. Indeed my issue with devlink is that it did not seem generic enough for all devices which use firmware and for which firmware can crash. Support should not have to be "add devlink support" + "now use this new hook", but rather a very lighweight devlink_crash(device) call we can sprinkly with only the device as a functional requirement. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 21:22 ` Luis Chamberlain @ 2020-05-18 22:16 ` Jakub Kicinski 2020-05-19 1:05 ` Luis Chamberlain 0 siblings, 1 reply; 51+ messages in thread From: Jakub Kicinski @ 2020-05-18 22:16 UTC (permalink / raw) To: Luis Chamberlain Cc: Johannes Berg, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote: > Indeed my issue with devlink is that it did not seem generic enough for > all devices which use firmware and for which firmware can crash. Support > should not have to be "add devlink support" + "now use this new hook", > but rather a very lighweight devlink_crash(device) call we can sprinkly > with only the device as a functional requirement. We can provide a lightweight devlink_crash(device) which only generates the notification, without the need to register the health reporter or a devlink instance upfront. But then we loose the ability to control the recovery, count errors, etc. So I'd think that's not the direction we want to go in. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed() 2020-05-18 22:16 ` Jakub Kicinski @ 2020-05-19 1:05 ` Luis Chamberlain 2020-05-19 21:15 ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski 2020-05-19 21:15 ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski 0 siblings, 2 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-19 1:05 UTC (permalink / raw) To: Jakub Kicinski Cc: Johannes Berg, Steve deRosier, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k On Mon, May 18, 2020 at 03:16:45PM -0700, Jakub Kicinski wrote: > On Mon, 18 May 2020 21:22:02 +0000 Luis Chamberlain wrote: > > Indeed my issue with devlink is that it did not seem generic enough for > > all devices which use firmware and for which firmware can crash. Support > > should not have to be "add devlink support" + "now use this new hook", > > but rather a very lighweight devlink_crash(device) call we can sprinkly > > with only the device as a functional requirement. > > We can provide a lightweight devlink_crash(device) which only generates > the notification, without the need to register the health reporter or a > devlink instance upfront. But then we loose the ability to control the > recovery, count errors, etc. So I'd think that's not the direction we > want to go in. Care to show me what the diff stat for a non devlink driver would look like? Just keep in mind this taint is 1 line addition. Granted, if we can use SmPL grammar to automate addition of an initial framework to a driver that'd be great, but since the device addition is subsystem specific (device_add() and friends), I don't suspect this will be easily automated. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC 1/2] devlink: add simple fw crash helpers 2020-05-19 1:05 ` Luis Chamberlain @ 2020-05-19 21:15 ` Jakub Kicinski 2020-05-22 5:20 ` Luis Chamberlain 2020-05-19 21:15 ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski 1 sibling, 1 reply; 51+ messages in thread From: Jakub Kicinski @ 2020-05-19 21:15 UTC (permalink / raw) To: mcgrof Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris, Jakub Kicinski Add infra for creating devlink instances for a device to report fw crashes. This patch expects the devlink instance to be registered at probe time. I belive to be the cleanest. We can also add a devm version of the helpers, so that we don't have to do the clean up. Or we can go even further and register the devlink instance only once error has happened (for the first time, then we can just find out if already registered by traversing the list like we do here). With the patch applied and a sample driver converted we get: $ devlink dev pci/0000:07:00.0 Then monitor for errors: $ devlink mon health [health,status] pci/0000:07:00.0: reporter fw state error error 1 recover 0 [health,status] pci/0000:07:00.0: reporter fw state error error 2 recover 0 These are the events I triggered on purpose. One can also inspect the health of all devices capable of reporting fw errors: $ devlink health pci/0000:07:00.0: reporter fw state error error 7 recover 0 Obviously drivers may upgrade to the full devlink health API which includes state dump, state dump auto-collect and automatic error recovery control. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- include/linux/devlink.h | 11 +++ net/core/Makefile | 2 +- net/core/devlink_simple_fw_reporter.c | 101 ++++++++++++++++++++++++++ 3 files changed, 113 insertions(+), 1 deletion(-) create mode 100644 include/linux/devlink.h create mode 100644 net/core/devlink_simple_fw_reporter.c diff --git a/include/linux/devlink.h b/include/linux/devlink.h new file mode 100644 index 000000000000..2b73987eefca --- /dev/null +++ b/include/linux/devlink.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _LINUX_DEVLINK_H_ +#define _LINUX_DEVLINK_H_ + +struct device; + +void devlink_simple_fw_reporter_prepare(struct device *dev); +void devlink_simple_fw_reporter_cleanup(struct device *dev); +void devlink_simple_fw_reporter_report_crash(struct device *dev); + +#endif diff --git a/net/core/Makefile b/net/core/Makefile index 3e2c378e5f31..6f1513781c17 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o obj-$(CONFIG_DST_CACHE) += dst_cache.o obj-$(CONFIG_HWBM) += hwbm.o -obj-$(CONFIG_NET_DEVLINK) += devlink.o +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o obj-$(CONFIG_GRO_CELLS) += gro_cells.o obj-$(CONFIG_FAILOVER) += failover.o obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o diff --git a/net/core/devlink_simple_fw_reporter.c b/net/core/devlink_simple_fw_reporter.c new file mode 100644 index 000000000000..48dde9123c3c --- /dev/null +++ b/net/core/devlink_simple_fw_reporter.c @@ -0,0 +1,101 @@ +#include <linux/devlink.h> +#include <linux/list.h> +#include <linux/mutex.h> +#include <net/devlink.h> + +struct devlink_simple_fw_reporter { + struct list_head list; + struct devlink_health_reporter *reporter; +}; + + +static LIST_HEAD(devlink_simple_fw_reporters); +static DEFINE_MUTEX(devlink_simple_fw_reporters_mutex); + +static const struct devlink_health_reporter_ops simple_devlink_health = { + .name = "fw", +}; + +static const struct devlink_ops simple_devlink_ops = { +}; + +static struct devlink_simple_fw_reporter * +devlink_simple_fw_reporter_find_for_dev(struct device *dev) +{ + struct devlink_simple_fw_reporter *simple_devlink, *ret = NULL; + struct devlink *devlink; + + mutex_lock(&devlink_simple_fw_reporters_mutex); + list_for_each_entry(simple_devlink, &devlink_simple_fw_reporters, + list) { + devlink = priv_to_devlink(simple_devlink); + if (devlink->dev == dev) { + ret = simple_devlink; + break; + } + } + mutex_unlock(&devlink_simple_fw_reporters_mutex); + + return ret; +} + +void devlink_simple_fw_reporter_report_crash(struct device *dev) +{ + struct devlink_simple_fw_reporter *simple_devlink; + + simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev); + if (!simple_devlink) + return; + + devlink_health_report(simple_devlink->reporter, "firmware crash", NULL); +} +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_report_crash); + +void devlink_simple_fw_reporter_prepare(struct device *dev) +{ + struct devlink_simple_fw_reporter *simple_devlink; + struct devlink *devlink; + + devlink = devlink_alloc(&simple_devlink_ops, + sizeof(struct devlink_simple_fw_reporter)); + if (!devlink) + return; + + if (devlink_register(devlink, dev)) + goto err_free; + + simple_devlink = devlink_priv(devlink); + simple_devlink->reporter = + devlink_health_reporter_create(devlink, &simple_devlink_health, + 0, NULL); + if (IS_ERR(simple_devlink->reporter)) + goto err_unregister; + + mutex_lock(&devlink_simple_fw_reporters_mutex); + list_add_tail(&simple_devlink->list, &devlink_simple_fw_reporters); + mutex_unlock(&devlink_simple_fw_reporters_mutex); + + return; + +err_unregister: + devlink_unregister(devlink); +err_free: + devlink_free(devlink); +} +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_prepare); + +void devlink_simple_fw_reporter_cleanup(struct device *dev) +{ + struct devlink_simple_fw_reporter *simple_devlink; + struct devlink *devlink; + + simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev); + if (!simple_devlink) + return; + + devlink = priv_to_devlink(simple_devlink); + devlink_health_reporter_destroy(simple_devlink->reporter); + devlink_unregister(devlink); + devlink_free(devlink); +} +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_cleanup); -- 2.25.4 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-19 21:15 ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski @ 2020-05-22 5:20 ` Luis Chamberlain 2020-05-22 17:17 ` Jakub Kicinski 0 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-22 5:20 UTC (permalink / raw) To: Jakub Kicinski Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris On Tue, May 19, 2020 at 02:15:30PM -0700, Jakub Kicinski wrote: > Add infra for creating devlink instances for a device to report Thanks for doing this series as a PoC, counter to the module_firmware_crash() which I proposed to taint the kernel with a firmware crash flag to the kernel and module. For those not famliar about devlink: https://lwn.net/Articles/677967/ https://www.kernel.org/doc/html/latest/networking/devlink/index.html The github page also is now 404 as Jiri merged that stuff into iproute2: git://git.kernel.org/pub/scm/network/iproute2/iproute2.git > fw crashes. This patch expects the devlink instance to be registered > at probe time. I belive to be the cleanest. We can also add a devm > version of the helpers, so that we don't have to do the clean up. > Or we can go even further and register the devlink instance only > once error has happened (for the first time, then we can just > find out if already registered by traversing the list like we > do here). > > With the patch applied and a sample driver converted we get: > > $ devlink dev > pci/0000:07:00.0 > > Then monitor for errors: > > $ devlink mon health > [health,status] pci/0000:07:00.0: > reporter fw > state error error 1 recover 0 > [health,status] pci/0000:07:00.0: > reporter fw > state error error 2 recover 0 > > These are the events I triggered on purpose. One can also inspect > the health of all devices capable of reporting fw errors: > > $ devlink health > pci/0000:07:00.0: > reporter fw > state error error 7 recover 0 > > Obviously drivers may upgrade to the full devlink health API > which includes state dump, state dump auto-collect and automatic > error recovery control. > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > include/linux/devlink.h | 11 +++ > net/core/Makefile | 2 +- > net/core/devlink_simple_fw_reporter.c | 101 ++++++++++++++++++++++++++ > 3 files changed, 113 insertions(+), 1 deletion(-) > create mode 100644 include/linux/devlink.h > create mode 100644 net/core/devlink_simple_fw_reporter.c > > diff --git a/include/linux/devlink.h b/include/linux/devlink.h > new file mode 100644 > index 000000000000..2b73987eefca > --- /dev/null > +++ b/include/linux/devlink.h > @@ -0,0 +1,11 @@ > +/* SPDX-License-Identifier: GPL-2.0-or-later */ > +#ifndef _LINUX_DEVLINK_H_ > +#define _LINUX_DEVLINK_H_ > + > +struct device; > + > +void devlink_simple_fw_reporter_prepare(struct device *dev); > +void devlink_simple_fw_reporter_cleanup(struct device *dev); > +void devlink_simple_fw_reporter_report_crash(struct device *dev); > + > +#endif > diff --git a/net/core/Makefile b/net/core/Makefile > index 3e2c378e5f31..6f1513781c17 100644 > --- a/net/core/Makefile > +++ b/net/core/Makefile > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o > obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o > obj-$(CONFIG_DST_CACHE) += dst_cache.o > obj-$(CONFIG_HWBM) += hwbm.o > -obj-$(CONFIG_NET_DEVLINK) += devlink.o > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o This was looking super sexy up to here. This is networking specific. We want something generic for *anything* that requests firmware. I'm afraid this won't work for something generic. I don't think its throw-away work though, the idea to provide a generic interface to dump firmware through netlink might be nice for networking, or other things. But I have a feeling we'll want something still more generic than this. So networking may want to be aware that a firmware crash happened as part of this network device health thing, but firmware crashing is a generic thing. I have now extended my patch set to include uvents and I am more set on that we need the taint now more than ever. Luis > obj-$(CONFIG_GRO_CELLS) += gro_cells.o > obj-$(CONFIG_FAILOVER) += failover.o > obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o > diff --git a/net/core/devlink_simple_fw_reporter.c b/net/core/devlink_simple_fw_reporter.c > new file mode 100644 > index 000000000000..48dde9123c3c > --- /dev/null > +++ b/net/core/devlink_simple_fw_reporter.c > @@ -0,0 +1,101 @@ > +#include <linux/devlink.h> > +#include <linux/list.h> > +#include <linux/mutex.h> > +#include <net/devlink.h> > + > +struct devlink_simple_fw_reporter { > + struct list_head list; > + struct devlink_health_reporter *reporter; > +}; > + > + > +static LIST_HEAD(devlink_simple_fw_reporters); > +static DEFINE_MUTEX(devlink_simple_fw_reporters_mutex); > + > +static const struct devlink_health_reporter_ops simple_devlink_health = { > + .name = "fw", > +}; > + > +static const struct devlink_ops simple_devlink_ops = { > +}; > + > +static struct devlink_simple_fw_reporter * > +devlink_simple_fw_reporter_find_for_dev(struct device *dev) > +{ > + struct devlink_simple_fw_reporter *simple_devlink, *ret = NULL; > + struct devlink *devlink; > + > + mutex_lock(&devlink_simple_fw_reporters_mutex); > + list_for_each_entry(simple_devlink, &devlink_simple_fw_reporters, > + list) { > + devlink = priv_to_devlink(simple_devlink); > + if (devlink->dev == dev) { > + ret = simple_devlink; > + break; > + } > + } > + mutex_unlock(&devlink_simple_fw_reporters_mutex); > + > + return ret; > +} > + > +void devlink_simple_fw_reporter_report_crash(struct device *dev) > +{ > + struct devlink_simple_fw_reporter *simple_devlink; > + > + simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev); > + if (!simple_devlink) > + return; > + > + devlink_health_report(simple_devlink->reporter, "firmware crash", NULL); > +} > +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_report_crash); > + > +void devlink_simple_fw_reporter_prepare(struct device *dev) > +{ > + struct devlink_simple_fw_reporter *simple_devlink; > + struct devlink *devlink; > + > + devlink = devlink_alloc(&simple_devlink_ops, > + sizeof(struct devlink_simple_fw_reporter)); > + if (!devlink) > + return; > + > + if (devlink_register(devlink, dev)) > + goto err_free; > + > + simple_devlink = devlink_priv(devlink); > + simple_devlink->reporter = > + devlink_health_reporter_create(devlink, &simple_devlink_health, > + 0, NULL); > + if (IS_ERR(simple_devlink->reporter)) > + goto err_unregister; > + > + mutex_lock(&devlink_simple_fw_reporters_mutex); > + list_add_tail(&simple_devlink->list, &devlink_simple_fw_reporters); > + mutex_unlock(&devlink_simple_fw_reporters_mutex); > + > + return; > + > +err_unregister: > + devlink_unregister(devlink); > +err_free: > + devlink_free(devlink); > +} > +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_prepare); > + > +void devlink_simple_fw_reporter_cleanup(struct device *dev) > +{ > + struct devlink_simple_fw_reporter *simple_devlink; > + struct devlink *devlink; > + > + simple_devlink = devlink_simple_fw_reporter_find_for_dev(dev); > + if (!simple_devlink) > + return; > + > + devlink = priv_to_devlink(simple_devlink); > + devlink_health_reporter_destroy(simple_devlink->reporter); > + devlink_unregister(devlink); > + devlink_free(devlink); > +} > +EXPORT_SYMBOL_GPL(devlink_simple_fw_reporter_cleanup); > -- > 2.25.4 > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 5:20 ` Luis Chamberlain @ 2020-05-22 17:17 ` Jakub Kicinski 2020-05-22 20:46 ` Johannes Berg 2020-05-22 21:49 ` Luis Chamberlain 0 siblings, 2 replies; 51+ messages in thread From: Jakub Kicinski @ 2020-05-22 17:17 UTC (permalink / raw) To: Luis Chamberlain Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris On Fri, 22 May 2020 05:20:46 +0000 Luis Chamberlain wrote: > > diff --git a/net/core/Makefile b/net/core/Makefile > > index 3e2c378e5f31..6f1513781c17 100644 > > --- a/net/core/Makefile > > +++ b/net/core/Makefile > > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o > > obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o > > obj-$(CONFIG_DST_CACHE) += dst_cache.o > > obj-$(CONFIG_HWBM) += hwbm.o > > -obj-$(CONFIG_NET_DEVLINK) += devlink.o > > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o > > This was looking super sexy up to here. This is networking specific. > We want something generic for *anything* that requests firmware. You can't be serious. It's network specific because of how the Kconfig is named? Working for a company operating large data centers I would strongly prefer if we didn't have ten different ways of reporting firmware problems in the fleet. > I'm afraid this won't work for something generic. I don't think its > throw-away work though, the idea to provide a generic interface to > dump firmware through netlink might be nice for networking, or other > things. > > But I have a feeling we'll want something still more generic than this. Please be specific. Saying generic a lot is not helpful. The code (as you can see in this patch) is in no way network specific. Or are you saying there are machines out there running without netlink sockets? > So networking may want to be aware that a firmware crash happened as > part of this network device health thing, but firmware crashing is a > generic thing. > > I have now extended my patch set to include uvents and I am more set on > that we need the taint now more than ever. Please expect my nack if you're trying to add this to networking drivers. The irony is you have a problem with a networking device and all the devices your initial set touched are networking. Two of the drivers you touched either have or will soon have devlink health reporters implemented. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 17:17 ` Jakub Kicinski @ 2020-05-22 20:46 ` Johannes Berg 2020-05-22 21:51 ` Luis Chamberlain 2020-05-25 20:57 ` Jakub Kicinski 2020-05-22 21:49 ` Luis Chamberlain 1 sibling, 2 replies; 51+ messages in thread From: Johannes Berg @ 2020-05-22 20:46 UTC (permalink / raw) To: Jakub Kicinski, Luis Chamberlain Cc: derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris On Fri, 2020-05-22 at 10:17 -0700, Jakub Kicinski wrote: > > > --- a/net/core/Makefile > > > +++ b/net/core/Makefile > > > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o > > > obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o > > > obj-$(CONFIG_DST_CACHE) += dst_cache.o > > > obj-$(CONFIG_HWBM) += hwbm.o > > > -obj-$(CONFIG_NET_DEVLINK) += devlink.o > > > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o > > > > This was looking super sexy up to here. This is networking specific. > > We want something generic for *anything* that requests firmware. > > You can't be serious. It's network specific because of how the Kconfig > is named? Wait, yeah, what? > Working for a company operating large data centers I would strongly > prefer if we didn't have ten different ways of reporting firmware > problems in the fleet. Agree. I don't actually operate anything, but still ... Thinking about this - maybe there's a way to still combine devcoredump and devlink somehow? Or (optionally) make devlink trigger devcoredump while userspace migrates? > > So networking may want to be aware that a firmware crash happened as > > part of this network device health thing, but firmware crashing is a > > generic thing. > > > > I have now extended my patch set to include uvents and I am more set on > > that we need the taint now more than ever. FWIW, I still completely disagree on that taint. You (Luis) obviously have been running into a bug in that driver, I doubt the firmware actually managed to wedge the hardware. But even if it did, that's still not really a kernel taint. The kernel itself isn't in any way affected by this. Yes, the system is in a weird state now. But that's *not* equivalent to "kernel tainted". > The irony is you have a problem with a networking device and all the > devices your initial set touched are networking. Two of the drivers > you touched either have or will soon have devlink health reporters > implemented. Like I said above, do you think it'd be feasible to make a devcoredump out of devlink health reports? And can the report be in a way that we control the file format, or are there limits? I guess I should read the code to find out, but I figure you probably just know. But feel free to tell me to read it :) The reason I'm asking is that it's starting to sound like we really ought to be implementing devlink, but we've got a bunch of infrastructure that uses the devcoredump, and it'll take time (significantly so) to change all that... johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 20:46 ` Johannes Berg @ 2020-05-22 21:51 ` Luis Chamberlain 2020-05-22 23:23 ` Steve deRosier 2020-05-25 20:57 ` Jakub Kicinski 1 sibling, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-22 21:51 UTC (permalink / raw) To: Johannes Berg Cc: Jakub Kicinski, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris On Fri, May 22, 2020 at 10:46:07PM +0200, Johannes Berg wrote: > FWIW, I still completely disagree on that taint. You (Luis) obviously > have been running into a bug in that driver, I doubt the firmware > actually managed to wedge the hardware. This hasn't happened just once, its happed many times sporadically now, once a week or two weeks I'd say. And the system isn't being moved around. > But even if it did, that's still not really a kernel taint. The kernel > itself isn't in any way affected by this. Of course it is, a full reboot is required. > Yes, the system is in a weird state now. But that's *not* equivalent to > "kernel tainted". Requiring a full reboot is a dire situation to be in, and loosing connectivity to the point this is not recoverable likewise. You guys are making out a taint to be the end of the world. We have a taint even for a kernel warning, and as others have mentioned mac80211 already produces these. What exactly is the opposition to a taint to clarify that a device firmware has crashed and your system requires a full reboot? Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 21:51 ` Luis Chamberlain @ 2020-05-22 23:23 ` Steve deRosier 2020-05-22 23:44 ` Luis Chamberlain 2020-05-25 9:07 ` Andy Shevchenko 0 siblings, 2 replies; 51+ messages in thread From: Steve deRosier @ 2020-05-22 23:23 UTC (permalink / raw) To: Luis Chamberlain Cc: Johannes Berg, Jakub Kicinski, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, Kees Cook, Daniel Vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k, jiri, briannorris On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Fri, May 22, 2020 at 10:46:07PM +0200, Johannes Berg wrote: > > FWIW, I still completely disagree on that taint. You (Luis) obviously > > have been running into a bug in that driver, I doubt the firmware > > actually managed to wedge the hardware. > > This hasn't happened just once, its happed many times sporadically now, > once a week or two weeks I'd say. And the system isn't being moved > around. > > > But even if it did, that's still not really a kernel taint. The kernel > > itself isn't in any way affected by this. > > Of course it is, a full reboot is required. > > > Yes, the system is in a weird state now. But that's *not* equivalent to > > "kernel tainted". > > Requiring a full reboot is a dire situation to be in, and loosing > connectivity to the point this is not recoverable likewise. > > You guys are making out a taint to be the end of the world. We have a > taint even for a kernel warning, and as others have mentioned mac80211 > already produces these. > I had to go RTFM re: kernel taints because it has been a very long time since I looked at them. It had always seemed to me that most were caused by "kernel-unfriendly" user actions. The most famous of course is loading proprietary modules, out-of-tree modules, forced module loads, etc... Honestly, I had forgotten the large variety of uses of the taint flags. For anyone who hasn't looked at taints recently, I recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html In light of this I don't object to setting a taint on this anymore. I'm a little uneasy, but I've softened on it now, and now I feel it depends on implementation. Specifically, I don't think we should set a taint flag when a driver easily handles a routine firmware crash and is confident that things have come up just fine again. In other words, triggering the taint in every driver module where it spits out a log comment that it had a firmware crash and had to recover seems too much. Sure, firmware shouldn't crash, sure it should be open source so we can fix it, whatever... those sort of wishful comments simply ignore reality and our ability to affect effective change. A lot of WiFi firmware crashes and for well-known cases the drivers handle them well. And in some cases, not so well and that should be a place the driver should detect and thus raise a red flag. If a WiFi firmware crash can bring down the kernel, there's either a major driver bug or some very funky hardware crap going on. That sort of thing we should be able to detect, mark with a taint (or something), and fix if within our sphere of influence. I guess what it comes down to me is how aggressive we are about setting the flag. I would like there to be a single solution, or a minimized set depending on what makes sense for the requirements. I haven't had time to look into the alternatives mentioned here so I don't have an informed opinion about the solution. I do think Luis is trying to solve a real problem though. Can we look at this from the point of view of what are the requirements? What is it we're trying to solve? I _think_ that the goal of Luis's original proposal is to report up to the user, at some future point when the user is interested (because something super drastic just occured, but long after the fw crash), that there was a firmware crash without the user having to grep through all logs on the machine. And then if the user sees that flag and suspects it, then they can bother to find it in the logs or do more drastic debugging steps like finding the fw crash in the log and pulling firmware crash dumps, etc. I think the various alternate solutions are great but perhaps solving a superset of features (like adding in user-space notifications etc)? Perhaps different people on these related threads are trying to solve different problems? - Steve ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 23:23 ` Steve deRosier @ 2020-05-22 23:44 ` Luis Chamberlain 2020-05-25 9:07 ` Andy Shevchenko 1 sibling, 0 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-22 23:44 UTC (permalink / raw) To: Steve deRosier Cc: Johannes Berg, Jakub Kicinski, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, andriy.shevchenko, Kees Cook, Daniel Vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k, jiri, briannorris On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote: > Specifically, I don't think we should set a taint flag when a driver > easily handles a routine firmware crash and is confident that things > have come up just fine again. In other words, triggering the taint in > every driver module where it spits out a log comment that it had a > firmware crash and had to recover seems too much. Sure, firmware > shouldn't crash, sure it should be open source so we can fix it, > whatever... those sort of wishful comments simply ignore reality and > our ability to affect effective change. A lot of WiFi firmware crashes > and for well-known cases the drivers handle them well. And in some > cases, not so well and that should be a place the driver should detect > and thus raise a red flag. If a WiFi firmware crash can bring down > the kernel, there's either a major driver bug or some very funky > hardware crap going on. That sort of thing we should be able to > detect, mark with a taint (or something), and fix if within our sphere > of influence. I guess what it comes down to me is how aggressive we > are about setting the flag. Exactly the crux of the issue. I hope that by now we should all be in agreement that at least a firmware crash requiring a reboot is something we should record and inform the user of. A taint seems like a reasonable standard practice for these sorts of things. > I would like there to be a single solution, or a minimized set > depending on what makes sense for the requirements. I haven't had time > to look into the alternatives mentioned here so I don't have an > informed opinion about the solution. I do think Luis is trying to > solve a real problem though. Can we look at this from the point of > view of what are the requirements? What is it we're trying to solve? > > I _think_ that the goal of Luis's original proposal is to report up to > the user, at some future point when the user is interested (because > something super drastic just occured, but long after the fw crash), > that there was a firmware crash without the user having to grep > through all logs on the machine. And then if the user sees that flag > and suspects it, then they can bother to find it in the logs or do > more drastic debugging steps like finding the fw crash in the log and > pulling firmware crash dumps, etc. Yes, that's exactly it. Not all users are clueful to inspect logs. I now have a generic uevent mechanism drafted which sends a uevent for *any* taint. So that is, it does not even depend on this series. But it accomplishes the goal of informing the user of taints. > I think the various alternate solutions are great but perhaps solving > a superset of features (like adding in user-space notifications etc)? > Perhaps different people on these related threads are trying to solve > different problems? The uevent mechanism I implemented (but not yet posted for review) at least sends out a smoke signal. I think that if each subsystem wants to expand on this with dumping facilities that is great too! Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 23:23 ` Steve deRosier 2020-05-22 23:44 ` Luis Chamberlain @ 2020-05-25 9:07 ` Andy Shevchenko 2020-05-25 17:08 ` Ben Greear 1 sibling, 1 reply; 51+ messages in thread From: Andy Shevchenko @ 2020-05-25 9:07 UTC (permalink / raw) To: Steve deRosier Cc: Luis Chamberlain, Johannes Berg, Jakub Kicinski, Ben Greear, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, Kees Cook, Daniel Vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k, jiri, briannorris On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote: > On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > I had to go RTFM re: kernel taints because it has been a very long > time since I looked at them. It had always seemed to me that most were > caused by "kernel-unfriendly" user actions. The most famous of course > is loading proprietary modules, out-of-tree modules, forced module > loads, etc... Honestly, I had forgotten the large variety of uses of > the taint flags. For anyone who hasn't looked at taints recently, I > recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html > > In light of this I don't object to setting a taint on this anymore. > I'm a little uneasy, but I've softened on it now, and now I feel it > depends on implementation. > > Specifically, I don't think we should set a taint flag when a driver > easily handles a routine firmware crash and is confident that things > have come up just fine again. In other words, triggering the taint in > every driver module where it spits out a log comment that it had a > firmware crash and had to recover seems too much. Sure, firmware > shouldn't crash, sure it should be open source so we can fix it, > whatever... While it may sound idealistic the firmware for the end-user, and even for mere kernel developer like me, is a complete blackbox which has more access than root user in the kernel. We have tons of firmwares and each of them potentially dangerous beast. As a user I really care about my data and privacy (hacker can oops a firmware in order to set a specific vector attack). So, tainting kernel is _a least_ we can do there, the strict rules would be to reboot immediately. > those sort of wishful comments simply ignore reality and > our ability to affect effective change. We can encourage users not to buy cheap crap for the starter. > A lot of WiFi firmware crashes > and for well-known cases the drivers handle them well. And in some > cases, not so well and that should be a place the driver should detect > and thus raise a red flag. If a WiFi firmware crash can bring down > the kernel, there's either a major driver bug or some very funky > hardware crap going on. That sort of thing we should be able to > detect, mark with a taint (or something), and fix if within our sphere > of influence. I guess what it comes down to me is how aggressive we > are about setting the flag. -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-25 9:07 ` Andy Shevchenko @ 2020-05-25 17:08 ` Ben Greear 0 siblings, 0 replies; 51+ messages in thread From: Ben Greear @ 2020-05-25 17:08 UTC (permalink / raw) To: Andy Shevchenko, Steve deRosier Cc: Luis Chamberlain, Johannes Berg, Jakub Kicinski, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, Takashi Iwai, schlad, Kees Cook, Daniel Vetter, will, mchehab+samsung, Kalle Valo, David S. Miller, Network Development, LKML, linux-wireless, ath10k, jiri, briannorris On 05/25/2020 02:07 AM, Andy Shevchenko wrote: > On Fri, May 22, 2020 at 04:23:55PM -0700, Steve deRosier wrote: >> On Fri, May 22, 2020 at 2:51 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > >> I had to go RTFM re: kernel taints because it has been a very long >> time since I looked at them. It had always seemed to me that most were >> caused by "kernel-unfriendly" user actions. The most famous of course >> is loading proprietary modules, out-of-tree modules, forced module >> loads, etc... Honestly, I had forgotten the large variety of uses of >> the taint flags. For anyone who hasn't looked at taints recently, I >> recommend: https://www.kernel.org/doc/html/latest/admin-guide/tainted-kernels.html >> >> In light of this I don't object to setting a taint on this anymore. >> I'm a little uneasy, but I've softened on it now, and now I feel it >> depends on implementation. >> >> Specifically, I don't think we should set a taint flag when a driver >> easily handles a routine firmware crash and is confident that things >> have come up just fine again. In other words, triggering the taint in >> every driver module where it spits out a log comment that it had a >> firmware crash and had to recover seems too much. Sure, firmware >> shouldn't crash, sure it should be open source so we can fix it, >> whatever... > > While it may sound idealistic the firmware for the end-user, and even for mere > kernel developer like me, is a complete blackbox which has more access than > root user in the kernel. We have tons of firmwares and each of them potentially > dangerous beast. As a user I really care about my data and privacy (hacker can > oops a firmware in order to set a specific vector attack). So, tainting kernel > is _a least_ we can do there, the strict rules would be to reboot immediately. > >> those sort of wishful comments simply ignore reality and >> our ability to affect effective change. > > We can encourage users not to buy cheap crap for the starter. There is no stable wifi firmware for any price. There is also no obvious feedback from even name-brand NICs like ath10k or AX200 when you report a crash. That said, at least in my experience with ath10k-ct, the OS normally recovers fine from firmware crashes. ath10k already reports full crash reports on udev, so easy for user-space to notice and report bug reports upstream if it cares to. Probably other NICs do the same, and if not, they certainly could. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 20:46 ` Johannes Berg 2020-05-22 21:51 ` Luis Chamberlain @ 2020-05-25 20:57 ` Jakub Kicinski 2020-07-30 13:56 ` Johannes Berg 1 sibling, 1 reply; 51+ messages in thread From: Jakub Kicinski @ 2020-05-25 20:57 UTC (permalink / raw) To: Johannes Berg Cc: Luis Chamberlain, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris On Fri, 22 May 2020 22:46:07 +0200 Johannes Berg wrote: > > The irony is you have a problem with a networking device and all the > > devices your initial set touched are networking. Two of the drivers > > you touched either have or will soon have devlink health reporters > > implemented. > > Like I said above, do you think it'd be feasible to make a devcoredump > out of devlink health reports? And can the report be in a way that we > control the file format, or are there limits? I guess I should read the > code to find out, but I figure you probably just know. But feel free to > tell me to read it :) > > The reason I'm asking is that it's starting to sound like we really > ought to be implementing devlink, but we've got a bunch of > infrastructure that uses the devcoredump, and it'll take time > (significantly so) to change all that... In devlink world pure FW core dumps are exposed by devlink regions. An API allowing reading device memory, registers, etc., but also creating dumps of memory regions when things go wrong. It should be a fairly straightforward migration. Devlink health is more targeted, the dump is supposed to contain only relevant information, selected and formatted by the driver. When device misbehaves driver reads the relevant registers and FW state and creates a formatted state dump. I believe each element of the dump must fit into a netlink message (but there may be multiple elements, see devlink_fmsg_prepare_skb()). We should be able to convert dl-regions dumps and dl-health dumps into devcoredumps, but since health reporters are supposed to be more targeted there's usually multiple of them per device. Conversely devcoredumps can be trivially exposed as dl-region dumps, but I believe dl-health would require driver changes to format the information appropriately. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-25 20:57 ` Jakub Kicinski @ 2020-07-30 13:56 ` Johannes Berg 0 siblings, 0 replies; 51+ messages in thread From: Johannes Berg @ 2020-07-30 13:56 UTC (permalink / raw) To: Jakub Kicinski Cc: Luis Chamberlain, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris Hi, Sorry ... long delay. > > The reason I'm asking is that it's starting to sound like we really > > ought to be implementing devlink, but we've got a bunch of > > infrastructure that uses the devcoredump, and it'll take time > > (significantly so) to change all that... > > In devlink world pure FW core dumps are exposed by devlink regions. > An API allowing reading device memory, registers, etc., but also > creating dumps of memory regions when things go wrong. It should be > a fairly straightforward migration. Right. We also have regions (various memory pieces, registers, ...). One issue might be that for devlink we wouldn't want to expose these as a single blob, I guess, but for devcoredump we already have a custom format to glue all the things together. Since it seems unlikely that anyone else would want to use the *iwlwifi* format to glue things together, we'd have to do something there. But perhaps that could be a matter of providing a "glue things into a devcoredump" function that would have a reasonable default but could be overridden by the driver for those migration cases. > Devlink health is more targeted, the dump is supposed to contain only > relevant information, selected and formatted by the driver. When device > misbehaves driver reads the relevant registers and FW state and creates > a formatted state dump. I believe each element of the dump must fit > into a netlink message (but there may be multiple elements, see > devlink_fmsg_prepare_skb()). That wouldn't help for our big memory dumps, but OK. > We should be able to convert dl-regions dumps and dl-health dumps into > devcoredumps, but since health reporters are supposed to be more > targeted there's usually multiple of them per device. Right. > Conversely devcoredumps can be trivially exposed as dl-region dumps, > but I believe dl-health would require driver changes to format the > information appropriately. Agree. Anyway, thanks. I'll put it on my list of things to look at ... not too hopeful that will be soon, given how long it even took me to get back to this email :) johannes ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC 1/2] devlink: add simple fw crash helpers 2020-05-22 17:17 ` Jakub Kicinski 2020-05-22 20:46 ` Johannes Berg @ 2020-05-22 21:49 ` Luis Chamberlain 1 sibling, 0 replies; 51+ messages in thread From: Luis Chamberlain @ 2020-05-22 21:49 UTC (permalink / raw) To: Jakub Kicinski Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris On Fri, May 22, 2020 at 10:17:38AM -0700, Jakub Kicinski wrote: > On Fri, 22 May 2020 05:20:46 +0000 Luis Chamberlain wrote: > > > diff --git a/net/core/Makefile b/net/core/Makefile > > > index 3e2c378e5f31..6f1513781c17 100644 > > > --- a/net/core/Makefile > > > +++ b/net/core/Makefile > > > @@ -31,7 +31,7 @@ obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o > > > obj-$(CONFIG_BPF_STREAM_PARSER) += sock_map.o > > > obj-$(CONFIG_DST_CACHE) += dst_cache.o > > > obj-$(CONFIG_HWBM) += hwbm.o > > > -obj-$(CONFIG_NET_DEVLINK) += devlink.o > > > +obj-$(CONFIG_NET_DEVLINK) += devlink.o devlink_simple_fw_reporter.o > > > > This was looking super sexy up to here. This is networking specific. > > We want something generic for *anything* that requests firmware. > > You can't be serious. It's network specific because of how the Kconfig > is named? Kconfig? What has that to do with anything? The issue I have is that the solution I am looking for is for it to be agnostic to the subsystem. I have found similar firmware crashes on gpu, media, scsci. > Working for a company operating large data centers I would strongly > prefer if we didn't have ten different ways of reporting firmware > problems in the fleet. Indeed. > > I'm afraid this won't work for something generic. I don't think its > > throw-away work though, the idea to provide a generic interface to > > dump firmware through netlink might be nice for networking, or other > > things. > > > > But I have a feeling we'll want something still more generic than this. > > Please be specific. Saying generic a lot is not helpful. The code (as > you can see in this patch) is in no way network specific. Or are you > saying there are machines out there running without netlink sockets? No, I am saying I want something to work with any struct device. > > So networking may want to be aware that a firmware crash happened as > > part of this network device health thing, but firmware crashing is a > > generic thing. > > > > I have now extended my patch set to include uvents and I am more set on > > that we need the taint now more than ever. > > Please expect my nack if you're trying to add this to networking > drivers. The uevent mechanism is not for networking. The taint however is, and I'd like to undertand how it is you do not see that an undesirable requirement for a reboot is a clear case for a taint. > The irony is you have a problem with a networking device and all the > devices your initial set touched are networking. Two of the drivers > you touched either have or will soon have devlink health reporters > implemented. That is all great, and I don't think its a bad idea to add infrastructure / extend it to get more information about a firmware crash dump. However, suggesting that devlink is the only solution we need in the kernel without considering other subsystems is what I am suggesting doesn't suit my needs. Networking was just the first subsystem I am taclking now but I have patches where similar situations happen across the kernel. Luis ^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC 2/2] i2400m: use devlink health reporter 2020-05-19 1:05 ` Luis Chamberlain 2020-05-19 21:15 ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski @ 2020-05-19 21:15 ` Jakub Kicinski 1 sibling, 0 replies; 51+ messages in thread From: Jakub Kicinski @ 2020-05-19 21:15 UTC (permalink / raw) To: mcgrof Cc: johannes, derosier, greearb, jeyu, akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k, jiri, briannorris, Jakub Kicinski It builds. Signed-off-by: Jakub Kicinski <kuba@kernel.org> --- drivers/net/wimax/i2400m/rx.c | 2 ++ drivers/net/wimax/i2400m/usb.c | 5 +++++ 2 files changed, 7 insertions(+) diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c index c9fb619a9e01..cc7fe78f2df0 100644 --- a/drivers/net/wimax/i2400m/rx.c +++ b/drivers/net/wimax/i2400m/rx.c @@ -144,6 +144,7 @@ * i2400m_msg_size_check * wimax_msg */ +#include <linux/devlink.h> #include <linux/slab.h> #include <linux/kernel.h> #include <linux/if_arp.h> @@ -712,6 +713,7 @@ void __i2400m_roq_queue(struct i2400m *i2400m, struct i2400m_roq *roq, dev_err(dev, "SW BUG? failed to insert packet\n"); dev_err(dev, "ERX: roq %p [ws %u] skb %p nsn %d sn %u\n", roq, roq->ws, skb, nsn, roq_data->sn); + devlink_simple_fw_reporter_report_crash(dev); skb_queue_walk(&roq->queue, skb_itr) { roq_data_itr = (struct i2400m_roq_data *) &skb_itr->cb; nsn_itr = __i2400m_roq_nsn(roq, roq_data_itr->sn); diff --git a/drivers/net/wimax/i2400m/usb.c b/drivers/net/wimax/i2400m/usb.c index 9659f9e1aaa6..5c811dccbf1d 100644 --- a/drivers/net/wimax/i2400m/usb.c +++ b/drivers/net/wimax/i2400m/usb.c @@ -49,6 +49,7 @@ * usb_reset_device() */ #include "i2400m-usb.h" +#include <linux/devlink.h> #include <linux/wimax/i2400m.h> #include <linux/debugfs.h> #include <linux/slab.h> @@ -423,6 +424,8 @@ int i2400mu_probe(struct usb_interface *iface, if (usb_dev->speed != USB_SPEED_HIGH) dev_err(dev, "device not connected as high speed\n"); + devlink_simple_fw_reporter_prepare(dev); + /* Allocate instance [calls i2400m_netdev_setup() on it]. */ result = -ENOMEM; net_dev = alloc_netdev(sizeof(*i2400mu), "wmx%d", NET_NAME_UNKNOWN, @@ -506,6 +509,7 @@ int i2400mu_probe(struct usb_interface *iface, usb_put_dev(i2400mu->usb_dev); free_netdev(net_dev); error_alloc_netdev: + devlink_simple_fw_reporter_cleanup(dev); return result; } @@ -532,6 +536,7 @@ void i2400mu_disconnect(struct usb_interface *iface) usb_set_intfdata(iface, NULL); usb_put_dev(i2400mu->usb_dev); free_netdev(net_dev); + devlink_simple_fw_reporter_cleanup(dev); d_fnend(3, dev, "(iface %p i2400m %p) = void\n", iface, i2400m); } -- 2.25.4 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() [not found] <20200515212846.1347-1-mcgrof@kernel.org> 2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain @ 2020-05-15 21:28 ` Luis Chamberlain 2020-05-16 4:12 ` Rafael Aquini 2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain 2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain 3 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw) To: jeyu Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, Luis Chamberlain, linux-wireless, ath10k This makes use of the new module_firmware_crashed() to help annotate when firmware for device drivers crash. When firmware crashes devices can sometimes become unresponsive, and recovery sometimes requires a driver unload / reload and in the worst cases a reboot. Using a taint flag allows us to annotate when this happens clearly. Cc: linux-wireless@vger.kernel.org Cc: ath10k@lists.infradead.org Cc: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/net/wireless/ath/ath6kl/hif.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/wireless/ath/ath6kl/hif.c b/drivers/net/wireless/ath/ath6kl/hif.c index d1942537ea10..cfd838607544 100644 --- a/drivers/net/wireless/ath/ath6kl/hif.c +++ b/drivers/net/wireless/ath/ath6kl/hif.c @@ -120,6 +120,7 @@ static int ath6kl_hif_proc_dbg_intr(struct ath6kl_device *dev) int ret; ath6kl_warn("firmware crashed\n"); + module_firmware_crashed(); /* * read counter to clear the interrupt, the debug error interrupt is -- 2.26.2 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() 2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain @ 2020-05-16 4:12 ` Rafael Aquini 0 siblings, 0 replies; 51+ messages in thread From: Rafael Aquini @ 2020-05-16 4:12 UTC (permalink / raw) To: Luis Chamberlain Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, ath10k On Fri, May 15, 2020 at 09:28:44PM +0000, Luis Chamberlain wrote: > This makes use of the new module_firmware_crashed() to help > annotate when firmware for device drivers crash. When firmware > crashes devices can sometimes become unresponsive, and recovery > sometimes requires a driver unload / reload and in the worst cases > a reboot. > > Using a taint flag allows us to annotate when this happens clearly. > > Cc: linux-wireless@vger.kernel.org > Cc: ath10k@lists.infradead.org > Cc: Kalle Valo <kvalo@codeaurora.org> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > drivers/net/wireless/ath/ath6kl/hif.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/wireless/ath/ath6kl/hif.c b/drivers/net/wireless/ath/ath6kl/hif.c > index d1942537ea10..cfd838607544 100644 > --- a/drivers/net/wireless/ath/ath6kl/hif.c > +++ b/drivers/net/wireless/ath/ath6kl/hif.c > @@ -120,6 +120,7 @@ static int ath6kl_hif_proc_dbg_intr(struct ath6kl_device *dev) > int ret; > > ath6kl_warn("firmware crashed\n"); > + module_firmware_crashed(); > > /* > * read counter to clear the interrupt, the debug error interrupt is > -- > 2.26.2 > Acked-by: Rafael Aquini <aquini@redhat.com> ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v2 14/15] brcm80211: use new module_firmware_crashed() [not found] <20200515212846.1347-1-mcgrof@kernel.org> 2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain 2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain @ 2020-05-15 21:28 ` Luis Chamberlain 2020-05-16 4:13 ` Rafael Aquini 2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain 3 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw) To: jeyu Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, Luis Chamberlain, linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list, Arend van Spriel, Franky Lin, Hante Meuleman, Chi-Hsien Lin, Wright Feng, Rafał Miłecki, Pieter-Paul Giesberts This makes use of the new module_firmware_crashed() to help annotate when firmware for device drivers crash. When firmware crashes devices can sometimes become unresponsive, and recovery sometimes requires a driver unload / reload and in the worst cases a reboot. Using a taint flag allows us to annotate when this happens clearly. Cc: linux-wireless@vger.kernel.org Cc: brcm80211-dev-list.pdl@broadcom.com Cc: brcm80211-dev-list@cypress.com Cc: Arend van Spriel <arend.vanspriel@broadcom.com> Cc: Franky Lin <franky.lin@broadcom.com> Cc: Hante Meuleman <hante.meuleman@broadcom.com> Cc: Chi-Hsien Lin <chi-hsien.lin@cypress.com> Cc: Wright Feng <wright.feng@cypress.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: "Rafał Miłecki" <rafal@milecki.pl> Cc: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c index c88655acc78c..d623f83568b3 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c @@ -1393,6 +1393,7 @@ void brcmf_fw_crashed(struct device *dev) struct brcmf_pub *drvr = bus_if->drvr; bphy_err(drvr, "Firmware has halted or crashed\n"); + module_firmware_crashed(); brcmf_dev_coredump(dev); -- 2.26.2 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v2 14/15] brcm80211: use new module_firmware_crashed() 2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain @ 2020-05-16 4:13 ` Rafael Aquini 0 siblings, 0 replies; 51+ messages in thread From: Rafael Aquini @ 2020-05-16 4:13 UTC (permalink / raw) To: Luis Chamberlain Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, brcm80211-dev-list.pdl, brcm80211-dev-list, Arend van Spriel, Franky Lin, Hante Meuleman, Chi-Hsien Lin, Wright Feng, Rafał Miłecki, Pieter-Paul Giesberts On Fri, May 15, 2020 at 09:28:45PM +0000, Luis Chamberlain wrote: > This makes use of the new module_firmware_crashed() to help > annotate when firmware for device drivers crash. When firmware > crashes devices can sometimes become unresponsive, and recovery > sometimes requires a driver unload / reload and in the worst cases > a reboot. > > Using a taint flag allows us to annotate when this happens clearly. > > Cc: linux-wireless@vger.kernel.org > Cc: brcm80211-dev-list.pdl@broadcom.com > Cc: brcm80211-dev-list@cypress.com > Cc: Arend van Spriel <arend.vanspriel@broadcom.com> > Cc: Franky Lin <franky.lin@broadcom.com> > Cc: Hante Meuleman <hante.meuleman@broadcom.com> > Cc: Chi-Hsien Lin <chi-hsien.lin@cypress.com> > Cc: Wright Feng <wright.feng@cypress.com> > Cc: Kalle Valo <kvalo@codeaurora.org> > Cc: "Rafał Miłecki" <rafal@milecki.pl> > Cc: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c > index c88655acc78c..d623f83568b3 100644 > --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c > +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c > @@ -1393,6 +1393,7 @@ void brcmf_fw_crashed(struct device *dev) > struct brcmf_pub *drvr = bus_if->drvr; > > bphy_err(drvr, "Firmware has halted or crashed\n"); > + module_firmware_crashed(); > > brcmf_dev_coredump(dev); > > -- > 2.26.2 > Acked-by: Rafael Aquini <aquini@redhat.com> ^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v2 15/15] mwl8k: use new module_firmware_crashed() [not found] <20200515212846.1347-1-mcgrof@kernel.org> ` (2 preceding siblings ...) 2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain @ 2020-05-15 21:28 ` Luis Chamberlain 2020-05-16 4:13 ` Rafael Aquini 3 siblings, 1 reply; 51+ messages in thread From: Luis Chamberlain @ 2020-05-15 21:28 UTC (permalink / raw) To: jeyu Cc: akpm, arnd, rostedt, mingo, aquini, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, Luis Chamberlain, linux-wireless, Lennert Buytenhek, Gustavo A. R. Silva, Johannes Berg, Ganapathi Bhat This makes use of the new module_firmware_crashed() to help annotate when firmware for device drivers crash. When firmware crashes devices can sometimes become unresponsive, and recovery sometimes requires a driver unload / reload and in the worst cases a reboot. Using a taint flag allows us to annotate when this happens clearly. Cc: linux-wireless@vger.kernel.org Cc: Lennert Buytenhek <buytenh@wantstofly.org> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Johannes Berg <johannes.berg@intel.com> Cc: Ganapathi Bhat <ganapathi.bhat@nxp.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- drivers/net/wireless/marvell/mwl8k.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c index 97f23f93f6e7..d609ef1bb879 100644 --- a/drivers/net/wireless/marvell/mwl8k.c +++ b/drivers/net/wireless/marvell/mwl8k.c @@ -1551,6 +1551,7 @@ static int mwl8k_tx_wait_empty(struct ieee80211_hw *hw) * the firmware has crashed */ if (priv->hw_restart_in_progress) { + module_firmware_crashed(); if (priv->hw_restart_owner == current) return 0; else -- 2.26.2 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v2 15/15] mwl8k: use new module_firmware_crashed() 2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain @ 2020-05-16 4:13 ` Rafael Aquini 0 siblings, 0 replies; 51+ messages in thread From: Rafael Aquini @ 2020-05-16 4:13 UTC (permalink / raw) To: Luis Chamberlain Cc: jeyu, akpm, arnd, rostedt, mingo, cai, dyoung, bhe, peterz, tglx, gpiccoli, pmladek, tiwai, schlad, andriy.shevchenko, keescook, daniel.vetter, will, mchehab+samsung, kvalo, davem, netdev, linux-kernel, linux-wireless, Lennert Buytenhek, Gustavo A. R. Silva, Johannes Berg, Ganapathi Bhat On Fri, May 15, 2020 at 09:28:46PM +0000, Luis Chamberlain wrote: > This makes use of the new module_firmware_crashed() to help > annotate when firmware for device drivers crash. When firmware > crashes devices can sometimes become unresponsive, and recovery > sometimes requires a driver unload / reload and in the worst cases > a reboot. > > Using a taint flag allows us to annotate when this happens clearly. > > Cc: linux-wireless@vger.kernel.org > Cc: Lennert Buytenhek <buytenh@wantstofly.org> > Cc: Kalle Valo <kvalo@codeaurora.org> > Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> > Cc: Johannes Berg <johannes.berg@intel.com> > Cc: Ganapathi Bhat <ganapathi.bhat@nxp.com> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > drivers/net/wireless/marvell/mwl8k.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c > index 97f23f93f6e7..d609ef1bb879 100644 > --- a/drivers/net/wireless/marvell/mwl8k.c > +++ b/drivers/net/wireless/marvell/mwl8k.c > @@ -1551,6 +1551,7 @@ static int mwl8k_tx_wait_empty(struct ieee80211_hw *hw) > * the firmware has crashed > */ > if (priv->hw_restart_in_progress) { > + module_firmware_crashed(); > if (priv->hw_restart_owner == current) > return 0; > else > -- > 2.26.2 > Acked-by: Rafael Aquini <aquini@redhat.com> ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2020-07-30 13:57 UTC | newest] Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20200515212846.1347-1-mcgrof@kernel.org> 2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain 2020-05-16 4:11 ` Rafael Aquini 2020-05-16 13:24 ` Johannes Berg 2020-05-16 13:50 ` Johannes Berg 2020-05-18 16:56 ` Luis Chamberlain 2020-05-19 1:23 ` Brian Norris 2020-05-19 14:02 ` Luis Chamberlain 2020-05-20 0:47 ` Brian Norris 2020-05-20 5:37 ` Emmanuel Grumbach 2020-05-20 8:32 ` Andy Shevchenko 2020-05-21 19:01 ` Brian Norris 2020-05-22 5:12 ` Emmanuel Grumbach 2020-05-22 5:23 ` Luis Chamberlain 2020-05-18 16:51 ` Luis Chamberlain 2020-05-18 16:58 ` Ben Greear 2020-05-18 17:09 ` Luis Chamberlain 2020-05-18 17:15 ` Ben Greear 2020-05-18 17:18 ` Luis Chamberlain 2020-05-18 18:06 ` Steve deRosier 2020-05-18 19:09 ` Luis Chamberlain 2020-05-18 19:25 ` Johannes Berg 2020-05-18 19:59 ` Luis Chamberlain 2020-05-18 20:07 ` Johannes Berg 2020-05-18 21:18 ` Luis Chamberlain 2020-05-18 20:28 ` Jakub Kicinski 2020-05-18 20:29 ` Johannes Berg 2020-05-18 20:35 ` Jakub Kicinski 2020-05-18 20:41 ` Johannes Berg 2020-05-18 20:46 ` Jakub Kicinski 2020-05-18 21:22 ` Luis Chamberlain 2020-05-18 22:16 ` Jakub Kicinski 2020-05-19 1:05 ` Luis Chamberlain 2020-05-19 21:15 ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski 2020-05-22 5:20 ` Luis Chamberlain 2020-05-22 17:17 ` Jakub Kicinski 2020-05-22 20:46 ` Johannes Berg 2020-05-22 21:51 ` Luis Chamberlain 2020-05-22 23:23 ` Steve deRosier 2020-05-22 23:44 ` Luis Chamberlain 2020-05-25 9:07 ` Andy Shevchenko 2020-05-25 17:08 ` Ben Greear 2020-05-25 20:57 ` Jakub Kicinski 2020-07-30 13:56 ` Johannes Berg 2020-05-22 21:49 ` Luis Chamberlain 2020-05-19 21:15 ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski 2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain 2020-05-16 4:12 ` Rafael Aquini 2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain 2020-05-16 4:13 ` Rafael Aquini 2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain 2020-05-16 4:13 ` Rafael Aquini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).