From: Luis Chamberlain <mcgrof@kernel.org>
To: Brian Norris <briannorris@chromium.org>
Cc: Johannes Berg <johannes@sipsolutions.net>,
linux-wireless <linux-wireless@vger.kernel.org>,
aquini@redhat.com, peterz@infradead.org, daniel.vetter@ffwll.ch,
mchehab+samsung@kernel.org, will@kernel.org, bhe@redhat.com,
ath10k@lists.infradead.org, Takashi Iwai <tiwai@suse.de>,
mingo@redhat.com, dyoung@redhat.com, pmladek@suse.com,
Kees Cook <keescook@chromium.org>, Arnd Bergmann <arnd@arndb.de>,
gpiccoli@canonical.com, Steven Rostedt <rostedt@goodmis.org>,
cai@lca.pw, tglx@linutronix.de,
Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
Kalle Valo <kvalo@codeaurora.org>,
"<netdev@vger.kernel.org>" <netdev@vger.kernel.org>,
schlad@suse.de, Linux Kernel <linux-kernel@vger.kernel.org>,
jeyu@kernel.org, Andrew Morton <akpm@linux-foundation.org>,
"David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH v2 12/15] ath10k: use new module_firmware_crashed()
Date: Tue, 19 May 2020 14:02:12 +0000 [thread overview]
Message-ID: <20200519140212.GT11244@42.do-not-panic.com> (raw)
In-Reply-To: <CA+ASDXOg9oKeMJP1Mf42oCMMM3sVe0jniaWowbXVuaYZ=ZpDjQ@mail.gmail.com>
On Mon, May 18, 2020 at 06:23:33PM -0700, Brian Norris wrote:
> On Sat, May 16, 2020 at 6:51 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > In addition, look what we have in iwl_trans_pcie_removal_wk(). If we
> > detect that the device is really wedged enough that the only way we can
> > still try to recover is by completely unbinding the driver from it, then
> > we give userspace a uevent for that. I don't remember exactly how and
> > where that gets used (ChromeOS) though, but it'd be nice to have that
> > sort of thing as part of the infrastructure, in a sort of two-level
> > notification?
>
> <slight side track>
> We use this on certain devices where we know the underlying hardware
> has design issues that may lead to device failure
Ah, after reading below I see you meant for iwlwifi.
If userspace can indeed grow to support this, that would be fantastic.
I should note that I don't discourage hiding firmware or hardware
issues. Quite the contrary, I suspect that taking pride in being
trasnparent about it, and dealing with it fast can help lead the pack.
I wrote about this long ago in 2015 [0], and stand by it.
[0] https://www.do-not-panic.com/2015/04/god-complex-why-open-models-will-win.html
> -- then when we see
> this sort of unrecoverable "firmware-death", we remove the
> device[*]+driver, force-reset the PCI device (SBR), and try to
> reload/reattach the driver. This all happens by way of a udev rule.
So you've sprikled your own udev event here as part of your kernel delta?
> We
> also log this sort of stuff (and metrics around it) for bug reports
> and health statistics, since we really hope to not see this happen
> often.
Assuming perfection is ideal but silly. So, what infrastructure do you
use for this sort of issue?
> [*] "We" (user space) don't actually do this...it happens via the
> 'remove_when_gone' module parameter abomination found in iwlwifi.
Holy moly.. but hey, at least it may seem a bit more seemless than forcing
a reboot / manual driver removal / addition to the user.
BTW is this likely a place on iwlwifi where the firmware likely crashed?
> I'd
> personally rather see the EVENT=INACESSIBLE stuff on its own, and let
> user space deal with when and how to remove and reset the device. But
> I digress too much here ;)
> </slight side track>
This is all useful information. We are just touching the surface of the
topic by addressing networking first. Imagine when we address other
subsystems.
> I really came to this thread to say that I also love the idea of a
> generic mechanism (a la $subject) to report firmware crashes, but I
> also have no interest in seeing a taint flag for it. For Chrome OS, I
> would readily (as in, we're already looking at more-hacky /
> non-generic ways to do this for drivers we care about) process these
> kinds of stats as they happen, logging metrics for bug reports and/or
> for automated crash statistics, when we see a firmware crash.
Great!
> A uevent
> would suit us very well I think, although it would be nice if drivers
> could also supply some small amount of informative text along with it
A follow up to this series was to add a uevent to add_taint(), however
since a *count* is not considered I think it is correct to seek
alternatives at this point. The leaner the solution the better though.
Do you have a pointer to what guys use so I can read?
> (e.g., a sort of "reason code", in case we can possibly aggregate
> certain failure types). We already do this sort of thing for WARN()
> and friends (not via uevent, but via log parsing; at least it has nice
> "cut here" markers!).
Indeed, similar things can indeed be argued about WARN*()... this
however can be non-device specific. With panic-on-warn becoming a
"thing", the more important it becomes to really tally exactly *why*
these WARN*()s may trigger.
> Perhaps
Note below.
> devlink (as proposed down-thread) would also fit the bill. I
> don't think sysfs alone would fit our needs, as we'd like to process
> these things as they happen, not only when a user submits a bug
> report.
I think we've reached a point where using "*Perhaps*" does not suffice,
and if there is already a *user* of similar desired infrastructure I
think we should jump on the opportunity to replace what you have with
something which could be used by other devices / subsystems which
require firmware. And indeed, also even consider in the abstract sense,
the possibility to leverage something like this for WARN*()s later too.
> > Level 1: firmware crashed, but we're recovering, at least mostly, and
> > it's more informational
>
> Chrome OS would love to track these things too, since we'd like to see
> these minimized, even if they're usually recoverable ;)
>
> > Level 2: device is wedged, going to try to recover by some more forceful
> > means (perhaps some devices can be power-cycled? etc.) but (more) state
> > would be lost in these cases?
>
> And we'd definitely want to know about these. We already get this for
> the iwlwifi case described above, in a non-generic way.
>
> In general, it's probably not that easy to tell the difference between
> 1 and 2, since even as you and Luis have noted, with the same driver
> (and the same driver location), you find the same crashes may or may
> not be recoverable. iwlwifi has extracted certain level 2 cases into
> iwl_trans_pcie_removal_wk(), but even iwlwifi doesn't know all the
> ways in which level 1 crashes actually lead to severe
> (non-recoverable) failure.
And that is fine, accepting these for what they are will help. However,
leaving the user in the *dark*, is what we should *not do*.
Luis
next prev parent reply other threads:[~2020-05-19 14:02 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20200515212846.1347-1-mcgrof@kernel.org>
2020-05-15 21:28 ` [PATCH v2 12/15] ath10k: use new module_firmware_crashed() Luis Chamberlain
2020-05-16 4:11 ` Rafael Aquini
2020-05-16 13:24 ` Johannes Berg
2020-05-16 13:50 ` Johannes Berg
2020-05-18 16:56 ` Luis Chamberlain
2020-05-19 1:23 ` Brian Norris
2020-05-19 14:02 ` Luis Chamberlain [this message]
2020-05-20 0:47 ` Brian Norris
2020-05-20 5:37 ` Emmanuel Grumbach
2020-05-20 8:32 ` Andy Shevchenko
2020-05-21 19:01 ` Brian Norris
2020-05-22 5:12 ` Emmanuel Grumbach
2020-05-22 5:23 ` Luis Chamberlain
2020-05-18 16:51 ` Luis Chamberlain
2020-05-18 16:58 ` Ben Greear
2020-05-18 17:09 ` Luis Chamberlain
2020-05-18 17:15 ` Ben Greear
2020-05-18 17:18 ` Luis Chamberlain
2020-05-18 18:06 ` Steve deRosier
2020-05-18 19:09 ` Luis Chamberlain
2020-05-18 19:25 ` Johannes Berg
2020-05-18 19:59 ` Luis Chamberlain
2020-05-18 20:07 ` Johannes Berg
2020-05-18 21:18 ` Luis Chamberlain
2020-05-18 20:28 ` Jakub Kicinski
2020-05-18 20:29 ` Johannes Berg
2020-05-18 20:35 ` Jakub Kicinski
2020-05-18 20:41 ` Johannes Berg
2020-05-18 20:46 ` Jakub Kicinski
2020-05-18 21:22 ` Luis Chamberlain
2020-05-18 22:16 ` Jakub Kicinski
2020-05-19 1:05 ` Luis Chamberlain
2020-05-19 21:15 ` [RFC 1/2] devlink: add simple fw crash helpers Jakub Kicinski
2020-05-22 5:20 ` Luis Chamberlain
2020-05-22 17:17 ` Jakub Kicinski
2020-05-22 20:46 ` Johannes Berg
2020-05-22 21:51 ` Luis Chamberlain
2020-05-22 23:23 ` Steve deRosier
2020-05-22 23:44 ` Luis Chamberlain
2020-05-25 9:07 ` Andy Shevchenko
2020-05-25 17:08 ` Ben Greear
2020-05-25 20:57 ` Jakub Kicinski
2020-07-30 13:56 ` Johannes Berg
2020-05-22 21:49 ` Luis Chamberlain
2020-05-19 21:15 ` [RFC 2/2] i2400m: use devlink health reporter Jakub Kicinski
2020-05-15 21:28 ` [PATCH v2 13/15] ath6kl: use new module_firmware_crashed() Luis Chamberlain
2020-05-16 4:12 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 14/15] brcm80211: " Luis Chamberlain
2020-05-16 4:13 ` Rafael Aquini
2020-05-15 21:28 ` [PATCH v2 15/15] mwl8k: " Luis Chamberlain
2020-05-16 4:13 ` Rafael Aquini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200519140212.GT11244@42.do-not-panic.com \
--to=mcgrof@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=andriy.shevchenko@linux.intel.com \
--cc=aquini@redhat.com \
--cc=arnd@arndb.de \
--cc=ath10k@lists.infradead.org \
--cc=bhe@redhat.com \
--cc=briannorris@chromium.org \
--cc=cai@lca.pw \
--cc=daniel.vetter@ffwll.ch \
--cc=davem@davemloft.net \
--cc=dyoung@redhat.com \
--cc=gpiccoli@canonical.com \
--cc=jeyu@kernel.org \
--cc=johannes@sipsolutions.net \
--cc=keescook@chromium.org \
--cc=kvalo@codeaurora.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=mchehab+samsung@kernel.org \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=pmladek@suse.com \
--cc=rostedt@goodmis.org \
--cc=schlad@suse.de \
--cc=tglx@linutronix.de \
--cc=tiwai@suse.de \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).