All of lore.kernel.org
 help / color / mirror / Atom feed
* ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
@ 2020-12-16  8:47 Kalle Valo
  2020-12-17  8:41 ` Kalle Valo
  2020-12-17  9:53 ` Manivannan Sadhasivam
  0 siblings, 2 replies; 9+ messages in thread
From: Kalle Valo @ 2020-12-16  8:47 UTC (permalink / raw)
  To: Manivannan Sadhasivam, Hemant Kumar, Jeffrey Hugo, Bhaumik Bhatt,
	Bjorn Andersson
  Cc: Stephen Liang, Carl Huang, ath11k, wink, Mitchell Nordine

Hi MHI devs,

To keep the discussion organised I'll start a new thread about weird
kernel crashes we are seeing on ath11k, and include MHI folks as well in
case they have any ideas. This is a long story, but I try to summarise
this as short as I can :)

Recently Dell released laptops with QCA6390. Unfortunately there's a
BIOS bug[1] and ath11k only receives 1 MSI vector, opposed to 32 vectors
it needs. Carl implemented a proof of concept patch[2] which worked fine
on some platforms, for example I didn't see any issues on my Intel NUC
with QCA6390.

But once we people with Dell XPS 13 9310 started testing Carl's patches
started reporting weird kernel crashes. This is what wink reported[3]:

----------------------------------------------------------------------
So up until this point, everything is working without issues.
Everything seems to spiral out of control a couple of seconds later
when my system attempts to actually bring up the adapter.  In most of
the crash states I will see this:

[   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
[   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
[   31.391928] wlp85s0: authenticated
[   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
[   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
(capab=0x411 status=0 aid=6)
[   31.407730] wlp85s0: associated
[   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready

And then either somewhere in that pile of messages, or a second or two
after this my machine will start to stutter as I mentioned before, and
then it either hangs, or I see this message (I'm truncating the
timestamp):

[   35.xxxx ] sched: RT throttling activated

After that moment, the machine is unresponsive.  Sorry I can't seem to
extract this data other than screenshots from my phone at the moment,
you can see the dmesg output from 6 different hangs here:

https://github.com/w1nk/ath11k-debug
----------------------------------------------------------------------

Wink even made videos available[3].

After extensive debugging from wink he found out that disabling M2 state
makes the all problems go away:

--- a/drivers/bus/mhi/core/pm.c
+++ b/drivers/bus/mhi/core/pm.c
@@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
        },
        {
                MHI_PM_M0,
-               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
+               MHI_PM_M0 | MHI_PM_M3_ENTER |
                MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
                MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
        },
        {
-               MHI_PM_M2,
+               MHI_PM_M0,
                MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
                MHI_PM_LD_ERR_FATAL_DETECT
        },

And indeed now we have numerous people reporting that with this
workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
earth could cause these kernel crashes/interrupt storms? And why is it
visible only on Dell laptops? Why does disabling M2 state fix it?

Also something to investigate is does AC power vs battery power have
something to do with this? Can that affect M2 states somehow?

Any other ideas how to debug this? This is a very weird problem.

Wink and others, in case I missed something please do fill in.

Kalle

[1] https://lore.kernel.org/ath11k/87mtzxkus5.fsf@nanos.tec.linutronix.de/

[2] https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath11k-qca6390-bringup&id=742b5de85acf7f25ca327c66c2b71d4f2cb6c245

[3] https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-16  8:47 ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state Kalle Valo
@ 2020-12-17  8:41 ` Kalle Valo
  2020-12-17  9:53 ` Manivannan Sadhasivam
  1 sibling, 0 replies; 9+ messages in thread
From: Kalle Valo @ 2020-12-17  8:41 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Stephen Liang, wink, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Hemant Kumar, ath11k, Mitchell Nordine

Kalle Valo <kvalo@codeaurora.org> writes:

> To keep the discussion organised I'll start a new thread about weird
> kernel crashes we are seeing on ath11k, and include MHI folks as well in
> case they have any ideas. This is a long story, but I try to summarise
> this as short as I can :)
>
> Recently Dell released laptops with QCA6390. Unfortunately there's a
> BIOS bug[1] and ath11k only receives 1 MSI vector, opposed to 32 vectors
> it needs. Carl implemented a proof of concept patch[2] which worked fine
> on some platforms, for example I didn't see any issues on my Intel NUC
> with QCA6390.
>
> But once we people with Dell XPS 13 9310 started testing Carl's patches
> started reporting weird kernel crashes. This is what wink reported[3]:
>
> ----------------------------------------------------------------------
> So up until this point, everything is working without issues.
> Everything seems to spiral out of control a couple of seconds later
> when my system attempts to actually bring up the adapter.  In most of
> the crash states I will see this:
>
> [   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> [   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> [   31.391928] wlp85s0: authenticated
> [   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> [   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> (capab=0x411 status=0 aid=6)
> [   31.407730] wlp85s0: associated
> [   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
>
> And then either somewhere in that pile of messages, or a second or two
> after this my machine will start to stutter as I mentioned before, and
> then it either hangs, or I see this message (I'm truncating the
> timestamp):
>
> [   35.xxxx ] sched: RT throttling activated
>
> After that moment, the machine is unresponsive.  Sorry I can't seem to
> extract this data other than screenshots from my phone at the moment,
> you can see the dmesg output from 6 different hangs here:
>
> https://github.com/w1nk/ath11k-debug
> ----------------------------------------------------------------------
>
> Wink even made videos available[3].
>
> After extensive debugging from wink he found out that disabling M2 state
> makes the all problems go away:
>
> --- a/drivers/bus/mhi/core/pm.c
> +++ b/drivers/bus/mhi/core/pm.c
> @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
>         },
>         {
>                 MHI_PM_M0,
> -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> +               MHI_PM_M0 | MHI_PM_M3_ENTER |
>                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
>         },
>         {
> -               MHI_PM_M2,
> +               MHI_PM_M0,
>                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>                 MHI_PM_LD_ERR_FATAL_DETECT
>         },
>
> And indeed now we have numerous people reporting that with this
> workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> earth could cause these kernel crashes/interrupt storms? And why is it
> visible only on Dell laptops? Why does disabling M2 state fix it?
>
> Also something to investigate is does AC power vs battery power have
> something to do with this? Can that affect M2 states somehow?
>
> Any other ideas how to debug this? This is a very weird problem.

I was told that some registers are not allowed to be accessed during M2
state, so it looks like wink was spot on with his workaround. And ASPM
is also related, which might explain why not everyone see these
problems.

This is all still very sketchy and I'm trying to get more information.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-16  8:47 ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state Kalle Valo
  2020-12-17  8:41 ` Kalle Valo
@ 2020-12-17  9:53 ` Manivannan Sadhasivam
  2020-12-17 19:01   ` Stephen Liang
  2020-12-19 21:34   ` wi nk
  1 sibling, 2 replies; 9+ messages in thread
From: Manivannan Sadhasivam @ 2020-12-17  9:53 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Stephen Liang, wink, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Hemant Kumar, ath11k, Mitchell Nordine

Hi Kalle,

On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> Hi MHI devs,
> 

[...]

> After extensive debugging from wink he found out that disabling M2 state
> makes the all problems go away:
> 
> --- a/drivers/bus/mhi/core/pm.c
> +++ b/drivers/bus/mhi/core/pm.c
> @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
>         },
>         {
>                 MHI_PM_M0,
> -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> +               MHI_PM_M0 | MHI_PM_M3_ENTER |
>                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
>         },
>         {
> -               MHI_PM_M2,
> +               MHI_PM_M0,
>                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>                 MHI_PM_LD_ERR_FATAL_DETECT
>         },
> 
> And indeed now we have numerous people reporting that with this
> workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> earth could cause these kernel crashes/interrupt storms? And why is it
> visible only on Dell laptops? Why does disabling M2 state fix it?
> 

This is related to the ASPM state of the PCIe bus. In the meantime, I'd
suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
line so that the MHI bus stays in M0.

For debugging this issue, can someone enable debug logs for MHI and share
the dmesg output (with ASPM enabled ofc)?

Thanks,
Mani

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-17  9:53 ` Manivannan Sadhasivam
@ 2020-12-17 19:01   ` Stephen Liang
  2020-12-19 21:34   ` wi nk
  1 sibling, 0 replies; 9+ messages in thread
From: Stephen Liang @ 2020-12-17 19:01 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: wi nk, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt, Bjorn Andersson,
	Hemant Kumar, ath11k, Kalle Valo, Mitchell Nordine

In my last test run, the system hangs were only occasionally
reproducible during WiFi scanning (opening Gnome WiFi settings to see
a list of networks, or looking at the network list dropdown will
trigger the hang). If you do this, one of two things happens, usually
within a minute.

1. The system hangs
2. The firmware crashes

Please find below the debug MHI logs that were generated via echo -n
'module mhi +p' > /sys/kernel/debug/dynamic_debug/control

Firmware crash logs (no hang): https://pastebin.com/raw/E0y49evA
Lockup: https://i.imgur.com/0XExack.jpg

On Thu, Dec 17, 2020 at 1:53 AM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> Hi Kalle,
>
> On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> > Hi MHI devs,
> >
>
> [...]
>
> > After extensive debugging from wink he found out that disabling M2 state
> > makes the all problems go away:
> >
> > --- a/drivers/bus/mhi/core/pm.c
> > +++ b/drivers/bus/mhi/core/pm.c
> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
> >         },
> >         {
> >                 MHI_PM_M0,
> > -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > +               MHI_PM_M0 | MHI_PM_M3_ENTER |
> >                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> >         },
> >         {
> > -               MHI_PM_M2,
> > +               MHI_PM_M0,
> >                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >                 MHI_PM_LD_ERR_FATAL_DETECT
> >         },
> >
> > And indeed now we have numerous people reporting that with this
> > workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> > earth could cause these kernel crashes/interrupt storms? And why is it
> > visible only on Dell laptops? Why does disabling M2 state fix it?
> >
>
> This is related to the ASPM state of the PCIe bus. In the meantime, I'd
> suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
> line so that the MHI bus stays in M0.
>
> For debugging this issue, can someone enable debug logs for MHI and share
> the dmesg output (with ASPM enabled ofc)?
>
> Thanks,
> Mani

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-17  9:53 ` Manivannan Sadhasivam
  2020-12-17 19:01   ` Stephen Liang
@ 2020-12-19 21:34   ` wi nk
  2020-12-20 15:05     ` Manivannan Sadhasivam
  1 sibling, 1 reply; 9+ messages in thread
From: wi nk @ 2020-12-19 21:34 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Stephen Liang, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Hemant Kumar, ath11k, Kalle Valo,
	Mitchell Nordine

On Thu, Dec 17, 2020 at 10:53 AM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> Hi Kalle,
>
> On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> > Hi MHI devs,
> >
>
> [...]
>
> > After extensive debugging from wink he found out that disabling M2 state
> > makes the all problems go away:
> >
> > --- a/drivers/bus/mhi/core/pm.c
> > +++ b/drivers/bus/mhi/core/pm.c
> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
> >         },
> >         {
> >                 MHI_PM_M0,
> > -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > +               MHI_PM_M0 | MHI_PM_M3_ENTER |
> >                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> >         },
> >         {
> > -               MHI_PM_M2,
> > +               MHI_PM_M0,
> >                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >                 MHI_PM_LD_ERR_FATAL_DETECT
> >         },
> >
> > And indeed now we have numerous people reporting that with this
> > workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> > earth could cause these kernel crashes/interrupt storms? And why is it
> > visible only on Dell laptops? Why does disabling M2 state fix it?
> >
>
> This is related to the ASPM state of the PCIe bus. In the meantime, I'd
> suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
> line so that the MHI bus stays in M0.
>
> For debugging this issue, can someone enable debug logs for MHI and share
> the dmesg output (with ASPM enabled ofc)?
>
> Thanks,
> Mani

Hi Mani,

  Thanks for the information and ideas.  I tried to disable ASPM with
the kernel parameter you mentioned, that didn't seem to work, so I
removed ASPM support from my kernel altogether.  I still see the
adapter in the M1 state, which with my patch would've gone to M2 had
it not been disabled.  Is ASPM the only thing that will trigger the M*
transitions?  Would it require a transition to M2 regardless of
settings (maybe that's why it tried)?  The MHI dmesg output is pretty
consistent when it fails, it looks like this:
https://i.imgur.com/0XExack.jpg .  You can also see it in the mp4's
I've placed here:
https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing
.  Also note that the failure isn't deterministic, sometimes the
transition to M2 will succeed and everything works.

Thanks!

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-19 21:34   ` wi nk
@ 2020-12-20 15:05     ` Manivannan Sadhasivam
  2020-12-20 15:39       ` wi nk
  0 siblings, 1 reply; 9+ messages in thread
From: Manivannan Sadhasivam @ 2020-12-20 15:05 UTC (permalink / raw)
  To: wi nk
  Cc: Stephen Liang, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Hemant Kumar, ath11k, Kalle Valo,
	Mitchell Nordine

Hi,

On Sat, Dec 19, 2020 at 10:34:23PM +0100, wi nk wrote:
> On Thu, Dec 17, 2020 at 10:53 AM Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > Hi Kalle,
> >
> > On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> > > Hi MHI devs,
> > >
> >
> > [...]
> >
> > > After extensive debugging from wink he found out that disabling M2 state
> > > makes the all problems go away:
> > >
> > > --- a/drivers/bus/mhi/core/pm.c
> > > +++ b/drivers/bus/mhi/core/pm.c
> > > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
> > >         },
> > >         {
> > >                 MHI_PM_M0,
> > > -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > +               MHI_PM_M0 | MHI_PM_M3_ENTER |
> > >                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > >                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > >         },
> > >         {
> > > -               MHI_PM_M2,
> > > +               MHI_PM_M0,
> > >                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > >                 MHI_PM_LD_ERR_FATAL_DETECT
> > >         },
> > >
> > > And indeed now we have numerous people reporting that with this
> > > workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> > > earth could cause these kernel crashes/interrupt storms? And why is it
> > > visible only on Dell laptops? Why does disabling M2 state fix it?
> > >
> >
> > This is related to the ASPM state of the PCIe bus. In the meantime, I'd
> > suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
> > line so that the MHI bus stays in M0.
> >
> > For debugging this issue, can someone enable debug logs for MHI and share
> > the dmesg output (with ASPM enabled ofc)?
> >
> > Thanks,
> > Mani
> 
> Hi Mani,
> 
>   Thanks for the information and ideas.  I tried to disable ASPM with
> the kernel parameter you mentioned, that didn't seem to work, so I
> removed ASPM support from my kernel altogether.  I still see the
> adapter in the M1 state, which with my patch would've gone to M2 had
> it not been disabled.  Is ASPM the only thing that will trigger the M*
> transitions?  Would it require a transition to M2 regardless of
> settings (maybe that's why it tried)?  

That's what I suspected but looks like the QCA6390 enters M1 state (which will
inturn cause host MHI to transition to M2) when it detects the link inactivity
using a timer. But with my NUC, I can't get QCA6390 to enter M1 state
regardless of the ASPM support in BIOS. In both conditions (with/without ASPM)
device just stays in M0. My hard is guess is that the device depends on the WAKE
sideband signal to go low for entering the M1 state even when it detects link
inactivity in the PCIe bus. And this signal might be low on Dell laptops. But I
need Hemant/Bhaumik to confirm this!

Inspite of that, I got a plenty of below messages in dmesg log when MHI
debugging is enabled:

local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0

And this only happens when one MSI vector is used and shared by all IRQs. This
is because for shared IRQs, the kernel calls all of the registered ISRs of the
interrupt line when an interrupt occurs. So I cooked up a patch which checks for
the device state before proceeding through the ISR:

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 2cff5ddff225..520948f3051d 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -386,6 +386,13 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
        state = mhi_get_mhi_state(mhi_cntrl);
        ee = mhi_cntrl->ee;
        mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
+
+       /* Only proceed if the device state is different */
+       if (mhi_cntrl->dev_state == state) {
+               write_unlock_irq(&mhi_cntrl->pm_lock);
+               goto exit_intvec;
+       }
+
        dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
                TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
                TO_MHI_STATE_STR(state));

This prevents the spike of the debug messages but not sure if the issue you're
seeing is related to this. Can you give this patch a try on your setup?

Thanks,
Mani

> The MHI dmesg output is pretty
> consistent when it fails, it looks like this:
> https://i.imgur.com/0XExack.jpg .  You can also see it in the mp4's
> I've placed here:
> https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing
> .  Also note that the failure isn't deterministic, sometimes the
> transition to M2 will succeed and everything works.
> 
> Thanks!

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-20 15:05     ` Manivannan Sadhasivam
@ 2020-12-20 15:39       ` wi nk
  2020-12-21 17:15         ` Kalle Valo
  0 siblings, 1 reply; 9+ messages in thread
From: wi nk @ 2020-12-20 15:39 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Stephen Liang, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Hemant Kumar, ath11k, Kalle Valo,
	Mitchell Nordine

On Sun, Dec 20, 2020 at 4:05 PM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> Hi,
>
> On Sat, Dec 19, 2020 at 10:34:23PM +0100, wi nk wrote:
> > On Thu, Dec 17, 2020 at 10:53 AM Manivannan Sadhasivam
> > <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > Hi Kalle,
> > >
> > > On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> > > > Hi MHI devs,
> > > >
> > >
> > > [...]
> > >
> > > > After extensive debugging from wink he found out that disabling M2 state
> > > > makes the all problems go away:
> > > >
> > > > --- a/drivers/bus/mhi/core/pm.c
> > > > +++ b/drivers/bus/mhi/core/pm.c
> > > > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
> > > >         },
> > > >         {
> > > >                 MHI_PM_M0,
> > > > -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > > +               MHI_PM_M0 | MHI_PM_M3_ENTER |
> > > >                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > >                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > > >         },
> > > >         {
> > > > -               MHI_PM_M2,
> > > > +               MHI_PM_M0,
> > > >                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > >                 MHI_PM_LD_ERR_FATAL_DETECT
> > > >         },
> > > >
> > > > And indeed now we have numerous people reporting that with this
> > > > workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> > > > earth could cause these kernel crashes/interrupt storms? And why is it
> > > > visible only on Dell laptops? Why does disabling M2 state fix it?
> > > >
> > >
> > > This is related to the ASPM state of the PCIe bus. In the meantime, I'd
> > > suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
> > > line so that the MHI bus stays in M0.
> > >
> > > For debugging this issue, can someone enable debug logs for MHI and share
> > > the dmesg output (with ASPM enabled ofc)?
> > >
> > > Thanks,
> > > Mani
> >
> > Hi Mani,
> >
> >   Thanks for the information and ideas.  I tried to disable ASPM with
> > the kernel parameter you mentioned, that didn't seem to work, so I
> > removed ASPM support from my kernel altogether.  I still see the
> > adapter in the M1 state, which with my patch would've gone to M2 had
> > it not been disabled.  Is ASPM the only thing that will trigger the M*
> > transitions?  Would it require a transition to M2 regardless of
> > settings (maybe that's why it tried)?
>
> That's what I suspected but looks like the QCA6390 enters M1 state (which will
> inturn cause host MHI to transition to M2) when it detects the link inactivity
> using a timer. But with my NUC, I can't get QCA6390 to enter M1 state
> regardless of the ASPM support in BIOS. In both conditions (with/without ASPM)
> device just stays in M0. My hard is guess is that the device depends on the WAKE
> sideband signal to go low for entering the M1 state even when it detects link
> inactivity in the PCIe bus. And this signal might be low on Dell laptops. But I
> need Hemant/Bhaumik to confirm this!
>
> Inspite of that, I got a plenty of below messages in dmesg log when MHI
> debugging is enabled:
>
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
>
> And this only happens when one MSI vector is used and shared by all IRQs. This
> is because for shared IRQs, the kernel calls all of the registered ISRs of the
> interrupt line when an interrupt occurs. So I cooked up a patch which checks for
> the device state before proceeding through the ISR:
>
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index 2cff5ddff225..520948f3051d 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -386,6 +386,13 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
>         state = mhi_get_mhi_state(mhi_cntrl);
>         ee = mhi_cntrl->ee;
>         mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
> +
> +       /* Only proceed if the device state is different */
> +       if (mhi_cntrl->dev_state == state) {
> +               write_unlock_irq(&mhi_cntrl->pm_lock);
> +               goto exit_intvec;
> +       }
> +
>         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
>                 TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
>                 TO_MHI_STATE_STR(state));
>
> This prevents the spike of the debug messages but not sure if the issue you're
> seeing is related to this. Can you give this patch a try on your setup?
>
> Thanks,
> Mani
>
> > The MHI dmesg output is pretty
> > consistent when it fails, it looks like this:
> > https://i.imgur.com/0XExack.jpg .  You can also see it in the mp4's
> > I've placed here:
> > https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing
> > .  Also note that the failure isn't deterministic, sometimes the
> > transition to M2 will succeed and everything works.
> >
> > Thanks!

Mani,

  I'll give the patch a try once I have a moment.  For some extra
information, I also patched the interrupt handlers so they'd bail
early if the signal wasn't for them (since there are ~20 ISRs attached
between MHI and the ath11k driver).  That didn't seem to have a large
effect on anything other than potentially changing some of the timing
of things.  Another interesting note is that the laptop only seems to
attempt the transition when it's plugged into the charger (see my mp4s
for more info).  If I boot the laptop on battery and bring up the
adapter, it never attempts the transition beyond m0 either.

Thanks!

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-20 15:39       ` wi nk
@ 2020-12-21 17:15         ` Kalle Valo
  2020-12-21 17:26           ` wi nk
  0 siblings, 1 reply; 9+ messages in thread
From: Kalle Valo @ 2020-12-21 17:15 UTC (permalink / raw)
  To: wi nk
  Cc: Stephen Liang, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Manivannan Sadhasivam, Hemant Kumar, ath11k,
	Mitchell Nordine

wi nk <wink@technolu.st> writes:

> Another interesting note is that the laptop only seems to attempt the
> transition when it's plugged into the charger (see my mp4s for more
> info). If I boot the laptop on battery and bring up the adapter, it
> never attempts the transition beyond m0 either.

This is so weird to me, what could cause it?

If you boot the laptop on battery but after the boot connect it to the
charger, what does it do? I'm wondering f this is some boot state
problem/feature or does the behaviour change runtime.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
  2020-12-21 17:15         ` Kalle Valo
@ 2020-12-21 17:26           ` wi nk
  0 siblings, 0 replies; 9+ messages in thread
From: wi nk @ 2020-12-21 17:26 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Stephen Liang, Jeffrey Hugo, Carl Huang, Bhaumik Bhatt,
	Bjorn Andersson, Manivannan Sadhasivam, Hemant Kumar, ath11k,
	Mitchell Nordine

On Mon, Dec 21, 2020 at 6:16 PM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> wi nk <wink@technolu.st> writes:
>
> > Another interesting note is that the laptop only seems to attempt the
> > transition when it's plugged into the charger (see my mp4s for more
> > info). If I boot the laptop on battery and bring up the adapter, it
> > never attempts the transition beyond m0 either.
>
> This is so weird to me, what could cause it?
>
> If you boot the laptop on battery but after the boot connect it to the
> charger, what does it do? I'm wondering f this is some boot state
> problem/feature or does the behaviour change runtime.
>
> --
> https://patchwork.kernel.org/project/linux-wireless/list/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

The behavior is changing during runtime as best I can tell.  In some
of my earlier videos you can see the behavior directly:

Boot 1: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_004643171.mp4
- The machine and driver has been online and stable for 5 minutes (as
seen in htop/ping), plugging in the charger causes the state
transition which results in a hang after a few seconds

Boot 2 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005346443.mp4
- Same set up (although the machine had been up for 6 minutes at that
point) and failure as boot 1. The machine hard locks instantly this
time, as opposed to the stuttering you can see in boot 1.

(video mirror is the google drive link earlier in the thread).  If you
watch those first 2 boots, the MHI state changes as a direct result of
me plugging the usb-c cable into the laptop.  What I thought was
bluetooth interfering early on in my debugging was actually an
artifact of which room I was working in and if my charger/headphones
were available.   In the boot videos where the driver crashes
immediately, the charger is already plugged in , so the adapter
attempts the state transition right after the driver initializes.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-12-21 17:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16  8:47 ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state Kalle Valo
2020-12-17  8:41 ` Kalle Valo
2020-12-17  9:53 ` Manivannan Sadhasivam
2020-12-17 19:01   ` Stephen Liang
2020-12-19 21:34   ` wi nk
2020-12-20 15:05     ` Manivannan Sadhasivam
2020-12-20 15:39       ` wi nk
2020-12-21 17:15         ` Kalle Valo
2020-12-21 17:26           ` wi nk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.