All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
@ 2020-12-06 17:38 Mitchell Nordine
  2020-12-06 17:53 ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: Mitchell Nordine @ 2020-12-06 17:38 UTC (permalink / raw)
  To: ath11k

I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.

FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.

Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Sunday, December 6, 2020 6:00 PM, <ath11k-request@lists.infradead.org> wrote:

> Send ath11k mailing list submissions to
> ath11k@lists.infradead.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.infradead.org/mailman/listinfo/ath11k
> or, via email, send a message with subject or body 'help' to
> ath11k-request@lists.infradead.org
>
> You can reach the person managing the list at
> ath11k-owner@lists.infradead.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ath11k digest..."
>
> Today's Topics:
>
> 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
>
>
> Message: 1
> Date: Sat, 5 Dec 2020 20:17:10 +0100
> From: wi nk wink@technolu.st
> To: Kalle Valo kvalo@codeaurora.org
> Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> Message-ID:
> CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> Content-Type: text/plain; charset="UTF-8"
>
> On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
>
> > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> >
> > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > >
> > > > Hi Wi and Thomas,
> > > > I'll start a new thread about problems on XPS 13. The information is
> > > > scattered to different threads and hard to find everything, it's much
> > > > easier to have everything in one place. So let's continue the discussion
> > > > about the kernel crashes on this thread.
> > > > Here's what I have understood so far:
> > > >
> > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > >     with 32 MSI vectors.
> > > >
> > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > >
> > > >
> > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > >
> > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > >     13. We added a hack to ath11k make it work with only vector and after
> > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > >     device for a while.
> > > >
> > > > -   But the problem now is that the kernel is crashing almost immediately
> > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > >     issues.
> > > >
> > > >
> > > > Is my understanding correct? Did I miss anything?
> > > > About the symptoms Wi reports:
> > > >
> > > > So up until this point, everything is working without issues.
> > > > Everything seems to spiral out of control a couple of seconds later
> > > > when my system attempts to actually bring up the adapter. In most of
> > > > the crash states I will see this:
> > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > [ 31.391928] wlp85s0: authenticated
> > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > (capab=0x411 status=0 aid=6)
> > > > [ 31.407730] wlp85s0: associated
> > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > And then either somewhere in that pile of messages, or a second or two
> > > > after this my machine will start to stutter as I mentioned before, and
> > > > then it either hangs, or I see this message (I'm truncating the
> > > > timestamp):
> > > > [ 35.xxxx ] sched: RT throttling activated
> > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > extract this data other than screenshots from my phone at the moment,
> > > > you can see the dmesg output from 6 different hangs here:
> > > >
> > > > https://github.com/w1nk/ath11k-debug
> > > >
> > > > -------------------------------------
> > > >
> > > > And Thomas Krause reports:
> > > >
> > > > I can confirm this behavior on my configuration. I managed to login
> > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > be stable long enough to enter the Wifi passphrase. After the
> > > > connection was established, the system hang and on each attempt to
> > > > reboot into the graphical system it would freeze at some point
> > > > (sometimes even before showing the login screen).
> > > >
> > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > --
> > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > >
> > > Hi Kalle,
> > > Again, thanks much for your work. I think you've summarized
> > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > RT throttling still exists for me occasionally on loading the
> > > driver/associating with an AP. The throttling consistently occurs
> > > after a few sets of the MHI debug printing showing the EE entering an
> > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > to see if there are any differences.
> > > Thanks!
> >
> > Just to follow up, the first boot resulted in the RT throttling
> > message as the adapter was coming up/associating, shortly after the
> > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > reboot to bring the adapter back.
>
> Kalle -
>
> I've noticed one additional behavior that may give someone with
> familiarity with the QCA hardware a clue. I'm running
> ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> whatever reason, having the bluetooth subsystem enabled (with a paired
> device) on this dell basically guarantees I'll hit the scheduler
> throttling issue as the ath11k driver is initializing / associating.
> The bluetooth system is using the btqca driver. I don't have any
> useful debugging (I'll gladly collect some if there is a way to do it)
> other than tracking some simple statistics. I booted my system 20
> times, 10 times with bluetooth enabled ((and some headphones turned on
> ready to pair), and 10 times without. In both scenarios, I'm booting
> into X and manually modprobing the ath11k driver. The difference is
> that with bluetooth on and by the time I modprobe the driver, the
> headphones are paired and I received the throttling message and
> subsequent freezing 10/10 times. With bluetooth off / my headphones
> not paired, I only saw it 2/10. I know it's not much hard information
> but it's reliably reproducible for me, is there anything useful I can
> collect?
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Message: 2
> Date: Sun, 6 Dec 2020 09:05:57 +0100
> From: wi nk wink@technolu.st
> To: Kalle Valo kvalo@codeaurora.org
> Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> Message-ID:
> CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> Content-Type: text/plain; charset="UTF-8"
>
> On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
>
> > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> >
> > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > >
> > > > > Hi Wi and Thomas,
> > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > scattered to different threads and hard to find everything, it's much
> > > > > easier to have everything in one place. So let's continue the discussion
> > > > > about the kernel crashes on this thread.
> > > > > Here's what I have understood so far:
> > > > >
> > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > >     with 32 MSI vectors.
> > > > >
> > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > >
> > > > >
> > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > >
> > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > >     device for a while.
> > > > >
> > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > >     issues.
> > > > >
> > > > >
> > > > > Is my understanding correct? Did I miss anything?
> > > > > About the symptoms Wi reports:
> > > > >
> > > > > So up until this point, everything is working without issues.
> > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > the crash states I will see this:
> > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > [ 31.391928] wlp85s0: authenticated
> > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > (capab=0x411 status=0 aid=6)
> > > > > [ 31.407730] wlp85s0: associated
> > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > timestamp):
> > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > extract this data other than screenshots from my phone at the moment,
> > > > > you can see the dmesg output from 6 different hangs here:
> > > > >
> > > > > https://github.com/w1nk/ath11k-debug
> > > > >
> > > > > -------------------------------------
> > > > >
> > > > > And Thomas Krause reports:
> > > > >
> > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > connection was established, the system hang and on each attempt to
> > > > > reboot into the graphical system it would freeze at some point
> > > > > (sometimes even before showing the login screen).
> > > > >
> > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > >
> > > > > --
> > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > >
> > > > Hi Kalle,
> > > > Again, thanks much for your work. I think you've summarized
> > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > RT throttling still exists for me occasionally on loading the
> > > > driver/associating with an AP. The throttling consistently occurs
> > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > to see if there are any differences.
> > > > Thanks!
> > >
> > > Just to follow up, the first boot resulted in the RT throttling
> > > message as the adapter was coming up/associating, shortly after the
> > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > reboot to bring the adapter back.
> >
> > Kalle -
> > I've noticed one additional behavior that may give someone with
> > familiarity with the QCA hardware a clue. I'm running
> > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > whatever reason, having the bluetooth subsystem enabled (with a paired
> > device) on this dell basically guarantees I'll hit the scheduler
> > throttling issue as the ath11k driver is initializing / associating.
> > The bluetooth system is using the btqca driver. I don't have any
> > useful debugging (I'll gladly collect some if there is a way to do it)
> > other than tracking some simple statistics. I booted my system 20
> > times, 10 times with bluetooth enabled ((and some headphones turned on
> > ready to pair), and 10 times without. In both scenarios, I'm booting
> > into X and manually modprobing the ath11k driver. The difference is
> > that with bluetooth on and by the time I modprobe the driver, the
> > headphones are paired and I received the throttling message and
> > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > not paired, I only saw it 2/10. I know it's not much hard information
> > but it's reliably reproducible for me, is there anything useful I can
> > collect?
>
> Well unfortunately I think the bluetooth was just a red herring in the
> racing. To chase that, I disabled all bluetooth and was able to get
> into a state where I had 6 failed boots in a row. To further poke
> around, I rebuilt the kernel with localmodconfig to disable building
> big chunks of things. This kernel is way less stable and seems to
> freeze most of the time (but does occasionally remain stable), I'm not
> sure what else got disabled in there, but it seems to have had a
> negative impact on the crash racing.
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Subject: Digest Footer
>
> ath11k mailing list
> ath11k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath11k
>
>
> -----------------------------------------------------------------------------------------------------------------------------
>
> End of ath11k Digest, Vol 7, Issue 5



-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-06 17:38 ath11k: QCA6390 on Dell XPS 13 and kernel crashes Mitchell Nordine
@ 2020-12-06 17:53 ` wi nk
  2020-12-06 21:45   ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-06 17:53 UTC (permalink / raw)
  To: Mitchell Nordine; +Cc: ath11k

On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
<mail@mitchellnordine.com> wrote:
>
> I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
>
> FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
>
> Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
>
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Sunday, December 6, 2020 6:00 PM, <ath11k-request@lists.infradead.org> wrote:
>
> > Send ath11k mailing list submissions to
> > ath11k@lists.infradead.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://lists.infradead.org/mailman/listinfo/ath11k
> > or, via email, send a message with subject or body 'help' to
> > ath11k-request@lists.infradead.org
> >
> > You can reach the person managing the list at
> > ath11k-owner@lists.infradead.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of ath11k digest..."
> >
> > Today's Topics:
> >
> > 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> >
> >
> > Message: 1
> > Date: Sat, 5 Dec 2020 20:17:10 +0100
> > From: wi nk wink@technolu.st
> > To: Kalle Valo kvalo@codeaurora.org
> > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > Message-ID:
> > CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> >
> > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > >
> > > > > Hi Wi and Thomas,
> > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > scattered to different threads and hard to find everything, it's much
> > > > > easier to have everything in one place. So let's continue the discussion
> > > > > about the kernel crashes on this thread.
> > > > > Here's what I have understood so far:
> > > > >
> > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > >     with 32 MSI vectors.
> > > > >
> > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > >
> > > > >
> > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > >
> > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > >     device for a while.
> > > > >
> > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > >     issues.
> > > > >
> > > > >
> > > > > Is my understanding correct? Did I miss anything?
> > > > > About the symptoms Wi reports:
> > > > >
> > > > > So up until this point, everything is working without issues.
> > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > the crash states I will see this:
> > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > [ 31.391928] wlp85s0: authenticated
> > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > (capab=0x411 status=0 aid=6)
> > > > > [ 31.407730] wlp85s0: associated
> > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > timestamp):
> > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > extract this data other than screenshots from my phone at the moment,
> > > > > you can see the dmesg output from 6 different hangs here:
> > > > >
> > > > > https://github.com/w1nk/ath11k-debug
> > > > >
> > > > > -------------------------------------
> > > > >
> > > > > And Thomas Krause reports:
> > > > >
> > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > connection was established, the system hang and on each attempt to
> > > > > reboot into the graphical system it would freeze at some point
> > > > > (sometimes even before showing the login screen).
> > > > >
> > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > >
> > > > > --
> > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > >
> > > > Hi Kalle,
> > > > Again, thanks much for your work. I think you've summarized
> > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > RT throttling still exists for me occasionally on loading the
> > > > driver/associating with an AP. The throttling consistently occurs
> > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > to see if there are any differences.
> > > > Thanks!
> > >
> > > Just to follow up, the first boot resulted in the RT throttling
> > > message as the adapter was coming up/associating, shortly after the
> > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > reboot to bring the adapter back.
> >
> > Kalle -
> >
> > I've noticed one additional behavior that may give someone with
> > familiarity with the QCA hardware a clue. I'm running
> > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > whatever reason, having the bluetooth subsystem enabled (with a paired
> > device) on this dell basically guarantees I'll hit the scheduler
> > throttling issue as the ath11k driver is initializing / associating.
> > The bluetooth system is using the btqca driver. I don't have any
> > useful debugging (I'll gladly collect some if there is a way to do it)
> > other than tracking some simple statistics. I booted my system 20
> > times, 10 times with bluetooth enabled ((and some headphones turned on
> > ready to pair), and 10 times without. In both scenarios, I'm booting
> > into X and manually modprobing the ath11k driver. The difference is
> > that with bluetooth on and by the time I modprobe the driver, the
> > headphones are paired and I received the throttling message and
> > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > not paired, I only saw it 2/10. I know it's not much hard information
> > but it's reliably reproducible for me, is there anything useful I can
> > collect?
> >
> >
> > -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > Message: 2
> > Date: Sun, 6 Dec 2020 09:05:57 +0100
> > From: wi nk wink@technolu.st
> > To: Kalle Valo kvalo@codeaurora.org
> > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > Message-ID:
> > CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > Content-Type: text/plain; charset="UTF-8"
> >
> > On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> >
> > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > >
> > > > > > Hi Wi and Thomas,
> > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > about the kernel crashes on this thread.
> > > > > > Here's what I have understood so far:
> > > > > >
> > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > >     with 32 MSI vectors.
> > > > > >
> > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > >
> > > > > >
> > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > >
> > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > >     device for a while.
> > > > > >
> > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > >     issues.
> > > > > >
> > > > > >
> > > > > > Is my understanding correct? Did I miss anything?
> > > > > > About the symptoms Wi reports:
> > > > > >
> > > > > > So up until this point, everything is working without issues.
> > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > the crash states I will see this:
> > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > (capab=0x411 status=0 aid=6)
> > > > > > [ 31.407730] wlp85s0: associated
> > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > timestamp):
> > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > >
> > > > > > https://github.com/w1nk/ath11k-debug
> > > > > >
> > > > > > -------------------------------------
> > > > > >
> > > > > > And Thomas Krause reports:
> > > > > >
> > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > connection was established, the system hang and on each attempt to
> > > > > > reboot into the graphical system it would freeze at some point
> > > > > > (sometimes even before showing the login screen).
> > > > > >
> > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > --
> > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > >
> > > > > Hi Kalle,
> > > > > Again, thanks much for your work. I think you've summarized
> > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > RT throttling still exists for me occasionally on loading the
> > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > to see if there are any differences.
> > > > > Thanks!
> > > >
> > > > Just to follow up, the first boot resulted in the RT throttling
> > > > message as the adapter was coming up/associating, shortly after the
> > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > reboot to bring the adapter back.
> > >
> > > Kalle -
> > > I've noticed one additional behavior that may give someone with
> > > familiarity with the QCA hardware a clue. I'm running
> > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > device) on this dell basically guarantees I'll hit the scheduler
> > > throttling issue as the ath11k driver is initializing / associating.
> > > The bluetooth system is using the btqca driver. I don't have any
> > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > other than tracking some simple statistics. I booted my system 20
> > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > into X and manually modprobing the ath11k driver. The difference is
> > > that with bluetooth on and by the time I modprobe the driver, the
> > > headphones are paired and I received the throttling message and
> > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > not paired, I only saw it 2/10. I know it's not much hard information
> > > but it's reliably reproducible for me, is there anything useful I can
> > > collect?
> >
> > Well unfortunately I think the bluetooth was just a red herring in the
> > racing. To chase that, I disabled all bluetooth and was able to get
> > into a state where I had 6 failed boots in a row. To further poke
> > around, I rebuilt the kernel with localmodconfig to disable building
> > big chunks of things. This kernel is way less stable and seems to
> > freeze most of the time (but does occasionally remain stable), I'm not
> > sure what else got disabled in there, but it seems to have had a
> > negative impact on the crash racing.
> >
> >
> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > Subject: Digest Footer
> >
> > ath11k mailing list
> > ath11k@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/ath11k
> >
> >
> > -----------------------------------------------------------------------------------------------------------------------------
> >
> > End of ath11k Digest, Vol 7, Issue 5
>
>
>
> --
> ath11k mailing list
> ath11k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath11k

Hey Mitchell,

   One more thing to try that may help us get a little bit of extra
info.  Out of everything I've done, something that has remained
consistent is to enable the MHI debugging as Kalle suggested:

sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"

  Before any crash/spinlock, I see the MHI printing (from
drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
and then after a number more iterations through this function, things
finally go out of control.  So from

        dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
                TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
                TO_MHI_STATE_STR(state));

I'll see something like this:

[  312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
[  313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
ee:INVALID_EE dev_state:SYS_ERR

Then after a few of those prints showing SYS_ERR, either a spinlock or
a firmware crash.  I'm not sure what causes this ee state to go
invalid, but maybe that's worth looking into.  Can you confirm the
same behavior?  To see this a little easier, I also run dmesg -wH in
two windows, one piping to | grep -v mhi (to filter out the mhi
debugging).

Thanks!

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-06 17:53 ` wi nk
@ 2020-12-06 21:45   ` wi nk
  2020-12-07  1:17     ` wi nk
  2020-12-09 15:35     ` Kalle Valo
  0 siblings, 2 replies; 31+ messages in thread
From: wi nk @ 2020-12-06 21:45 UTC (permalink / raw)
  To: Mitchell Nordine; +Cc: ath11k

On Sun, Dec 6, 2020 at 6:53 PM wi nk <wink@technolu.st> wrote:
>
> On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> <mail@mitchellnordine.com> wrote:
> >
> > I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> >
> > FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> >
> > Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> >
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Sunday, December 6, 2020 6:00 PM, <ath11k-request@lists.infradead.org> wrote:
> >
> > > Send ath11k mailing list submissions to
> > > ath11k@lists.infradead.org
> > >
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > or, via email, send a message with subject or body 'help' to
> > > ath11k-request@lists.infradead.org
> > >
> > > You can reach the person managing the list at
> > > ath11k-owner@lists.infradead.org
> > >
> > > When replying, please edit your Subject line so it is more specific
> > > than "Re: Contents of ath11k digest..."
> > >
> > > Today's Topics:
> > >
> > > 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > >
> > >
> > > Message: 1
> > > Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > From: wi nk wink@technolu.st
> > > To: Kalle Valo kvalo@codeaurora.org
> > > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > Message-ID:
> > > CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > > Content-Type: text/plain; charset="UTF-8"
> > >
> > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > >
> > > > > > Hi Wi and Thomas,
> > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > about the kernel crashes on this thread.
> > > > > > Here's what I have understood so far:
> > > > > >
> > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > >     with 32 MSI vectors.
> > > > > >
> > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > >
> > > > > >
> > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > >
> > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > >     device for a while.
> > > > > >
> > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > >     issues.
> > > > > >
> > > > > >
> > > > > > Is my understanding correct? Did I miss anything?
> > > > > > About the symptoms Wi reports:
> > > > > >
> > > > > > So up until this point, everything is working without issues.
> > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > the crash states I will see this:
> > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > (capab=0x411 status=0 aid=6)
> > > > > > [ 31.407730] wlp85s0: associated
> > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > timestamp):
> > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > >
> > > > > > https://github.com/w1nk/ath11k-debug
> > > > > >
> > > > > > -------------------------------------
> > > > > >
> > > > > > And Thomas Krause reports:
> > > > > >
> > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > connection was established, the system hang and on each attempt to
> > > > > > reboot into the graphical system it would freeze at some point
> > > > > > (sometimes even before showing the login screen).
> > > > > >
> > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > --
> > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > >
> > > > > Hi Kalle,
> > > > > Again, thanks much for your work. I think you've summarized
> > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > RT throttling still exists for me occasionally on loading the
> > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > to see if there are any differences.
> > > > > Thanks!
> > > >
> > > > Just to follow up, the first boot resulted in the RT throttling
> > > > message as the adapter was coming up/associating, shortly after the
> > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > reboot to bring the adapter back.
> > >
> > > Kalle -
> > >
> > > I've noticed one additional behavior that may give someone with
> > > familiarity with the QCA hardware a clue. I'm running
> > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > device) on this dell basically guarantees I'll hit the scheduler
> > > throttling issue as the ath11k driver is initializing / associating.
> > > The bluetooth system is using the btqca driver. I don't have any
> > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > other than tracking some simple statistics. I booted my system 20
> > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > into X and manually modprobing the ath11k driver. The difference is
> > > that with bluetooth on and by the time I modprobe the driver, the
> > > headphones are paired and I received the throttling message and
> > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > not paired, I only saw it 2/10. I know it's not much hard information
> > > but it's reliably reproducible for me, is there anything useful I can
> > > collect?
> > >
> > >
> > > -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > Message: 2
> > > Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > From: wi nk wink@technolu.st
> > > To: Kalle Valo kvalo@codeaurora.org
> > > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > Message-ID:
> > > CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > > Content-Type: text/plain; charset="UTF-8"
> > >
> > > On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> > >
> > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > > >
> > > > > > > Hi Wi and Thomas,
> > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > about the kernel crashes on this thread.
> > > > > > > Here's what I have understood so far:
> > > > > > >
> > > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > >     with 32 MSI vectors.
> > > > > > >
> > > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > >
> > > > > > >
> > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > >
> > > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > > >     device for a while.
> > > > > > >
> > > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > > >     issues.
> > > > > > >
> > > > > > >
> > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > About the symptoms Wi reports:
> > > > > > >
> > > > > > > So up until this point, everything is working without issues.
> > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > the crash states I will see this:
> > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > timestamp):
> > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > >
> > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > >
> > > > > > > -------------------------------------
> > > > > > >
> > > > > > > And Thomas Krause reports:
> > > > > > >
> > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > (sometimes even before showing the login screen).
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > >
> > > > > > > --
> > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > >
> > > > > > Hi Kalle,
> > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > to see if there are any differences.
> > > > > > Thanks!
> > > > >
> > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > message as the adapter was coming up/associating, shortly after the
> > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > reboot to bring the adapter back.
> > > >
> > > > Kalle -
> > > > I've noticed one additional behavior that may give someone with
> > > > familiarity with the QCA hardware a clue. I'm running
> > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > throttling issue as the ath11k driver is initializing / associating.
> > > > The bluetooth system is using the btqca driver. I don't have any
> > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > other than tracking some simple statistics. I booted my system 20
> > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > into X and manually modprobing the ath11k driver. The difference is
> > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > headphones are paired and I received the throttling message and
> > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > but it's reliably reproducible for me, is there anything useful I can
> > > > collect?
> > >
> > > Well unfortunately I think the bluetooth was just a red herring in the
> > > racing. To chase that, I disabled all bluetooth and was able to get
> > > into a state where I had 6 failed boots in a row. To further poke
> > > around, I rebuilt the kernel with localmodconfig to disable building
> > > big chunks of things. This kernel is way less stable and seems to
> > > freeze most of the time (but does occasionally remain stable), I'm not
> > > sure what else got disabled in there, but it seems to have had a
> > > negative impact on the crash racing.
> > >
> > >
> > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > Subject: Digest Footer
> > >
> > > ath11k mailing list
> > > ath11k@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> > >
> > >
> > > -----------------------------------------------------------------------------------------------------------------------------
> > >
> > > End of ath11k Digest, Vol 7, Issue 5
> >
> >
> >
> > --
> > ath11k mailing list
> > ath11k@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/ath11k
>
> Hey Mitchell,
>
>    One more thing to try that may help us get a little bit of extra
> info.  Out of everything I've done, something that has remained
> consistent is to enable the MHI debugging as Kalle suggested:
>
> sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
>
>   Before any crash/spinlock, I see the MHI printing (from
> drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> and then after a number more iterations through this function, things
> finally go out of control.  So from
>
>         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
>                 TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
>                 TO_MHI_STATE_STR(state));
>
> I'll see something like this:
>
> [  312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> [  313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> ee:INVALID_EE dev_state:SYS_ERR
>
> Then after a few of those prints showing SYS_ERR, either a spinlock or
> a firmware crash.  I'm not sure what causes this ee state to go
> invalid, but maybe that's worth looking into.  Can you confirm the
> same behavior?  To see this a little easier, I also run dmesg -wH in
> two windows, one piping to | grep -v mhi (to filter out the mhi
> debugging).
>
> Thanks!

So I've managed to stabilise my system now, so either the race is
gone, or I've done something to win it all the time.  So one of the
avenues of racing I was chasing at first was in the ath11k driver
itself.  There are a couple areas where the single/shared IRQ is being
forcibly toggled in ways that the documentation says are not great
(and the original patch was trying to avoid).  Fixing those didn't
seem to have much impact on the stability of things (I've included
those changes in my patch though).  After the last email I was
thinking about the MHI side of things a bit more and found a number of
call sites that my naive grepping had missed that do the same thing,
but via acquiring a lock at the same time.  I modified all the calls
to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
variants that accept the flags parameter to capture state.  I've now
booted and loaded the driver 10+ times without a single freeze or
crash.  I'm not sure all of those modifications are necessary (ie:
which things are re-entrant in this single interrupt operating mode vs
which ones can use the simpler lock/unlock mechanisms), so I could use
some advice/guidance there.

Mitchell - if you want to grab this patch and try it, let me know how
it goes and I can clean it up for the mailing list:
https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
(apply to ath11k-qca6390-bringup-202011301608)

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-06 21:45   ` wi nk
@ 2020-12-07  1:17     ` wi nk
  2020-12-07 14:45       ` Mitchell Nordine
  2020-12-09 15:35     ` Kalle Valo
  1 sibling, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-07  1:17 UTC (permalink / raw)
  To: Mitchell Nordine, Kalle Valo; +Cc: ath11k

On Sun, Dec 6, 2020 at 10:45 PM wi nk <wink@technolu.st> wrote:
>
> On Sun, Dec 6, 2020 at 6:53 PM wi nk <wink@technolu.st> wrote:
> >
> > On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> > <mail@mitchellnordine.com> wrote:
> > >
> > > I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> > >
> > > FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> > >
> > > Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> > >
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Sunday, December 6, 2020 6:00 PM, <ath11k-request@lists.infradead.org> wrote:
> > >
> > > > Send ath11k mailing list submissions to
> > > > ath11k@lists.infradead.org
> > > >
> > > > To subscribe or unsubscribe via the World Wide Web, visit
> > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > > or, via email, send a message with subject or body 'help' to
> > > > ath11k-request@lists.infradead.org
> > > >
> > > > You can reach the person managing the list at
> > > > ath11k-owner@lists.infradead.org
> > > >
> > > > When replying, please edit your Subject line so it is more specific
> > > > than "Re: Contents of ath11k digest..."
> > > >
> > > > Today's Topics:
> > > >
> > > > 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > > 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > >
> > > >
> > > > Message: 1
> > > > Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > > From: wi nk wink@technolu.st
> > > > To: Kalle Valo kvalo@codeaurora.org
> > > > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > Message-ID:
> > > > CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > > > Content-Type: text/plain; charset="UTF-8"
> > > >
> > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > > >
> > > > > > > Hi Wi and Thomas,
> > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > about the kernel crashes on this thread.
> > > > > > > Here's what I have understood so far:
> > > > > > >
> > > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > >     with 32 MSI vectors.
> > > > > > >
> > > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > >
> > > > > > >
> > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > >
> > > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > > >     device for a while.
> > > > > > >
> > > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > > >     issues.
> > > > > > >
> > > > > > >
> > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > About the symptoms Wi reports:
> > > > > > >
> > > > > > > So up until this point, everything is working without issues.
> > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > the crash states I will see this:
> > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > timestamp):
> > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > >
> > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > >
> > > > > > > -------------------------------------
> > > > > > >
> > > > > > > And Thomas Krause reports:
> > > > > > >
> > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > (sometimes even before showing the login screen).
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > >
> > > > > > > --
> > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > >
> > > > > > Hi Kalle,
> > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > to see if there are any differences.
> > > > > > Thanks!
> > > > >
> > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > message as the adapter was coming up/associating, shortly after the
> > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > reboot to bring the adapter back.
> > > >
> > > > Kalle -
> > > >
> > > > I've noticed one additional behavior that may give someone with
> > > > familiarity with the QCA hardware a clue. I'm running
> > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > throttling issue as the ath11k driver is initializing / associating.
> > > > The bluetooth system is using the btqca driver. I don't have any
> > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > other than tracking some simple statistics. I booted my system 20
> > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > into X and manually modprobing the ath11k driver. The difference is
> > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > headphones are paired and I received the throttling message and
> > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > but it's reliably reproducible for me, is there anything useful I can
> > > > collect?
> > > >
> > > >
> > > > -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > Message: 2
> > > > Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > > From: wi nk wink@technolu.st
> > > > To: Kalle Valo kvalo@codeaurora.org
> > > > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > Message-ID:
> > > > CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > > > Content-Type: text/plain; charset="UTF-8"
> > > >
> > > > On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > > > >
> > > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > > > >
> > > > > > > > Hi Wi and Thomas,
> > > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > > about the kernel crashes on this thread.
> > > > > > > > Here's what I have understood so far:
> > > > > > > >
> > > > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > > >     with 32 MSI vectors.
> > > > > > > >
> > > > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > > >
> > > > > > > >
> > > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > > >
> > > > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > > > >     device for a while.
> > > > > > > >
> > > > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > > > >     issues.
> > > > > > > >
> > > > > > > >
> > > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > > About the symptoms Wi reports:
> > > > > > > >
> > > > > > > > So up until this point, everything is working without issues.
> > > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > > the crash states I will see this:
> > > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > > timestamp):
> > > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > > >
> > > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > > >
> > > > > > > > -------------------------------------
> > > > > > > >
> > > > > > > > And Thomas Krause reports:
> > > > > > > >
> > > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > > (sometimes even before showing the login screen).
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > > >
> > > > > > > > --
> > > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > > >
> > > > > > > Hi Kalle,
> > > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > > to see if there are any differences.
> > > > > > > Thanks!
> > > > > >
> > > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > > message as the adapter was coming up/associating, shortly after the
> > > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > > reboot to bring the adapter back.
> > > > >
> > > > > Kalle -
> > > > > I've noticed one additional behavior that may give someone with
> > > > > familiarity with the QCA hardware a clue. I'm running
> > > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > > throttling issue as the ath11k driver is initializing / associating.
> > > > > The bluetooth system is using the btqca driver. I don't have any
> > > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > > other than tracking some simple statistics. I booted my system 20
> > > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > > into X and manually modprobing the ath11k driver. The difference is
> > > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > > headphones are paired and I received the throttling message and
> > > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > > but it's reliably reproducible for me, is there anything useful I can
> > > > > collect?
> > > >
> > > > Well unfortunately I think the bluetooth was just a red herring in the
> > > > racing. To chase that, I disabled all bluetooth and was able to get
> > > > into a state where I had 6 failed boots in a row. To further poke
> > > > around, I rebuilt the kernel with localmodconfig to disable building
> > > > big chunks of things. This kernel is way less stable and seems to
> > > > freeze most of the time (but does occasionally remain stable), I'm not
> > > > sure what else got disabled in there, but it seems to have had a
> > > > negative impact on the crash racing.
> > > >
> > > >
> > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > Subject: Digest Footer
> > > >
> > > > ath11k mailing list
> > > > ath11k@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > >
> > > >
> > > > -----------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > End of ath11k Digest, Vol 7, Issue 5
> > >
> > >
> > >
> > > --
> > > ath11k mailing list
> > > ath11k@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> >
> > Hey Mitchell,
> >
> >    One more thing to try that may help us get a little bit of extra
> > info.  Out of everything I've done, something that has remained
> > consistent is to enable the MHI debugging as Kalle suggested:
> >
> > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> >
> >   Before any crash/spinlock, I see the MHI printing (from
> > drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> > and then after a number more iterations through this function, things
> > finally go out of control.  So from
> >
> >         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
> >                 TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
> >                 TO_MHI_STATE_STR(state));
> >
> > I'll see something like this:
> >
> > [  312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> > [  313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> > ee:INVALID_EE dev_state:SYS_ERR
> >
> > Then after a few of those prints showing SYS_ERR, either a spinlock or
> > a firmware crash.  I'm not sure what causes this ee state to go
> > invalid, but maybe that's worth looking into.  Can you confirm the
> > same behavior?  To see this a little easier, I also run dmesg -wH in
> > two windows, one piping to | grep -v mhi (to filter out the mhi
> > debugging).
> >
> > Thanks!
>
> So I've managed to stabilise my system now, so either the race is
> gone, or I've done something to win it all the time.  So one of the
> avenues of racing I was chasing at first was in the ath11k driver
> itself.  There are a couple areas where the single/shared IRQ is being
> forcibly toggled in ways that the documentation says are not great
> (and the original patch was trying to avoid).  Fixing those didn't
> seem to have much impact on the stability of things (I've included
> those changes in my patch though).  After the last email I was
> thinking about the MHI side of things a bit more and found a number of
> call sites that my naive grepping had missed that do the same thing,
> but via acquiring a lock at the same time.  I modified all the calls
> to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> variants that accept the flags parameter to capture state.  I've now
> booted and loaded the driver 10+ times without a single freeze or
> crash.  I'm not sure all of those modifications are necessary (ie:
> which things are re-entrant in this single interrupt operating mode vs
> which ones can use the simpler lock/unlock mechanisms), so I could use
> some advice/guidance there.
>
> Mitchell - if you want to grab this patch and try it, let me know how
> it goes and I can clean it up for the mailing list:
> https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> (apply to ath11k-qca6390-bringup-202011301608)

Blindly chasing the crashing, I've found one more probably relevant
piece of information.  As I was playing around trying to see if I had
actually stopped the racing, I noticed my battery was low.  I plugged
it in and immediately received the RT throttling crash. I've now tried
quite a bit, and on the battery I don't see the crashing.  I thought
maybe dynamic CPU clocking is changing some of the racing properties.
When I bring everything up on the battery and wait around a bit, once
I plug in the usb-c cable, within a few seconds it will often trigger
the RT throttling message.  I poked a little bit at some of the wifi
power management settings, specifically trying to disable them, but I
didn't seem to kick anything relevant yet.  I can essentially use the
power cable as a trigger for this race though..

Kalle - are you aware of anything that happens to the driver/adapter
when ac power shows up?  I think I see some power saving stuff in
wmi.c but I haven't gotten deep enough to know...

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-07  1:17     ` wi nk
@ 2020-12-07 14:45       ` Mitchell Nordine
  2020-12-07 17:01         ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: Mitchell Nordine @ 2020-12-07 14:45 UTC (permalink / raw)
  To: wi nk, ath11k

Thanks for sending through this patch Wink.

I built and installed the ath11k-qca6390-bringup branch with your patch last night on my Dell XPS 13 9310 running NixOS. I have only run the patch 6 times. The startup sequence seems more reliable. I was able to successfully enable the adapter and connect to my router each time, however each time my system would eventually freeze a few minutes after. I noticed that mouse input would stutter for a moment before completely freezing.

I tested on battery twice to check your theory w.r.t. power management, but did not notice any difference in behaviour.

> > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"

I tried running this but haven't noticed any difference to the output I'm observing in `dmesg` or `journalctl`. There's a chance that there's another way I should be doing this on NixOS as most things including the kernel and its configuration are built and configured declaratively. I'll try and work this out next time I get the chance to have a longer testing session.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, December 7, 2020 2:17 AM, wi nk <wink@technolu.st> wrote:

&gt; On Sun, Dec 6, 2020 at 10:45 PM wi nk wink@technolu.st wrote:
&gt;
&gt; &gt; On Sun, Dec 6, 2020 at 6:53 PM wi nk wink@technolu.st wrote:
&gt; &gt;
&gt; &gt; &gt; On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
&gt; &gt; &gt; mail@mitchellnordine.com wrote:
&gt; &gt; &gt;
&gt; &gt; &gt; &gt; I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
&gt; &gt; &gt; &gt; FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
&gt; &gt; &gt; &gt; Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
&gt; &gt; &gt; &gt; ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
&gt; &gt; &gt; &gt; On Sunday, December 6, 2020 6:00 PM, ath11k-request@lists.infradead.org wrote:
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Send ath11k mailing list submissions to
&gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
&gt; &gt; &gt; &gt; &gt; To subscribe or unsubscribe via the World Wide Web, visit
&gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
&gt; &gt; &gt; &gt; &gt; or, via email, send a message with subject or body 'help' to
&gt; &gt; &gt; &gt; &gt; ath11k-request@lists.infradead.org
&gt; &gt; &gt; &gt; &gt; You can reach the person managing the list at
&gt; &gt; &gt; &gt; &gt; ath11k-owner@lists.infradead.org
&gt; &gt; &gt; &gt; &gt; When replying, please edit your Subject line so it is more specific
&gt; &gt; &gt; &gt; &gt; than "Re: Contents of ath11k digest..."
&gt; &gt; &gt; &gt; &gt; Today's Topics:
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
&gt; &gt; &gt; &gt; &gt; 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Message: 1
&gt; &gt; &gt; &gt; &gt; Date: Sat, 5 Dec 2020 20:17:10 +0100
&gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
&gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
&gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
&gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
&gt; &gt; &gt; &gt; &gt; Message-ID:
&gt; &gt; &gt; &gt; &gt; CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
&gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
&gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
&gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
&gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
&gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
&gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
&gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
&gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
&gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
&gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
&gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Kalle -
&gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
&gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
&gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
&gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
&gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
&gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
&gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
&gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
&gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
&gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
&gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
&gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
&gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
&gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
&gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
&gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
&gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
&gt; &gt; &gt; &gt; &gt; collect?
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Message: 2
&gt; &gt; &gt; &gt; &gt; Date: Sun, 6 Dec 2020 09:05:57 +0100
&gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
&gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
&gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
&gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
&gt; &gt; &gt; &gt; &gt; Message-ID:
&gt; &gt; &gt; &gt; &gt; CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
&gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
&gt; &gt; &gt; &gt; &gt; On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
&gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
&gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
&gt; &gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
&gt; &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
&gt; &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
&gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
&gt; &gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; &gt; Kalle -
&gt; &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
&gt; &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
&gt; &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
&gt; &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
&gt; &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
&gt; &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
&gt; &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
&gt; &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
&gt; &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
&gt; &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
&gt; &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
&gt; &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
&gt; &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
&gt; &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
&gt; &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
&gt; &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
&gt; &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
&gt; &gt; &gt; &gt; &gt; &gt; collect?
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Well unfortunately I think the bluetooth was just a red herring in the
&gt; &gt; &gt; &gt; &gt; racing. To chase that, I disabled all bluetooth and was able to get
&gt; &gt; &gt; &gt; &gt; into a state where I had 6 failed boots in a row. To further poke
&gt; &gt; &gt; &gt; &gt; around, I rebuilt the kernel with localmodconfig to disable building
&gt; &gt; &gt; &gt; &gt; big chunks of things. This kernel is way less stable and seems to
&gt; &gt; &gt; &gt; &gt; freeze most of the time (but does occasionally remain stable), I'm not
&gt; &gt; &gt; &gt; &gt; sure what else got disabled in there, but it seems to have had a
&gt; &gt; &gt; &gt; &gt; negative impact on the crash racing.
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; Subject: Digest Footer
&gt; &gt; &gt; &gt; &gt; ath11k mailing list
&gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
&gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
&gt; &gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; &gt; End of ath11k Digest, Vol 7, Issue 5
&gt; &gt; &gt; &gt;
&gt; &gt; &gt; &gt; --
&gt; &gt; &gt; &gt; ath11k mailing list
&gt; &gt; &gt; &gt; ath11k@lists.infradead.org
&gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
&gt; &gt; &gt;
&gt; &gt; &gt; Hey Mitchell,
&gt; &gt; &gt; One more thing to try that may help us get a little bit of extra
&gt; &gt; &gt; info. Out of everything I've done, something that has remained
&gt; &gt; &gt; consistent is to enable the MHI debugging as Kalle suggested:
&gt; &gt; &gt; sudo sh -c "echo -n 'module mhi +p' &gt; /sys/kernel/debug/dynamic_debug/control"
&gt; &gt; &gt; Before any crash/spinlock, I see the MHI printing (from
&gt; &gt; &gt; drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
&gt; &gt; &gt; and then after a number more iterations through this function, things
&gt; &gt; &gt; finally go out of control. So from
&gt; &gt; &gt;
&gt; &gt; &gt;         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\\n",
&gt; &gt; &gt;                 TO_MHI_EXEC_STR(mhi_cntrl-&gt;ee), TO_MHI_EXEC_STR(ee),
&gt; &gt; &gt;                 TO_MHI_STATE_STR(state));
&gt; &gt; &gt;
&gt; &gt; &gt;
&gt; &gt; &gt; I'll see something like this:
&gt; &gt; &gt; [ 312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
&gt; &gt; &gt; [ 313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
&gt; &gt; &gt; ee:INVALID_EE dev_state:SYS_ERR
&gt; &gt; &gt; Then after a few of those prints showing SYS_ERR, either a spinlock or
&gt; &gt; &gt; a firmware crash. I'm not sure what causes this ee state to go
&gt; &gt; &gt; invalid, but maybe that's worth looking into. Can you confirm the
&gt; &gt; &gt; same behavior? To see this a little easier, I also run dmesg -wH in
&gt; &gt; &gt; two windows, one piping to | grep -v mhi (to filter out the mhi
&gt; &gt; &gt; debugging).
&gt; &gt; &gt; Thanks!
&gt; &gt;
&gt; &gt; So I've managed to stabilise my system now, so either the race is
&gt; &gt; gone, or I've done something to win it all the time. So one of the
&gt; &gt; avenues of racing I was chasing at first was in the ath11k driver
&gt; &gt; itself. There are a couple areas where the single/shared IRQ is being
&gt; &gt; forcibly toggled in ways that the documentation says are not great
&gt; &gt; (and the original patch was trying to avoid). Fixing those didn't
&gt; &gt; seem to have much impact on the stability of things (I've included
&gt; &gt; those changes in my patch though). After the last email I was
&gt; &gt; thinking about the MHI side of things a bit more and found a number of
&gt; &gt; call sites that my naive grepping had missed that do the same thing,
&gt; &gt; but via acquiring a lock at the same time. I modified all the calls
&gt; &gt; to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
&gt; &gt; variants that accept the flags parameter to capture state. I've now
&gt; &gt; booted and loaded the driver 10+ times without a single freeze or
&gt; &gt; crash. I'm not sure all of those modifications are necessary (ie:
&gt; &gt; which things are re-entrant in this single interrupt operating mode vs
&gt; &gt; which ones can use the simpler lock/unlock mechanisms), so I could use
&gt; &gt; some advice/guidance there.
&gt; &gt; Mitchell - if you want to grab this patch and try it, let me know how
&gt; &gt; it goes and I can clean it up for the mailing list:
&gt; &gt; https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
&gt; &gt; (apply to ath11k-qca6390-bringup-202011301608)
&gt;
&gt; Blindly chasing the crashing, I've found one more probably relevant
&gt; piece of information. As I was playing around trying to see if I had
&gt; actually stopped the racing, I noticed my battery was low. I plugged
&gt; it in and immediately received the RT throttling crash. I've now tried
&gt; quite a bit, and on the battery I don't see the crashing. I thought
&gt; maybe dynamic CPU clocking is changing some of the racing properties.
&gt; When I bring everything up on the battery and wait around a bit, once
&gt; I plug in the usb-c cable, within a few seconds it will often trigger
&gt; the RT throttling message. I poked a little bit at some of the wifi
&gt; power management settings, specifically trying to disable them, but I
&gt; didn't seem to kick anything relevant yet. I can essentially use the
&gt; power cable as a trigger for this race though..
&gt;
&gt; Kalle - are you aware of anything that happens to the driver/adapter
&gt; when ac power shows up? I think I see some power saving stuff in
&gt; wmi.c but I haven't gotten deep enough to know...

</wink@technolu.st>

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-07 14:45       ` Mitchell Nordine
@ 2020-12-07 17:01         ` wi nk
  2020-12-09  1:52           ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-07 17:01 UTC (permalink / raw)
  To: Mitchell Nordine; +Cc: ath11k

On Mon, Dec 7, 2020 at 3:45 PM Mitchell Nordine
<mail@mitchellnordine.com> wrote:
>
> Thanks for sending through this patch Wink.
>
> I built and installed the ath11k-qca6390-bringup branch with your patch last night on my Dell XPS 13 9310 running NixOS. I have only run the patch 6 times. The startup sequence seems more reliable. I was able to successfully enable the adapter and connect to my router each time, however each time my system would eventually freeze a few minutes after. I noticed that mouse input would stutter for a moment before completely freezing.
>
> I tested on battery twice to check your theory w.r.t. power management, but did not notice any difference in behaviour.
>
> > > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
>
> I tried running this but haven't noticed any difference to the output I'm observing in `dmesg` or `journalctl`. There's a chance that there's another way I should be doing this on NixOS as most things including the kernel and its configuration are built and configured declaratively. I'll try and work this out next time I get the chance to have a longer testing session.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, December 7, 2020 2:17 AM, wi nk <wink@technolu.st> wrote:
>
> &gt; On Sun, Dec 6, 2020 at 10:45 PM wi nk wink@technolu.st wrote:
> &gt;
> &gt; &gt; On Sun, Dec 6, 2020 at 6:53 PM wi nk wink@technolu.st wrote:
> &gt; &gt;
> &gt; &gt; &gt; On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> &gt; &gt; &gt; mail@mitchellnordine.com wrote:
> &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> &gt; &gt; &gt; &gt; FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> &gt; &gt; &gt; &gt; Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> &gt; &gt; &gt; &gt; ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> &gt; &gt; &gt; &gt; On Sunday, December 6, 2020 6:00 PM, ath11k-request@lists.infradead.org wrote:
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; Send ath11k mailing list submissions to
> &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> &gt; &gt; &gt; &gt; &gt; To subscribe or unsubscribe via the World Wide Web, visit
> &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> &gt; &gt; &gt; &gt; &gt; or, via email, send a message with subject or body 'help' to
> &gt; &gt; &gt; &gt; &gt; ath11k-request@lists.infradead.org
> &gt; &gt; &gt; &gt; &gt; You can reach the person managing the list at
> &gt; &gt; &gt; &gt; &gt; ath11k-owner@lists.infradead.org
> &gt; &gt; &gt; &gt; &gt; When replying, please edit your Subject line so it is more specific
> &gt; &gt; &gt; &gt; &gt; than "Re: Contents of ath11k digest..."
> &gt; &gt; &gt; &gt; &gt; Today's Topics:
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> &gt; &gt; &gt; &gt; &gt; 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; Message: 1
> &gt; &gt; &gt; &gt; &gt; Date: Sat, 5 Dec 2020 20:17:10 +0100
> &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> &gt; &gt; &gt; &gt; &gt; Message-ID:
> &gt; &gt; &gt; &gt; &gt; CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; Kalle -
> &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> &gt; &gt; &gt; &gt; &gt; collect?
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; Message: 2
> &gt; &gt; &gt; &gt; &gt; Date: Sun, 6 Dec 2020 09:05:57 +0100
> &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> &gt; &gt; &gt; &gt; &gt; Message-ID:
> &gt; &gt; &gt; &gt; &gt; CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> &gt; &gt; &gt; &gt; &gt; On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> &gt; &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; &gt; Kalle -
> &gt; &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> &gt; &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> &gt; &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> &gt; &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> &gt; &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> &gt; &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> &gt; &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> &gt; &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> &gt; &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> &gt; &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> &gt; &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> &gt; &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> &gt; &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> &gt; &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> &gt; &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> &gt; &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> &gt; &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> &gt; &gt; &gt; &gt; &gt; &gt; collect?
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; Well unfortunately I think the bluetooth was just a red herring in the
> &gt; &gt; &gt; &gt; &gt; racing. To chase that, I disabled all bluetooth and was able to get
> &gt; &gt; &gt; &gt; &gt; into a state where I had 6 failed boots in a row. To further poke
> &gt; &gt; &gt; &gt; &gt; around, I rebuilt the kernel with localmodconfig to disable building
> &gt; &gt; &gt; &gt; &gt; big chunks of things. This kernel is way less stable and seems to
> &gt; &gt; &gt; &gt; &gt; freeze most of the time (but does occasionally remain stable), I'm not
> &gt; &gt; &gt; &gt; &gt; sure what else got disabled in there, but it seems to have had a
> &gt; &gt; &gt; &gt; &gt; negative impact on the crash racing.
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; Subject: Digest Footer
> &gt; &gt; &gt; &gt; &gt; ath11k mailing list
> &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> &gt; &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; &gt; End of ath11k Digest, Vol 7, Issue 5
> &gt; &gt; &gt; &gt;
> &gt; &gt; &gt; &gt; --
> &gt; &gt; &gt; &gt; ath11k mailing list
> &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> &gt; &gt; &gt;
> &gt; &gt; &gt; Hey Mitchell,
> &gt; &gt; &gt; One more thing to try that may help us get a little bit of extra
> &gt; &gt; &gt; info. Out of everything I've done, something that has remained
> &gt; &gt; &gt; consistent is to enable the MHI debugging as Kalle suggested:
> &gt; &gt; &gt; sudo sh -c "echo -n 'module mhi +p' &gt; /sys/kernel/debug/dynamic_debug/control"
> &gt; &gt; &gt; Before any crash/spinlock, I see the MHI printing (from
> &gt; &gt; &gt; drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> &gt; &gt; &gt; and then after a number more iterations through this function, things
> &gt; &gt; &gt; finally go out of control. So from
> &gt; &gt; &gt;
> &gt; &gt; &gt;         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\\n",
> &gt; &gt; &gt;                 TO_MHI_EXEC_STR(mhi_cntrl-&gt;ee), TO_MHI_EXEC_STR(ee),
> &gt; &gt; &gt;                 TO_MHI_STATE_STR(state));
> &gt; &gt; &gt;
> &gt; &gt; &gt;
> &gt; &gt; &gt; I'll see something like this:
> &gt; &gt; &gt; [ 312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> &gt; &gt; &gt; [ 313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> &gt; &gt; &gt; ee:INVALID_EE dev_state:SYS_ERR
> &gt; &gt; &gt; Then after a few of those prints showing SYS_ERR, either a spinlock or
> &gt; &gt; &gt; a firmware crash. I'm not sure what causes this ee state to go
> &gt; &gt; &gt; invalid, but maybe that's worth looking into. Can you confirm the
> &gt; &gt; &gt; same behavior? To see this a little easier, I also run dmesg -wH in
> &gt; &gt; &gt; two windows, one piping to | grep -v mhi (to filter out the mhi
> &gt; &gt; &gt; debugging).
> &gt; &gt; &gt; Thanks!
> &gt; &gt;
> &gt; &gt; So I've managed to stabilise my system now, so either the race is
> &gt; &gt; gone, or I've done something to win it all the time. So one of the
> &gt; &gt; avenues of racing I was chasing at first was in the ath11k driver
> &gt; &gt; itself. There are a couple areas where the single/shared IRQ is being
> &gt; &gt; forcibly toggled in ways that the documentation says are not great
> &gt; &gt; (and the original patch was trying to avoid). Fixing those didn't
> &gt; &gt; seem to have much impact on the stability of things (I've included
> &gt; &gt; those changes in my patch though). After the last email I was
> &gt; &gt; thinking about the MHI side of things a bit more and found a number of
> &gt; &gt; call sites that my naive grepping had missed that do the same thing,
> &gt; &gt; but via acquiring a lock at the same time. I modified all the calls
> &gt; &gt; to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> &gt; &gt; variants that accept the flags parameter to capture state. I've now
> &gt; &gt; booted and loaded the driver 10+ times without a single freeze or
> &gt; &gt; crash. I'm not sure all of those modifications are necessary (ie:
> &gt; &gt; which things are re-entrant in this single interrupt operating mode vs
> &gt; &gt; which ones can use the simpler lock/unlock mechanisms), so I could use
> &gt; &gt; some advice/guidance there.
> &gt; &gt; Mitchell - if you want to grab this patch and try it, let me know how
> &gt; &gt; it goes and I can clean it up for the mailing list:
> &gt; &gt; https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> &gt; &gt; (apply to ath11k-qca6390-bringup-202011301608)
> &gt;
> &gt; Blindly chasing the crashing, I've found one more probably relevant
> &gt; piece of information. As I was playing around trying to see if I had
> &gt; actually stopped the racing, I noticed my battery was low. I plugged
> &gt; it in and immediately received the RT throttling crash. I've now tried
> &gt; quite a bit, and on the battery I don't see the crashing. I thought
> &gt; maybe dynamic CPU clocking is changing some of the racing properties.
> &gt; When I bring everything up on the battery and wait around a bit, once
> &gt; I plug in the usb-c cable, within a few seconds it will often trigger
> &gt; the RT throttling message. I poked a little bit at some of the wifi
> &gt; power management settings, specifically trying to disable them, but I
> &gt; didn't seem to kick anything relevant yet. I can essentially use the
> &gt; power cable as a trigger for this race though..
> &gt;
> &gt; Kalle - are you aware of anything that happens to the driver/adapter
> &gt; when ac power shows up? I think I see some power saving stuff in
> &gt; wmi.c but I haven't gotten deep enough to know...
>
> </wink@technolu.st>

Mitchell - one thing to note re the mhi debugging, the module needs to
be in place first.  Here's how I've been doing it:

modprobe ath11k_pci; echo -n 'module mhi +p' >
/sys/kernel/debug/dynamic_debug/control; dmesg -wH

In the previously linked git repo, I've added my kernel build config,
that may be worth trying.  Another change I've made that seems to help
is to completely disable power management for 80211 in the kernel.
Between that and setting ubuntu to leave the iwconfig things alone, it
seems to have resolved the power plugging stuff.  I'm guessing the
real racing is related to just attempting to configure/reconfigure
settings in the adapter (which is why we're seeing crashing when it
tries to actually attempt to 'do things', like associate or modify
operational configs, before it goes nuts).  The thing that's weird is
that I'm assuming the instability has been introduced due to the
shared IRQ since presumably this driver works for the previous pieces
of hardware the chipset was put into, but specifically in those
codepaths, there's nothing obviously related to the single IRQ.  Which
leads my thoughts back to timing/synchronization issues...

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-07 17:01         ` wi nk
@ 2020-12-09  1:52           ` wi nk
  2020-12-09  9:43             ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-09  1:52 UTC (permalink / raw)
  To: Mitchell Nordine, Kalle Valo, Carl Huang; +Cc: ath11k

On Mon, Dec 7, 2020 at 6:01 PM wi nk <wink@technolu.st> wrote:
>
> On Mon, Dec 7, 2020 at 3:45 PM Mitchell Nordine
> <mail@mitchellnordine.com> wrote:
> >
> > Thanks for sending through this patch Wink.
> >
> > I built and installed the ath11k-qca6390-bringup branch with your patch last night on my Dell XPS 13 9310 running NixOS. I have only run the patch 6 times. The startup sequence seems more reliable. I was able to successfully enable the adapter and connect to my router each time, however each time my system would eventually freeze a few minutes after. I noticed that mouse input would stutter for a moment before completely freezing.
> >
> > I tested on battery twice to check your theory w.r.t. power management, but did not notice any difference in behaviour.
> >
> > > > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> >
> > I tried running this but haven't noticed any difference to the output I'm observing in `dmesg` or `journalctl`. There's a chance that there's another way I should be doing this on NixOS as most things including the kernel and its configuration are built and configured declaratively. I'll try and work this out next time I get the chance to have a longer testing session.
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Monday, December 7, 2020 2:17 AM, wi nk <wink@technolu.st> wrote:
> >
> > &gt; On Sun, Dec 6, 2020 at 10:45 PM wi nk wink@technolu.st wrote:
> > &gt;
> > &gt; &gt; On Sun, Dec 6, 2020 at 6:53 PM wi nk wink@technolu.st wrote:
> > &gt; &gt;
> > &gt; &gt; &gt; On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> > &gt; &gt; &gt; mail@mitchellnordine.com wrote:
> > &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> > &gt; &gt; &gt; &gt; FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> > &gt; &gt; &gt; &gt; Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> > &gt; &gt; &gt; &gt; ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > &gt; &gt; &gt; &gt; On Sunday, December 6, 2020 6:00 PM, ath11k-request@lists.infradead.org wrote:
> > &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; Send ath11k mailing list submissions to
> > &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > &gt; &gt; &gt; &gt; &gt; To subscribe or unsubscribe via the World Wide Web, visit
> > &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > &gt; &gt; &gt; &gt; &gt; or, via email, send a message with subject or body 'help' to
> > &gt; &gt; &gt; &gt; &gt; ath11k-request@lists.infradead.org
> > &gt; &gt; &gt; &gt; &gt; You can reach the person managing the list at
> > &gt; &gt; &gt; &gt; &gt; ath11k-owner@lists.infradead.org
> > &gt; &gt; &gt; &gt; &gt; When replying, please edit your Subject line so it is more specific
> > &gt; &gt; &gt; &gt; &gt; than "Re: Contents of ath11k digest..."
> > &gt; &gt; &gt; &gt; &gt; Today's Topics:
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > &gt; &gt; &gt; &gt; &gt; 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; Message: 1
> > &gt; &gt; &gt; &gt; &gt; Date: Sat, 5 Dec 2020 20:17:10 +0100
> > &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> > &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> > &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > &gt; &gt; &gt; &gt; &gt; Message-ID:
> > &gt; &gt; &gt; &gt; &gt; CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> > &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> > &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> > &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> > &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> > &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; Kalle -
> > &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> > &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> > &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> > &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> > &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> > &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> > &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> > &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> > &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> > &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> > &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> > &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> > &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> > &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> > &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> > &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> > &gt; &gt; &gt; &gt; &gt; collect?
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; Message: 2
> > &gt; &gt; &gt; &gt; &gt; Date: Sun, 6 Dec 2020 09:05:57 +0100
> > &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> > &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> > &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > &gt; &gt; &gt; &gt; &gt; Message-ID:
> > &gt; &gt; &gt; &gt; &gt; CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> > &gt; &gt; &gt; &gt; &gt; On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> > &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> > &gt; &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; &gt; Kalle -
> > &gt; &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> > &gt; &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> > &gt; &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > &gt; &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> > &gt; &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> > &gt; &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> > &gt; &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> > &gt; &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> > &gt; &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> > &gt; &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> > &gt; &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> > &gt; &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> > &gt; &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> > &gt; &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> > &gt; &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> > &gt; &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> > &gt; &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> > &gt; &gt; &gt; &gt; &gt; &gt; collect?
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; Well unfortunately I think the bluetooth was just a red herring in the
> > &gt; &gt; &gt; &gt; &gt; racing. To chase that, I disabled all bluetooth and was able to get
> > &gt; &gt; &gt; &gt; &gt; into a state where I had 6 failed boots in a row. To further poke
> > &gt; &gt; &gt; &gt; &gt; around, I rebuilt the kernel with localmodconfig to disable building
> > &gt; &gt; &gt; &gt; &gt; big chunks of things. This kernel is way less stable and seems to
> > &gt; &gt; &gt; &gt; &gt; freeze most of the time (but does occasionally remain stable), I'm not
> > &gt; &gt; &gt; &gt; &gt; sure what else got disabled in there, but it seems to have had a
> > &gt; &gt; &gt; &gt; &gt; negative impact on the crash racing.
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; Subject: Digest Footer
> > &gt; &gt; &gt; &gt; &gt; ath11k mailing list
> > &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > &gt; &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; &gt; End of ath11k Digest, Vol 7, Issue 5
> > &gt; &gt; &gt; &gt;
> > &gt; &gt; &gt; &gt; --
> > &gt; &gt; &gt; &gt; ath11k mailing list
> > &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > &gt; &gt; &gt;
> > &gt; &gt; &gt; Hey Mitchell,
> > &gt; &gt; &gt; One more thing to try that may help us get a little bit of extra
> > &gt; &gt; &gt; info. Out of everything I've done, something that has remained
> > &gt; &gt; &gt; consistent is to enable the MHI debugging as Kalle suggested:
> > &gt; &gt; &gt; sudo sh -c "echo -n 'module mhi +p' &gt; /sys/kernel/debug/dynamic_debug/control"
> > &gt; &gt; &gt; Before any crash/spinlock, I see the MHI printing (from
> > &gt; &gt; &gt; drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> > &gt; &gt; &gt; and then after a number more iterations through this function, things
> > &gt; &gt; &gt; finally go out of control. So from
> > &gt; &gt; &gt;
> > &gt; &gt; &gt;         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\\n",
> > &gt; &gt; &gt;                 TO_MHI_EXEC_STR(mhi_cntrl-&gt;ee), TO_MHI_EXEC_STR(ee),
> > &gt; &gt; &gt;                 TO_MHI_STATE_STR(state));
> > &gt; &gt; &gt;
> > &gt; &gt; &gt;
> > &gt; &gt; &gt; I'll see something like this:
> > &gt; &gt; &gt; [ 312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> > &gt; &gt; &gt; [ 313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> > &gt; &gt; &gt; ee:INVALID_EE dev_state:SYS_ERR
> > &gt; &gt; &gt; Then after a few of those prints showing SYS_ERR, either a spinlock or
> > &gt; &gt; &gt; a firmware crash. I'm not sure what causes this ee state to go
> > &gt; &gt; &gt; invalid, but maybe that's worth looking into. Can you confirm the
> > &gt; &gt; &gt; same behavior? To see this a little easier, I also run dmesg -wH in
> > &gt; &gt; &gt; two windows, one piping to | grep -v mhi (to filter out the mhi
> > &gt; &gt; &gt; debugging).
> > &gt; &gt; &gt; Thanks!
> > &gt; &gt;
> > &gt; &gt; So I've managed to stabilise my system now, so either the race is
> > &gt; &gt; gone, or I've done something to win it all the time. So one of the
> > &gt; &gt; avenues of racing I was chasing at first was in the ath11k driver
> > &gt; &gt; itself. There are a couple areas where the single/shared IRQ is being
> > &gt; &gt; forcibly toggled in ways that the documentation says are not great
> > &gt; &gt; (and the original patch was trying to avoid). Fixing those didn't
> > &gt; &gt; seem to have much impact on the stability of things (I've included
> > &gt; &gt; those changes in my patch though). After the last email I was
> > &gt; &gt; thinking about the MHI side of things a bit more and found a number of
> > &gt; &gt; call sites that my naive grepping had missed that do the same thing,
> > &gt; &gt; but via acquiring a lock at the same time. I modified all the calls
> > &gt; &gt; to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > &gt; &gt; variants that accept the flags parameter to capture state. I've now
> > &gt; &gt; booted and loaded the driver 10+ times without a single freeze or
> > &gt; &gt; crash. I'm not sure all of those modifications are necessary (ie:
> > &gt; &gt; which things are re-entrant in this single interrupt operating mode vs
> > &gt; &gt; which ones can use the simpler lock/unlock mechanisms), so I could use
> > &gt; &gt; some advice/guidance there.
> > &gt; &gt; Mitchell - if you want to grab this patch and try it, let me know how
> > &gt; &gt; it goes and I can clean it up for the mailing list:
> > &gt; &gt; https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > &gt; &gt; (apply to ath11k-qca6390-bringup-202011301608)
> > &gt;
> > &gt; Blindly chasing the crashing, I've found one more probably relevant
> > &gt; piece of information. As I was playing around trying to see if I had
> > &gt; actually stopped the racing, I noticed my battery was low. I plugged
> > &gt; it in and immediately received the RT throttling crash. I've now tried
> > &gt; quite a bit, and on the battery I don't see the crashing. I thought
> > &gt; maybe dynamic CPU clocking is changing some of the racing properties.
> > &gt; When I bring everything up on the battery and wait around a bit, once
> > &gt; I plug in the usb-c cable, within a few seconds it will often trigger
> > &gt; the RT throttling message. I poked a little bit at some of the wifi
> > &gt; power management settings, specifically trying to disable them, but I
> > &gt; didn't seem to kick anything relevant yet. I can essentially use the
> > &gt; power cable as a trigger for this race though..
> > &gt;
> > &gt; Kalle - are you aware of anything that happens to the driver/adapter
> > &gt; when ac power shows up? I think I see some power saving stuff in
> > &gt; wmi.c but I haven't gotten deep enough to know...
> >
> > </wink@technolu.st>
>
> Mitchell - one thing to note re the mhi debugging, the module needs to
> be in place first.  Here's how I've been doing it:
>
> modprobe ath11k_pci; echo -n 'module mhi +p' >
> /sys/kernel/debug/dynamic_debug/control; dmesg -wH
>
> In the previously linked git repo, I've added my kernel build config,
> that may be worth trying.  Another change I've made that seems to help
> is to completely disable power management for 80211 in the kernel.
> Between that and setting ubuntu to leave the iwconfig things alone, it
> seems to have resolved the power plugging stuff.  I'm guessing the
> real racing is related to just attempting to configure/reconfigure
> settings in the adapter (which is why we're seeing crashing when it
> tries to actually attempt to 'do things', like associate or modify
> operational configs, before it goes nuts).  The thing that's weird is
> that I'm assuming the instability has been introduced due to the
> shared IRQ since presumably this driver works for the previous pieces
> of hardware the chipset was put into, but specifically in those
> codepaths, there's nothing obviously related to the single IRQ.  Which
> leads my thoughts back to timing/synchronization issues...

While I'm semi-randomly poking things I decided to capture some
information in a structured way that could be useful to Kalle and
team.  I'm running the latest bringup branch without any
modifications.  I booted my machine 6 consecutive times to demonstrate
the power triggering the freezing I was referring to.  In each video,
you'll see the dmesg output, and in the cases I can control, you'll
also see it with MHI debugging.

The first 2 boots, I'm intentionally booting / initializing the driver
on battery power and then waiting 5+ minutes to plug in the charger.
Note:  the system always comes online and remains stable when I start
in this configuration, it's only when I plug the charger in that it
crashes.

Boot 1: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_004643171.mp4
- The machine and driver has been online and stable for 5 minutes (as
seen in htop/ping), within a few seconds of plugging in the usb
charger, the mhi debugging shows a failure and the machine crashes.

Boot 2 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005346443.mp4
- Same set up (although the machine had been up for 6 minutes at that
point) and failure as boot 1. The machine hard locks instantly this
time, as opposed to the stuttering you can see in boot 1.

For the next boots, I'm booting / initializing the driver with the
charger plugged in ahead of time:

Boot 3 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005642416.mp4
- Within a few seconds of the driver initializing, the machine
crashes.

Boot 4 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005800378.mp4
- Same setup as boot 3, but this time the system survives a bit longer
(15 seconds or so).

Boot 5: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005938734.mp4
- Same setup as 3/4, similar crash to boot 4.  The driver survives ~15
seconds and then the machine hangs.

After this I went back to the setup for boot 1/2 where I brought
everything online, waited a bit over 5 minutes and plugged in the
charger.

Boot 6: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_010537553.mp4
- This boot was successful and has remained stable.  I'm composing
this email from it.  If this follows previous behavior, it should stay
online for at least 24h (I always fiddled beyond that).

So in conclusion, I wanted to demonstrate that clearly being on
battery power is causing something that is enabling my system to be
stable in a way that goes away when I plug in my charger (both up
front, and after initialization).  I don't have any great ideas of
what could be going on, I'm not entirely sure it's directly power
related but when I toggle it, clearly something is linked (maybe back
to the ACPI tables being borked?).  I'll leave this boot running as
long as I can to see if it randomly crashes after an hour...

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09  1:52           ` wi nk
@ 2020-12-09  9:43             ` wi nk
  2020-12-09 15:28               ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-09  9:43 UTC (permalink / raw)
  To: Mitchell Nordine, Kalle Valo, Carl Huang; +Cc: ath11k

On Wed, Dec 9, 2020 at 2:52 AM wi nk <wink@technolu.st> wrote:
>
> On Mon, Dec 7, 2020 at 6:01 PM wi nk <wink@technolu.st> wrote:
> >
> > On Mon, Dec 7, 2020 at 3:45 PM Mitchell Nordine
> > <mail@mitchellnordine.com> wrote:
> > >
> > > Thanks for sending through this patch Wink.
> > >
> > > I built and installed the ath11k-qca6390-bringup branch with your patch last night on my Dell XPS 13 9310 running NixOS. I have only run the patch 6 times. The startup sequence seems more reliable. I was able to successfully enable the adapter and connect to my router each time, however each time my system would eventually freeze a few minutes after. I noticed that mouse input would stutter for a moment before completely freezing.
> > >
> > > I tested on battery twice to check your theory w.r.t. power management, but did not notice any difference in behaviour.
> > >
> > > > > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> > >
> > > I tried running this but haven't noticed any difference to the output I'm observing in `dmesg` or `journalctl`. There's a chance that there's another way I should be doing this on NixOS as most things including the kernel and its configuration are built and configured declaratively. I'll try and work this out next time I get the chance to have a longer testing session.
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Monday, December 7, 2020 2:17 AM, wi nk <wink@technolu.st> wrote:
> > >
> > > &gt; On Sun, Dec 6, 2020 at 10:45 PM wi nk wink@technolu.st wrote:
> > > &gt;
> > > &gt; &gt; On Sun, Dec 6, 2020 at 6:53 PM wi nk wink@technolu.st wrote:
> > > &gt; &gt;
> > > &gt; &gt; &gt; On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> > > &gt; &gt; &gt; mail@mitchellnordine.com wrote:
> > > &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> > > &gt; &gt; &gt; &gt; FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> > > &gt; &gt; &gt; &gt; Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> > > &gt; &gt; &gt; &gt; ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > &gt; &gt; &gt; &gt; On Sunday, December 6, 2020 6:00 PM, ath11k-request@lists.infradead.org wrote:
> > > &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; Send ath11k mailing list submissions to
> > > &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > > &gt; &gt; &gt; &gt; &gt; To subscribe or unsubscribe via the World Wide Web, visit
> > > &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > > &gt; &gt; &gt; &gt; &gt; or, via email, send a message with subject or body 'help' to
> > > &gt; &gt; &gt; &gt; &gt; ath11k-request@lists.infradead.org
> > > &gt; &gt; &gt; &gt; &gt; You can reach the person managing the list at
> > > &gt; &gt; &gt; &gt; &gt; ath11k-owner@lists.infradead.org
> > > &gt; &gt; &gt; &gt; &gt; When replying, please edit your Subject line so it is more specific
> > > &gt; &gt; &gt; &gt; &gt; than "Re: Contents of ath11k digest..."
> > > &gt; &gt; &gt; &gt; &gt; Today's Topics:
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > &gt; &gt; &gt; &gt; &gt; 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; Message: 1
> > > &gt; &gt; &gt; &gt; &gt; Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> > > &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> > > &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > &gt; &gt; &gt; &gt; &gt; Message-ID:
> > > &gt; &gt; &gt; &gt; &gt; CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > > &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> > > &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> > > &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> > > &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; Kalle -
> > > &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> > > &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> > > &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> > > &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> > > &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> > > &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> > > &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> > > &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> > > &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> > > &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> > > &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> > > &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> > > &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> > > &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> > > &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> > > &gt; &gt; &gt; &gt; &gt; collect?
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; Message: 2
> > > &gt; &gt; &gt; &gt; &gt; Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> > > &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> > > &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > &gt; &gt; &gt; &gt; &gt; Message-ID:
> > > &gt; &gt; &gt; &gt; &gt; CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > > &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> > > &gt; &gt; &gt; &gt; &gt; On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; &gt; Kalle -
> > > &gt; &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> > > &gt; &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> > > &gt; &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > &gt; &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> > > &gt; &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> > > &gt; &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> > > &gt; &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> > > &gt; &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> > > &gt; &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> > > &gt; &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> > > &gt; &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> > > &gt; &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> > > &gt; &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> > > &gt; &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> > > &gt; &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > &gt; &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> > > &gt; &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> > > &gt; &gt; &gt; &gt; &gt; &gt; collect?
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; Well unfortunately I think the bluetooth was just a red herring in the
> > > &gt; &gt; &gt; &gt; &gt; racing. To chase that, I disabled all bluetooth and was able to get
> > > &gt; &gt; &gt; &gt; &gt; into a state where I had 6 failed boots in a row. To further poke
> > > &gt; &gt; &gt; &gt; &gt; around, I rebuilt the kernel with localmodconfig to disable building
> > > &gt; &gt; &gt; &gt; &gt; big chunks of things. This kernel is way less stable and seems to
> > > &gt; &gt; &gt; &gt; &gt; freeze most of the time (but does occasionally remain stable), I'm not
> > > &gt; &gt; &gt; &gt; &gt; sure what else got disabled in there, but it seems to have had a
> > > &gt; &gt; &gt; &gt; &gt; negative impact on the crash racing.
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; Subject: Digest Footer
> > > &gt; &gt; &gt; &gt; &gt; ath11k mailing list
> > > &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > > &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > > &gt; &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; &gt; End of ath11k Digest, Vol 7, Issue 5
> > > &gt; &gt; &gt; &gt;
> > > &gt; &gt; &gt; &gt; --
> > > &gt; &gt; &gt; &gt; ath11k mailing list
> > > &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > > &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > > &gt; &gt; &gt;
> > > &gt; &gt; &gt; Hey Mitchell,
> > > &gt; &gt; &gt; One more thing to try that may help us get a little bit of extra
> > > &gt; &gt; &gt; info. Out of everything I've done, something that has remained
> > > &gt; &gt; &gt; consistent is to enable the MHI debugging as Kalle suggested:
> > > &gt; &gt; &gt; sudo sh -c "echo -n 'module mhi +p' &gt; /sys/kernel/debug/dynamic_debug/control"
> > > &gt; &gt; &gt; Before any crash/spinlock, I see the MHI printing (from
> > > &gt; &gt; &gt; drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> > > &gt; &gt; &gt; and then after a number more iterations through this function, things
> > > &gt; &gt; &gt; finally go out of control. So from
> > > &gt; &gt; &gt;
> > > &gt; &gt; &gt;         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\\n",
> > > &gt; &gt; &gt;                 TO_MHI_EXEC_STR(mhi_cntrl-&gt;ee), TO_MHI_EXEC_STR(ee),
> > > &gt; &gt; &gt;                 TO_MHI_STATE_STR(state));
> > > &gt; &gt; &gt;
> > > &gt; &gt; &gt;
> > > &gt; &gt; &gt; I'll see something like this:
> > > &gt; &gt; &gt; [ 312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> > > &gt; &gt; &gt; [ 313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> > > &gt; &gt; &gt; ee:INVALID_EE dev_state:SYS_ERR
> > > &gt; &gt; &gt; Then after a few of those prints showing SYS_ERR, either a spinlock or
> > > &gt; &gt; &gt; a firmware crash. I'm not sure what causes this ee state to go
> > > &gt; &gt; &gt; invalid, but maybe that's worth looking into. Can you confirm the
> > > &gt; &gt; &gt; same behavior? To see this a little easier, I also run dmesg -wH in
> > > &gt; &gt; &gt; two windows, one piping to | grep -v mhi (to filter out the mhi
> > > &gt; &gt; &gt; debugging).
> > > &gt; &gt; &gt; Thanks!
> > > &gt; &gt;
> > > &gt; &gt; So I've managed to stabilise my system now, so either the race is
> > > &gt; &gt; gone, or I've done something to win it all the time. So one of the
> > > &gt; &gt; avenues of racing I was chasing at first was in the ath11k driver
> > > &gt; &gt; itself. There are a couple areas where the single/shared IRQ is being
> > > &gt; &gt; forcibly toggled in ways that the documentation says are not great
> > > &gt; &gt; (and the original patch was trying to avoid). Fixing those didn't
> > > &gt; &gt; seem to have much impact on the stability of things (I've included
> > > &gt; &gt; those changes in my patch though). After the last email I was
> > > &gt; &gt; thinking about the MHI side of things a bit more and found a number of
> > > &gt; &gt; call sites that my naive grepping had missed that do the same thing,
> > > &gt; &gt; but via acquiring a lock at the same time. I modified all the calls
> > > &gt; &gt; to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > > &gt; &gt; variants that accept the flags parameter to capture state. I've now
> > > &gt; &gt; booted and loaded the driver 10+ times without a single freeze or
> > > &gt; &gt; crash. I'm not sure all of those modifications are necessary (ie:
> > > &gt; &gt; which things are re-entrant in this single interrupt operating mode vs
> > > &gt; &gt; which ones can use the simpler lock/unlock mechanisms), so I could use
> > > &gt; &gt; some advice/guidance there.
> > > &gt; &gt; Mitchell - if you want to grab this patch and try it, let me know how
> > > &gt; &gt; it goes and I can clean it up for the mailing list:
> > > &gt; &gt; https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > > &gt; &gt; (apply to ath11k-qca6390-bringup-202011301608)
> > > &gt;
> > > &gt; Blindly chasing the crashing, I've found one more probably relevant
> > > &gt; piece of information. As I was playing around trying to see if I had
> > > &gt; actually stopped the racing, I noticed my battery was low. I plugged
> > > &gt; it in and immediately received the RT throttling crash. I've now tried
> > > &gt; quite a bit, and on the battery I don't see the crashing. I thought
> > > &gt; maybe dynamic CPU clocking is changing some of the racing properties.
> > > &gt; When I bring everything up on the battery and wait around a bit, once
> > > &gt; I plug in the usb-c cable, within a few seconds it will often trigger
> > > &gt; the RT throttling message. I poked a little bit at some of the wifi
> > > &gt; power management settings, specifically trying to disable them, but I
> > > &gt; didn't seem to kick anything relevant yet. I can essentially use the
> > > &gt; power cable as a trigger for this race though..
> > > &gt;
> > > &gt; Kalle - are you aware of anything that happens to the driver/adapter
> > > &gt; when ac power shows up? I think I see some power saving stuff in
> > > &gt; wmi.c but I haven't gotten deep enough to know...
> > >
> > > </wink@technolu.st>
> >
> > Mitchell - one thing to note re the mhi debugging, the module needs to
> > be in place first.  Here's how I've been doing it:
> >
> > modprobe ath11k_pci; echo -n 'module mhi +p' >
> > /sys/kernel/debug/dynamic_debug/control; dmesg -wH
> >
> > In the previously linked git repo, I've added my kernel build config,
> > that may be worth trying.  Another change I've made that seems to help
> > is to completely disable power management for 80211 in the kernel.
> > Between that and setting ubuntu to leave the iwconfig things alone, it
> > seems to have resolved the power plugging stuff.  I'm guessing the
> > real racing is related to just attempting to configure/reconfigure
> > settings in the adapter (which is why we're seeing crashing when it
> > tries to actually attempt to 'do things', like associate or modify
> > operational configs, before it goes nuts).  The thing that's weird is
> > that I'm assuming the instability has been introduced due to the
> > shared IRQ since presumably this driver works for the previous pieces
> > of hardware the chipset was put into, but specifically in those
> > codepaths, there's nothing obviously related to the single IRQ.  Which
> > leads my thoughts back to timing/synchronization issues...
>
> While I'm semi-randomly poking things I decided to capture some
> information in a structured way that could be useful to Kalle and
> team.  I'm running the latest bringup branch without any
> modifications.  I booted my machine 6 consecutive times to demonstrate
> the power triggering the freezing I was referring to.  In each video,
> you'll see the dmesg output, and in the cases I can control, you'll
> also see it with MHI debugging.
>
> The first 2 boots, I'm intentionally booting / initializing the driver
> on battery power and then waiting 5+ minutes to plug in the charger.
> Note:  the system always comes online and remains stable when I start
> in this configuration, it's only when I plug the charger in that it
> crashes.
>
> Boot 1: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_004643171.mp4
> - The machine and driver has been online and stable for 5 minutes (as
> seen in htop/ping), within a few seconds of plugging in the usb
> charger, the mhi debugging shows a failure and the machine crashes.
>
> Boot 2 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005346443.mp4
> - Same set up (although the machine had been up for 6 minutes at that
> point) and failure as boot 1. The machine hard locks instantly this
> time, as opposed to the stuttering you can see in boot 1.
>
> For the next boots, I'm booting / initializing the driver with the
> charger plugged in ahead of time:
>
> Boot 3 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005642416.mp4
> - Within a few seconds of the driver initializing, the machine
> crashes.
>
> Boot 4 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005800378.mp4
> - Same setup as boot 3, but this time the system survives a bit longer
> (15 seconds or so).
>
> Boot 5: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005938734.mp4
> - Same setup as 3/4, similar crash to boot 4.  The driver survives ~15
> seconds and then the machine hangs.
>
> After this I went back to the setup for boot 1/2 where I brought
> everything online, waited a bit over 5 minutes and plugged in the
> charger.
>
> Boot 6: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_010537553.mp4
> - This boot was successful and has remained stable.  I'm composing
> this email from it.  If this follows previous behavior, it should stay
> online for at least 24h (I always fiddled beyond that).
>
> So in conclusion, I wanted to demonstrate that clearly being on
> battery power is causing something that is enabling my system to be
> stable in a way that goes away when I plug in my charger (both up
> front, and after initialization).  I don't have any great ideas of
> what could be going on, I'm not entirely sure it's directly power
> related but when I toggle it, clearly something is linked (maybe back
> to the ACPI tables being borked?).  I'll leave this boot running as
> long as I can to see if it randomly crashes after an hour...

Github didn't appreciate hosting those mp4s too much, I've reuploaded
them here as well:
https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09  9:43             ` wi nk
@ 2020-12-09 15:28               ` wi nk
  0 siblings, 0 replies; 31+ messages in thread
From: wi nk @ 2020-12-09 15:28 UTC (permalink / raw)
  To: Kalle Valo, Mitchell Nordine, Carl Huang; +Cc: ath11k

On Wed, Dec 9, 2020 at 10:43 AM wi nk <wink@technolu.st> wrote:
>
> On Wed, Dec 9, 2020 at 2:52 AM wi nk <wink@technolu.st> wrote:
> >
> > On Mon, Dec 7, 2020 at 6:01 PM wi nk <wink@technolu.st> wrote:
> > >
> > > On Mon, Dec 7, 2020 at 3:45 PM Mitchell Nordine
> > > <mail@mitchellnordine.com> wrote:
> > > >
> > > > Thanks for sending through this patch Wink.
> > > >
> > > > I built and installed the ath11k-qca6390-bringup branch with your patch last night on my Dell XPS 13 9310 running NixOS. I have only run the patch 6 times. The startup sequence seems more reliable. I was able to successfully enable the adapter and connect to my router each time, however each time my system would eventually freeze a few minutes after. I noticed that mouse input would stutter for a moment before completely freezing.
> > > >
> > > > I tested on battery twice to check your theory w.r.t. power management, but did not notice any difference in behaviour.
> > > >
> > > > > > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> > > >
> > > > I tried running this but haven't noticed any difference to the output I'm observing in `dmesg` or `journalctl`. There's a chance that there's another way I should be doing this on NixOS as most things including the kernel and its configuration are built and configured declaratively. I'll try and work this out next time I get the chance to have a longer testing session.
> > > >
> > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > On Monday, December 7, 2020 2:17 AM, wi nk <wink@technolu.st> wrote:
> > > >
> > > > &gt; On Sun, Dec 6, 2020 at 10:45 PM wi nk wink@technolu.st wrote:
> > > > &gt;
> > > > &gt; &gt; On Sun, Dec 6, 2020 at 6:53 PM wi nk wink@technolu.st wrote:
> > > > &gt; &gt;
> > > > &gt; &gt; &gt; On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> > > > &gt; &gt; &gt; mail@mitchellnordine.com wrote:
> > > > &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> > > > &gt; &gt; &gt; &gt; FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> > > > &gt; &gt; &gt; &gt; Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> > > > &gt; &gt; &gt; &gt; ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > &gt; &gt; &gt; &gt; On Sunday, December 6, 2020 6:00 PM, ath11k-request@lists.infradead.org wrote:
> > > > &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; Send ath11k mailing list submissions to
> > > > &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; &gt; To subscribe or unsubscribe via the World Wide Web, visit
> > > > &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > > > &gt; &gt; &gt; &gt; &gt; or, via email, send a message with subject or body 'help' to
> > > > &gt; &gt; &gt; &gt; &gt; ath11k-request@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; &gt; You can reach the person managing the list at
> > > > &gt; &gt; &gt; &gt; &gt; ath11k-owner@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; &gt; When replying, please edit your Subject line so it is more specific
> > > > &gt; &gt; &gt; &gt; &gt; than "Re: Contents of ath11k digest..."
> > > > &gt; &gt; &gt; &gt; &gt; Today's Topics:
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > > &gt; &gt; &gt; &gt; &gt; 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; Message: 1
> > > > &gt; &gt; &gt; &gt; &gt; Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > > &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> > > > &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> > > > &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > &gt; &gt; &gt; &gt; &gt; Message-ID:
> > > > &gt; &gt; &gt; &gt; &gt; CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > > > &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> > > > &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> > > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> > > > &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; Kalle -
> > > > &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> > > > &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> > > > &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> > > > &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> > > > &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> > > > &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> > > > &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> > > > &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> > > > &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> > > > &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> > > > &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> > > > &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> > > > &gt; &gt; &gt; &gt; &gt; collect?
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; Message: 2
> > > > &gt; &gt; &gt; &gt; &gt; Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > > &gt; &gt; &gt; &gt; &gt; From: wi nk wink@technolu.st
> > > > &gt; &gt; &gt; &gt; &gt; To: Kalle Valo kvalo@codeaurora.org
> > > > &gt; &gt; &gt; &gt; &gt; Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; &gt; Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > &gt; &gt; &gt; &gt; &gt; Message-ID:
> > > > &gt; &gt; &gt; &gt; &gt; CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > > > &gt; &gt; &gt; &gt; &gt; Content-Type: text/plain; charset="UTF-8"
> > > > &gt; &gt; &gt; &gt; &gt; On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Wi and Thomas,
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I'll start a new thread about problems on XPS 13. The information is
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; scattered to different threads and hard to find everything, it's much
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; easier to have everything in one place. So let's continue the discussion
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; about the kernel crashes on this thread.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Here's what I have understood so far:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     with 32 MSI vectors.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13. We added a hack to ath11k make it work with only vector and after
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     that it's possible to boot the firmware, connect to the AP and use the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     device for a while.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; -   But the problem now is that the kernel is crashing almost immediately
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     and almost every time(?). And these crashes only happen on Dell XPS
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     13, all other systems (including Dell XPS 15) seem to work without
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;     issues.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Is my understanding correct? Did I miss anything?
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; About the symptoms Wi reports:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; So up until this point, everything is working without issues.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Everything seems to spiral out of control a couple of seconds later
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; when my system attempts to actually bring up the adapter. In most of
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; the crash states I will see this:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.391928] wlp85s0: authenticated
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (capab=0x411 status=0 aid=6)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.407730] wlp85s0: associated
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And then either somewhere in that pile of messages, or a second or two
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after this my machine will start to stutter as I mentioned before, and
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; then it either hangs, or I see this message (I'm truncating the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; timestamp):
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; [ 35.xxxx ] sched: RT throttling activated
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; extract this data other than screenshots from my phone at the moment,
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; you can see the dmesg output from 6 different hangs here:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://github.com/w1nk/ath11k-debug
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; And Thomas Krause reports:
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; I can confirm this behavior on my configuration. I managed to login
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; once and select the Wifi and connect to it. It seemed curiously enough
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; be stable long enough to enter the Wifi passphrase. After the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; connection was established, the system hang and on each attempt to
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot into the graphical system it would freeze at some point
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; (sometimes even before showing the login screen).
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; --
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://patchwork.kernel.org/project/linux-wireless/list/
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Hi Kalle,
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Again, thanks much for your work. I think you've summarized
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; everything up until this point. On my XPS 13 9310 The behavior of the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; RT throttling still exists for me occasionally on loading the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; driver/associating with an AP. The throttling consistently occurs
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; after a few sets of the MHI debug printing showing the EE entering an
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; invalid state ( AMSS -&gt; INVALID_EE ). I'm now building the latest tag
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; to see if there are any differences.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; &gt; Thanks!
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; Just to follow up, the first boot resulted in the RT throttling
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; message as the adapter was coming up/associating, shortly after the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > &gt; &gt; &gt; &gt; &gt; &gt; &gt; reboot to bring the adapter back.
> > > > &gt; &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; &gt; Kalle -
> > > > &gt; &gt; &gt; &gt; &gt; &gt; I've noticed one additional behavior that may give someone with
> > > > &gt; &gt; &gt; &gt; &gt; &gt; familiarity with the QCA hardware a clue. I'm running
> > > > &gt; &gt; &gt; &gt; &gt; &gt; ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > &gt; &gt; &gt; &gt; &gt; &gt; whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > &gt; &gt; &gt; &gt; &gt; &gt; device) on this dell basically guarantees I'll hit the scheduler
> > > > &gt; &gt; &gt; &gt; &gt; &gt; throttling issue as the ath11k driver is initializing / associating.
> > > > &gt; &gt; &gt; &gt; &gt; &gt; The bluetooth system is using the btqca driver. I don't have any
> > > > &gt; &gt; &gt; &gt; &gt; &gt; useful debugging (I'll gladly collect some if there is a way to do it)
> > > > &gt; &gt; &gt; &gt; &gt; &gt; other than tracking some simple statistics. I booted my system 20
> > > > &gt; &gt; &gt; &gt; &gt; &gt; times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > &gt; &gt; &gt; &gt; &gt; &gt; ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > &gt; &gt; &gt; &gt; &gt; &gt; into X and manually modprobing the ath11k driver. The difference is
> > > > &gt; &gt; &gt; &gt; &gt; &gt; that with bluetooth on and by the time I modprobe the driver, the
> > > > &gt; &gt; &gt; &gt; &gt; &gt; headphones are paired and I received the throttling message and
> > > > &gt; &gt; &gt; &gt; &gt; &gt; subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > &gt; &gt; &gt; &gt; &gt; &gt; not paired, I only saw it 2/10. I know it's not much hard information
> > > > &gt; &gt; &gt; &gt; &gt; &gt; but it's reliably reproducible for me, is there anything useful I can
> > > > &gt; &gt; &gt; &gt; &gt; &gt; collect?
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; Well unfortunately I think the bluetooth was just a red herring in the
> > > > &gt; &gt; &gt; &gt; &gt; racing. To chase that, I disabled all bluetooth and was able to get
> > > > &gt; &gt; &gt; &gt; &gt; into a state where I had 6 failed boots in a row. To further poke
> > > > &gt; &gt; &gt; &gt; &gt; around, I rebuilt the kernel with localmodconfig to disable building
> > > > &gt; &gt; &gt; &gt; &gt; big chunks of things. This kernel is way less stable and seems to
> > > > &gt; &gt; &gt; &gt; &gt; freeze most of the time (but does occasionally remain stable), I'm not
> > > > &gt; &gt; &gt; &gt; &gt; sure what else got disabled in there, but it seems to have had a
> > > > &gt; &gt; &gt; &gt; &gt; negative impact on the crash racing.
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; Subject: Digest Footer
> > > > &gt; &gt; &gt; &gt; &gt; ath11k mailing list
> > > > &gt; &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > > > &gt; &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; &gt; End of ath11k Digest, Vol 7, Issue 5
> > > > &gt; &gt; &gt; &gt;
> > > > &gt; &gt; &gt; &gt; --
> > > > &gt; &gt; &gt; &gt; ath11k mailing list
> > > > &gt; &gt; &gt; &gt; ath11k@lists.infradead.org
> > > > &gt; &gt; &gt; &gt; http://lists.infradead.org/mailman/listinfo/ath11k
> > > > &gt; &gt; &gt;
> > > > &gt; &gt; &gt; Hey Mitchell,
> > > > &gt; &gt; &gt; One more thing to try that may help us get a little bit of extra
> > > > &gt; &gt; &gt; info. Out of everything I've done, something that has remained
> > > > &gt; &gt; &gt; consistent is to enable the MHI debugging as Kalle suggested:
> > > > &gt; &gt; &gt; sudo sh -c "echo -n 'module mhi +p' &gt; /sys/kernel/debug/dynamic_debug/control"
> > > > &gt; &gt; &gt; Before any crash/spinlock, I see the MHI printing (from
> > > > &gt; &gt; &gt; drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> > > > &gt; &gt; &gt; and then after a number more iterations through this function, things
> > > > &gt; &gt; &gt; finally go out of control. So from
> > > > &gt; &gt; &gt;
> > > > &gt; &gt; &gt;         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\\n",
> > > > &gt; &gt; &gt;                 TO_MHI_EXEC_STR(mhi_cntrl-&gt;ee), TO_MHI_EXEC_STR(ee),
> > > > &gt; &gt; &gt;                 TO_MHI_STATE_STR(state));
> > > > &gt; &gt; &gt;
> > > > &gt; &gt; &gt;
> > > > &gt; &gt; &gt; I'll see something like this:
> > > > &gt; &gt; &gt; [ 312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> > > > &gt; &gt; &gt; [ 313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> > > > &gt; &gt; &gt; ee:INVALID_EE dev_state:SYS_ERR
> > > > &gt; &gt; &gt; Then after a few of those prints showing SYS_ERR, either a spinlock or
> > > > &gt; &gt; &gt; a firmware crash. I'm not sure what causes this ee state to go
> > > > &gt; &gt; &gt; invalid, but maybe that's worth looking into. Can you confirm the
> > > > &gt; &gt; &gt; same behavior? To see this a little easier, I also run dmesg -wH in
> > > > &gt; &gt; &gt; two windows, one piping to | grep -v mhi (to filter out the mhi
> > > > &gt; &gt; &gt; debugging).
> > > > &gt; &gt; &gt; Thanks!
> > > > &gt; &gt;
> > > > &gt; &gt; So I've managed to stabilise my system now, so either the race is
> > > > &gt; &gt; gone, or I've done something to win it all the time. So one of the
> > > > &gt; &gt; avenues of racing I was chasing at first was in the ath11k driver
> > > > &gt; &gt; itself. There are a couple areas where the single/shared IRQ is being
> > > > &gt; &gt; forcibly toggled in ways that the documentation says are not great
> > > > &gt; &gt; (and the original patch was trying to avoid). Fixing those didn't
> > > > &gt; &gt; seem to have much impact on the stability of things (I've included
> > > > &gt; &gt; those changes in my patch though). After the last email I was
> > > > &gt; &gt; thinking about the MHI side of things a bit more and found a number of
> > > > &gt; &gt; call sites that my naive grepping had missed that do the same thing,
> > > > &gt; &gt; but via acquiring a lock at the same time. I modified all the calls
> > > > &gt; &gt; to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > > > &gt; &gt; variants that accept the flags parameter to capture state. I've now
> > > > &gt; &gt; booted and loaded the driver 10+ times without a single freeze or
> > > > &gt; &gt; crash. I'm not sure all of those modifications are necessary (ie:
> > > > &gt; &gt; which things are re-entrant in this single interrupt operating mode vs
> > > > &gt; &gt; which ones can use the simpler lock/unlock mechanisms), so I could use
> > > > &gt; &gt; some advice/guidance there.
> > > > &gt; &gt; Mitchell - if you want to grab this patch and try it, let me know how
> > > > &gt; &gt; it goes and I can clean it up for the mailing list:
> > > > &gt; &gt; https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > > > &gt; &gt; (apply to ath11k-qca6390-bringup-202011301608)
> > > > &gt;
> > > > &gt; Blindly chasing the crashing, I've found one more probably relevant
> > > > &gt; piece of information. As I was playing around trying to see if I had
> > > > &gt; actually stopped the racing, I noticed my battery was low. I plugged
> > > > &gt; it in and immediately received the RT throttling crash. I've now tried
> > > > &gt; quite a bit, and on the battery I don't see the crashing. I thought
> > > > &gt; maybe dynamic CPU clocking is changing some of the racing properties.
> > > > &gt; When I bring everything up on the battery and wait around a bit, once
> > > > &gt; I plug in the usb-c cable, within a few seconds it will often trigger
> > > > &gt; the RT throttling message. I poked a little bit at some of the wifi
> > > > &gt; power management settings, specifically trying to disable them, but I
> > > > &gt; didn't seem to kick anything relevant yet. I can essentially use the
> > > > &gt; power cable as a trigger for this race though..
> > > > &gt;
> > > > &gt; Kalle - are you aware of anything that happens to the driver/adapter
> > > > &gt; when ac power shows up? I think I see some power saving stuff in
> > > > &gt; wmi.c but I haven't gotten deep enough to know...
> > > >
> > > > </wink@technolu.st>
> > >
> > > Mitchell - one thing to note re the mhi debugging, the module needs to
> > > be in place first.  Here's how I've been doing it:
> > >
> > > modprobe ath11k_pci; echo -n 'module mhi +p' >
> > > /sys/kernel/debug/dynamic_debug/control; dmesg -wH
> > >
> > > In the previously linked git repo, I've added my kernel build config,
> > > that may be worth trying.  Another change I've made that seems to help
> > > is to completely disable power management for 80211 in the kernel.
> > > Between that and setting ubuntu to leave the iwconfig things alone, it
> > > seems to have resolved the power plugging stuff.  I'm guessing the
> > > real racing is related to just attempting to configure/reconfigure
> > > settings in the adapter (which is why we're seeing crashing when it
> > > tries to actually attempt to 'do things', like associate or modify
> > > operational configs, before it goes nuts).  The thing that's weird is
> > > that I'm assuming the instability has been introduced due to the
> > > shared IRQ since presumably this driver works for the previous pieces
> > > of hardware the chipset was put into, but specifically in those
> > > codepaths, there's nothing obviously related to the single IRQ.  Which
> > > leads my thoughts back to timing/synchronization issues...
> >
> > While I'm semi-randomly poking things I decided to capture some
> > information in a structured way that could be useful to Kalle and
> > team.  I'm running the latest bringup branch without any
> > modifications.  I booted my machine 6 consecutive times to demonstrate
> > the power triggering the freezing I was referring to.  In each video,
> > you'll see the dmesg output, and in the cases I can control, you'll
> > also see it with MHI debugging.
> >
> > The first 2 boots, I'm intentionally booting / initializing the driver
> > on battery power and then waiting 5+ minutes to plug in the charger.
> > Note:  the system always comes online and remains stable when I start
> > in this configuration, it's only when I plug the charger in that it
> > crashes.
> >
> > Boot 1: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_004643171.mp4
> > - The machine and driver has been online and stable for 5 minutes (as
> > seen in htop/ping), within a few seconds of plugging in the usb
> > charger, the mhi debugging shows a failure and the machine crashes.
> >
> > Boot 2 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005346443.mp4
> > - Same set up (although the machine had been up for 6 minutes at that
> > point) and failure as boot 1. The machine hard locks instantly this
> > time, as opposed to the stuttering you can see in boot 1.
> >
> > For the next boots, I'm booting / initializing the driver with the
> > charger plugged in ahead of time:
> >
> > Boot 3 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005642416.mp4
> > - Within a few seconds of the driver initializing, the machine
> > crashes.
> >
> > Boot 4 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005800378.mp4
> > - Same setup as boot 3, but this time the system survives a bit longer
> > (15 seconds or so).
> >
> > Boot 5: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005938734.mp4
> > - Same setup as 3/4, similar crash to boot 4.  The driver survives ~15
> > seconds and then the machine hangs.
> >
> > After this I went back to the setup for boot 1/2 where I brought
> > everything online, waited a bit over 5 minutes and plugged in the
> > charger.
> >
> > Boot 6: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_010537553.mp4
> > - This boot was successful and has remained stable.  I'm composing
> > this email from it.  If this follows previous behavior, it should stay
> > online for at least 24h (I always fiddled beyond that).
> >
> > So in conclusion, I wanted to demonstrate that clearly being on
> > battery power is causing something that is enabling my system to be
> > stable in a way that goes away when I plug in my charger (both up
> > front, and after initialization).  I don't have any great ideas of
> > what could be going on, I'm not entirely sure it's directly power
> > related but when I toggle it, clearly something is linked (maybe back
> > to the ACPI tables being borked?).  I'll leave this boot running as
> > long as I can to see if it randomly crashes after an hour...
>
> Github didn't appreciate hosting those mp4s too much, I've reuploaded
> them here as well:
> https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing

So as expected, that final boot stayed online until I started
tinkering.  That said, I continued to follow the thought that somehow
this power stuff was related.  I rewatched my videos and noted that
the hanging is occuring after the MHI driver reports a transition to
the M2 state.  When on battery, it doesn't ever attempt that
transition, and when on a charger it attempts it immediately (hence
the behavior I was seeing).  I altered the MHI driver a little to just
ignore/prevent the transition to the M2 state and it has fixed at
least the immediate hanging.  I can reboot and initialize the driver
in any state (plugged/not) and it stays online without a crash/RT
throttle.  I'm guessing disabling this entirely isn't the correct
thing to do as I can see the interface reports itself oddly in my
systray occasionally, but it does prevent crashing for me and the
adapter seems to operate correctly.

diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
index 3de7b1639ec6..20f670c8b129 100644
--- a/drivers/bus/mhi/core/pm.c
+++ b/drivers/bus/mhi/core/pm.c
@@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
dev_state_transitions[] = {
     },
     {
         MHI_PM_M0,
-        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
+        MHI_PM_M0 | MHI_PM_M3_ENTER |
         MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
         MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
     },
     {
-        MHI_PM_M2,
+        MHI_PM_M0,
         MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
         MHI_PM_LD_ERR_FATAL_DETECT
     },

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-06 21:45   ` wi nk
  2020-12-07  1:17     ` wi nk
@ 2020-12-09 15:35     ` Kalle Valo
  2020-12-09 15:39       ` wi nk
  1 sibling, 1 reply; 31+ messages in thread
From: Kalle Valo @ 2020-12-09 15:35 UTC (permalink / raw)
  To: wi nk; +Cc: ath11k, Mitchell Nordine

wi nk <wink@technolu.st> writes:

> So I've managed to stabilise my system now, so either the race is
> gone, or I've done something to win it all the time.  So one of the
> avenues of racing I was chasing at first was in the ath11k driver
> itself.  There are a couple areas where the single/shared IRQ is being
> forcibly toggled in ways that the documentation says are not great
> (and the original patch was trying to avoid).  Fixing those didn't
> seem to have much impact on the stability of things (I've included
> those changes in my patch though).  After the last email I was
> thinking about the MHI side of things a bit more and found a number of
> call sites that my naive grepping had missed that do the same thing,
> but via acquiring a lock at the same time.  I modified all the calls
> to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> variants that accept the flags parameter to capture state.  I've now
> booted and loaded the driver 10+ times without a single freeze or
> crash.  I'm not sure all of those modifications are necessary (ie:
> which things are re-entrant in this single interrupt operating mode vs
> which ones can use the simpler lock/unlock mechanisms), so I could use
> some advice/guidance there.
>
> Mitchell - if you want to grab this patch and try it, let me know how
> it goes and I can clean it up for the mailing list:
> https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> (apply to ath11k-qca6390-bringup-202011301608)

Wink, I want to ask more about your the very interesting
one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
activated" crash with that patch? If yes, how many times, for example 5
out of 10 times or something like that?

Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
didn't quite understand the situation.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 15:35     ` Kalle Valo
@ 2020-12-09 15:39       ` wi nk
  2020-12-09 15:50         ` wi nk
  2020-12-09 15:50         ` Kalle Valo
  0 siblings, 2 replies; 31+ messages in thread
From: wi nk @ 2020-12-09 15:39 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath11k, Mitchell Nordine

On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> wi nk <wink@technolu.st> writes:
>
> > So I've managed to stabilise my system now, so either the race is
> > gone, or I've done something to win it all the time.  So one of the
> > avenues of racing I was chasing at first was in the ath11k driver
> > itself.  There are a couple areas where the single/shared IRQ is being
> > forcibly toggled in ways that the documentation says are not great
> > (and the original patch was trying to avoid).  Fixing those didn't
> > seem to have much impact on the stability of things (I've included
> > those changes in my patch though).  After the last email I was
> > thinking about the MHI side of things a bit more and found a number of
> > call sites that my naive grepping had missed that do the same thing,
> > but via acquiring a lock at the same time.  I modified all the calls
> > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > variants that accept the flags parameter to capture state.  I've now
> > booted and loaded the driver 10+ times without a single freeze or
> > crash.  I'm not sure all of those modifications are necessary (ie:
> > which things are re-entrant in this single interrupt operating mode vs
> > which ones can use the simpler lock/unlock mechanisms), so I could use
> > some advice/guidance there.
> >
> > Mitchell - if you want to grab this patch and try it, let me know how
> > it goes and I can clean it up for the mailing list:
> > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > (apply to ath11k-qca6390-bringup-202011301608)
>
> Wink, I want to ask more about your the very interesting
> one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
> activated" crash with that patch? If yes, how many times, for example 5
> out of 10 times or something like that?
>
> Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
> didn't quite understand the situation.
>
> --
> https://patchwork.kernel.org/project/linux-wireless/list/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Kalle,

   Sorry for moving the thread :).  So I've attempted 2 patches that
seem to produce varying degrees of success.  The single IRQ patch took
the crashing behaviour from hard locking immediately, to that
stuttering / RT throttling message consistently.  So instead of hard
locking 9/10 times and stuttering 1/10, it was inverted.

The second patch disabling the m2 transition (even without the single
IRQ patch) seems to have resolved the issues altogether, but at the
expense of disabling this m2 state, which I don't have much idea of
the consequences..

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 15:39       ` wi nk
@ 2020-12-09 15:50         ` wi nk
  2020-12-09 15:50         ` Kalle Valo
  1 sibling, 0 replies; 31+ messages in thread
From: wi nk @ 2020-12-09 15:50 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath11k, Mitchell Nordine

On Wed, Dec 9, 2020 at 4:39 PM wi nk <wink@technolu.st> wrote:
>
> On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> >
> > wi nk <wink@technolu.st> writes:
> >
> > > So I've managed to stabilise my system now, so either the race is
> > > gone, or I've done something to win it all the time.  So one of the
> > > avenues of racing I was chasing at first was in the ath11k driver
> > > itself.  There are a couple areas where the single/shared IRQ is being
> > > forcibly toggled in ways that the documentation says are not great
> > > (and the original patch was trying to avoid).  Fixing those didn't
> > > seem to have much impact on the stability of things (I've included
> > > those changes in my patch though).  After the last email I was
> > > thinking about the MHI side of things a bit more and found a number of
> > > call sites that my naive grepping had missed that do the same thing,
> > > but via acquiring a lock at the same time.  I modified all the calls
> > > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > > variants that accept the flags parameter to capture state.  I've now
> > > booted and loaded the driver 10+ times without a single freeze or
> > > crash.  I'm not sure all of those modifications are necessary (ie:
> > > which things are re-entrant in this single interrupt operating mode vs
> > > which ones can use the simpler lock/unlock mechanisms), so I could use
> > > some advice/guidance there.
> > >
> > > Mitchell - if you want to grab this patch and try it, let me know how
> > > it goes and I can clean it up for the mailing list:
> > > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > > (apply to ath11k-qca6390-bringup-202011301608)
> >
> > Wink, I want to ask more about your the very interesting
> > one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
> > activated" crash with that patch? If yes, how many times, for example 5
> > out of 10 times or something like that?
> >
> > Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
> > didn't quite understand the situation.
> >
> > --
> > https://patchwork.kernel.org/project/linux-wireless/list/
> >
> > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
>
> Kalle,
>
>    Sorry for moving the thread :).  So I've attempted 2 patches that
> seem to produce varying degrees of success.  The single IRQ patch took
> the crashing behaviour from hard locking immediately, to that
> stuttering / RT throttling message consistently.  So instead of hard
> locking 9/10 times and stuttering 1/10, it was inverted.
>
> The second patch disabling the m2 transition (even without the single
> IRQ patch) seems to have resolved the issues altogether, but at the
> expense of disabling this m2 state, which I don't have much idea of
> the consequences..

Sorry one more point of clarification, after the first patch I made, I
was able to always bring the adapter up while not on the charger.  I
didn't test that mode on an unmodified bringup branch.  I suspect it
would eventually crash though and I just modified some of the racing
parameters, I can confirm that if it'd be useful information for you.
It seems the key is that sometimes, something is causing this M2 state
transition (so in my original observation, plugging in the charger)
and that in the majority of the time, that state transition causes the
EE to become invalid and then everything goes sideways.  It does seem
like the adapter can successfully survive the transition occasionally,
just not often.  Preventing the transition entirely seems to keep the
race from ever occurring, but doesn't solve it really.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 15:39       ` wi nk
  2020-12-09 15:50         ` wi nk
@ 2020-12-09 15:50         ` Kalle Valo
  2020-12-09 15:55           ` wi nk
  1 sibling, 1 reply; 31+ messages in thread
From: Kalle Valo @ 2020-12-09 15:50 UTC (permalink / raw)
  To: wi nk; +Cc: ath11k, Mitchell Nordine

wi nk <wink@technolu.st> writes:

> On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo@codeaurora.org> wrote:
>>
>> wi nk <wink@technolu.st> writes:
>>
>> > So I've managed to stabilise my system now, so either the race is
>> > gone, or I've done something to win it all the time.  So one of the
>> > avenues of racing I was chasing at first was in the ath11k driver
>> > itself.  There are a couple areas where the single/shared IRQ is being
>> > forcibly toggled in ways that the documentation says are not great
>> > (and the original patch was trying to avoid).  Fixing those didn't
>> > seem to have much impact on the stability of things (I've included
>> > those changes in my patch though).  After the last email I was
>> > thinking about the MHI side of things a bit more and found a number of
>> > call sites that my naive grepping had missed that do the same thing,
>> > but via acquiring a lock at the same time.  I modified all the calls
>> > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
>> > variants that accept the flags parameter to capture state.  I've now
>> > booted and loaded the driver 10+ times without a single freeze or
>> > crash.  I'm not sure all of those modifications are necessary (ie:
>> > which things are re-entrant in this single interrupt operating mode vs
>> > which ones can use the simpler lock/unlock mechanisms), so I could use
>> > some advice/guidance there.
>> >
>> > Mitchell - if you want to grab this patch and try it, let me know how
>> > it goes and I can clean it up for the mailing list:
>> > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
>> > (apply to ath11k-qca6390-bringup-202011301608)
>>
>> Wink, I want to ask more about your the very interesting
>> one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
>> activated" crash with that patch? If yes, how many times, for example 5
>> out of 10 times or something like that?
>>
>> Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
>> didn't quite understand the situation.
>>
>> --
>> https://patchwork.kernel.org/project/linux-wireless/list/
>>
>> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
>
> Kalle,
>
>    Sorry for moving the thread :).

No problem, I'll just make extra questions to make sure that I'm
understanding things correctly :)

> So I've attempted 2 patches that seem to produce varying degrees of
> success. The single IRQ patch took the crashing behaviour from hard
> locking immediately, to that stuttering / RT throttling message
> consistently. So instead of hard locking 9/10 times and stuttering
> 1/10, it was inverted.

Ok, got it now.

> The second patch disabling the m2 transition (even without the single
> IRQ patch) seems to have resolved the issues altogether, but at the
> expense of disabling this m2 state, which I don't have much idea of
> the consequences..

Sorry, I have missed that. What second patch are you talking about?

Also can you share your /proc/interrupts in full?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 15:50         ` Kalle Valo
@ 2020-12-09 15:55           ` wi nk
  2020-12-09 21:46             ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-09 15:55 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath11k, Mitchell Nordine

On Wed, Dec 9, 2020 at 4:50 PM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> wi nk <wink@technolu.st> writes:
>
> > On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> >>
> >> wi nk <wink@technolu.st> writes:
> >>
> >> > So I've managed to stabilise my system now, so either the race is
> >> > gone, or I've done something to win it all the time.  So one of the
> >> > avenues of racing I was chasing at first was in the ath11k driver
> >> > itself.  There are a couple areas where the single/shared IRQ is being
> >> > forcibly toggled in ways that the documentation says are not great
> >> > (and the original patch was trying to avoid).  Fixing those didn't
> >> > seem to have much impact on the stability of things (I've included
> >> > those changes in my patch though).  After the last email I was
> >> > thinking about the MHI side of things a bit more and found a number of
> >> > call sites that my naive grepping had missed that do the same thing,
> >> > but via acquiring a lock at the same time.  I modified all the calls
> >> > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> >> > variants that accept the flags parameter to capture state.  I've now
> >> > booted and loaded the driver 10+ times without a single freeze or
> >> > crash.  I'm not sure all of those modifications are necessary (ie:
> >> > which things are re-entrant in this single interrupt operating mode vs
> >> > which ones can use the simpler lock/unlock mechanisms), so I could use
> >> > some advice/guidance there.
> >> >
> >> > Mitchell - if you want to grab this patch and try it, let me know how
> >> > it goes and I can clean it up for the mailing list:
> >> > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> >> > (apply to ath11k-qca6390-bringup-202011301608)
> >>
> >> Wink, I want to ask more about your the very interesting
> >> one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
> >> activated" crash with that patch? If yes, how many times, for example 5
> >> out of 10 times or something like that?
> >>
> >> Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
> >> didn't quite understand the situation.
> >>
> >> --
> >> https://patchwork.kernel.org/project/linux-wireless/list/
> >>
> >> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> >
> > Kalle,
> >
> >    Sorry for moving the thread :).
>
> No problem, I'll just make extra questions to make sure that I'm
> understanding things correctly :)
>
> > So I've attempted 2 patches that seem to produce varying degrees of
> > success. The single IRQ patch took the crashing behaviour from hard
> > locking immediately, to that stuttering / RT throttling message
> > consistently. So instead of hard locking 9/10 times and stuttering
> > 1/10, it was inverted.
>
> Ok, got it now.
>
> > The second patch disabling the m2 transition (even without the single
> > IRQ patch) seems to have resolved the issues altogether, but at the
> > expense of disabling this m2 state, which I don't have much idea of
> > the consequences..
>
> Sorry, I have missed that. What second patch are you talking about?
>
> Also can you share your /proc/interrupts in full?
>
> --
> https://patchwork.kernel.org/project/linux-wireless/list/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
>
> --
> ath11k mailing list
> ath11k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath11k

Here's interrupts in full , and the short patch after:

            CPU0       CPU1       CPU2       CPU3       CPU4
CPU5       CPU6       CPU7
   0:          7          0          0          0          0
0          0          0   IO-APIC    2-edge      timer
   1:          0          0          0          0          0
0          0       2923   IO-APIC    1-edge      i8042
   8:          0          0          0          0          0
0          0          0   IO-APIC    8-edge      rtc0
   9:          0       9290          0          0          0
0          0          0   IO-APIC    9-fasteoi   acpi
  12:          0          0          0          0          0
0         53          0   IO-APIC   12-edge      i8042
  14:          0      29816          0          0          0
0          0          0   IO-APIC   14-fasteoi   INT34C5:00
  16:          0          0          0          0          0
10376          0          0   IO-APIC   16-fasteoi   intel_ish_ipc,
i801_smbus, idma64.4
  27:          0          0          0          0          0
0          0          0   IO-APIC   27-fasteoi   idma64.0,
i2c_designware.0
  31:          0          0          0          0          0
0          0          0   IO-APIC   31-fasteoi   idma64.2,
i2c_designware.2
  32:          0          0          0          0          0
0          0          0   IO-APIC   32-fasteoi   idma64.3,
i2c_designware.3
  40:       9681     777197      27906          0          0
0          0          0   IO-APIC   40-fasteoi   idma64.1,
i2c_designware.1
 120:          0          0          0          0          0
0          0          0   PCI-MSI 114688-edge      PCIe PME, pciehp
 121:          0          0          0          0          0
0          0          0   PCI-MSI 118784-edge      PCIe PME, pciehp
 122:          0          0          0          0          0
0          0          0   PCI-MSI 458752-edge      PCIe PME
 123:          0          0          0          0          0
0          0          0   PCI-MSI 475136-edge      PCIe PME
 124:          0          0          1          0          0
0          0          0   PCI-MSI 229376-edge      vmd
 125:          0          0          0         27          0
0          0          0   PCI-MSI 229377-edge      vmd
 126:          0          0          0          0       4303
0          0          0   PCI-MSI 229378-edge      vmd
 127:          0          0          0          0          0
2992          0        434   PCI-MSI 229379-edge      vmd
 128:          0          0          0          0          0
593       2504          0   PCI-MSI 229380-edge      vmd
 129:          0          0          0          0        699
0       1061       1873   PCI-MSI 229381-edge      vmd
 130:       2382        394          0        603          0
0          0          0   PCI-MSI 229382-edge      vmd
 131:          0       1670          0        406        646
0          0          0   PCI-MSI 229383-edge      vmd
 132:        692          0       2903          0          0
0          0          0   PCI-MSI 229384-edge      vmd
 133:          0        518        913       2198          0
0          0          0   PCI-MSI 229385-edge      vmd
 134:          0          0          0          0          0
0          0          0   PCI-MSI 229386-edge      vmd
 135:          0          0          0          0          0
0          0          0   PCI-MSI 229387-edge      vmd
 136:          0          0          0          0          0
0          0          0   PCI-MSI 229388-edge      vmd
 137:          0          0          0          0          0
0          0          0   PCI-MSI 229389-edge      vmd
 138:          0          0          0          0          0
0          0          0   PCI-MSI 229390-edge      vmd
 139:          0          0          0          0          0
0          0          0   PCI-MSI 229391-edge      vmd
 140:          0          0          0          0          0
0          0          0   PCI-MSI 229392-edge      vmd
 141:          0          0          0          0          0
0          0          0   PCI-MSI 229393-edge      vmd
 142:          0          0          0          0          0
0          0          0   PCI-MSI 229394-edge      vmd
 143:          0          0          0          0          0
0          0          0   VMD-MSI  124  PCIe PME, aerdrv, pcie-dpc
 144:          0          0          0          0          0
0          1          0   PCI-MSI 212992-edge      xhci_hcd
 145:          0          0          0          0          0
0          0         72   PCI-MSI 327680-edge      xhci_hcd
 146:          6          0          0          0          0
0          0          0   PCI-MSI 45088768-edge      rtsx_pci
 147:          0          0          0          0          0
0          0          0   VMD-MSI  125  nvme0q0
 148:          0          0          0       1859          0
0          0      38399   PCI-MSI 32768-edge      i915
 149:          0          0          0          0          0
0          0          0   VMD-MSI  126  nvme0q1
 150:          0          0          0          0          0
0          0          0   VMD-MSI  127  nvme0q2
 151:          0          0          0          0          0
0          0          0   VMD-MSI  128  nvme0q3
 152:          0          0          0          0          0
0          0          0   VMD-MSI  129  nvme0q4
 153:          0          0          0          0          0
0          0          0   VMD-MSI  130  nvme0q5
 154:          0          0          0          0          0
0          0          0   VMD-MSI  131  nvme0q6
 155:          0          0          0          0          0
0          0          0   VMD-MSI  132  nvme0q7
 156:          0          0          0          0          0
0          0          0   VMD-MSI  133  nvme0q8
 157:          0      29816          0          0          0
0          0          0  INT34C5:00  327  DLL0945:00
 158:          0          0          0          0          0
0         48          0   PCI-MSI 360448-edge      mei_me
 159:          0          0          0          0          0
0          0       1134   PCI-MSI 514048-edge      AudioDSP
 162:          0          0          0     108102          0
0          0          0   PCI-MSI 44564480-edge      ce0, ce1, ce2,
ce3, ce5, ce7, ce8, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
DP_EXT_IRQ, bhi, mhi, mhi
 NMI:          0          0          0          0          0
0          0          0   Non-maskable interrupts
 LOC:      64516      80387      54151      82574      64663
113373      58033      81555   Local timer interrupts
 SPU:          0          0          0          0          0
0          0          0   Spurious interrupts
 PMI:          0          0          0          0          0
0          0          0   Performance monitoring interrupts
 IWI:          5          2          1        760          1
1          0      16078   IRQ work interrupts
 RTR:          6          0          0          0          0
0          0          0   APIC ICR read retries
 RES:       1834       7304       1432       1807       3015
1552       1417       1498   Rescheduling interrupts
 CAL:      21739      26798      28934      22211      22590
28622      22541      20023   Function call interrupts
 TLB:      51267      49182      59392      48384      46755
56491      48103      46560   TLB shootdowns
 TRM:          2          2          2          2          2
2          2          2   Thermal event interrupts
 THR:          0          0          0          0          0
0          0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0
0          0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0
0          0          0   Machine check exceptions
 MCP:          3          4          4          4          4
4          4          4   Machine check polls
 ERR:         16
 MIS:          0
 PIN:          0          0          0          0          0
0          0          0   Posted-interrupt notification event
 NPI:          0          0          0          0          0
0          0          0   Nested posted-interrupt event
 PIW:          0          0          0          0          0
0          0          0   Posted-interrupt wakeup event

and the modification that disables m2 state:

diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
index 3de7b1639ec6..20f670c8b129 100644
--- a/drivers/bus/mhi/core/pm.c
+++ b/drivers/bus/mhi/core/pm.c
@@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
dev_state_transitions[] = {
     },
     {
         MHI_PM_M0,
-        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
+        MHI_PM_M0 | MHI_PM_M3_ENTER |
         MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
         MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
     },
     {
-        MHI_PM_M2,
+        MHI_PM_M0,
         MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
         MHI_PM_LD_ERR_FATAL_DETECT
     },

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 15:55           ` wi nk
@ 2020-12-09 21:46             ` wi nk
  2020-12-11 12:28               ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-09 21:46 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath11k, Mitchell Nordine

On Wed, Dec 9, 2020 at 4:55 PM wi nk <wink@technolu.st> wrote:
>
> On Wed, Dec 9, 2020 at 4:50 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> >
> > wi nk <wink@technolu.st> writes:
> >
> > > On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> > >>
> > >> wi nk <wink@technolu.st> writes:
> > >>
> > >> > So I've managed to stabilise my system now, so either the race is
> > >> > gone, or I've done something to win it all the time.  So one of the
> > >> > avenues of racing I was chasing at first was in the ath11k driver
> > >> > itself.  There are a couple areas where the single/shared IRQ is being
> > >> > forcibly toggled in ways that the documentation says are not great
> > >> > (and the original patch was trying to avoid).  Fixing those didn't
> > >> > seem to have much impact on the stability of things (I've included
> > >> > those changes in my patch though).  After the last email I was
> > >> > thinking about the MHI side of things a bit more and found a number of
> > >> > call sites that my naive grepping had missed that do the same thing,
> > >> > but via acquiring a lock at the same time.  I modified all the calls
> > >> > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > >> > variants that accept the flags parameter to capture state.  I've now
> > >> > booted and loaded the driver 10+ times without a single freeze or
> > >> > crash.  I'm not sure all of those modifications are necessary (ie:
> > >> > which things are re-entrant in this single interrupt operating mode vs
> > >> > which ones can use the simpler lock/unlock mechanisms), so I could use
> > >> > some advice/guidance there.
> > >> >
> > >> > Mitchell - if you want to grab this patch and try it, let me know how
> > >> > it goes and I can clean it up for the mailing list:
> > >> > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > >> > (apply to ath11k-qca6390-bringup-202011301608)
> > >>
> > >> Wink, I want to ask more about your the very interesting
> > >> one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
> > >> activated" crash with that patch? If yes, how many times, for example 5
> > >> out of 10 times or something like that?
> > >>
> > >> Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
> > >> didn't quite understand the situation.
> > >>
> > >> --
> > >> https://patchwork.kernel.org/project/linux-wireless/list/
> > >>
> > >> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > >
> > > Kalle,
> > >
> > >    Sorry for moving the thread :).
> >
> > No problem, I'll just make extra questions to make sure that I'm
> > understanding things correctly :)
> >
> > > So I've attempted 2 patches that seem to produce varying degrees of
> > > success. The single IRQ patch took the crashing behaviour from hard
> > > locking immediately, to that stuttering / RT throttling message
> > > consistently. So instead of hard locking 9/10 times and stuttering
> > > 1/10, it was inverted.
> >
> > Ok, got it now.
> >
> > > The second patch disabling the m2 transition (even without the single
> > > IRQ patch) seems to have resolved the issues altogether, but at the
> > > expense of disabling this m2 state, which I don't have much idea of
> > > the consequences..
> >
> > Sorry, I have missed that. What second patch are you talking about?
> >
> > Also can you share your /proc/interrupts in full?
> >
> > --
> > https://patchwork.kernel.org/project/linux-wireless/list/
> >
> > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> >
> > --
> > ath11k mailing list
> > ath11k@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/ath11k
>
> Here's interrupts in full , and the short patch after:
>
>             CPU0       CPU1       CPU2       CPU3       CPU4
> CPU5       CPU6       CPU7
>    0:          7          0          0          0          0
> 0          0          0   IO-APIC    2-edge      timer
>    1:          0          0          0          0          0
> 0          0       2923   IO-APIC    1-edge      i8042
>    8:          0          0          0          0          0
> 0          0          0   IO-APIC    8-edge      rtc0
>    9:          0       9290          0          0          0
> 0          0          0   IO-APIC    9-fasteoi   acpi
>   12:          0          0          0          0          0
> 0         53          0   IO-APIC   12-edge      i8042
>   14:          0      29816          0          0          0
> 0          0          0   IO-APIC   14-fasteoi   INT34C5:00
>   16:          0          0          0          0          0
> 10376          0          0   IO-APIC   16-fasteoi   intel_ish_ipc,
> i801_smbus, idma64.4
>   27:          0          0          0          0          0
> 0          0          0   IO-APIC   27-fasteoi   idma64.0,
> i2c_designware.0
>   31:          0          0          0          0          0
> 0          0          0   IO-APIC   31-fasteoi   idma64.2,
> i2c_designware.2
>   32:          0          0          0          0          0
> 0          0          0   IO-APIC   32-fasteoi   idma64.3,
> i2c_designware.3
>   40:       9681     777197      27906          0          0
> 0          0          0   IO-APIC   40-fasteoi   idma64.1,
> i2c_designware.1
>  120:          0          0          0          0          0
> 0          0          0   PCI-MSI 114688-edge      PCIe PME, pciehp
>  121:          0          0          0          0          0
> 0          0          0   PCI-MSI 118784-edge      PCIe PME, pciehp
>  122:          0          0          0          0          0
> 0          0          0   PCI-MSI 458752-edge      PCIe PME
>  123:          0          0          0          0          0
> 0          0          0   PCI-MSI 475136-edge      PCIe PME
>  124:          0          0          1          0          0
> 0          0          0   PCI-MSI 229376-edge      vmd
>  125:          0          0          0         27          0
> 0          0          0   PCI-MSI 229377-edge      vmd
>  126:          0          0          0          0       4303
> 0          0          0   PCI-MSI 229378-edge      vmd
>  127:          0          0          0          0          0
> 2992          0        434   PCI-MSI 229379-edge      vmd
>  128:          0          0          0          0          0
> 593       2504          0   PCI-MSI 229380-edge      vmd
>  129:          0          0          0          0        699
> 0       1061       1873   PCI-MSI 229381-edge      vmd
>  130:       2382        394          0        603          0
> 0          0          0   PCI-MSI 229382-edge      vmd
>  131:          0       1670          0        406        646
> 0          0          0   PCI-MSI 229383-edge      vmd
>  132:        692          0       2903          0          0
> 0          0          0   PCI-MSI 229384-edge      vmd
>  133:          0        518        913       2198          0
> 0          0          0   PCI-MSI 229385-edge      vmd
>  134:          0          0          0          0          0
> 0          0          0   PCI-MSI 229386-edge      vmd
>  135:          0          0          0          0          0
> 0          0          0   PCI-MSI 229387-edge      vmd
>  136:          0          0          0          0          0
> 0          0          0   PCI-MSI 229388-edge      vmd
>  137:          0          0          0          0          0
> 0          0          0   PCI-MSI 229389-edge      vmd
>  138:          0          0          0          0          0
> 0          0          0   PCI-MSI 229390-edge      vmd
>  139:          0          0          0          0          0
> 0          0          0   PCI-MSI 229391-edge      vmd
>  140:          0          0          0          0          0
> 0          0          0   PCI-MSI 229392-edge      vmd
>  141:          0          0          0          0          0
> 0          0          0   PCI-MSI 229393-edge      vmd
>  142:          0          0          0          0          0
> 0          0          0   PCI-MSI 229394-edge      vmd
>  143:          0          0          0          0          0
> 0          0          0   VMD-MSI  124  PCIe PME, aerdrv, pcie-dpc
>  144:          0          0          0          0          0
> 0          1          0   PCI-MSI 212992-edge      xhci_hcd
>  145:          0          0          0          0          0
> 0          0         72   PCI-MSI 327680-edge      xhci_hcd
>  146:          6          0          0          0          0
> 0          0          0   PCI-MSI 45088768-edge      rtsx_pci
>  147:          0          0          0          0          0
> 0          0          0   VMD-MSI  125  nvme0q0
>  148:          0          0          0       1859          0
> 0          0      38399   PCI-MSI 32768-edge      i915
>  149:          0          0          0          0          0
> 0          0          0   VMD-MSI  126  nvme0q1
>  150:          0          0          0          0          0
> 0          0          0   VMD-MSI  127  nvme0q2
>  151:          0          0          0          0          0
> 0          0          0   VMD-MSI  128  nvme0q3
>  152:          0          0          0          0          0
> 0          0          0   VMD-MSI  129  nvme0q4
>  153:          0          0          0          0          0
> 0          0          0   VMD-MSI  130  nvme0q5
>  154:          0          0          0          0          0
> 0          0          0   VMD-MSI  131  nvme0q6
>  155:          0          0          0          0          0
> 0          0          0   VMD-MSI  132  nvme0q7
>  156:          0          0          0          0          0
> 0          0          0   VMD-MSI  133  nvme0q8
>  157:          0      29816          0          0          0
> 0          0          0  INT34C5:00  327  DLL0945:00
>  158:          0          0          0          0          0
> 0         48          0   PCI-MSI 360448-edge      mei_me
>  159:          0          0          0          0          0
> 0          0       1134   PCI-MSI 514048-edge      AudioDSP
>  162:          0          0          0     108102          0
> 0          0          0   PCI-MSI 44564480-edge      ce0, ce1, ce2,
> ce3, ce5, ce7, ce8, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
> DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
> DP_EXT_IRQ, bhi, mhi, mhi
>  NMI:          0          0          0          0          0
> 0          0          0   Non-maskable interrupts
>  LOC:      64516      80387      54151      82574      64663
> 113373      58033      81555   Local timer interrupts
>  SPU:          0          0          0          0          0
> 0          0          0   Spurious interrupts
>  PMI:          0          0          0          0          0
> 0          0          0   Performance monitoring interrupts
>  IWI:          5          2          1        760          1
> 1          0      16078   IRQ work interrupts
>  RTR:          6          0          0          0          0
> 0          0          0   APIC ICR read retries
>  RES:       1834       7304       1432       1807       3015
> 1552       1417       1498   Rescheduling interrupts
>  CAL:      21739      26798      28934      22211      22590
> 28622      22541      20023   Function call interrupts
>  TLB:      51267      49182      59392      48384      46755
> 56491      48103      46560   TLB shootdowns
>  TRM:          2          2          2          2          2
> 2          2          2   Thermal event interrupts
>  THR:          0          0          0          0          0
> 0          0          0   Threshold APIC interrupts
>  DFR:          0          0          0          0          0
> 0          0          0   Deferred Error APIC interrupts
>  MCE:          0          0          0          0          0
> 0          0          0   Machine check exceptions
>  MCP:          3          4          4          4          4
> 4          4          4   Machine check polls
>  ERR:         16
>  MIS:          0
>  PIN:          0          0          0          0          0
> 0          0          0   Posted-interrupt notification event
>  NPI:          0          0          0          0          0
> 0          0          0   Nested posted-interrupt event
>  PIW:          0          0          0          0          0
> 0          0          0   Posted-interrupt wakeup event
>
> and the modification that disables m2 state:
>
> diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> index 3de7b1639ec6..20f670c8b129 100644
> --- a/drivers/bus/mhi/core/pm.c
> +++ b/drivers/bus/mhi/core/pm.c
> @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> dev_state_transitions[] = {
>      },
>      {
>          MHI_PM_M0,
> -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> +        MHI_PM_M0 | MHI_PM_M3_ENTER |
>          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
>      },
>      {
> -        MHI_PM_M2,
> +        MHI_PM_M0,
>          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>          MHI_PM_LD_ERR_FATAL_DETECT
>      },

Adding one more data point.  The driver will not crash on
initialization this way, but also with the M2 state transition
disabled the system survives suspend and wake and the adapter
successfully reassociates consistently.  As expected with my patch,
the MHI driver shows everything stays in the M1 state instead of
attempting to transition to M2 ever.  It also doesn't return back to
M0 if I disconnect the power / replug it.  I'm not sure what things
are affected by me hacking this state machine, but avoiding that M2
transition has removed any obvious issues from my system.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 21:46             ` wi nk
@ 2020-12-11 12:28               ` wi nk
  2020-12-12  5:37                 ` Kalle Valo
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-11 12:28 UTC (permalink / raw)
  To: Kalle Valo, Stephen Liang, Mitchell Nordine, Carl Huang; +Cc: ath11k

On Wed, Dec 9, 2020 at 10:46 PM wi nk <wink@technolu.st> wrote:
>
> On Wed, Dec 9, 2020 at 4:55 PM wi nk <wink@technolu.st> wrote:
> >
> > On Wed, Dec 9, 2020 at 4:50 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> > >
> > > wi nk <wink@technolu.st> writes:
> > >
> > > > On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> > > >>
> > > >> wi nk <wink@technolu.st> writes:
> > > >>
> > > >> > So I've managed to stabilise my system now, so either the race is
> > > >> > gone, or I've done something to win it all the time.  So one of the
> > > >> > avenues of racing I was chasing at first was in the ath11k driver
> > > >> > itself.  There are a couple areas where the single/shared IRQ is being
> > > >> > forcibly toggled in ways that the documentation says are not great
> > > >> > (and the original patch was trying to avoid).  Fixing those didn't
> > > >> > seem to have much impact on the stability of things (I've included
> > > >> > those changes in my patch though).  After the last email I was
> > > >> > thinking about the MHI side of things a bit more and found a number of
> > > >> > call sites that my naive grepping had missed that do the same thing,
> > > >> > but via acquiring a lock at the same time.  I modified all the calls
> > > >> > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > > >> > variants that accept the flags parameter to capture state.  I've now
> > > >> > booted and loaded the driver 10+ times without a single freeze or
> > > >> > crash.  I'm not sure all of those modifications are necessary (ie:
> > > >> > which things are re-entrant in this single interrupt operating mode vs
> > > >> > which ones can use the simpler lock/unlock mechanisms), so I could use
> > > >> > some advice/guidance there.
> > > >> >
> > > >> > Mitchell - if you want to grab this patch and try it, let me know how
> > > >> > it goes and I can clean it up for the mailing list:
> > > >> > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > > >> > (apply to ath11k-qca6390-bringup-202011301608)
> > > >>
> > > >> Wink, I want to ask more about your the very interesting
> > > >> one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
> > > >> activated" crash with that patch? If yes, how many times, for example 5
> > > >> out of 10 times or something like that?
> > > >>
> > > >> Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
> > > >> didn't quite understand the situation.
> > > >>
> > > >> --
> > > >> https://patchwork.kernel.org/project/linux-wireless/list/
> > > >>
> > > >> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > >
> > > > Kalle,
> > > >
> > > >    Sorry for moving the thread :).
> > >
> > > No problem, I'll just make extra questions to make sure that I'm
> > > understanding things correctly :)
> > >
> > > > So I've attempted 2 patches that seem to produce varying degrees of
> > > > success. The single IRQ patch took the crashing behaviour from hard
> > > > locking immediately, to that stuttering / RT throttling message
> > > > consistently. So instead of hard locking 9/10 times and stuttering
> > > > 1/10, it was inverted.
> > >
> > > Ok, got it now.
> > >
> > > > The second patch disabling the m2 transition (even without the single
> > > > IRQ patch) seems to have resolved the issues altogether, but at the
> > > > expense of disabling this m2 state, which I don't have much idea of
> > > > the consequences..
> > >
> > > Sorry, I have missed that. What second patch are you talking about?
> > >
> > > Also can you share your /proc/interrupts in full?
> > >
> > > --
> > > https://patchwork.kernel.org/project/linux-wireless/list/
> > >
> > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > >
> > > --
> > > ath11k mailing list
> > > ath11k@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> >
> > Here's interrupts in full , and the short patch after:
> >
> >             CPU0       CPU1       CPU2       CPU3       CPU4
> > CPU5       CPU6       CPU7
> >    0:          7          0          0          0          0
> > 0          0          0   IO-APIC    2-edge      timer
> >    1:          0          0          0          0          0
> > 0          0       2923   IO-APIC    1-edge      i8042
> >    8:          0          0          0          0          0
> > 0          0          0   IO-APIC    8-edge      rtc0
> >    9:          0       9290          0          0          0
> > 0          0          0   IO-APIC    9-fasteoi   acpi
> >   12:          0          0          0          0          0
> > 0         53          0   IO-APIC   12-edge      i8042
> >   14:          0      29816          0          0          0
> > 0          0          0   IO-APIC   14-fasteoi   INT34C5:00
> >   16:          0          0          0          0          0
> > 10376          0          0   IO-APIC   16-fasteoi   intel_ish_ipc,
> > i801_smbus, idma64.4
> >   27:          0          0          0          0          0
> > 0          0          0   IO-APIC   27-fasteoi   idma64.0,
> > i2c_designware.0
> >   31:          0          0          0          0          0
> > 0          0          0   IO-APIC   31-fasteoi   idma64.2,
> > i2c_designware.2
> >   32:          0          0          0          0          0
> > 0          0          0   IO-APIC   32-fasteoi   idma64.3,
> > i2c_designware.3
> >   40:       9681     777197      27906          0          0
> > 0          0          0   IO-APIC   40-fasteoi   idma64.1,
> > i2c_designware.1
> >  120:          0          0          0          0          0
> > 0          0          0   PCI-MSI 114688-edge      PCIe PME, pciehp
> >  121:          0          0          0          0          0
> > 0          0          0   PCI-MSI 118784-edge      PCIe PME, pciehp
> >  122:          0          0          0          0          0
> > 0          0          0   PCI-MSI 458752-edge      PCIe PME
> >  123:          0          0          0          0          0
> > 0          0          0   PCI-MSI 475136-edge      PCIe PME
> >  124:          0          0          1          0          0
> > 0          0          0   PCI-MSI 229376-edge      vmd
> >  125:          0          0          0         27          0
> > 0          0          0   PCI-MSI 229377-edge      vmd
> >  126:          0          0          0          0       4303
> > 0          0          0   PCI-MSI 229378-edge      vmd
> >  127:          0          0          0          0          0
> > 2992          0        434   PCI-MSI 229379-edge      vmd
> >  128:          0          0          0          0          0
> > 593       2504          0   PCI-MSI 229380-edge      vmd
> >  129:          0          0          0          0        699
> > 0       1061       1873   PCI-MSI 229381-edge      vmd
> >  130:       2382        394          0        603          0
> > 0          0          0   PCI-MSI 229382-edge      vmd
> >  131:          0       1670          0        406        646
> > 0          0          0   PCI-MSI 229383-edge      vmd
> >  132:        692          0       2903          0          0
> > 0          0          0   PCI-MSI 229384-edge      vmd
> >  133:          0        518        913       2198          0
> > 0          0          0   PCI-MSI 229385-edge      vmd
> >  134:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229386-edge      vmd
> >  135:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229387-edge      vmd
> >  136:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229388-edge      vmd
> >  137:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229389-edge      vmd
> >  138:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229390-edge      vmd
> >  139:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229391-edge      vmd
> >  140:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229392-edge      vmd
> >  141:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229393-edge      vmd
> >  142:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229394-edge      vmd
> >  143:          0          0          0          0          0
> > 0          0          0   VMD-MSI  124  PCIe PME, aerdrv, pcie-dpc
> >  144:          0          0          0          0          0
> > 0          1          0   PCI-MSI 212992-edge      xhci_hcd
> >  145:          0          0          0          0          0
> > 0          0         72   PCI-MSI 327680-edge      xhci_hcd
> >  146:          6          0          0          0          0
> > 0          0          0   PCI-MSI 45088768-edge      rtsx_pci
> >  147:          0          0          0          0          0
> > 0          0          0   VMD-MSI  125  nvme0q0
> >  148:          0          0          0       1859          0
> > 0          0      38399   PCI-MSI 32768-edge      i915
> >  149:          0          0          0          0          0
> > 0          0          0   VMD-MSI  126  nvme0q1
> >  150:          0          0          0          0          0
> > 0          0          0   VMD-MSI  127  nvme0q2
> >  151:          0          0          0          0          0
> > 0          0          0   VMD-MSI  128  nvme0q3
> >  152:          0          0          0          0          0
> > 0          0          0   VMD-MSI  129  nvme0q4
> >  153:          0          0          0          0          0
> > 0          0          0   VMD-MSI  130  nvme0q5
> >  154:          0          0          0          0          0
> > 0          0          0   VMD-MSI  131  nvme0q6
> >  155:          0          0          0          0          0
> > 0          0          0   VMD-MSI  132  nvme0q7
> >  156:          0          0          0          0          0
> > 0          0          0   VMD-MSI  133  nvme0q8
> >  157:          0      29816          0          0          0
> > 0          0          0  INT34C5:00  327  DLL0945:00
> >  158:          0          0          0          0          0
> > 0         48          0   PCI-MSI 360448-edge      mei_me
> >  159:          0          0          0          0          0
> > 0          0       1134   PCI-MSI 514048-edge      AudioDSP
> >  162:          0          0          0     108102          0
> > 0          0          0   PCI-MSI 44564480-edge      ce0, ce1, ce2,
> > ce3, ce5, ce7, ce8, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
> > DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
> > DP_EXT_IRQ, bhi, mhi, mhi
> >  NMI:          0          0          0          0          0
> > 0          0          0   Non-maskable interrupts
> >  LOC:      64516      80387      54151      82574      64663
> > 113373      58033      81555   Local timer interrupts
> >  SPU:          0          0          0          0          0
> > 0          0          0   Spurious interrupts
> >  PMI:          0          0          0          0          0
> > 0          0          0   Performance monitoring interrupts
> >  IWI:          5          2          1        760          1
> > 1          0      16078   IRQ work interrupts
> >  RTR:          6          0          0          0          0
> > 0          0          0   APIC ICR read retries
> >  RES:       1834       7304       1432       1807       3015
> > 1552       1417       1498   Rescheduling interrupts
> >  CAL:      21739      26798      28934      22211      22590
> > 28622      22541      20023   Function call interrupts
> >  TLB:      51267      49182      59392      48384      46755
> > 56491      48103      46560   TLB shootdowns
> >  TRM:          2          2          2          2          2
> > 2          2          2   Thermal event interrupts
> >  THR:          0          0          0          0          0
> > 0          0          0   Threshold APIC interrupts
> >  DFR:          0          0          0          0          0
> > 0          0          0   Deferred Error APIC interrupts
> >  MCE:          0          0          0          0          0
> > 0          0          0   Machine check exceptions
> >  MCP:          3          4          4          4          4
> > 4          4          4   Machine check polls
> >  ERR:         16
> >  MIS:          0
> >  PIN:          0          0          0          0          0
> > 0          0          0   Posted-interrupt notification event
> >  NPI:          0          0          0          0          0
> > 0          0          0   Nested posted-interrupt event
> >  PIW:          0          0          0          0          0
> > 0          0          0   Posted-interrupt wakeup event
> >
> > and the modification that disables m2 state:
> >
> > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> > index 3de7b1639ec6..20f670c8b129 100644
> > --- a/drivers/bus/mhi/core/pm.c
> > +++ b/drivers/bus/mhi/core/pm.c
> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> > dev_state_transitions[] = {
> >      },
> >      {
> >          MHI_PM_M0,
> > -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > +        MHI_PM_M0 | MHI_PM_M3_ENTER |
> >          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> >      },
> >      {
> > -        MHI_PM_M2,
> > +        MHI_PM_M0,
> >          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >          MHI_PM_LD_ERR_FATAL_DETECT
> >      },
>
> Adding one more data point.  The driver will not crash on
> initialization this way, but also with the M2 state transition
> disabled the system survives suspend and wake and the adapter
> successfully reassociates consistently.  As expected with my patch,
> the MHI driver shows everything stays in the M1 state instead of
> attempting to transition to M2 ever.  It also doesn't return back to
> M0 if I disconnect the power / replug it.  I'm not sure what things
> are affected by me hacking this state machine, but avoiding that M2
> transition has removed any obvious issues from my system.

While waiting for someone else to confirm, I can report that I've
still not seen any instability since this patch.  The laptop has been
stable through reboots, power cycling, suspension, etc.  I'd be happy
to continue to try to understand why this is this case.  It sounds
like Stephen isn't seeing these issues on 5.10 rc6 with the single msi
patch+reverting that one commit.  I can try to give that a shot if
it'd produce something useful.

Kalle - a couple quick questions, in the driver comments the M2 state
is loosely documented as a low power mode.  Why would it transition to
that while on charger/plugging in, but stay in M0 while on battery
(you can see this behavior in the videos I linked previously)?
Naively I would've expected the opposite behavior.  Also, is there any
way to prevent that transition other than my brute force?  It seems on
battery the 'nominal' state for it is M0, I'm not sure what the effect
of it being left in this M1 state really is even though there's
nothing observable.  Lastly, any thoughts as to why it seems that
transition causes the EE state to become invalid?

Thanks!

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-11 12:28               ` wi nk
@ 2020-12-12  5:37                 ` Kalle Valo
  2020-12-12 11:46                   ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: Kalle Valo @ 2020-12-12  5:37 UTC (permalink / raw)
  To: wi nk; +Cc: Stephen Liang, Carl Huang, ath11k, Mitchell Nordine

wi nk <wink@technolu.st> writes:

>> > and the modification that disables m2 state:
>> >
>> > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
>> > index 3de7b1639ec6..20f670c8b129 100644
>> > --- a/drivers/bus/mhi/core/pm.c
>> > +++ b/drivers/bus/mhi/core/pm.c
>> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
>> > dev_state_transitions[] = {
>> >      },
>> >      {
>> >          MHI_PM_M0,
>> > -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
>> > +        MHI_PM_M0 | MHI_PM_M3_ENTER |
>> >          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>> >          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
>> >      },
>> >      {
>> > -        MHI_PM_M2,
>> > +        MHI_PM_M0,
>> >          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
>> >          MHI_PM_LD_ERR_FATAL_DETECT
>> >      },
>>
>> Adding one more data point.  The driver will not crash on
>> initialization this way, but also with the M2 state transition
>> disabled the system survives suspend and wake and the adapter
>> successfully reassociates consistently.  As expected with my patch,
>> the MHI driver shows everything stays in the M1 state instead of
>> attempting to transition to M2 ever.  It also doesn't return back to
>> M0 if I disconnect the power / replug it.  I'm not sure what things
>> are affected by me hacking this state machine, but avoiding that M2
>> transition has removed any obvious issues from my system.
>
> While waiting for someone else to confirm, I can report that I've
> still not seen any instability since this patch.  The laptop has been
> stable through reboots, power cycling, suspension, etc.

Very interesting! Are you saying that with this patch the wireless
connection using QCA6390 works fine on your Dell XPS 9310, you can
connect to an AP and transfer data normally?

I would like to submit your patch to patchwork.kernel.org as RFC patch
so that it's easier for everyone to download. But before I can do that I
need your Signed-off-by, can you read Developer's Certificate of Origin:

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin

And if you agree with the DCO please send your s-o-b by replying to this
email. But you can also submit the RFC patch yourself, instructions
here:

https://wireless.wiki.kernel.org/en/users/drivers/ath11k/submittingpatches

> I'd be happy to continue to try to understand why this is this case.
> It sounds like Stephen isn't seeing these issues on 5.10 rc6 with the
> single msi patch+reverting that one commit. I can try to give that a
> shot if it'd produce something useful.

Yes, being able to give datapoints what affects this bug is very helpful
to track down it.

> Kalle - a couple quick questions, in the driver comments the M2 state
> is loosely documented as a low power mode.  Why would it transition to
> that while on charger/plugging in, but stay in M0 while on battery
> (you can see this behavior in the videos I linked previously)?
> Naively I would've expected the opposite behavior.

I would have expected the same as well, it does sound strange or we are
misunderstanding something. I'll try to find out why it's so. But if you
learn more, please do let me know.

> Also, is there any way to prevent that transition other than my brute
> force? It seems on battery the 'nominal' state for it is M0, I'm not
> sure what the effect of it being left in this M1 state really is even
> though there's nothing observable. Lastly, any thoughts as to why it
> seems that transition causes the EE state to become invalid?

TBH I'm not very familiar with MHI, you seem to already know it much
more better than I do :) I'll include more folks to the thread later,
hopefully they can help.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-12  5:37                 ` Kalle Valo
@ 2020-12-12 11:46                   ` wi nk
  2020-12-12 23:29                     ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-12 11:46 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Stephen Liang, Carl Huang, ath11k, Mitchell Nordine

On Sat, Dec 12, 2020 at 6:37 AM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> wi nk <wink@technolu.st> writes:
>
> >> > and the modification that disables m2 state:
> >> >
> >> > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> >> > index 3de7b1639ec6..20f670c8b129 100644
> >> > --- a/drivers/bus/mhi/core/pm.c
> >> > +++ b/drivers/bus/mhi/core/pm.c
> >> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> >> > dev_state_transitions[] = {
> >> >      },
> >> >      {
> >> >          MHI_PM_M0,
> >> > -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> >> > +        MHI_PM_M0 | MHI_PM_M3_ENTER |
> >> >          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >> >          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> >> >      },
> >> >      {
> >> > -        MHI_PM_M2,
> >> > +        MHI_PM_M0,
> >> >          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >> >          MHI_PM_LD_ERR_FATAL_DETECT
> >> >      },
> >>
> >> Adding one more data point.  The driver will not crash on
> >> initialization this way, but also with the M2 state transition
> >> disabled the system survives suspend and wake and the adapter
> >> successfully reassociates consistently.  As expected with my patch,
> >> the MHI driver shows everything stays in the M1 state instead of
> >> attempting to transition to M2 ever.  It also doesn't return back to
> >> M0 if I disconnect the power / replug it.  I'm not sure what things
> >> are affected by me hacking this state machine, but avoiding that M2
> >> transition has removed any obvious issues from my system.
> >
> > While waiting for someone else to confirm, I can report that I've
> > still not seen any instability since this patch.  The laptop has been
> > stable through reboots, power cycling, suspension, etc.
>
> Very interesting! Are you saying that with this patch the wireless
> connection using QCA6390 works fine on your Dell XPS 9310, you can
> connect to an AP and transfer data normally?
>

Precisely.  The machine is now over 24h of uptime, I can reboot/sleep
without any issues, and throughput seems to saturate my wifi link
(5-600mpbs).


> I would like to submit your patch to patchwork.kernel.org as RFC patch
> so that it's easier for everyone to download. But before I can do that I
> need your Signed-off-by, can you read Developer's Certificate of Origin:
>
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
>
> And if you agree with the DCO please send your s-o-b by replying to this
> email. But you can also submit the RFC patch yourself, instructions
> here:
>
> https://wireless.wiki.kernel.org/en/users/drivers/ath11k/submittingpatches
>

Signed-off-by: Lee Smith <wink@technolu.st>

I'll get an email out later this afternoon, if you get there first,
please feel free :).

> > I'd be happy to continue to try to understand why this is this case.
> > It sounds like Stephen isn't seeing these issues on 5.10 rc6 with the
> > single msi patch+reverting that one commit. I can try to give that a
> > shot if it'd produce something useful.
>
> Yes, being able to give datapoints what affects this bug is very helpful
> to track down it.
>

Ok, I'll try to rebuild to that configuration later today and report back.

> > Kalle - a couple quick questions, in the driver comments the M2 state
> > is loosely documented as a low power mode.  Why would it transition to
> > that while on charger/plugging in, but stay in M0 while on battery
> > (you can see this behavior in the videos I linked previously)?
> > Naively I would've expected the opposite behavior.
>
> I would have expected the same as well, it does sound strange or we are
> misunderstanding something. I'll try to find out why it's so. But if you
> learn more, please do let me know.
>

Will do.

> > Also, is there any way to prevent that transition other than my brute
> > force? It seems on battery the 'nominal' state for it is M0, I'm not
> > sure what the effect of it being left in this M1 state really is even
> > though there's nothing observable. Lastly, any thoughts as to why it
> > seems that transition causes the EE state to become invalid?
>
> TBH I'm not very familiar with MHI, you seem to already know it much
> more better than I do :) I'll include more folks to the thread later,
> hopefully they can help.
>

Thanks!


> --
> https://patchwork.kernel.org/project/linux-wireless/list/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-12 11:46                   ` wi nk
@ 2020-12-12 23:29                     ` wi nk
  2020-12-13  0:03                       ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-12 23:29 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Stephen Liang, Carl Huang, ath11k, Mitchell Nordine

On Sat, Dec 12, 2020 at 12:46 PM wi nk <wink@technolu.st> wrote:
>
> On Sat, Dec 12, 2020 at 6:37 AM Kalle Valo <kvalo@codeaurora.org> wrote:
> >
> > wi nk <wink@technolu.st> writes:
> >
> > >> > and the modification that disables m2 state:
> > >> >
> > >> > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> > >> > index 3de7b1639ec6..20f670c8b129 100644
> > >> > --- a/drivers/bus/mhi/core/pm.c
> > >> > +++ b/drivers/bus/mhi/core/pm.c
> > >> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> > >> > dev_state_transitions[] = {
> > >> >      },
> > >> >      {
> > >> >          MHI_PM_M0,
> > >> > -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > >> > +        MHI_PM_M0 | MHI_PM_M3_ENTER |
> > >> >          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > >> >          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > >> >      },
> > >> >      {
> > >> > -        MHI_PM_M2,
> > >> > +        MHI_PM_M0,
> > >> >          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > >> >          MHI_PM_LD_ERR_FATAL_DETECT
> > >> >      },
> > >>
> > >> Adding one more data point.  The driver will not crash on
> > >> initialization this way, but also with the M2 state transition
> > >> disabled the system survives suspend and wake and the adapter
> > >> successfully reassociates consistently.  As expected with my patch,
> > >> the MHI driver shows everything stays in the M1 state instead of
> > >> attempting to transition to M2 ever.  It also doesn't return back to
> > >> M0 if I disconnect the power / replug it.  I'm not sure what things
> > >> are affected by me hacking this state machine, but avoiding that M2
> > >> transition has removed any obvious issues from my system.
> > >
> > > While waiting for someone else to confirm, I can report that I've
> > > still not seen any instability since this patch.  The laptop has been
> > > stable through reboots, power cycling, suspension, etc.
> >
> > Very interesting! Are you saying that with this patch the wireless
> > connection using QCA6390 works fine on your Dell XPS 9310, you can
> > connect to an AP and transfer data normally?
> >
>
> Precisely.  The machine is now over 24h of uptime, I can reboot/sleep
> without any issues, and throughput seems to saturate my wifi link
> (5-600mpbs).
>
>
> > I would like to submit your patch to patchwork.kernel.org as RFC patch
> > so that it's easier for everyone to download. But before I can do that I
> > need your Signed-off-by, can you read Developer's Certificate of Origin:
> >
> > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
> >
> > And if you agree with the DCO please send your s-o-b by replying to this
> > email. But you can also submit the RFC patch yourself, instructions
> > here:
> >
> > https://wireless.wiki.kernel.org/en/users/drivers/ath11k/submittingpatches
> >
>
> Signed-off-by: Lee Smith <wink@technolu.st>
>
> I'll get an email out later this afternoon, if you get there first,
> please feel free :).
>
> > > I'd be happy to continue to try to understand why this is this case.
> > > It sounds like Stephen isn't seeing these issues on 5.10 rc6 with the
> > > single msi patch+reverting that one commit. I can try to give that a
> > > shot if it'd produce something useful.
> >
> > Yes, being able to give datapoints what affects this bug is very helpful
> > to track down it.
> >
>
> Ok, I'll try to rebuild to that configuration later today and report back.
>
> > > Kalle - a couple quick questions, in the driver comments the M2 state
> > > is loosely documented as a low power mode.  Why would it transition to
> > > that while on charger/plugging in, but stay in M0 while on battery
> > > (you can see this behavior in the videos I linked previously)?
> > > Naively I would've expected the opposite behavior.
> >
> > I would have expected the same as well, it does sound strange or we are
> > misunderstanding something. I'll try to find out why it's so. But if you
> > learn more, please do let me know.
> >
>
> Will do.
>
> > > Also, is there any way to prevent that transition other than my brute
> > > force? It seems on battery the 'nominal' state for it is M0, I'm not
> > > sure what the effect of it being left in this M1 state really is even
> > > though there's nothing observable. Lastly, any thoughts as to why it
> > > seems that transition causes the EE state to become invalid?
> >
> > TBH I'm not very familiar with MHI, you seem to already know it much
> > more better than I do :) I'll include more folks to the thread later,
> > hopefully they can help.
> >
>
> Thanks!
>
>
> > --
> > https://patchwork.kernel.org/project/linux-wireless/list/
> >
> > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Ok I tried to boot 5.10-rc6 with
59c6d022df8efb450f82d33dd6a6812935bd022f (single msi) and reverted
7fef431be9c9.  With this kernel, I can't get the wifi adapter to come
up, but no freezing.  I receive this consistently:

[   23.959920] mhi 0000:55:00.0: Requested to power ON
[   23.960058] mhi 0000:55:00.0: Power on setup success
[   24.362295] ath11k_pci 0000:55:00.0: Respond mem req failed, result: 1, err:
0
[   24.362303] ath11k_pci 0000:55:00.0: qmi failed to respond fw mem req:-22
[   24.374433] ath11k_pci 0000:55:00.0: chip_id 0x0 chip_family 0xb board_id 0xf
f soc_id 0xffffffff
[   24.374438] ath11k_pci 0000:55:00.0: fw_version 0x101c06cc fw_build_timestamp
 2020-06-24 19:50 fw_build_id
[   25.450139] ath11k_pci 0000:55:00.0: failed to receive control response compl
etion, polling..
[   26.474154] ath11k_pci 0000:55:00.0: Service connect timeout
[   26.474163] ath11k_pci 0000:55:00.0: failed to connect to HTT: -110
[   26.477247] ath11k_pci 0000:55:00.0: failed to start core: -110

With the latest bringup and my patch to disable M2, I'm still booting
and operating reliably.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-12 23:29                     ` wi nk
@ 2020-12-13  0:03                       ` wi nk
  2020-12-13  0:59                         ` Mitchell Nordine
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-13  0:03 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Stephen Liang, Carl Huang, ath11k, Mitchell Nordine

On Sun, Dec 13, 2020 at 12:29 AM wi nk <wink@technolu.st> wrote:
>
> On Sat, Dec 12, 2020 at 12:46 PM wi nk <wink@technolu.st> wrote:
> >
> > On Sat, Dec 12, 2020 at 6:37 AM Kalle Valo <kvalo@codeaurora.org> wrote:
> > >
> > > wi nk <wink@technolu.st> writes:
> > >
> > > >> > and the modification that disables m2 state:
> > > >> >
> > > >> > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> > > >> > index 3de7b1639ec6..20f670c8b129 100644
> > > >> > --- a/drivers/bus/mhi/core/pm.c
> > > >> > +++ b/drivers/bus/mhi/core/pm.c
> > > >> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> > > >> > dev_state_transitions[] = {
> > > >> >      },
> > > >> >      {
> > > >> >          MHI_PM_M0,
> > > >> > -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > >> > +        MHI_PM_M0 | MHI_PM_M3_ENTER |
> > > >> >          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > >> >          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > > >> >      },
> > > >> >      {
> > > >> > -        MHI_PM_M2,
> > > >> > +        MHI_PM_M0,
> > > >> >          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > >> >          MHI_PM_LD_ERR_FATAL_DETECT
> > > >> >      },
> > > >>
> > > >> Adding one more data point.  The driver will not crash on
> > > >> initialization this way, but also with the M2 state transition
> > > >> disabled the system survives suspend and wake and the adapter
> > > >> successfully reassociates consistently.  As expected with my patch,
> > > >> the MHI driver shows everything stays in the M1 state instead of
> > > >> attempting to transition to M2 ever.  It also doesn't return back to
> > > >> M0 if I disconnect the power / replug it.  I'm not sure what things
> > > >> are affected by me hacking this state machine, but avoiding that M2
> > > >> transition has removed any obvious issues from my system.
> > > >
> > > > While waiting for someone else to confirm, I can report that I've
> > > > still not seen any instability since this patch.  The laptop has been
> > > > stable through reboots, power cycling, suspension, etc.
> > >
> > > Very interesting! Are you saying that with this patch the wireless
> > > connection using QCA6390 works fine on your Dell XPS 9310, you can
> > > connect to an AP and transfer data normally?
> > >
> >
> > Precisely.  The machine is now over 24h of uptime, I can reboot/sleep
> > without any issues, and throughput seems to saturate my wifi link
> > (5-600mpbs).
> >
> >
> > > I would like to submit your patch to patchwork.kernel.org as RFC patch
> > > so that it's easier for everyone to download. But before I can do that I
> > > need your Signed-off-by, can you read Developer's Certificate of Origin:
> > >
> > > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
> > >
> > > And if you agree with the DCO please send your s-o-b by replying to this
> > > email. But you can also submit the RFC patch yourself, instructions
> > > here:
> > >
> > > https://wireless.wiki.kernel.org/en/users/drivers/ath11k/submittingpatches
> > >
> >
> > Signed-off-by: Lee Smith <wink@technolu.st>
> >
> > I'll get an email out later this afternoon, if you get there first,
> > please feel free :).
> >
> > > > I'd be happy to continue to try to understand why this is this case.
> > > > It sounds like Stephen isn't seeing these issues on 5.10 rc6 with the
> > > > single msi patch+reverting that one commit. I can try to give that a
> > > > shot if it'd produce something useful.
> > >
> > > Yes, being able to give datapoints what affects this bug is very helpful
> > > to track down it.
> > >
> >
> > Ok, I'll try to rebuild to that configuration later today and report back.
> >
> > > > Kalle - a couple quick questions, in the driver comments the M2 state
> > > > is loosely documented as a low power mode.  Why would it transition to
> > > > that while on charger/plugging in, but stay in M0 while on battery
> > > > (you can see this behavior in the videos I linked previously)?
> > > > Naively I would've expected the opposite behavior.
> > >
> > > I would have expected the same as well, it does sound strange or we are
> > > misunderstanding something. I'll try to find out why it's so. But if you
> > > learn more, please do let me know.
> > >
> >
> > Will do.
> >
> > > > Also, is there any way to prevent that transition other than my brute
> > > > force? It seems on battery the 'nominal' state for it is M0, I'm not
> > > > sure what the effect of it being left in this M1 state really is even
> > > > though there's nothing observable. Lastly, any thoughts as to why it
> > > > seems that transition causes the EE state to become invalid?
> > >
> > > TBH I'm not very familiar with MHI, you seem to already know it much
> > > more better than I do :) I'll include more folks to the thread later,
> > > hopefully they can help.
> > >
> >
> > Thanks!
> >
> >
> > > --
> > > https://patchwork.kernel.org/project/linux-wireless/list/
> > >
> > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
>
> Ok I tried to boot 5.10-rc6 with
> 59c6d022df8efb450f82d33dd6a6812935bd022f (single msi) and reverted
> 7fef431be9c9.  With this kernel, I can't get the wifi adapter to come
> up, but no freezing.  I receive this consistently:
>
> [   23.959920] mhi 0000:55:00.0: Requested to power ON
> [   23.960058] mhi 0000:55:00.0: Power on setup success
> [   24.362295] ath11k_pci 0000:55:00.0: Respond mem req failed, result: 1, err:
> 0
> [   24.362303] ath11k_pci 0000:55:00.0: qmi failed to respond fw mem req:-22
> [   24.374433] ath11k_pci 0000:55:00.0: chip_id 0x0 chip_family 0xb board_id 0xf
> f soc_id 0xffffffff
> [   24.374438] ath11k_pci 0000:55:00.0: fw_version 0x101c06cc fw_build_timestamp
>  2020-06-24 19:50 fw_build_id
> [   25.450139] ath11k_pci 0000:55:00.0: failed to receive control response compl
> etion, polling..
> [   26.474154] ath11k_pci 0000:55:00.0: Service connect timeout
> [   26.474163] ath11k_pci 0000:55:00.0: failed to connect to HTT: -110
> [   26.477247] ath11k_pci 0000:55:00.0: failed to start core: -110
>
> With the latest bringup and my patch to disable M2, I'm still booting
> and operating reliably.

I took my bringup branch and merged 5.10-rc6 into it.  It merges fine,
and seems to be stable as well.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-13  0:03                       ` wi nk
@ 2020-12-13  0:59                         ` Mitchell Nordine
  2020-12-13 22:09                           ` Stephen Liang
  2020-12-16  8:50                           ` Kalle Valo
  0 siblings, 2 replies; 31+ messages in thread
From: Mitchell Nordine @ 2020-12-13  0:59 UTC (permalink / raw)
  To: wi nk; +Cc: Stephen Liang, Carl Huang, ath11k, Kalle Valo

On Sunday, December 13, 2020 1:03 AM, wi nk wink@technolu.st wrote:

> On Sun, Dec 13, 2020 at 12:29 AM wi nk wink@technolu.st wrote:
>
> > On Sat, Dec 12, 2020 at 12:46 PM wi nk wink@technolu.st wrote:
> >
> > > On Sat, Dec 12, 2020 at 6:37 AM Kalle Valo kvalo@codeaurora.org wrote:
> > >
> > > > wi nk wink@technolu.st writes:
> > > >
> > > > > > > and the modification that disables m2 state:
> > > > > > > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> > > > > > > index 3de7b1639ec6..20f670c8b129 100644
> > > > > > > --- a/drivers/bus/mhi/core/pm.c
> > > > > > > +++ b/drivers/bus/mhi/core/pm.c
> > > > > > > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> > > > > > > dev_state_transitions[] = {
> > > > > > > },
> > > > > > > {
> > > > > > > MHI_PM_M0,
> > > > > > >
> > > > > > > -            MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > > > > >
> > > > > > >
> > > > > > > -            MHI_PM_M0 | MHI_PM_M3_ENTER |
> > > > > > >              MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > > > > >              MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > > > > > >
> > > > > > >
> > > > > > >     },
> > > > > > >     {
> > > > > > >
> > > > > > > -            MHI_PM_M2,
> > > > > > >
> > > > > > >
> > > > > > > -            MHI_PM_M0,
> > > > > > >              MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > > > > >              MHI_PM_LD_ERR_FATAL_DETECT
> > > > > > >
> > > > > > >
> > > > > > >     },
> > > > > > >
> > > > > >
> > > > > > Adding one more data point. The driver will not crash on
> > > > > > initialization this way, but also with the M2 state transition
> > > > > > disabled the system survives suspend and wake and the adapter
> > > > > > successfully reassociates consistently. As expected with my patch,
> > > > > > the MHI driver shows everything stays in the M1 state instead of
> > > > > > attempting to transition to M2 ever. It also doesn't return back to
> > > > > > M0 if I disconnect the power / replug it. I'm not sure what things
> > > > > > are affected by me hacking this state machine, but avoiding that M2
> > > > > > transition has removed any obvious issues from my system.
> > > > >
> > > > > While waiting for someone else to confirm, I can report that I've
> > > > > still not seen any instability since this patch. The laptop has been
> > > > > stable through reboots, power cycling, suspension, etc.
> > > >
> > > > Very interesting! Are you saying that with this patch the wireless
> > > > connection using QCA6390 works fine on your Dell XPS 9310, you can
> > > > connect to an AP and transfer data normally?
> > >
> > > Precisely. The machine is now over 24h of uptime, I can reboot/sleep
> > > without any issues, and throughput seems to saturate my wifi link
> > > (5-600mpbs).
> > >
> > > > I would like to submit your patch to patchwork.kernel.org as RFC patch
> > > > so that it's easier for everyone to download. But before I can do that I
> > > > need your Signed-off-by, can you read Developer's Certificate of Origin:
> > > > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
> > > > And if you agree with the DCO please send your s-o-b by replying to this
> > > > email. But you can also submit the RFC patch yourself, instructions
> > > > here:
> > > > https://wireless.wiki.kernel.org/en/users/drivers/ath11k/submittingpatches
> > >
> > > Signed-off-by: Lee Smith wink@technolu.st
> > > I'll get an email out later this afternoon, if you get there first,
> > > please feel free :).
> > >
> > > > > I'd be happy to continue to try to understand why this is this case.
> > > > > It sounds like Stephen isn't seeing these issues on 5.10 rc6 with the
> > > > > single msi patch+reverting that one commit. I can try to give that a
> > > > > shot if it'd produce something useful.
> > > >
> > > > Yes, being able to give datapoints what affects this bug is very helpful
> > > > to track down it.
> > >
> > > Ok, I'll try to rebuild to that configuration later today and report back.
> > >
> > > > > Kalle - a couple quick questions, in the driver comments the M2 state
> > > > > is loosely documented as a low power mode. Why would it transition to
> > > > > that while on charger/plugging in, but stay in M0 while on battery
> > > > > (you can see this behavior in the videos I linked previously)?
> > > > > Naively I would've expected the opposite behavior.
> > > >
> > > > I would have expected the same as well, it does sound strange or we are
> > > > misunderstanding something. I'll try to find out why it's so. But if you
> > > > learn more, please do let me know.
> > >
> > > Will do.
> > >
> > > > > Also, is there any way to prevent that transition other than my brute
> > > > > force? It seems on battery the 'nominal' state for it is M0, I'm not
> > > > > sure what the effect of it being left in this M1 state really is even
> > > > > though there's nothing observable. Lastly, any thoughts as to why it
> > > > > seems that transition causes the EE state to become invalid?
> > > >
> > > > TBH I'm not very familiar with MHI, you seem to already know it much
> > > > more better than I do :) I'll include more folks to the thread later,
> > > > hopefully they can help.
> > >
> > > Thanks!
> > >
> > > > --
> > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> >
> > Ok I tried to boot 5.10-rc6 with
> > 59c6d022df8efb450f82d33dd6a6812935bd022f (single msi) and reverted
> > 7fef431be9c9. With this kernel, I can't get the wifi adapter to come
> > up, but no freezing. I receive this consistently:
> > [ 23.959920] mhi 0000:55:00.0: Requested to power ON
> > [ 23.960058] mhi 0000:55:00.0: Power on setup success
> > [ 24.362295] ath11k_pci 0000:55:00.0: Respond mem req failed, result: 1, err:
> > 0
> > [ 24.362303] ath11k_pci 0000:55:00.0: qmi failed to respond fw mem req:-22
> > [ 24.374433] ath11k_pci 0000:55:00.0: chip_id 0x0 chip_family 0xb board_id 0xf
> > f soc_id 0xffffffff
> > [ 24.374438] ath11k_pci 0000:55:00.0: fw_version 0x101c06cc fw_build_timestamp
> > 2020-06-24 19:50 fw_build_id
> > [ 25.450139] ath11k_pci 0000:55:00.0: failed to receive control response compl
> > etion, polling..
> > [ 26.474154] ath11k_pci 0000:55:00.0: Service connect timeout
> > [ 26.474163] ath11k_pci 0000:55:00.0: failed to connect to HTT: -110
> > [ 26.477247] ath11k_pci 0000:55:00.0: failed to start core: -110
> > With the latest bringup and my patch to disable M2, I'm still booting
> > and operating reliably.
>
> I took my bringup branch and merged 5.10-rc6 into it. It merges fine,
> and seems to be stable as well.

Nice find wink, I've been running your patch that disables the MHI M2 state on my XPS 9310 for the past few hours and wifi appears to be running smoothly for the first time.

The wifi symbol in the top right menu (GNOME 3 desktop on NixOS) does show a question mark for some reason, but otherwise everything appears quite stable so far.

Perhaps its worth running git blame on `pm.c` and seeing if the original author of the MHI state machine might be able to shed some light (if they remember)? Would it be inappropriate to cc them into this thread? I'm unsure of mailing list etiquette here.

I'll report back if I run into any other issues, otherwise will keep an eye on this mailing list in case of any updates or new patches that need testing.

And thanks again all, excited to finally discard my dangly USB-2 external wifi adapter :)

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-13  0:59                         ` Mitchell Nordine
@ 2020-12-13 22:09                           ` Stephen Liang
  2020-12-16  8:50                           ` Kalle Valo
  1 sibling, 0 replies; 31+ messages in thread
From: Stephen Liang @ 2020-12-13 22:09 UTC (permalink / raw)
  To: Mitchell Nordine; +Cc: Carl Huang, Kalle Valo, wi nk, ath11k

I spent a bit of time this weekend attempting to bisect commits based
on the below reproduction steps, but received no conclusive results.
What I've discovered is that the lockups and the firmware crashing
typically occurred only during WiFi scanning. The reproduction steps
are as follows:

1. Open Gnome WiFi Settings to get a list of networks
2. Wait a couple moments, typically less than a minute.

If you do this, either the firmware will crash or the system will
lockup. I guess the reason why I was able to escape this and stayed
fairly stable is that I never need to look at other WiFi networks and
the system just auto-connects to the same one each time. On any of the
commits (including the tip of bringup), if I never look at the WiFi
networks list, it never locks up.

Firmware crash logs (no lockup): https://pastebin.com/raw/E0y49evA
Firmware crash with lockup: https://i.imgur.com/0XExack.jpg

I ended up going back to the bringup branch on rc6 and applied wi nk's
M2 patch on top, and upon performing the reproduction steps - I have
not had any lockups or firmware crashes and it's been going several
minutes now. MHI also does not report dev_state:M2, but rather
dev_state:M1. Great find, wi nk!

Perhaps there's some sort of issue in the WiFi scanning process?


On Sat, Dec 12, 2020 at 5:00 PM Mitchell Nordine
<mail@mitchellnordine.com> wrote:
>
> On Sunday, December 13, 2020 1:03 AM, wi nk wink@technolu.st wrote:
>
> > On Sun, Dec 13, 2020 at 12:29 AM wi nk wink@technolu.st wrote:
> >
> > > On Sat, Dec 12, 2020 at 12:46 PM wi nk wink@technolu.st wrote:
> > >
> > > > On Sat, Dec 12, 2020 at 6:37 AM Kalle Valo kvalo@codeaurora.org wrote:
> > > >
> > > > > wi nk wink@technolu.st writes:
> > > > >
> > > > > > > > and the modification that disables m2 state:
> > > > > > > > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> > > > > > > > index 3de7b1639ec6..20f670c8b129 100644
> > > > > > > > --- a/drivers/bus/mhi/core/pm.c
> > > > > > > > +++ b/drivers/bus/mhi/core/pm.c
> > > > > > > > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> > > > > > > > dev_state_transitions[] = {
> > > > > > > > },
> > > > > > > > {
> > > > > > > > MHI_PM_M0,
> > > > > > > >
> > > > > > > > -            MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > > > > > >
> > > > > > > >
> > > > > > > > -            MHI_PM_M0 | MHI_PM_M3_ENTER |
> > > > > > > >              MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > > > > > >              MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > > > > > > >
> > > > > > > >
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >
> > > > > > > > -            MHI_PM_M2,
> > > > > > > >
> > > > > > > >
> > > > > > > > -            MHI_PM_M0,
> > > > > > > >              MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > > > > > >              MHI_PM_LD_ERR_FATAL_DETECT
> > > > > > > >
> > > > > > > >
> > > > > > > >     },
> > > > > > > >
> > > > > > >
> > > > > > > Adding one more data point. The driver will not crash on
> > > > > > > initialization this way, but also with the M2 state transition
> > > > > > > disabled the system survives suspend and wake and the adapter
> > > > > > > successfully reassociates consistently. As expected with my patch,
> > > > > > > the MHI driver shows everything stays in the M1 state instead of
> > > > > > > attempting to transition to M2 ever. It also doesn't return back to
> > > > > > > M0 if I disconnect the power / replug it. I'm not sure what things
> > > > > > > are affected by me hacking this state machine, but avoiding that M2
> > > > > > > transition has removed any obvious issues from my system.
> > > > > >
> > > > > > While waiting for someone else to confirm, I can report that I've
> > > > > > still not seen any instability since this patch. The laptop has been
> > > > > > stable through reboots, power cycling, suspension, etc.
> > > > >
> > > > > Very interesting! Are you saying that with this patch the wireless
> > > > > connection using QCA6390 works fine on your Dell XPS 9310, you can
> > > > > connect to an AP and transfer data normally?
> > > >
> > > > Precisely. The machine is now over 24h of uptime, I can reboot/sleep
> > > > without any issues, and throughput seems to saturate my wifi link
> > > > (5-600mpbs).
> > > >
> > > > > I would like to submit your patch to patchwork.kernel.org as RFC patch
> > > > > so that it's easier for everyone to download. But before I can do that I
> > > > > need your Signed-off-by, can you read Developer's Certificate of Origin:
> > > > > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
> > > > > And if you agree with the DCO please send your s-o-b by replying to this
> > > > > email. But you can also submit the RFC patch yourself, instructions
> > > > > here:
> > > > > https://wireless.wiki.kernel.org/en/users/drivers/ath11k/submittingpatches
> > > >
> > > > Signed-off-by: Lee Smith wink@technolu.st
> > > > I'll get an email out later this afternoon, if you get there first,
> > > > please feel free :).
> > > >
> > > > > > I'd be happy to continue to try to understand why this is this case.
> > > > > > It sounds like Stephen isn't seeing these issues on 5.10 rc6 with the
> > > > > > single msi patch+reverting that one commit. I can try to give that a
> > > > > > shot if it'd produce something useful.
> > > > >
> > > > > Yes, being able to give datapoints what affects this bug is very helpful
> > > > > to track down it.
> > > >
> > > > Ok, I'll try to rebuild to that configuration later today and report back.
> > > >
> > > > > > Kalle - a couple quick questions, in the driver comments the M2 state
> > > > > > is loosely documented as a low power mode. Why would it transition to
> > > > > > that while on charger/plugging in, but stay in M0 while on battery
> > > > > > (you can see this behavior in the videos I linked previously)?
> > > > > > Naively I would've expected the opposite behavior.
> > > > >
> > > > > I would have expected the same as well, it does sound strange or we are
> > > > > misunderstanding something. I'll try to find out why it's so. But if you
> > > > > learn more, please do let me know.
> > > >
> > > > Will do.
> > > >
> > > > > > Also, is there any way to prevent that transition other than my brute
> > > > > > force? It seems on battery the 'nominal' state for it is M0, I'm not
> > > > > > sure what the effect of it being left in this M1 state really is even
> > > > > > though there's nothing observable. Lastly, any thoughts as to why it
> > > > > > seems that transition causes the EE state to become invalid?
> > > > >
> > > > > TBH I'm not very familiar with MHI, you seem to already know it much
> > > > > more better than I do :) I'll include more folks to the thread later,
> > > > > hopefully they can help.
> > > >
> > > > Thanks!
> > > >
> > > > > --
> > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > >
> > > Ok I tried to boot 5.10-rc6 with
> > > 59c6d022df8efb450f82d33dd6a6812935bd022f (single msi) and reverted
> > > 7fef431be9c9. With this kernel, I can't get the wifi adapter to come
> > > up, but no freezing. I receive this consistently:
> > > [ 23.959920] mhi 0000:55:00.0: Requested to power ON
> > > [ 23.960058] mhi 0000:55:00.0: Power on setup success
> > > [ 24.362295] ath11k_pci 0000:55:00.0: Respond mem req failed, result: 1, err:
> > > 0
> > > [ 24.362303] ath11k_pci 0000:55:00.0: qmi failed to respond fw mem req:-22
> > > [ 24.374433] ath11k_pci 0000:55:00.0: chip_id 0x0 chip_family 0xb board_id 0xf
> > > f soc_id 0xffffffff
> > > [ 24.374438] ath11k_pci 0000:55:00.0: fw_version 0x101c06cc fw_build_timestamp
> > > 2020-06-24 19:50 fw_build_id
> > > [ 25.450139] ath11k_pci 0000:55:00.0: failed to receive control response compl
> > > etion, polling..
> > > [ 26.474154] ath11k_pci 0000:55:00.0: Service connect timeout
> > > [ 26.474163] ath11k_pci 0000:55:00.0: failed to connect to HTT: -110
> > > [ 26.477247] ath11k_pci 0000:55:00.0: failed to start core: -110
> > > With the latest bringup and my patch to disable M2, I'm still booting
> > > and operating reliably.
> >
> > I took my bringup branch and merged 5.10-rc6 into it. It merges fine,
> > and seems to be stable as well.
>
> Nice find wink, I've been running your patch that disables the MHI M2 state on my XPS 9310 for the past few hours and wifi appears to be running smoothly for the first time.
>
> The wifi symbol in the top right menu (GNOME 3 desktop on NixOS) does show a question mark for some reason, but otherwise everything appears quite stable so far.
>
> Perhaps its worth running git blame on `pm.c` and seeing if the original author of the MHI state machine might be able to shed some light (if they remember)? Would it be inappropriate to cc them into this thread? I'm unsure of mailing list etiquette here.
>
> I'll report back if I run into any other issues, otherwise will keep an eye on this mailing list in case of any updates or new patches that need testing.
>
> And thanks again all, excited to finally discard my dangly USB-2 external wifi adapter :)

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-13  0:59                         ` Mitchell Nordine
  2020-12-13 22:09                           ` Stephen Liang
@ 2020-12-16  8:50                           ` Kalle Valo
  1 sibling, 0 replies; 31+ messages in thread
From: Kalle Valo @ 2020-12-16  8:50 UTC (permalink / raw)
  To: Mitchell Nordine; +Cc: Stephen Liang, Carl Huang, ath11k, wi nk

Mitchell Nordine <mail@mitchellnordine.com> writes:

>> > Ok I tried to boot 5.10-rc6 with
>> > 59c6d022df8efb450f82d33dd6a6812935bd022f (single msi) and reverted
>> > 7fef431be9c9. With this kernel, I can't get the wifi adapter to come
>> > up, but no freezing. I receive this consistently:
>> > [ 23.959920] mhi 0000:55:00.0: Requested to power ON
>> > [ 23.960058] mhi 0000:55:00.0: Power on setup success
>> > [ 24.362295] ath11k_pci 0000:55:00.0: Respond mem req failed, result: 1, err:
>> > 0
>> > [ 24.362303] ath11k_pci 0000:55:00.0: qmi failed to respond fw mem req:-22
>> > [ 24.374433] ath11k_pci 0000:55:00.0: chip_id 0x0 chip_family 0xb board_id 0xf
>> > f soc_id 0xffffffff
>> > [ 24.374438] ath11k_pci 0000:55:00.0: fw_version 0x101c06cc fw_build_timestamp
>> > 2020-06-24 19:50 fw_build_id
>> > [ 25.450139] ath11k_pci 0000:55:00.0: failed to receive control response compl
>> > etion, polling..
>> > [ 26.474154] ath11k_pci 0000:55:00.0: Service connect timeout
>> > [ 26.474163] ath11k_pci 0000:55:00.0: failed to connect to HTT: -110
>> > [ 26.477247] ath11k_pci 0000:55:00.0: failed to start core: -110
>> > With the latest bringup and my patch to disable M2, I'm still booting
>> > and operating reliably.
>>
>> I took my bringup branch and merged 5.10-rc6 into it. It merges fine,
>> and seems to be stable as well.
>
> Nice find wink, I've been running your patch that disables the MHI M2
> state on my XPS 9310 for the past few hours and wifi appears to be
> running smoothly for the first time.
>
> The wifi symbol in the top right menu (GNOME 3 desktop on NixOS) does
> show a question mark for some reason, but otherwise everything appears
> quite stable so far.
>
> Perhaps its worth running git blame on `pm.c` and seeing if the
> original author of the MHI state machine might be able to shed some
> light (if they remember)? Would it be inappropriate to cc them into
> this thread? I'm unsure of mailing list etiquette here.

I started a new thread and included MHI developers, hopefully they have
more insight what could cause this.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-10  3:07   ` Stephen Liang
@ 2020-12-10  7:37     ` Stephen Liang
  0 siblings, 0 replies; 31+ messages in thread
From: Stephen Liang @ 2020-12-10  7:37 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath11k

Okay, I installed ath11k-qca6390-bringup (6244933dd) and did notice
lockups occurring. In fact, I found a reproducible way to lock up the
system:

1. Turn off WiFi
2. Reboot system
3. Login
4. Turn on WiFi

If you perform those steps, the system will lock on
ath11k-qca6390-bringup, dmesg: https://i.imgur.com/0XExack.jpg

On 59c6d02 + revert of 7fef431be9c9, the firmware will crash but the
system will not lock up. dmesg: https://pastebin.com/raw/hH0HyWVx

If you keep WiFi on, i.e. the system is automatically connected to
WiFi on boot, then on the ath11k-qca6390-bringup tag, it randomly
freezes. On 59c6d02, it seems to be fairly stable so long as I do not
turn off/turn on WiFi. I can bisect based on the reproducible steps in
the next couple days if nobody else comes up with something first.

On Wed, Dec 9, 2020 at 7:07 PM Stephen Liang <stephenliang7@gmail.com> wrote:
>
> > Thanks, good to know. On what platform is this, can you share more details? Is it the new Dell XPS 13 9310 or something else?
>
> Yes, this is the XPS 13 9310.
>
> > I recommend using ath11k-qca6390-bringup branch, it has quite a few
> fixes:
>
> Let me give that a shot, perhaps if I can reproduce the same issues
> the others have been seeing I could start a bisect as well.
>
> My uptime is now at 2 days with Fedora Rawhide kernel 5.10.0-0.rc6
> with commit 59c6d022df8efb450f82d33dd6a6812935bd022f and a revert of
> 7fef431be9c9 patched in. I'm not experiencing any of the stalls or
> crashes that others have been mentioning and I have been able to
> suspend/resume consistently. No issues with power being plugged and
> unplugged either. Happy to help provide details if anybody needs any.
>
> > You mean that the reboot stalls and you need to turn off the laptop using power button?
>
> Actually, with the previously mentioned commit combination, I was able
> to shutdown and reboot successfully as well.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-09 15:09 ` Kalle Valo
@ 2020-12-10  3:07   ` Stephen Liang
  2020-12-10  7:37     ` Stephen Liang
  0 siblings, 1 reply; 31+ messages in thread
From: Stephen Liang @ 2020-12-10  3:07 UTC (permalink / raw)
  To: Kalle Valo; +Cc: ath11k

> Thanks, good to know. On what platform is this, can you share more details? Is it the new Dell XPS 13 9310 or something else?

Yes, this is the XPS 13 9310.

> I recommend using ath11k-qca6390-bringup branch, it has quite a few
fixes:

Let me give that a shot, perhaps if I can reproduce the same issues
the others have been seeing I could start a bisect as well.

My uptime is now at 2 days with Fedora Rawhide kernel 5.10.0-0.rc6
with commit 59c6d022df8efb450f82d33dd6a6812935bd022f and a revert of
7fef431be9c9 patched in. I'm not experiencing any of the stalls or
crashes that others have been mentioning and I have been able to
suspend/resume consistently. No issues with power being plugged and
unplugged either. Happy to help provide details if anybody needs any.

> You mean that the reboot stalls and you need to turn off the laptop using power button?

Actually, with the previously mentioned commit combination, I was able
to shutdown and reboot successfully as well.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-02 23:49 Stephen Liang
@ 2020-12-09 15:09 ` Kalle Valo
  2020-12-10  3:07   ` Stephen Liang
  0 siblings, 1 reply; 31+ messages in thread
From: Kalle Valo @ 2020-12-09 15:09 UTC (permalink / raw)
  To: Stephen Liang; +Cc: ath11k

Stephen Liang <stephenliang7@gmail.com> writes:

> For reference, I am running Fedora Rawhide kernel 5.10.0-0.rc6 with
> commit 59c6d022df8efb450f82d33dd6a6812935bd022f and a revert of
> 7fef431be9c9 patched in.
>
> I am able to connect to WiFi and did not receive any immediate
> freezing, hard kernel panics, etc. Uptime is now going on 35 minutes
> and able to get 50 up/down throughput.

Thanks, good to know. On what platform is this, can you share more
details? Is it the new Dell XPS 13 9310 or something else?

> I  do see a number of taints in dmesg while connecting to an access
> point but the taint doesn't seem to affect functionality, for example:
> https://pastebin.com/raw/gbzuvs3q

This commit should fix that:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/commit/?h=ath11k-qca6390-bringup&id=fa4eea695afb286ae38beb30dabf251335cb4a62

I recommend using ath11k-qca6390-bringup branch, it has quite a few
fixes:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git/log/?h=ath11k-qca6390-bringup

> Finally, rebooting the computer does require a hard power down.

You mean that the reboot stalls and you need to turn off the laptop
using power button?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-05 19:17     ` wi nk
@ 2020-12-06  8:05       ` wi nk
  0 siblings, 0 replies; 31+ messages in thread
From: wi nk @ 2020-12-06  8:05 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Thomas Krause, ath11k

On Sat, Dec 5, 2020 at 8:17 PM wi nk <wink@technolu.st> wrote:
>
> On Tue, Dec 1, 2020 at 11:17 AM wi nk <wink@technolu.st> wrote:
> >
> > On Mon, Nov 30, 2020 at 6:02 PM wi nk <wink@technolu.st> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> > > >
> > > > Hi Wi and Thomas,
> > > >
> > > > I'll start a new thread about problems on XPS 13. The information is
> > > > scattered to different threads and hard to find everything, it's much
> > > > easier to have everything in one place. So let's continue the discussion
> > > > about the kernel crashes on this thread.
> > > >
> > > > Here's what I have understood so far:
> > > >
> > > > * On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > >   with 32 MSI vectors.
> > > >
> > > > * On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > >
> > > > [    0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > >                BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > >
> > > > * Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > >   13. We added a hack to ath11k make it work with only vector and after
> > > >   that it's possible to boot the firmware, connect to the AP and use the
> > > >   device for a while.
> > > >
> > > > * But the problem now is that the kernel is crashing almost immediately
> > > >   and almost every time(?). And these crashes only happen on Dell XPS
> > > >   13, all other systems (including Dell XPS 15) seem to work without
> > > >   issues.
> > > >
> > > > Is my understanding correct? Did I miss anything?
> > > >
> > > > About the symptoms Wi reports:
> > > >
> > > > ----------------------------------------------------------------------
> > > > So up until this point, everything is working without issues.
> > > > Everything seems to spiral out of control a couple of seconds later
> > > > when my system attempts to actually bring up the adapter.  In most of
> > > > the crash states I will see this:
> > > >
> > > > [   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > [   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > [   31.391928] wlp85s0: authenticated
> > > > [   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > [   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > (capab=0x411 status=0 aid=6)
> > > > [   31.407730] wlp85s0: associated
> > > > [   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > >
> > > > And then either somewhere in that pile of messages, or a second or two
> > > > after this my machine will start to stutter as I mentioned before, and
> > > > then it either hangs, or I see this message (I'm truncating the
> > > > timestamp):
> > > >
> > > > [   35.xxxx ] sched: RT throttling activated
> > > >
> > > > After that moment, the machine is unresponsive.  Sorry I can't seem to
> > > > extract this data other than screenshots from my phone at the moment,
> > > > you can see the dmesg output from 6 different hangs here:
> > > >
> > > > https://github.com/w1nk/ath11k-debug
> > > > ----------------------------------------------------------------------
> > > >
> > > > And Thomas Krause reports:
> > > >
> > > > --------------------------------------------------------------------------------
> > > > I can confirm this behavior on my configuration. I managed to login
> > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > be stable long enough to enter the Wifi passphrase. After the
> > > > connection was established, the system hang and on each attempt to
> > > > reboot into the graphical system it would freeze at some point
> > > > (sometimes even before showing the login screen).
> > > > ----------------------------------------------------------------------
> > > >
> > > > --
> > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > >
> > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > >
> > > Hi Kalle,
> > >
> > >   Again, thanks much for your work.  I think you've summarized
> > > everything up until this point.  On my XPS 13 9310 The behavior of the
> > > RT throttling still exists for me occasionally on loading the
> > > driver/associating with an AP.  The throttling consistently occurs
> > > after a few sets of the MHI debug printing showing the EE entering an
> > > invalid state ( AMSS -> INVALID_EE ).  I'm now building the latest tag
> > > to see if there are any differences.
> > >
> > > Thanks!
> >
> > Just to follow up, the first boot resulted in the RT throttling
> > message as the adapter was coming up/associating, shortly after the
> > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > reboot to bring the adapter back.
>
> Kalle -
>
>   I've noticed one additional behavior that may give someone with
> familiarity with the QCA hardware a clue.  I'm running
> ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310.  For
> whatever reason, having the bluetooth subsystem enabled (with a paired
> device) on this dell basically guarantees I'll hit the scheduler
> throttling issue as the ath11k driver is initializing / associating.
> The bluetooth system is using the btqca driver.  I don't have any
> useful debugging (I'll gladly collect some if there is a way to do it)
> other than tracking some simple statistics.  I booted my system 20
> times, 10 times with bluetooth enabled ((and some headphones turned on
> ready to pair), and 10 times without.  In both scenarios, I'm booting
> into X and manually modprobing the ath11k driver.  The difference is
> that with bluetooth on and by the time I modprobe the driver, the
> headphones are paired and I received the throttling message and
> subsequent freezing 10/10 times.  With bluetooth off / my headphones
> not paired, I only saw it 2/10.  I know it's not much hard information
> but it's reliably reproducible for me, is there anything useful I can
> collect?

Well unfortunately I think the bluetooth was just a red herring in the
racing.  To chase that, I disabled all bluetooth and was able to get
into a state where I had 6 failed boots in a row.  To further poke
around, I rebuilt the kernel with localmodconfig to disable building
big chunks of things.  This kernel is way less stable and seems to
freeze most of the time (but does occasionally remain stable), I'm not
sure what else got disabled in there, but it seems to have had a
negative impact on the crash racing.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-12-01 10:17   ` wi nk
@ 2020-12-05 19:17     ` wi nk
  2020-12-06  8:05       ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-05 19:17 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Thomas Krause, ath11k

On Tue, Dec 1, 2020 at 11:17 AM wi nk <wink@technolu.st> wrote:
>
> On Mon, Nov 30, 2020 at 6:02 PM wi nk <wink@technolu.st> wrote:
> >
> > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> > >
> > > Hi Wi and Thomas,
> > >
> > > I'll start a new thread about problems on XPS 13. The information is
> > > scattered to different threads and hard to find everything, it's much
> > > easier to have everything in one place. So let's continue the discussion
> > > about the kernel crashes on this thread.
> > >
> > > Here's what I have understood so far:
> > >
> > > * On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > >   with 32 MSI vectors.
> > >
> > > * On Dell XPS 13 there's a BIOS bug and kernel prints:
> > >
> > > [    0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > >                BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > >
> > > * Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > >   13. We added a hack to ath11k make it work with only vector and after
> > >   that it's possible to boot the firmware, connect to the AP and use the
> > >   device for a while.
> > >
> > > * But the problem now is that the kernel is crashing almost immediately
> > >   and almost every time(?). And these crashes only happen on Dell XPS
> > >   13, all other systems (including Dell XPS 15) seem to work without
> > >   issues.
> > >
> > > Is my understanding correct? Did I miss anything?
> > >
> > > About the symptoms Wi reports:
> > >
> > > ----------------------------------------------------------------------
> > > So up until this point, everything is working without issues.
> > > Everything seems to spiral out of control a couple of seconds later
> > > when my system attempts to actually bring up the adapter.  In most of
> > > the crash states I will see this:
> > >
> > > [   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > [   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > [   31.391928] wlp85s0: authenticated
> > > [   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > [   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > (capab=0x411 status=0 aid=6)
> > > [   31.407730] wlp85s0: associated
> > > [   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > >
> > > And then either somewhere in that pile of messages, or a second or two
> > > after this my machine will start to stutter as I mentioned before, and
> > > then it either hangs, or I see this message (I'm truncating the
> > > timestamp):
> > >
> > > [   35.xxxx ] sched: RT throttling activated
> > >
> > > After that moment, the machine is unresponsive.  Sorry I can't seem to
> > > extract this data other than screenshots from my phone at the moment,
> > > you can see the dmesg output from 6 different hangs here:
> > >
> > > https://github.com/w1nk/ath11k-debug
> > > ----------------------------------------------------------------------
> > >
> > > And Thomas Krause reports:
> > >
> > > --------------------------------------------------------------------------------
> > > I can confirm this behavior on my configuration. I managed to login
> > > once and select the Wifi and connect to it. It seemed curiously enough
> > > be stable long enough to enter the Wifi passphrase. After the
> > > connection was established, the system hang and on each attempt to
> > > reboot into the graphical system it would freeze at some point
> > > (sometimes even before showing the login screen).
> > > ----------------------------------------------------------------------
> > >
> > > --
> > > https://patchwork.kernel.org/project/linux-wireless/list/
> > >
> > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> >
> > Hi Kalle,
> >
> >   Again, thanks much for your work.  I think you've summarized
> > everything up until this point.  On my XPS 13 9310 The behavior of the
> > RT throttling still exists for me occasionally on loading the
> > driver/associating with an AP.  The throttling consistently occurs
> > after a few sets of the MHI debug printing showing the EE entering an
> > invalid state ( AMSS -> INVALID_EE ).  I'm now building the latest tag
> > to see if there are any differences.
> >
> > Thanks!
>
> Just to follow up, the first boot resulted in the RT throttling
> message as the adapter was coming up/associating, shortly after the
> firmware crashed and the kernel didn't fully freeze, but I needed to(
> reboot to bring the adapter back.

Kalle -

  I've noticed one additional behavior that may give someone with
familiarity with the QCA hardware a clue.  I'm running
ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310.  For
whatever reason, having the bluetooth subsystem enabled (with a paired
device) on this dell basically guarantees I'll hit the scheduler
throttling issue as the ath11k driver is initializing / associating.
The bluetooth system is using the btqca driver.  I don't have any
useful debugging (I'll gladly collect some if there is a way to do it)
other than tracking some simple statistics.  I booted my system 20
times, 10 times with bluetooth enabled ((and some headphones turned on
ready to pair), and 10 times without.  In both scenarios, I'm booting
into X and manually modprobing the ath11k driver.  The difference is
that with bluetooth on and by the time I modprobe the driver, the
headphones are paired and I received the throttling message and
subsequent freezing 10/10 times.  With bluetooth off / my headphones
not paired, I only saw it 2/10.  I know it's not much hard information
but it's reliably reproducible for me, is there anything useful I can
collect?

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-11-30 17:02 ` wi nk
@ 2020-12-01 10:17   ` wi nk
  2020-12-05 19:17     ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-12-01 10:17 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Thomas Krause, ath11k

On Mon, Nov 30, 2020 at 6:02 PM wi nk <wink@technolu.st> wrote:
>
> On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo <kvalo@codeaurora.org> wrote:
> >
> > Hi Wi and Thomas,
> >
> > I'll start a new thread about problems on XPS 13. The information is
> > scattered to different threads and hard to find everything, it's much
> > easier to have everything in one place. So let's continue the discussion
> > about the kernel crashes on this thread.
> >
> > Here's what I have understood so far:
> >
> > * On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> >   with 32 MSI vectors.
> >
> > * On Dell XPS 13 there's a BIOS bug and kernel prints:
> >
> > [    0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> >                BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> >
> > * Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> >   13. We added a hack to ath11k make it work with only vector and after
> >   that it's possible to boot the firmware, connect to the AP and use the
> >   device for a while.
> >
> > * But the problem now is that the kernel is crashing almost immediately
> >   and almost every time(?). And these crashes only happen on Dell XPS
> >   13, all other systems (including Dell XPS 15) seem to work without
> >   issues.
> >
> > Is my understanding correct? Did I miss anything?
> >
> > About the symptoms Wi reports:
> >
> > ----------------------------------------------------------------------
> > So up until this point, everything is working without issues.
> > Everything seems to spiral out of control a couple of seconds later
> > when my system attempts to actually bring up the adapter.  In most of
> > the crash states I will see this:
> >
> > [   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > [   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > [   31.391928] wlp85s0: authenticated
> > [   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > [   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > (capab=0x411 status=0 aid=6)
> > [   31.407730] wlp85s0: associated
> > [   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> >
> > And then either somewhere in that pile of messages, or a second or two
> > after this my machine will start to stutter as I mentioned before, and
> > then it either hangs, or I see this message (I'm truncating the
> > timestamp):
> >
> > [   35.xxxx ] sched: RT throttling activated
> >
> > After that moment, the machine is unresponsive.  Sorry I can't seem to
> > extract this data other than screenshots from my phone at the moment,
> > you can see the dmesg output from 6 different hangs here:
> >
> > https://github.com/w1nk/ath11k-debug
> > ----------------------------------------------------------------------
> >
> > And Thomas Krause reports:
> >
> > --------------------------------------------------------------------------------
> > I can confirm this behavior on my configuration. I managed to login
> > once and select the Wifi and connect to it. It seemed curiously enough
> > be stable long enough to enter the Wifi passphrase. After the
> > connection was established, the system hang and on each attempt to
> > reboot into the graphical system it would freeze at some point
> > (sometimes even before showing the login screen).
> > ----------------------------------------------------------------------
> >
> > --
> > https://patchwork.kernel.org/project/linux-wireless/list/
> >
> > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
>
> Hi Kalle,
>
>   Again, thanks much for your work.  I think you've summarized
> everything up until this point.  On my XPS 13 9310 The behavior of the
> RT throttling still exists for me occasionally on loading the
> driver/associating with an AP.  The throttling consistently occurs
> after a few sets of the MHI debug printing showing the EE entering an
> invalid state ( AMSS -> INVALID_EE ).  I'm now building the latest tag
> to see if there are any differences.
>
> Thanks!

Just to follow up, the first boot resulted in the RT throttling
message as the adapter was coming up/associating, shortly after the
firmware crashed and the kernel didn't fully freeze, but I needed to
reboot to bring the adapter back.

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
  2020-11-30 16:55 Kalle Valo
@ 2020-11-30 17:02 ` wi nk
  2020-12-01 10:17   ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: wi nk @ 2020-11-30 17:02 UTC (permalink / raw)
  To: Kalle Valo; +Cc: Thomas Krause, ath11k

On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> Hi Wi and Thomas,
>
> I'll start a new thread about problems on XPS 13. The information is
> scattered to different threads and hard to find everything, it's much
> easier to have everything in one place. So let's continue the discussion
> about the kernel crashes on this thread.
>
> Here's what I have understood so far:
>
> * On Dell XPS 15 there are no issues with QCA6390 and it seems to work
>   with 32 MSI vectors.
>
> * On Dell XPS 13 there's a BIOS bug and kernel prints:
>
> [    0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
>                BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
>
> * Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
>   13. We added a hack to ath11k make it work with only vector and after
>   that it's possible to boot the firmware, connect to the AP and use the
>   device for a while.
>
> * But the problem now is that the kernel is crashing almost immediately
>   and almost every time(?). And these crashes only happen on Dell XPS
>   13, all other systems (including Dell XPS 15) seem to work without
>   issues.
>
> Is my understanding correct? Did I miss anything?
>
> About the symptoms Wi reports:
>
> ----------------------------------------------------------------------
> So up until this point, everything is working without issues.
> Everything seems to spiral out of control a couple of seconds later
> when my system attempts to actually bring up the adapter.  In most of
> the crash states I will see this:
>
> [   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> [   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> [   31.391928] wlp85s0: authenticated
> [   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> [   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> (capab=0x411 status=0 aid=6)
> [   31.407730] wlp85s0: associated
> [   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
>
> And then either somewhere in that pile of messages, or a second or two
> after this my machine will start to stutter as I mentioned before, and
> then it either hangs, or I see this message (I'm truncating the
> timestamp):
>
> [   35.xxxx ] sched: RT throttling activated
>
> After that moment, the machine is unresponsive.  Sorry I can't seem to
> extract this data other than screenshots from my phone at the moment,
> you can see the dmesg output from 6 different hangs here:
>
> https://github.com/w1nk/ath11k-debug
> ----------------------------------------------------------------------
>
> And Thomas Krause reports:
>
> --------------------------------------------------------------------------------
> I can confirm this behavior on my configuration. I managed to login
> once and select the Wifi and connect to it. It seemed curiously enough
> be stable long enough to enter the Wifi passphrase. After the
> connection was established, the system hang and on each attempt to
> reboot into the graphical system it would freeze at some point
> (sometimes even before showing the login screen).
> ----------------------------------------------------------------------
>
> --
> https://patchwork.kernel.org/project/linux-wireless/list/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Hi Kalle,

  Again, thanks much for your work.  I think you've summarized
everything up until this point.  On my XPS 13 9310 The behavior of the
RT throttling still exists for me occasionally on loading the
driver/associating with an AP.  The throttling consistently occurs
after a few sets of the MHI debug printing showing the EE entering an
invalid state ( AMSS -> INVALID_EE ).  I'm now building the latest tag
to see if there are any differences.

Thanks!

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

* ath11k: QCA6390 on Dell XPS 13 and kernel crashes
@ 2020-11-30 16:55 Kalle Valo
  2020-11-30 17:02 ` wi nk
  0 siblings, 1 reply; 31+ messages in thread
From: Kalle Valo @ 2020-11-30 16:55 UTC (permalink / raw)
  To: wi nk, Thomas Krause; +Cc: ath11k

Hi Wi and Thomas,

I'll start a new thread about problems on XPS 13. The information is
scattered to different threads and hard to find everything, it's much
easier to have everything in one place. So let's continue the discussion
about the kernel crashes on this thread.

Here's what I have understood so far:

* On Dell XPS 15 there are no issues with QCA6390 and it seems to work
  with 32 MSI vectors.

* On Dell XPS 13 there's a BIOS bug and kernel prints:

[    0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
               BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:

* Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
  13. We added a hack to ath11k make it work with only vector and after
  that it's possible to boot the firmware, connect to the AP and use the
  device for a while.

* But the problem now is that the kernel is crashing almost immediately
  and almost every time(?). And these crashes only happen on Dell XPS
  13, all other systems (including Dell XPS 15) seem to work without
  issues.

Is my understanding correct? Did I miss anything?

About the symptoms Wi reports:

----------------------------------------------------------------------
So up until this point, everything is working without issues.
Everything seems to spiral out of control a couple of seconds later
when my system attempts to actually bring up the adapter.  In most of
the crash states I will see this:

[   31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
[   31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
[   31.391928] wlp85s0: authenticated
[   31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
[   31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
(capab=0x411 status=0 aid=6)
[   31.407730] wlp85s0: associated
[   31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready

And then either somewhere in that pile of messages, or a second or two
after this my machine will start to stutter as I mentioned before, and
then it either hangs, or I see this message (I'm truncating the
timestamp):

[   35.xxxx ] sched: RT throttling activated

After that moment, the machine is unresponsive.  Sorry I can't seem to
extract this data other than screenshots from my phone at the moment,
you can see the dmesg output from 6 different hangs here:

https://github.com/w1nk/ath11k-debug
----------------------------------------------------------------------

And Thomas Krause reports:

--------------------------------------------------------------------------------
I can confirm this behavior on my configuration. I managed to login
once and select the Wifi and connect to it. It seemed curiously enough
be stable long enough to enter the Wifi passphrase. After the
connection was established, the system hang and on each attempt to
reboot into the graphical system it would freeze at some point
(sometimes even before showing the login screen).
----------------------------------------------------------------------

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-12-16  8:50 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-06 17:38 ath11k: QCA6390 on Dell XPS 13 and kernel crashes Mitchell Nordine
2020-12-06 17:53 ` wi nk
2020-12-06 21:45   ` wi nk
2020-12-07  1:17     ` wi nk
2020-12-07 14:45       ` Mitchell Nordine
2020-12-07 17:01         ` wi nk
2020-12-09  1:52           ` wi nk
2020-12-09  9:43             ` wi nk
2020-12-09 15:28               ` wi nk
2020-12-09 15:35     ` Kalle Valo
2020-12-09 15:39       ` wi nk
2020-12-09 15:50         ` wi nk
2020-12-09 15:50         ` Kalle Valo
2020-12-09 15:55           ` wi nk
2020-12-09 21:46             ` wi nk
2020-12-11 12:28               ` wi nk
2020-12-12  5:37                 ` Kalle Valo
2020-12-12 11:46                   ` wi nk
2020-12-12 23:29                     ` wi nk
2020-12-13  0:03                       ` wi nk
2020-12-13  0:59                         ` Mitchell Nordine
2020-12-13 22:09                           ` Stephen Liang
2020-12-16  8:50                           ` Kalle Valo
  -- strict thread matches above, loose matches on Subject: below --
2020-12-02 23:49 Stephen Liang
2020-12-09 15:09 ` Kalle Valo
2020-12-10  3:07   ` Stephen Liang
2020-12-10  7:37     ` Stephen Liang
2020-11-30 16:55 Kalle Valo
2020-11-30 17:02 ` wi nk
2020-12-01 10:17   ` wi nk
2020-12-05 19:17     ` wi nk
2020-12-06  8:05       ` wi nk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.