All of lore.kernel.org
 help / color / mirror / Atom feed
From: wi nk <wink@technolu.st>
To: Mitchell Nordine <mail@mitchellnordine.com>
Cc: "ath11k@lists.infradead.org" <ath11k@lists.infradead.org>
Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
Date: Sun, 6 Dec 2020 22:45:29 +0100	[thread overview]
Message-ID: <CAHUdJJUjts8xFOAuULO-j=vft5BS4ijz1Lk6A1YrBfR-=MPFRQ@mail.gmail.com> (raw)
In-Reply-To: <CAHUdJJW0wfXQdCJgew8PC+=mk3pBCUaRFXXWARmFNZhogCyABg@mail.gmail.com>

On Sun, Dec 6, 2020 at 6:53 PM wi nk <wink@technolu.st> wrote:
>
> On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> <mail@mitchellnordine.com> wrote:
> >
> > I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> >
> > FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> >
> > Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> >
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Sunday, December 6, 2020 6:00 PM, <ath11k-request@lists.infradead.org> wrote:
> >
> > > Send ath11k mailing list submissions to
> > > ath11k@lists.infradead.org
> > >
> > > To subscribe or unsubscribe via the World Wide Web, visit
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > or, via email, send a message with subject or body 'help' to
> > > ath11k-request@lists.infradead.org
> > >
> > > You can reach the person managing the list at
> > > ath11k-owner@lists.infradead.org
> > >
> > > When replying, please edit your Subject line so it is more specific
> > > than "Re: Contents of ath11k digest..."
> > >
> > > Today's Topics:
> > >
> > > 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > >
> > >
> > > Message: 1
> > > Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > From: wi nk wink@technolu.st
> > > To: Kalle Valo kvalo@codeaurora.org
> > > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > Message-ID:
> > > CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A@mail.gmail.com
> > > Content-Type: text/plain; charset="UTF-8"
> > >
> > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > >
> > > > > > Hi Wi and Thomas,
> > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > about the kernel crashes on this thread.
> > > > > > Here's what I have understood so far:
> > > > > >
> > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > >     with 32 MSI vectors.
> > > > > >
> > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > >
> > > > > >
> > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > >
> > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > >     device for a while.
> > > > > >
> > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > >     issues.
> > > > > >
> > > > > >
> > > > > > Is my understanding correct? Did I miss anything?
> > > > > > About the symptoms Wi reports:
> > > > > >
> > > > > > So up until this point, everything is working without issues.
> > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > the crash states I will see this:
> > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > (capab=0x411 status=0 aid=6)
> > > > > > [ 31.407730] wlp85s0: associated
> > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > timestamp):
> > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > >
> > > > > > https://github.com/w1nk/ath11k-debug
> > > > > >
> > > > > > -------------------------------------
> > > > > >
> > > > > > And Thomas Krause reports:
> > > > > >
> > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > connection was established, the system hang and on each attempt to
> > > > > > reboot into the graphical system it would freeze at some point
> > > > > > (sometimes even before showing the login screen).
> > > > > >
> > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > >
> > > > > > --
> > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > >
> > > > > Hi Kalle,
> > > > > Again, thanks much for your work. I think you've summarized
> > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > RT throttling still exists for me occasionally on loading the
> > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > to see if there are any differences.
> > > > > Thanks!
> > > >
> > > > Just to follow up, the first boot resulted in the RT throttling
> > > > message as the adapter was coming up/associating, shortly after the
> > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > reboot to bring the adapter back.
> > >
> > > Kalle -
> > >
> > > I've noticed one additional behavior that may give someone with
> > > familiarity with the QCA hardware a clue. I'm running
> > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > device) on this dell basically guarantees I'll hit the scheduler
> > > throttling issue as the ath11k driver is initializing / associating.
> > > The bluetooth system is using the btqca driver. I don't have any
> > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > other than tracking some simple statistics. I booted my system 20
> > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > into X and manually modprobing the ath11k driver. The difference is
> > > that with bluetooth on and by the time I modprobe the driver, the
> > > headphones are paired and I received the throttling message and
> > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > not paired, I only saw it 2/10. I know it's not much hard information
> > > but it's reliably reproducible for me, is there anything useful I can
> > > collect?
> > >
> > >
> > > -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > Message: 2
> > > Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > From: wi nk wink@technolu.st
> > > To: Kalle Valo kvalo@codeaurora.org
> > > Cc: Thomas Krause thomaskrause@posteo.de, ath11k@lists.infradead.org
> > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > Message-ID:
> > > CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q@mail.gmail.com
> > > Content-Type: text/plain; charset="UTF-8"
> > >
> > > On Sat, Dec 5, 2020 at 8:17 PM wi nk wink@technolu.st wrote:
> > >
> > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink@technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink@technolu.st wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo@codeaurora.org wrote:
> > > > > >
> > > > > > > Hi Wi and Thomas,
> > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > about the kernel crashes on this thread.
> > > > > > > Here's what I have understood so far:
> > > > > > >
> > > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > >     with 32 MSI vectors.
> > > > > > >
> > > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > >
> > > > > > >
> > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > >
> > > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > > >     device for a while.
> > > > > > >
> > > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > > >     issues.
> > > > > > >
> > > > > > >
> > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > About the symptoms Wi reports:
> > > > > > >
> > > > > > > So up until this point, everything is working without issues.
> > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > the crash states I will see this:
> > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > timestamp):
> > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > >
> > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > >
> > > > > > > -------------------------------------
> > > > > > >
> > > > > > > And Thomas Krause reports:
> > > > > > >
> > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > (sometimes even before showing the login screen).
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > >
> > > > > > > --
> > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > >
> > > > > > Hi Kalle,
> > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > to see if there are any differences.
> > > > > > Thanks!
> > > > >
> > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > message as the adapter was coming up/associating, shortly after the
> > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > reboot to bring the adapter back.
> > > >
> > > > Kalle -
> > > > I've noticed one additional behavior that may give someone with
> > > > familiarity with the QCA hardware a clue. I'm running
> > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > throttling issue as the ath11k driver is initializing / associating.
> > > > The bluetooth system is using the btqca driver. I don't have any
> > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > other than tracking some simple statistics. I booted my system 20
> > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > into X and manually modprobing the ath11k driver. The difference is
> > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > headphones are paired and I received the throttling message and
> > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > but it's reliably reproducible for me, is there anything useful I can
> > > > collect?
> > >
> > > Well unfortunately I think the bluetooth was just a red herring in the
> > > racing. To chase that, I disabled all bluetooth and was able to get
> > > into a state where I had 6 failed boots in a row. To further poke
> > > around, I rebuilt the kernel with localmodconfig to disable building
> > > big chunks of things. This kernel is way less stable and seems to
> > > freeze most of the time (but does occasionally remain stable), I'm not
> > > sure what else got disabled in there, but it seems to have had a
> > > negative impact on the crash racing.
> > >
> > >
> > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > Subject: Digest Footer
> > >
> > > ath11k mailing list
> > > ath11k@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> > >
> > >
> > > -----------------------------------------------------------------------------------------------------------------------------
> > >
> > > End of ath11k Digest, Vol 7, Issue 5
> >
> >
> >
> > --
> > ath11k mailing list
> > ath11k@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/ath11k
>
> Hey Mitchell,
>
>    One more thing to try that may help us get a little bit of extra
> info.  Out of everything I've done, something that has remained
> consistent is to enable the MHI debugging as Kalle suggested:
>
> sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
>
>   Before any crash/spinlock, I see the MHI printing (from
> drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> and then after a number more iterations through this function, things
> finally go out of control.  So from
>
>         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
>                 TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
>                 TO_MHI_STATE_STR(state));
>
> I'll see something like this:
>
> [  312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> [  313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> ee:INVALID_EE dev_state:SYS_ERR
>
> Then after a few of those prints showing SYS_ERR, either a spinlock or
> a firmware crash.  I'm not sure what causes this ee state to go
> invalid, but maybe that's worth looking into.  Can you confirm the
> same behavior?  To see this a little easier, I also run dmesg -wH in
> two windows, one piping to | grep -v mhi (to filter out the mhi
> debugging).
>
> Thanks!

So I've managed to stabilise my system now, so either the race is
gone, or I've done something to win it all the time.  So one of the
avenues of racing I was chasing at first was in the ath11k driver
itself.  There are a couple areas where the single/shared IRQ is being
forcibly toggled in ways that the documentation says are not great
(and the original patch was trying to avoid).  Fixing those didn't
seem to have much impact on the stability of things (I've included
those changes in my patch though).  After the last email I was
thinking about the MHI side of things a bit more and found a number of
call sites that my naive grepping had missed that do the same thing,
but via acquiring a lock at the same time.  I modified all the calls
to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
variants that accept the flags parameter to capture state.  I've now
booted and loaded the driver 10+ times without a single freeze or
crash.  I'm not sure all of those modifications are necessary (ie:
which things are re-entrant in this single interrupt operating mode vs
which ones can use the simpler lock/unlock mechanisms), so I could use
some advice/guidance there.

Mitchell - if you want to grab this patch and try it, let me know how
it goes and I can clean it up for the mailing list:
https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
(apply to ath11k-qca6390-bringup-202011301608)

-- 
ath11k mailing list
ath11k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath11k

  reply	other threads:[~2020-12-06 21:45 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-06 17:38 ath11k: QCA6390 on Dell XPS 13 and kernel crashes Mitchell Nordine
2020-12-06 17:53 ` wi nk
2020-12-06 21:45   ` wi nk [this message]
2020-12-07  1:17     ` wi nk
2020-12-07 14:45       ` Mitchell Nordine
2020-12-07 17:01         ` wi nk
2020-12-09  1:52           ` wi nk
2020-12-09  9:43             ` wi nk
2020-12-09 15:28               ` wi nk
2020-12-09 15:35     ` Kalle Valo
2020-12-09 15:39       ` wi nk
2020-12-09 15:50         ` wi nk
2020-12-09 15:50         ` Kalle Valo
2020-12-09 15:55           ` wi nk
2020-12-09 21:46             ` wi nk
2020-12-11 12:28               ` wi nk
2020-12-12  5:37                 ` Kalle Valo
2020-12-12 11:46                   ` wi nk
2020-12-12 23:29                     ` wi nk
2020-12-13  0:03                       ` wi nk
2020-12-13  0:59                         ` Mitchell Nordine
2020-12-13 22:09                           ` Stephen Liang
2020-12-16  8:50                           ` Kalle Valo
  -- strict thread matches above, loose matches on Subject: below --
2020-12-02 23:49 Stephen Liang
2020-12-09 15:09 ` Kalle Valo
2020-12-10  3:07   ` Stephen Liang
2020-12-10  7:37     ` Stephen Liang
2020-11-30 16:55 Kalle Valo
2020-11-30 17:02 ` wi nk
2020-12-01 10:17   ` wi nk
2020-12-05 19:17     ` wi nk
2020-12-06  8:05       ` wi nk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHUdJJUjts8xFOAuULO-j=vft5BS4ijz1Lk6A1YrBfR-=MPFRQ@mail.gmail.com' \
    --to=wink@technolu.st \
    --cc=ath11k@lists.infradead.org \
    --cc=mail@mitchellnordine.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.