linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
@ 2020-01-01  0:49 Stephen Oberholtzer
  2020-01-02 13:28 ` Johannes Berg
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Oberholtzer @ 2020-01-01  0:49 UTC (permalink / raw)
  To: toke; +Cc: Johannes Berg, David S. Miller, linux-wireless, netdev, linux-kernel

Wireless networking goes down on Acer C720P Chromebook (bisected)

Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
packet dequeue")

I found that the newest kernel (5.4) displayed a curious issue on my
Acer C720P Chromebook: shortly after bringing networking up, all
connections would suddenly fail.  I discovered that I could
consistently reproduce the issue by ssh'ing into the machine and
running 'dmesg' -- on a non-working kernel; I would get partial
output, and then the connection would completely hang. This was so
consistent, in fact, that I was able to leverage it to automate the
process from 'git bisect run'.

KEYWORDS: c720p, chromebook, wireless, networking, mac80211

KERNEL: any kernel containing commit 7a89233a ("mac80211: Use
Airtime-based Queue Limits (AQL) on packet dequeue")

I find this bit in the offending commit's message suspicious:

> This patch does *not* include any mechanism to wake a throttled TXQ again,
> on the assumption that this will happen anyway as a side effect of whatever
>  freed the skb (most commonly a TX completion).

Methinks this assumption is not a fully valid one.

I'll be happy to test any patches. If you need some printk calls, just
tell me where to put 'em.

This is the card:

01:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network
Adapter (rev 01)
        Subsystem: Foxconn International, Inc. AR9462 Wireless Network Adapter
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at e0400000 (64-bit, non-prefetchable) [size=512K]
        Expansion ROM at e0480000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/4 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
unlimited, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+
FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1,
Exit Latency L0s <4us, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported,
TimeoutDis+, LTR-, OBFF Not Supported
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms,
TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap-
ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Kernel driver in use: ath9k
        Kernel modules: ath9k



-- 
-- Stevie-O
Real programmers use COPY CON PROGRAM.EXE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
  2020-01-01  0:49 PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected) Stephen Oberholtzer
@ 2020-01-02 13:28 ` Johannes Berg
  2020-01-02 23:05   ` Stephen Oberholtzer
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Berg @ 2020-01-02 13:28 UTC (permalink / raw)
  To: Stephen Oberholtzer, toke
  Cc: David S. Miller, linux-wireless, netdev, linux-kernel

On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote:
> Wireless networking goes down on Acer C720P Chromebook (bisected)
> 
> Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
> packet dequeue")
> 
> I found that the newest kernel (5.4) displayed a curious issue on my
> Acer C720P Chromebook: shortly after bringing networking up, all
> connections would suddenly fail.  I discovered that I could
> consistently reproduce the issue by ssh'ing into the machine and
> running 'dmesg' -- on a non-working kernel; I would get partial
> output, and then the connection would completely hang. This was so
> consistent, in fact, that I was able to leverage it to automate the
> process from 'git bisect run'.
> 
> KEYWORDS: c720p, chromebook, wireless, networking, mac80211
> 
> KERNEL: any kernel containing commit 7a89233a ("mac80211: Use
> Airtime-based Queue Limits (AQL) on packet dequeue")
> 
> I find this bit in the offending commit's message suspicious:
> 
> > This patch does *not* include any mechanism to wake a throttled TXQ again,
> > on the assumption that this will happen anyway as a side effect of whatever
> >  freed the skb (most commonly a TX completion).
> 
> Methinks this assumption is not a fully valid one.

I think I found at least one hole in this, but IIRC (it was before my
vacation, sorry) it was pretty unlikely to actually happen. Perhaps
there are more though.

https://lore.kernel.org/r/b14519e81b6d2335bd0cb7dcf074f0d1a4eec707.camel@sipsolutions.net


> I'll be happy to test any patches. If you need some printk calls, just
> tell me where to put 'em.

Do you get any output at all? Like a WARN_ON() for an underflow, or
something?

johannes


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
  2020-01-02 13:28 ` Johannes Berg
@ 2020-01-02 23:05   ` Stephen Oberholtzer
  2020-01-03  6:31     ` Johannes Berg
  2020-01-03  7:21     ` Justin Capella
  0 siblings, 2 replies; 7+ messages in thread
From: Stephen Oberholtzer @ 2020-01-02 23:05 UTC (permalink / raw)
  To: Johannes Berg; +Cc: toke, David S. Miller, linux-wireless, netdev, linux-kernel

On Thu, Jan 2, 2020 at 8:28 AM Johannes Berg <johannes@sipsolutions.net> wrote:
>
> On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote:
> > Wireless networking goes down on Acer C720P Chromebook (bisected)
> >
> > Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
> > packet dequeue")
> >

<snip>

> I think I found at least one hole in this, but IIRC (it was before my
> vacation, sorry) it was pretty unlikely to actually happen. Perhaps
> there are more though.
>
> https://lore.kernel.org/r/b14519e81b6d2335bd0cb7dcf074f0d1a4eec707.camel@sipsolutions.net

<snippety-snip>

> Do you get any output at all? Like a WARN_ON() for an underflow, or
> something?
>
> johannes
>

Johannes,

To answer your immediate question, no, I don't get any dmesg output at
all. Nothing about underruns.
However, while pursuing other avenues -- specifically, enabling
mac80211 debugfs and log messages -- I realized that my 'master' was
out-of-date from linux-stable and did a git pull.  Imagine my surprise
when the resulting kernel did not exhibit the problem!

Apparently, I had been a bit too pessimistic; since the problem
existed in 5.5-rc1 release, I'd assumed that the problem wouldn't get
rectified before 5.5.

However, I decided to bisect the fix, and ended up with: 911bde0f
("mac80211: Turn AQL into an NL80211_EXT_FEATURE"), which appears to
have "solved" the problem by just disabling the feature (this is
ath9k, by the way.)

This AQL stuff sounds pretty nifty, and I'd love to try my hand at
making it work for ath9k (also, since I put so much effort into an
automated build-and-test framework, it'd be a shame to just abandon
it.)  However, the ath9k code is rather lacking for comments, so I
don't even know where I should start, except for (I suspect) a call to
`wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from
inside ath9k_set_hw_capab()?

In the meantime, I went back to e548f749b096 -- the commit prior to
the one making AQL support opt-in -- and cranked up the debugging.

I'm not sure how to interpret any of this, but  here's what I got:

dmesg output:

Last relevant mention is "moving STA <my AP's MAC> to state 4" which
happened during startup, before everything shut down.

/sys/kernel/debug/ieee80211/phy0

airtime_flags = 7

stations/<my AP's MAC>/airtime =

RX: 6583578 us
TX: 32719 us
Weight: 256
Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000

(I have no idea how to interpret this, but that '32719 us' seems odd,
I thought the airtime usage was in 4us units?)


Doing an 'echo 3 | tee airtime_flags' to clear the (old) AQL-enabled
bit seemed to *immediately* restore network connectivity.

I ran a ping, and saw this:

- pings coming back in <5ms
- re-enable AQL (echo 7 | tee airtime_flags)
- pings stop coming back immediately
- some seconds later, disable AQL again (echo 3 | tee airtime_flags)
- immediate *flood* of ping replies registered, with times 16000ms,
15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
- According to the icmp_seq values, all 28 requests were replied to,
and their replies were delivered in-order

This certainly looks like a missing TX queue restart to me?


-- 
-- Stevie-O
Real programmers use COPY CON PROGRAM.EXE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
  2020-01-02 23:05   ` Stephen Oberholtzer
@ 2020-01-03  6:31     ` Johannes Berg
  2020-01-03  7:21     ` Justin Capella
  1 sibling, 0 replies; 7+ messages in thread
From: Johannes Berg @ 2020-01-03  6:31 UTC (permalink / raw)
  To: Stephen Oberholtzer
  Cc: toke, David S. Miller, linux-wireless, netdev, linux-kernel

Hi Stephen,


> To answer your immediate question, no, I don't get any dmesg output at
> all. Nothing about underruns.

OK.

> However, while pursuing other avenues -- specifically, enabling
> mac80211 debugfs and log messages -- I realized that my 'master' was
> out-of-date from linux-stable and did a git pull.  Imagine my surprise
> when the resulting kernel did not exhibit the problem!
> 
> Apparently, I had been a bit too pessimistic; since the problem
> existed in 5.5-rc1 release, I'd assumed that the problem wouldn't get
> rectified before 5.5.
> 
> However, I decided to bisect the fix, and ended up with: 911bde0f
> ("mac80211: Turn AQL into an NL80211_EXT_FEATURE"), which appears to
> have "solved" the problem by just disabling the feature (this is
> ath9k, by the way.)

Oh. I didn't pay attention and thought you actually had ath10k, not
ath9k!

> This AQL stuff sounds pretty nifty, and I'd love to try my hand at
> making it work for ath9k (also, since I put so much effort into an
> automated build-and-test framework, it'd be a shame to just abandon
> it.)

:-)

> However, the ath9k code is rather lacking for comments, so I
> don't even know where I should start, except for (I suspect) a call to
> `wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from
> inside ath9k_set_hw_capab()?

Honestly, I don't know, you'd probably have to wait for Toke to be back
from vacations to get a pointer on what could be done here.

> In the meantime, I went back to e548f749b096 -- the commit prior to
> the one making AQL support opt-in -- and cranked up the debugging.
> 
> I'm not sure how to interpret any of this, but  here's what I got:
> 
> dmesg output:
> 
> Last relevant mention is "moving STA <my AP's MAC> to state 4" which
> happened during startup, before everything shut down.

That just means the STA was set to authorized (state 4) due to the 4-
way-handshake completing, nothing more.

> /sys/kernel/debug/ieee80211/phy0
> 
> airtime_flags = 7
> 
> stations/<my AP's MAC>/airtime =
> 
> RX: 6583578 us
> TX: 32719 us
> Weight: 256
> Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
> Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
> Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000
> 
> (I have no idea how to interpret this, but that '32719 us' seems odd,
> I thought the airtime usage was in 4us units?)

Me neither, off the top of my head, let's wait for Toke.

> Doing an 'echo 3 | tee airtime_flags' to clear the (old) AQL-enabled
> bit seemed to *immediately* restore network connectivity.

Hmm. That probably means it was blocked on AQL and not some kind of soft
"didn't know I had to transmit" type scenario that the comment in the
commit log you quoted would have implied (if it was actually wrong).

> I ran a ping, and saw this:
> 
> - pings coming back in <5ms
> - re-enable AQL (echo 7 | tee airtime_flags)
> - pings stop coming back immediately
> - some seconds later, disable AQL again (echo 3 | tee airtime_flags)
> - immediate *flood* of ping replies registered, with times 16000ms,
> 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
> - According to the icmp_seq values, all 28 requests were replied to,
> and their replies were delivered in-order
> 
> This certainly looks like a missing TX queue restart to me?

I don't think it does. If it was just a missing TX queue restart then
changing the AQL bit shouldn't have had any effect, since changing the
flags via debugfs doesn't trigger a TX queue restart.

Rather, it seems to me that this implies some accounting went wrong, and
doing this clears/recovers that?

But I guess Toke will have a much better idea from the debug data above.

johannes


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
  2020-01-02 23:05   ` Stephen Oberholtzer
  2020-01-03  6:31     ` Johannes Berg
@ 2020-01-03  7:21     ` Justin Capella
  2020-01-04  2:34       ` Kan Yan
  1 sibling, 1 reply; 7+ messages in thread
From: Justin Capella @ 2020-01-03  7:21 UTC (permalink / raw)
  To: Stephen Oberholtzer
  Cc: Johannes Berg, Toke Høiland-Jørgensen, David S. Miller,
	linux-wireless, netdev, LKML

The rather large negative deficit stands out to me. See this patch,
https://patchwork.kernel.org/patch/11246363/ specifically the comments
by Kan Yan

On Thu, Jan 2, 2020, 3:14 PM Stephen Oberholtzer <stevie@qrpff.net> wrote:
>
>
> /sys/kernel/debug/ieee80211/phy0
>
> airtime_flags = 7
>
> stations/<my AP's MAC>/airtime =
>
> RX: 6583578 us
> TX: 32719 us
> Weight: 256
> Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
> Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
> Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/1200

On Thu, Jan 2, 2020 at 3:14 PM Stephen Oberholtzer <stevie@qrpff.net> wrote:
>
> On Thu, Jan 2, 2020 at 8:28 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> >
> > On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote:
> > > Wireless networking goes down on Acer C720P Chromebook (bisected)
> > >
> > > Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
> > > packet dequeue")
> > >
>
> <snip>
>
> > I think I found at least one hole in this, but IIRC (it was before my
> > vacation, sorry) it was pretty unlikely to actually happen. Perhaps
> > there are more though.
> >
> > https://lore.kernel.org/r/b14519e81b6d2335bd0cb7dcf074f0d1a4eec707.camel@sipsolutions.net
>
> <snippety-snip>
>
> > Do you get any output at all? Like a WARN_ON() for an underflow, or
> > something?
> >
> > johannes
> >
>
> Johannes,
>
> To answer your immediate question, no, I don't get any dmesg output at
> all. Nothing about underruns.
> However, while pursuing other avenues -- specifically, enabling
> mac80211 debugfs and log messages -- I realized that my 'master' was
> out-of-date from linux-stable and did a git pull.  Imagine my surprise
> when the resulting kernel did not exhibit the problem!
>
> Apparently, I had been a bit too pessimistic; since the problem
> existed in 5.5-rc1 release, I'd assumed that the problem wouldn't get
> rectified before 5.5.
>
> However, I decided to bisect the fix, and ended up with: 911bde0f
> ("mac80211: Turn AQL into an NL80211_EXT_FEATURE"), which appears to
> have "solved" the problem by just disabling the feature (this is
> ath9k, by the way.)
>
> This AQL stuff sounds pretty nifty, and I'd love to try my hand at
> making it work for ath9k (also, since I put so much effort into an
> automated build-and-test framework, it'd be a shame to just abandon
> it.)  However, the ath9k code is rather lacking for comments, so I
> don't even know where I should start, except for (I suspect) a call to
> `wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from
> inside ath9k_set_hw_capab()?
>
> In the meantime, I went back to e548f749b096 -- the commit prior to
> the one making AQL support opt-in -- and cranked up the debugging.
>
> I'm not sure how to interpret any of this, but  here's what I got:
>
> dmesg output:
>
> Last relevant mention is "moving STA <my AP's MAC> to state 4" which
> happened during startup, before everything shut down.
>
> /sys/kernel/debug/ieee80211/phy0
>
> airtime_flags = 7
>
> stations/<my AP's MAC>/airtime =
>
> RX: 6583578 us
> TX: 32719 us
> Weight: 256
> Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
> Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
> Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000
>
> (I have no idea how to interpret this, but that '32719 us' seems odd,
> I thought the airtime usage was in 4us units?)
>
>
> Doing an 'echo 3 | tee airtime_flags' to clear the (old) AQL-enabled
> bit seemed to *immediately* restore network connectivity.
>
> I ran a ping, and saw this:
>
> - pings coming back in <5ms
> - re-enable AQL (echo 7 | tee airtime_flags)
> - pings stop coming back immediately
> - some seconds later, disable AQL again (echo 3 | tee airtime_flags)
> - immediate *flood* of ping replies registered, with times 16000ms,
> 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
> - According to the icmp_seq values, all 28 requests were replied to,
> and their replies were delivered in-order
>
> This certainly looks like a missing TX queue restart to me?
>
>
> --
> -- Stevie-O
> Real programmers use COPY CON PROGRAM.EXE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
  2020-01-03  7:21     ` Justin Capella
@ 2020-01-04  2:34       ` Kan Yan
  2020-01-06 13:34         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 7+ messages in thread
From: Kan Yan @ 2020-01-04  2:34 UTC (permalink / raw)
  To: Justin Capella
  Cc: Stephen Oberholtzer, Johannes Berg,
	Toke Høiland-Jørgensen, David S. Miller,
	linux-wireless, netdev, LKML

> This AQL stuff sounds pretty nifty, and I'd love to try my hand at
> making it work for ath9k (also, since I put so much effort into an
> automated build-and-test framework, it'd be a shame to just abandon
> it.)  However, the ath9k code is rather lacking for comments, so I
> don't even know where I should start, except for (I suspect) a call to
> `wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from
> inside ath9k_set_hw_capab()?
> In the meantime, I went back to e548f749b096 -- the commit prior to
> the one making AQL support opt-in -- and cranked up the debugging.

AQL is designed for wireless chipset that uses firmware/hardware
offloading, to manage the firmware/hardware queue size. For ath9k, the
TX queues are controlled by the host driver and chipsets that use
ath9k have a much smaller hardware queue compared to ath10k, so AQL is
probably not needed for ath9k. The airtime based TX scheduler alone
should be sufficient.

> > /sys/kernel/debug/ieee80211/phy0
> >
> > airtime_flags = 7
> >
> > stations/<my AP's MAC>/airtime =
> >
> > RX: 6583578 us
> > TX: 32719 us
> > Weight: 256
> > Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
> > Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
> > Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000
> >
> > (I have no idea how to interpret this, but that '32719 us' seems odd,
> > I thought the airtime usage was in 4us units?)
> Me neither, off the top of my head, let's wait for Toke.

"TX: 32719 us" is the airtime reported by firmware, which is not in 4us units.
There are two airtime: the "consumed" airtime reported by firmware,
which is used by the airtimed based TX scheduler to enforce fairness,
and the "estimated" airtime used by AQL to control the queue length
for frames pending in the firmware/hardware queue, which in 4us unit.

> I ran a ping, and saw this:
>
> - pings coming back in <5ms
> - re-enable AQL (echo 7 | tee airtime_flags)
> - pings stop coming back immediately
> - some seconds later, disable AQL again (echo 3 | tee airtime_flags)
> - immediate *flood* of ping replies registered, with times 16000ms,
> 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
> - According to the icmp_seq values, all 28 requests were replied to,
> and their replies were delivered in-order
>
> This certainly looks like a missing TX queue restart to me?
I don't think TX queue restart is "missing", the TX queue should get
restarted when the pending frames is completed and returned to the
host driver. However, It looks like there is some issue with the
deficit refill logic in ath9k, and the TX queue got blocked due to the
negative deficit.


On Thu, Jan 2, 2020 at 11:22 PM Justin Capella <justincapella@gmail.com> wrote:
>
> The rather large negative deficit stands out to me. See this patch,
> https://patchwork.kernel.org/patch/11246363/ specifically the comments
> by Kan Yan
>
> On Thu, Jan 2, 2020, 3:14 PM Stephen Oberholtzer <stevie@qrpff.net> wrote:
> >
> >
> > /sys/kernel/debug/ieee80211/phy0
> >
> > airtime_flags = 7
> >
> > stations/<my AP's MAC>/airtime =
> >
> > RX: 6583578 us
> > TX: 32719 us
> > Weight: 256
> > Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
> > Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
> > Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/1200
>
> On Thu, Jan 2, 2020 at 3:14 PM Stephen Oberholtzer <stevie@qrpff.net> wrote:
> >
> > On Thu, Jan 2, 2020 at 8:28 AM Johannes Berg <johannes@sipsolutions.net> wrote:
> > >
> > > On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote:
> > > > Wireless networking goes down on Acer C720P Chromebook (bisected)
> > > >
> > > > Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on
> > > > packet dequeue")
> > > >
> >
> > <snip>
> >
> > > I think I found at least one hole in this, but IIRC (it was before my
> > > vacation, sorry) it was pretty unlikely to actually happen. Perhaps
> > > there are more though.
> > >
> > > https://lore.kernel.org/r/b14519e81b6d2335bd0cb7dcf074f0d1a4eec707.camel@sipsolutions.net
> >
> > <snippety-snip>
> >
> > > Do you get any output at all? Like a WARN_ON() for an underflow, or
> > > something?
> > >
> > > johannes
> > >
> >
> > Johannes,
> >
> > To answer your immediate question, no, I don't get any dmesg output at
> > all. Nothing about underruns.
> > However, while pursuing other avenues -- specifically, enabling
> > mac80211 debugfs and log messages -- I realized that my 'master' was
> > out-of-date from linux-stable and did a git pull.  Imagine my surprise
> > when the resulting kernel did not exhibit the problem!
> >
> > Apparently, I had been a bit too pessimistic; since the problem
> > existed in 5.5-rc1 release, I'd assumed that the problem wouldn't get
> > rectified before 5.5.
> >
> > However, I decided to bisect the fix, and ended up with: 911bde0f
> > ("mac80211: Turn AQL into an NL80211_EXT_FEATURE"), which appears to
> > have "solved" the problem by just disabling the feature (this is
> > ath9k, by the way.)
> >
> > This AQL stuff sounds pretty nifty, and I'd love to try my hand at
> > making it work for ath9k (also, since I put so much effort into an
> > automated build-and-test framework, it'd be a shame to just abandon
> > it.)  However, the ath9k code is rather lacking for comments, so I
> > don't even know where I should start, except for (I suspect) a call to
> > `wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from
> > inside ath9k_set_hw_capab()?
> >
> > In the meantime, I went back to e548f749b096 -- the commit prior to
> > the one making AQL support opt-in -- and cranked up the debugging.
> >
> > I'm not sure how to interpret any of this, but  here's what I got:
> >
> > dmesg output:
> >
> > Last relevant mention is "moving STA <my AP's MAC> to state 4" which
> > happened during startup, before everything shut down.
> >
> > /sys/kernel/debug/ieee80211/phy0
> >
> > airtime_flags = 7
> >
> > stations/<my AP's MAC>/airtime =
> >
> > RX: 6583578 us
> > TX: 32719 us
> > Weight: 256
> > Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us
> > Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us
> > Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000
> >
> > (I have no idea how to interpret this, but that '32719 us' seems odd,
> > I thought the airtime usage was in 4us units?)
> >
> >
> > Doing an 'echo 3 | tee airtime_flags' to clear the (old) AQL-enabled
> > bit seemed to *immediately* restore network connectivity.
> >
> > I ran a ping, and saw this:
> >
> > - pings coming back in <5ms
> > - re-enable AQL (echo 7 | tee airtime_flags)
> > - pings stop coming back immediately
> > - some seconds later, disable AQL again (echo 3 | tee airtime_flags)
> > - immediate *flood* of ping replies registered, with times 16000ms,
> > 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
> > - According to the icmp_seq values, all 28 requests were replied to,
> > and their replies were delivered in-order
> >
> > This certainly looks like a missing TX queue restart to me?
> >
> >
> > --
> > -- Stevie-O
> > Real programmers use COPY CON PROGRAM.EXE

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected)
  2020-01-04  2:34       ` Kan Yan
@ 2020-01-06 13:34         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 7+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-06 13:34 UTC (permalink / raw)
  To: Kan Yan, Justin Capella
  Cc: Stephen Oberholtzer, Johannes Berg, David S. Miller,
	linux-wireless, netdev, LKML

Kan Yan <kyan@google.com> writes:

>> I ran a ping, and saw this:
>>
>> - pings coming back in <5ms
>> - re-enable AQL (echo 7 | tee airtime_flags)
>> - pings stop coming back immediately
>> - some seconds later, disable AQL again (echo 3 | tee airtime_flags)
>> - immediate *flood* of ping replies registered, with times 16000ms,
>> 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms
>> - According to the icmp_seq values, all 28 requests were replied to,
>> and their replies were delivered in-order
>>
>> This certainly looks like a missing TX queue restart to me?
> I don't think TX queue restart is "missing", the TX queue should get
> restarted when the pending frames is completed and returned to the
> host driver. However, It looks like there is some issue with the
> deficit refill logic in ath9k, and the TX queue got blocked due to the
> negative deficit.

s/deficit refill/packet freeing/. I.e., there's an issue with the ath9k
driver either stomping on the tx_time_est field in the cb, or it's not
reporting back all freed TX skbs properly, so the AQL Q depth doesn't go
back down.

The large negative deficit is just because the queue is being blocked by
AQL, so it won't get its deficit refilled (and it keeps decreasing as RX
packets are being accounted).

All this being said, given the fact that ath9k definitely doesn't need
AQL, I think it's probably not worth it to try to find out exactly what
is causing this, and instead just leave AQL off for that driver?

(As an aside, this definitely disproves my initial "AQL should be benign
for drivers that don't need it" hypothesis. Guess that was way too
optimistic anyway ;))

-Toke


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-06 13:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-01  0:49 PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected) Stephen Oberholtzer
2020-01-02 13:28 ` Johannes Berg
2020-01-02 23:05   ` Stephen Oberholtzer
2020-01-03  6:31     ` Johannes Berg
2020-01-03  7:21     ` Justin Capella
2020-01-04  2:34       ` Kan Yan
2020-01-06 13:34         ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).