regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Regression] rt2800usb - Wifi performance issues and connection drops
@ 2023-03-04 16:24 Linux regression tracking (Thorsten Leemhuis)
  2023-03-05 17:25 ` Thorsten Leemhuis
  0 siblings, 1 reply; 20+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-04 16:24 UTC (permalink / raw)
  To: Stanislaw Gruszka, Helmut Schaa
  Cc: linux-wireless, LKML, Linux kernel regressions list, Thomas Mann

Hi, this is your Linux kernel regression tracker.

I noticed a regression report in bugzilla.kernel.org. As many (most?)
kernel developer don't keep an eye on it, I decided to forward it by
mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=217119 :

>  Thomas Mann 2023-03-03 15:12:03 UTC
> 
> After the update of linux to 6.2.x, i get connection drops and bandwidth problems.
> 
> 6.2.1 was completely unusable and 6.2.2 still has bandwidth problems but works a bit better
> 
> The device in use is:
> 
> 13d3:3273 IMC Networks 802.11 n/g/b Wireless LAN USB Mini-Card
> 
> Downgrading the kernel to 6.1.[14,15] fixes the problem and the wifi gets stable again and the available bandwidth increases.
> 
> demsg shows no errors
> 
> [tag] [reply] [−]
> Private
> Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-04 05:45:33 UTC
> 
> Please attach dmesg [without it most people won't even know which driver is in use for your card]
> 
> [tag] [reply] [−]
> Private
> Comment 2 Thomas Mann 2023-03-04 12:36:45 UTC
> 
> drive in use is rt2800usb
> 
> [tag] [reply] [−]
> Private
> Comment 3 Thomas Mann 2023-03-04 12:38:01 UTC
> 
> Created attachment 303840 [details]
> dmesg output
> 


See the ticket for more details.


[TLDR for the rest of this mail: I'm adding this report to the list of
tracked Linux kernel regressions; the text you find below is based on a
few templates paragraphs you might have encountered already in similar
form.]

BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:

#regzbot introduced: v6.1..v6.2
https://bugzilla.kernel.org/show_bug.cgi?id=217119
#regzbot title: net: wireless: rt2800usb: wifi performance issues and
connection drops
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (e.g. the buzgzilla ticket and maybe this mail as well, if
this thread sees some discussion). See page linked in footer for details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-04 16:24 [Regression] rt2800usb - Wifi performance issues and connection drops Linux regression tracking (Thorsten Leemhuis)
@ 2023-03-05 17:25 ` Thorsten Leemhuis
  2023-03-05 22:05   ` Alexander Wetzel
  0 siblings, 1 reply; 20+ messages in thread
From: Thorsten Leemhuis @ 2023-03-05 17:25 UTC (permalink / raw)
  To: Alexander Wetzel
  Cc: linux-wireless, LKML, Linux kernel regressions list, Thomas Mann,
	Stanislaw Gruszka, Helmut Schaa, Johannes Berg

On 04.03.23 17:24, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, this is your Linux kernel regression tracker.
> 
> I noticed a regression report in bugzilla.kernel.org. As many (most?)
> kernel developer don't keep an eye on it, I decided to forward it by
> mail. Quoting from https://bugzilla.kernel.org/show_bug.cgi?id=217119 :
> 
>>  Thomas Mann 2023-03-03 15:12:03 UTC
>>
>> After the update of linux to 6.2.x, i get connection drops and bandwidth problems.
>>
>> 6.2.1 was completely unusable and 6.2.2 still has bandwidth problems but works a bit better
>>
>> The device in use is:
>>
>> 13d3:3273 IMC Networks 802.11 n/g/b Wireless LAN USB Mini-Card
>>
>> Downgrading the kernel to 6.1.[14,15] fixes the problem and the wifi gets stable again and the available bandwidth increases.

Quick update from bugzilla:

```
--- Comment #4 from Thomas Mann (rauchwolke@gmx.net) ---
i bisected and found the commit that introduced the regression:

# first bad commit: [4444bc2116aecdcde87dce80373540adc8bd478b] wifi:
mac80211: Proper mark iTXQs for resumption
```

That's a commit from Alexander, applied by Johannes.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

>> demsg shows no errors
>>
>> [tag] [reply] [−]
>> Private
>> Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-04 05:45:33 UTC
>>
>> Please attach dmesg [without it most people won't even know which driver is in use for your card]
>>
>> [tag] [reply] [−]
>> Private
>> Comment 2 Thomas Mann 2023-03-04 12:36:45 UTC
>>
>> drive in use is rt2800usb
>>
>> [tag] [reply] [−]
>> Private
>> Comment 3 Thomas Mann 2023-03-04 12:38:01 UTC
>>
>> Created attachment 303840 [details]
>> dmesg output
>>
> 
> 
> See the ticket for more details.
> 
> 
> [TLDR for the rest of this mail: I'm adding this report to the list of
> tracked Linux kernel regressions; the text you find below is based on a
> few templates paragraphs you might have encountered already in similar
> form.]
> 
> BTW, let me use this mail to also add the report to the list of tracked
> regressions to ensure it's doesn't fall through the cracks:
> 
> #regzbot introduced: v6.1..v6.2
> https://bugzilla.kernel.org/show_bug.cgi?id=217119

P.S.:

#regzbot introduced: 4444bc2116aecdcde8
#regzbot ignore-activity

> #regzbot title: net: wireless: rt2800usb: wifi performance issues and
> connection drops
> #regzbot ignore-activity
> 
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
> 
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (e.g. the buzgzilla ticket and maybe this mail as well, if
> this thread sees some discussion). See page linked in footer for details.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-05 17:25 ` Thorsten Leemhuis
@ 2023-03-05 22:05   ` Alexander Wetzel
  2023-03-07 20:54     ` Alexander Wetzel
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-05 22:05 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

> Quick update from bugzilla:
> 
> ```
> --- Comment #4 from Thomas Mann (rauchwolke@gmx.net) ---
> i bisected and found the commit that introduced the regression:
> 
> # first bad commit: [4444bc2116aecdcde87dce80373540adc8bd478b] wifi:
> mac80211: Proper mark iTXQs for resumption
> ```
> 
> That's a commit from Alexander, applied by Johannes.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 

I just uploaded a test patch to bugzilla.
Please have a look if that fixes the issue.

If not I would be interested in the output of your iTXQ status.
Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection 
is bad and send/share/upload to bugzilla the resulting debug.out:

k=1; while [ $k -lt 10 ]; do \
cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
k=$(($k+1)); done >> debug.out

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-05 22:05   ` Alexander Wetzel
@ 2023-03-07 20:54     ` Alexander Wetzel
  2023-03-07 22:31       ` Thomas Mann
                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-07 20:54 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

>>
> 
> I just uploaded a test patch to bugzilla.
> Please have a look if that fixes the issue.
> 
> If not I would be interested in the output of your iTXQ status.
> Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection 
> is bad and send/share/upload to bugzilla the resulting debug.out:
> 
> k=1; while [ $k -lt 10 ]; do \
> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> k=$(($k+1)); done >> debug.out

Thomas and I continued with some debugging in
https://bugzilla.kernel.org/show_bug.cgi?id=217119

But the results so far are unexpected and we decided to continue the 
debugging with the round here. Hoping someone sees something I miss.

A very summary where we are:
I can't reproduce the bug with a very similar card and kernel config so 
far. Thomas card stops the iTXQs for intervalls >30s. Mine operates 
normally.

A more useful but longer summary:

Thomas updated to a 6.2 kernel and reported "connection drops and 
bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked for 
some more details he reported:
"...slow bandwidth stuff works better, but the main problem/test case is 
to start a 8-16 mbit video stream, which sometimes runs for a few 
seconds and then stops or it doesn't start at all"

He bisected the issue and identified my commit 4444bc2116ae ("wifi: 
mac80211: Proper mark iTXQs for resumption") as culprit.

Checking the internal iTXQ status when the issue is ongoing shows, that 
TID zero is flagged as dirty and thus is not transmitting queued 
packets. Interesting line from 
/sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
tid ac backlog-bytes backlog-packets new-flows drops marks overlimit 
collisions tx-bytes tx-packets flags
0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)

--> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and is 
flagged as DIRTY. There even is a potential race setting the DIRTY flag, 
but the fix for that is not helping.

Thus Thomas applied two debug patches, to better understand why the 
DIRTY flag is not cleared.

And looking at the output from those we see that the driver stops Tx by 
calling ieee80211_stop_queue(). When ieee80211_wake_queue() mac80211 
correctly resumes TX but is getting stopped by the driver after a single 
packet again. (The start of the relevant log is missing, so that may be 
initially more).
I assume TX is still ok at that stage. But after some singe Tx 
operations the driver stops the queues again. Here the relevant part of 
the log:
[  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
[  179.585022] XXXX drv_tx: TX
[  179.585027] XXXX ieee80211_stop_queue: called
[  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
[  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
[  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
[  179.585034] XXXX __ieee80211_wake_txqs: EXIT
[  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
[  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
[  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
[  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
[  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
[  179.585041] XXXX __ieee80211_wake_txqs: EXIT
[  179.585047] XXXX drv_tx: TX
[  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
[  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
....
[  214.307617] XXXX ieee80211_wake_queue: called


--> So the driver blocked TX for more than 30s. Which is a good 
explanation of what Thomas observes.

But there is nothing mac80211 can do differently here. Whatever is the 
real reason for the issue, it's nothing obvious I see.

Luckily I found a card using the same driver and nearly the same card:
Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo 
Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39 
p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET 2023ieee80211 phy0: 
rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'

My system, using the kernel config from Thomas with only minor 
modifications (different filesystems and initramfs settings and enabled 
mac80211 debug and developer options):
Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo 
12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2) 
2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0: 
rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware file 
'rt2870.bin'
ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected - 
version: 0.36

But there is one big difference on my system: I can't reproduce the bug 
so far. It's working as it should... (I did not apply the debug patches 
myself so far)

I'm now planning to look a bit more into the rt2800usb driver and 
provide another debug patch for interesting looking code pieces in it.

@Thomas:
I've also uploaded you my binary kernel I'm running at the moment here:
https://www.awhome.eu/s/5FjqMS73rtCtSBM

That kernel should also be able to boot and operate your system. Can you 
try that and tell me, if that makes any difference?

I'm also planning to provide some more debug patches, to figuring out 
which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs 
for resumption") fixes the issue for you. Assuming my understanding 
above is correct the patch should not really fix/break anything for 
you...With the findings above I would have expected your git bisec to 
identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue 
callback to drivers") as the first broken commit...

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-07 20:54     ` Alexander Wetzel
@ 2023-03-07 22:31       ` Thomas Mann
  2023-03-08  7:13         ` Alexander Wetzel
  2023-03-08  7:52       ` Felix Fietkau
  2023-03-09 17:00       ` Alexander Wetzel
  2 siblings, 1 reply; 20+ messages in thread
From: Thomas Mann @ 2023-03-07 22:31 UTC (permalink / raw)
  To: Alexander Wetzel
  Cc: Linux regressions mailing list, linux-wireless, LKML,
	Stanislaw Gruszka, Helmut Schaa, Johannes Berg

Hi Alexander,

i can't boot the binary kernel here, as the initramfs is included in my
kernel, if you send me a patch, i can apply it and test it.

Thomas

On Tue, 7 Mar 2023 21:54:31 +0100
Alexander Wetzel <alexander@wetzel-home.de> wrote:

> >>
> >
> > I just uploaded a test patch to bugzilla.
> > Please have a look if that fixes the issue.
> >
> > If not I would be interested in the output of your iTXQ status.
> > Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> > connection is bad and send/share/upload to bugzilla the resulting
> > debug.out:
> >
> > k=1; while [ $k -lt 10 ]; do \
> > cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> > k=$(($k+1)); done >> debug.out
>
> Thomas and I continued with some debugging in
> https://bugzilla.kernel.org/show_bug.cgi?id=217119
>
> But the results so far are unexpected and we decided to continue the
> debugging with the round here. Hoping someone sees something I miss.
>
> A very summary where we are:
> I can't reproduce the bug with a very similar card and kernel config
> so far. Thomas card stops the iTXQs for intervalls >30s. Mine
> operates normally.
>
> A more useful but longer summary:
>
> Thomas updated to a 6.2 kernel and reported "connection drops and
> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked
> for some more details he reported:
> "...slow bandwidth stuff works better, but the main problem/test case
> is to start a 8-16 mbit video stream, which sometimes runs for a few
> seconds and then stops or it doesn't start at all"
>
> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> mac80211: Proper mark iTXQs for resumption") as culprit.
>
> Checking the internal iTXQ status when the issue is ongoing shows,
> that TID zero is flagged as dirty and thus is not transmitting queued
> packets. Interesting line from
> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
>
> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and
> is flagged as DIRTY. There even is a potential race setting the DIRTY
> flag, but the fix for that is not helping.
>
> Thus Thomas applied two debug patches, to better understand why the
> DIRTY flag is not cleared.
>
> And looking at the output from those we see that the driver stops Tx
> by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> mac80211 correctly resumes TX but is getting stopped by the driver
> after a single packet again. (The start of the relevant log is
> missing, so that may be initially more).
> I assume TX is still ok at that stage. But after some singe Tx
> operations the driver stops the queues again. Here the relevant part
> of the log:
> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> [  179.585022] XXXX drv_tx: TX
> [  179.585027] XXXX ieee80211_stop_queue: called
> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585047] XXXX drv_tx: TX
> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> ....
> [  214.307617] XXXX ieee80211_wake_queue: called
>
>
> --> So the driver blocked TX for more than 30s. Which is a good
> explanation of what Thomas observes.
>
> But there is nothing mac80211 can do differently here. Whatever is
> the real reason for the issue, it's nothing obvious I see.
>
> Luckily I found a card using the same driver and nearly the same card:
> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo
> 2.39 p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>
> My system, using the kernel config from Thomas with only minor
> modifications (different filesystems and initramfs settings and
> enabled mac80211 debug and developer options):
> Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo
> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
> file 'rt2870.bin'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected
> - version: 0.36
>
> But there is one big difference on my system: I can't reproduce the
> bug so far. It's working as it should... (I did not apply the debug
> patches myself so far)
>
> I'm now planning to look a bit more into the rt2800usb driver and
> provide another debug patch for interesting looking code pieces in it.
>
> @Thomas:
> I've also uploaded you my binary kernel I'm running at the moment
> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
>
> That kernel should also be able to boot and operate your system. Can
> you try that and tell me, if that makes any difference?
>
> I'm also planning to provide some more debug patches, to figuring out
> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
> for resumption") fixes the issue for you. Assuming my understanding
> above is correct the patch should not really fix/break anything for
> you...With the findings above I would have expected your git bisec to
> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
> callback to drivers") as the first broken commit...
>
> Alexander


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-07 22:31       ` Thomas Mann
@ 2023-03-08  7:13         ` Alexander Wetzel
  2023-03-08 10:26           ` Thomas Mann
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-08  7:13 UTC (permalink / raw)
  To: Thomas Mann
  Cc: Linux regressions mailing list, linux-wireless, LKML,
	Stanislaw Gruszka, Helmut Schaa, Johannes Berg

On 07.03.23 23:31, Thomas Mann wrote:
> Hi Alexander,

Since I suspect we'll exchange quite some mails here:
Top posting is being frowned on the mailing lists on copy.
Details here: https://www.infradead.org/~dwmw2/email.html

I've moved your post to the correct position and replied there.

> 
>>>>
>>>
>>> I just uploaded a test patch to bugzilla.
>>> Please have a look if that fixes the issue.
>>>
>>> If not I would be interested in the output of your iTXQ status.
>>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
>>> connection is bad and send/share/upload to bugzilla the resulting
>>> debug.out:
>>>
>>> k=1; while [ $k -lt 10 ]; do \
>>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
>>> k=$(($k+1)); done >> debug.out
>>
>> Thomas and I continued with some debugging in
>> https://bugzilla.kernel.org/show_bug.cgi?id=217119
>>
>> But the results so far are unexpected and we decided to continue the
>> debugging with the round here. Hoping someone sees something I miss.
>>
>> A very summary where we are:
>> I can't reproduce the bug with a very similar card and kernel config
>> so far. Thomas card stops the iTXQs for intervalls >30s. Mine
>> operates normally.
>>
>> A more useful but longer summary:
>>
>> Thomas updated to a 6.2 kernel and reported "connection drops and
>> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked
>> for some more details he reported:
>> "...slow bandwidth stuff works better, but the main problem/test case
>> is to start a 8-16 mbit video stream, which sometimes runs for a few
>> seconds and then stops or it doesn't start at all"
>>
>> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
>> mac80211: Proper mark iTXQs for resumption") as culprit.
>>
>> Checking the internal iTXQ status when the issue is ongoing shows,
>> that TID zero is flagged as dirty and thus is not transmitting queued
>> packets. Interesting line from
>> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
>> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
>> collisions tx-bytes tx-packets flags
>> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
>>
>> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and
>> is flagged as DIRTY. There even is a potential race setting the DIRTY
>> flag, but the fix for that is not helping.
>>
>> Thus Thomas applied two debug patches, to better understand why the
>> DIRTY flag is not cleared.
>>
>> And looking at the output from those we see that the driver stops Tx
>> by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
>> mac80211 correctly resumes TX but is getting stopped by the driver
>> after a single packet again. (The start of the relevant log is
>> missing, so that may be initially more).
>> I assume TX is still ok at that stage. But after some singe Tx
>> operations the driver stops the queues again. Here the relevant part
>> of the log:
>> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
>> [  179.585022] XXXX drv_tx: TX
>> [  179.585027] XXXX ieee80211_stop_queue: called
>> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
>> [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
>> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
>> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
>> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
>> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
>> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
>> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
>> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
>> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
>> [  179.585047] XXXX drv_tx: TX
>> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
>> ....
>> [  214.307617] XXXX ieee80211_wake_queue: called
>>
>>
>> --> So the driver blocked TX for more than 30s. Which is a good
>> explanation of what Thomas observes.
>>
>> But there is nothing mac80211 can do differently here. Whatever is
>> the real reason for the issue, it's nothing obvious I see.
>>
>> Luckily I found a card using the same driver and nearly the same card:
>> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
>> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo
>> 2.39 p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET 2023ieee80211 phy0:
>> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
>> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
>> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>>
>> My system, using the kernel config from Thomas with only minor
>> modifications (different filesystems and initramfs settings and
>> enabled mac80211 debug and developer options):
>> Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo
>> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
>> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
>> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
>> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
>> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
>> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
>> file 'rt2870.bin'
>> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected
>> - version: 0.36
>>
>> But there is one big difference on my system: I can't reproduce the
>> bug so far. It's working as it should... (I did not apply the debug
>> patches myself so far)
>>
>> I'm now planning to look a bit more into the rt2800usb driver and
>> provide another debug patch for interesting looking code pieces in it.
>>
>> @Thomas:
>> I've also uploaded you my binary kernel I'm running at the moment
>> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
>>
>> That kernel should also be able to boot and operate your system. Can
>> you try that and tell me, if that makes any difference?

 >
 > i can't boot the binary kernel here, as the initramfs is included in
 > my kernel, if you send me a patch, i can apply it and test it.

That was an unpatched kernel. Idea was to verify that it's not a 
compiler issue. (You seem to be using a hardened Gentoo profile.)

Can you share your initrd, so I can include it? (Mail it to me directly, 
upload it to bug in buguilla or send a link to some cloud storage.)



>>
>> I'm also planning to provide some more debug patches, to figuring out
>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>> for resumption") fixes the issue for you. Assuming my understanding
>> above is correct the patch should not really fix/break anything for
>> you...With the findings above I would have expected your git bisec to
>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>> callback to drivers") as the first broken commit...
>>
>> Alexander
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-07 20:54     ` Alexander Wetzel
  2023-03-07 22:31       ` Thomas Mann
@ 2023-03-08  7:52       ` Felix Fietkau
  2023-03-08 11:41         ` Alexander Wetzel
  2023-03-09 17:00       ` Alexander Wetzel
  2 siblings, 1 reply; 20+ messages in thread
From: Felix Fietkau @ 2023-03-08  7:52 UTC (permalink / raw)
  To: Alexander Wetzel, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 07.03.23 21:54, Alexander Wetzel wrote:
>>>
>> 
>> I just uploaded a test patch to bugzilla.
>> Please have a look if that fixes the issue.
>> 
>> If not I would be interested in the output of your iTXQ status.
>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the connection 
>> is bad and send/share/upload to bugzilla the resulting debug.out:
>> 
>> k=1; while [ $k -lt 10 ]; do \
>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
>> k=$(($k+1)); done >> debug.out
> 
> Thomas and I continued with some debugging in
> https://bugzilla.kernel.org/show_bug.cgi?id=217119
> 
> But the results so far are unexpected and we decided to continue the
> debugging with the round here. Hoping someone sees something I miss.
> 
> A very summary where we are:
> I can't reproduce the bug with a very similar card and kernel config so
> far. Thomas card stops the iTXQs for intervalls >30s. Mine operates
> normally.
> 
> A more useful but longer summary:
> 
> Thomas updated to a 6.2 kernel and reported "connection drops and
> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked for
> some more details he reported:
> "...slow bandwidth stuff works better, but the main problem/test case is
> to start a 8-16 mbit video stream, which sometimes runs for a few
> seconds and then stops or it doesn't start at all"
> 
> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> mac80211: Proper mark iTXQs for resumption") as culprit.
> 
> Checking the internal iTXQ status when the issue is ongoing shows, that
> TID zero is flagged as dirty and thus is not transmitting queued
> packets. Interesting line from
> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit
> collisions tx-bytes tx-packets flags
> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
> 
> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and is
> flagged as DIRTY. There even is a potential race setting the DIRTY flag,
> but the fix for that is not helping.
> 
> Thus Thomas applied two debug patches, to better understand why the
> DIRTY flag is not cleared.
> 
> And looking at the output from those we see that the driver stops Tx by
> calling ieee80211_stop_queue(). When ieee80211_wake_queue() mac80211
> correctly resumes TX but is getting stopped by the driver after a single
> packet again. (The start of the relevant log is missing, so that may be
> initially more).
> I assume TX is still ok at that stage. But after some singe Tx
> operations the driver stops the queues again. Here the relevant part of
> the log:
> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> [  179.585022] XXXX drv_tx: TX
> [  179.585027] XXXX ieee80211_stop_queue: called
> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585047] XXXX drv_tx: TX
> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> ....
> [  214.307617] XXXX ieee80211_wake_queue: called
> 
> 
> --> So the driver blocked TX for more than 30s. Which is a good
> explanation of what Thomas observes.
> 
> But there is nothing mac80211 can do differently here. Whatever is the
> real reason for the issue, it's nothing obvious I see.
> 
> Luckily I found a card using the same driver and nearly the same card:
> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo
> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39
> p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> 
> My system, using the kernel config from Thomas with only minor
> modifications (different filesystems and initramfs settings and enabled
> mac80211 debug and developer options):
> Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo
> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware file
> 'rt2870.bin'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected -
> version: 0.36
> 
> But there is one big difference on my system: I can't reproduce the bug
> so far. It's working as it should... (I did not apply the debug patches
> myself so far)
> 
> I'm now planning to look a bit more into the rt2800usb driver and
> provide another debug patch for interesting looking code pieces in it.
> 
> @Thomas:
> I've also uploaded you my binary kernel I'm running at the moment here:
> https://www.awhome.eu/s/5FjqMS73rtCtSBM
> 
> That kernel should also be able to boot and operate your system. Can you
> try that and tell me, if that makes any difference?
> 
> I'm also planning to provide some more debug patches, to figuring out
> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
> for resumption") fixes the issue for you. Assuming my understanding
> above is correct the patch should not really fix/break anything for
> you...With the findings above I would have expected your git bisec to
> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
> callback to drivers") as the first broken commit...
I can't point to any specific series of events where it would go wrong, 
but I suspect that the problem might be the fact that you're doing tx 
scheduling from within ieee80211_handle_wake_tx_queue. I don't see how 
it's properly protected from potentially being called on different CPUs 
concurrently.

Back when I was debugging some iTXQ issues in mt76, I also had problems 
when tx scheduling could happen from multiple places. My solution was to 
have a single worker thread that handles tx, which is scheduled from the 
wake_tx_queue op.
Maybe you could do something similar in mac80211 for non-iTXQ drivers.

- Felix

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08  7:13         ` Alexander Wetzel
@ 2023-03-08 10:26           ` Thomas Mann
  2023-03-08 12:10             ` Alexander Wetzel
  2023-03-08 12:29             ` Thomas Mann
  0 siblings, 2 replies; 20+ messages in thread
From: Thomas Mann @ 2023-03-08 10:26 UTC (permalink / raw)
  To: Alexander Wetzel
  Cc: Linux regressions mailing list, linux-wireless, LKML,
	Stanislaw Gruszka, Helmut Schaa, Johannes Berg

On Wed, 8 Mar 2023 08:13:32 +0100
Alexander Wetzel <alexander@wetzel-home.de> wrote:

> On 07.03.23 23:31, Thomas Mann wrote:
> > Hi Alexander,
>
> Since I suspect we'll exchange quite some mails here:
> Top posting is being frowned on the mailing lists on copy.
> Details here: https://www.infradead.org/~dwmw2/email.html
>
> I've moved your post to the correct position and replied there.
>
> >
> >>>>
> >>>
> >>> I just uploaded a test patch to bugzilla.
> >>> Please have a look if that fixes the issue.
> >>>
> >>> If not I would be interested in the output of your iTXQ status.
> >>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> >>> connection is bad and send/share/upload to bugzilla the resulting
> >>> debug.out:
> >>>
> >>> k=1; while [ $k -lt 10 ]; do \
> >>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> >>> k=$(($k+1)); done >> debug.out
> >>
> >> Thomas and I continued with some debugging in
> >> https://bugzilla.kernel.org/show_bug.cgi?id=217119
> >>
> >> But the results so far are unexpected and we decided to continue
> >> the debugging with the round here. Hoping someone sees something I
> >> miss.
> >>
> >> A very summary where we are:
> >> I can't reproduce the bug with a very similar card and kernel
> >> config so far. Thomas card stops the iTXQs for intervalls >30s.
> >> Mine operates normally.
> >>
> >> A more useful but longer summary:
> >>
> >> Thomas updated to a 6.2 kernel and reported "connection drops and
> >> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.)
> >> Asked for some more details he reported:
> >> "...slow bandwidth stuff works better, but the main problem/test
> >> case is to start a 8-16 mbit video stream, which sometimes runs
> >> for a few seconds and then stops or it doesn't start at all"
> >>
> >> He bisected the issue and identified my commit 4444bc2116ae ("wifi:
> >> mac80211: Proper mark iTXQs for resumption") as culprit.
> >>
> >> Checking the internal iTXQ status when the issue is ongoing shows,
> >> that TID zero is flagged as dirty and thus is not transmitting
> >> queued packets. Interesting line from
> >> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> >> tid ac backlog-bytes backlog-packets new-flows drops marks
> >> overlimit collisions tx-bytes tx-packets flags
> >> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU
> >> DIRTY)
> >> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets
> >> and is flagged as DIRTY. There even is a potential race setting
> >> the DIRTY flag, but the fix for that is not helping.
> >>
> >> Thus Thomas applied two debug patches, to better understand why the
> >> DIRTY flag is not cleared.
> >>
> >> And looking at the output from those we see that the driver stops
> >> Tx by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> >> mac80211 correctly resumes TX but is getting stopped by the driver
> >> after a single packet again. (The start of the relevant log is
> >> missing, so that may be initially more).
> >> I assume TX is still ok at that stage. But after some singe Tx
> >> operations the driver stops the queues again. Here the relevant
> >> part of the log:
> >> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> >> [  179.585022] XXXX drv_tx: TX
> >> [  179.585027] XXXX ieee80211_stop_queue: called
> >> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> >> Reason: 1 [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT
> >> dirty [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> >> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> >> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> >> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> >> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> >> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> >> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> >> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> >> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> >> [  179.585047] XXXX drv_tx: TX
> >> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> >> Reason: 1 [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0
> >> dirty. Reason: 1 [  179.585868] XXXX ieee80211_tx_dequeue: mark
> >> TID 0 dirty. Reason: 1 [  179.586120] XXXX ieee80211_tx_dequeue:
> >> mark TID 0 dirty. Reason: 1 [  179.586544] XXXX
> >> ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [  179.586792]
> >> XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [
> >> 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> >> [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> >> Reason: 1 [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0
> >> dirty. Reason: 1 .... [  214.307617] XXXX ieee80211_wake_queue:
> >> called
> >>
> >>
> >> --> So the driver blocked TX for more than 30s. Which is a good
> >> explanation of what Thomas observes.
> >>
> >> But there is nothing mac80211 can do differently here. Whatever is
> >> the real reason for the issue, it's nothing obvious I see.
> >>
> >> Luckily I found a card using the same driver and nearly the same
> >> card: Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc
> >> (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld
> >> (Gentoo 2.39 p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET
> >> 2023ieee80211 phy0: rt2x00_set_rt: Info - RT chipset 3070, rev
> >> 0201 detected ieee80211 phy0: rt2x00_set_rf: Info - RF chipset
> >> 0005 detected ieee80211 phy0: Selected rate control algorithm
> >> 'minstrel_ht'
> >>
> >> My system, using the kernel config from Thomas with only minor
> >> modifications (different filesystems and initramfs settings and
> >> enabled mac80211 debug and developer options):
> >> Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo
> >> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> >> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
> >> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> >> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> >> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
> >> file 'rt2870.bin'
> >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware
> >> detected
> >> - version: 0.36
> >>
> >> But there is one big difference on my system: I can't reproduce the
> >> bug so far. It's working as it should... (I did not apply the debug
> >> patches myself so far)
> >>
> >> I'm now planning to look a bit more into the rt2800usb driver and
> >> provide another debug patch for interesting looking code pieces in
> >> it.
> >>
> >> @Thomas:
> >> I've also uploaded you my binary kernel I'm running at the moment
> >> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
> >>
> >> That kernel should also be able to boot and operate your system.
> >> Can you try that and tell me, if that makes any difference?
>
>  >
>  > i can't boot the binary kernel here, as the initramfs is included
>  > in my kernel, if you send me a patch, i can apply it and test it.
>
> That was an unpatched kernel. Idea was to verify that it's not a
> compiler issue. (You seem to be using a hardened Gentoo profile.)
>
> Can you share your initrd, so I can include it? (Mail it to me
> directly, upload it to bug in buguilla or send a link to some cloud
> storage.)
>
I can't share this config, as it's a production system, and i'm not
allowed to run abitrary binary code on the system. As 6.1.x works
without a problem, i don't think it's a compiler problem. I will try to
get a none hardened compiler and recompile the kernel.

>
>
> >>
> >> I'm also planning to provide some more debug patches, to figuring
> >> out which part of commit 4444bc2116ae ("wifi: mac80211: Proper
> >> mark iTXQs for resumption") fixes the issue for you. Assuming my
> >> understanding above is correct the patch should not really
> >> fix/break anything for you...With the findings above I would have
> >> expected your git bisec to identify commit a790cc3a4fad ("wifi:
> >> mac80211: add wake_tx_queue callback to drivers") as the first
> >> broken commit...
> >>
> >> Alexander
> >
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08  7:52       ` Felix Fietkau
@ 2023-03-08 11:41         ` Alexander Wetzel
  2023-03-08 11:57           ` Felix Fietkau
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-08 11:41 UTC (permalink / raw)
  To: Felix Fietkau, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 08.03.23 08:52, Felix Fietkau wrote:

>> I'm also planning to provide some more debug patches, to figuring out
>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>> for resumption") fixes the issue for you. Assuming my understanding
>> above is correct the patch should not really fix/break anything for
>> you...With the findings above I would have expected your git bisec to
>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>> callback to drivers") as the first broken commit...
> I can't point to any specific series of events where it would go wrong, 
> but I suspect that the problem might be the fact that you're doing tx 
> scheduling from within ieee80211_handle_wake_tx_queue. I don't see how 
> it's properly protected from potentially being called on different CPUs 
> concurrently.
> 
> Back when I was debugging some iTXQ issues in mt76, I also had problems 
> when tx scheduling could happen from multiple places. My solution was to 
> have a single worker thread that handles tx, which is scheduled from the 
> wake_tx_queue op.
> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
> 

I think it's already doing all of that:
ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the 
wake_tx_queue op. The drivers without native iTXQ support simply link it 
to this handler.

Alexander



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 11:41         ` Alexander Wetzel
@ 2023-03-08 11:57           ` Felix Fietkau
  2023-03-08 12:21             ` Linux regression tracking (Thorsten Leemhuis)
  2023-03-09 22:13             ` Alexander Wetzel
  0 siblings, 2 replies; 20+ messages in thread
From: Felix Fietkau @ 2023-03-08 11:57 UTC (permalink / raw)
  To: Alexander Wetzel, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 08.03.23 12:41, Alexander Wetzel wrote:
> On 08.03.23 08:52, Felix Fietkau wrote:
> 
>>> I'm also planning to provide some more debug patches, to figuring out
>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>> for resumption") fixes the issue for you. Assuming my understanding
>>> above is correct the patch should not really fix/break anything for
>>> you...With the findings above I would have expected your git bisec to
>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>> callback to drivers") as the first broken commit...
>> I can't point to any specific series of events where it would go wrong, 
>> but I suspect that the problem might be the fact that you're doing tx 
>> scheduling from within ieee80211_handle_wake_tx_queue. I don't see how 
>> it's properly protected from potentially being called on different CPUs 
>> concurrently.
>> 
>> Back when I was debugging some iTXQ issues in mt76, I also had problems 
>> when tx scheduling could happen from multiple places. My solution was to 
>> have a single worker thread that handles tx, which is scheduled from the 
>> wake_tx_queue op.
>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>> 
> 
> I think it's already doing all of that:
> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
> wake_tx_queue op. The drivers without native iTXQ support simply link it
> to this handler.
I know. The problem I see is that I can't find anything that guarantees 
that .wake_tx_queue_op is not being called concurrently from multiple 
different places. ieee80211_handle_wake_tx_queue is doing the scheduling 
directly, instead of deferring it to a single workqueue/tasklet/thread, 
and multiple concurrent calls to it could potentially cause issues.

- Felix

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 10:26           ` Thomas Mann
@ 2023-03-08 12:10             ` Alexander Wetzel
  2023-03-08 12:29             ` Thomas Mann
  1 sibling, 0 replies; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-08 12:10 UTC (permalink / raw)
  To: Thomas Mann
  Cc: Linux regressions mailing list, linux-wireless, LKML,
	Stanislaw Gruszka, Helmut Schaa, Johannes Berg

>>>> @Thomas:
>>>> I've also uploaded you my binary kernel I'm running at the moment
>>>> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
>>>>
>>>> That kernel should also be able to boot and operate your system.
>>>> Can you try that and tell me, if that makes any difference?
>>
>>   >
>>   > i can't boot the binary kernel here, as the initramfs is included
>>   > in my kernel, if you send me a patch, i can apply it and test it.
>>
>> That was an unpatched kernel. Idea was to verify that it's not a
>> compiler issue. (You seem to be using a hardened Gentoo profile.)
>>
>> Can you share your initrd, so I can include it? (Mail it to me
>> directly, upload it to bug in buguilla or send a link to some cloud
>> storage.)
>>
> I can't share this config, as it's a production system, and i'm not
> allowed to run abitrary binary code on the system. As 6.1.x works
> without a problem, i don't think it's a compiler problem. I will try to
> get a none hardened compiler and recompile the kernel.
> 

I was suspecting something like that. I may try the same in reverse. But 
it's so far quite some way down on the list. There are more promising 
ways to spend the debug time I have for so far.

But one remark:
As far as TX is concerned 6.1 and 6.2 kernels are handling TX in 
drastically different ways for many - including yours - cards.

The patch you identified as culprit is well after the move to the new TX 
mode of operation and only fixes a comparable minor issue.

Your setup seems to require both, the move to iTXQ AND this minor fix.

>>
>>>>
>>>> I'm also planning to provide some more debug patches, to figuring
>>>> out which part of commit 4444bc2116ae ("wifi: mac80211: Proper
>>>> mark iTXQs for resumption") fixes the issue for you. Assuming my
>>>> understanding above is correct the patch should not really
>>>> fix/break anything for you...With the findings above I would have
>>>> expected your git bisec to identify commit a790cc3a4fad ("wifi:
>>>> mac80211: add wake_tx_queue callback to drivers") as the first
>>>> broken commit...
>>>>
>>>> Alexander
>>>
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 11:57           ` Felix Fietkau
@ 2023-03-08 12:21             ` Linux regression tracking (Thorsten Leemhuis)
  2023-03-08 16:50               ` Alexander Wetzel
  2023-03-09 22:13             ` Alexander Wetzel
  1 sibling, 1 reply; 20+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-08 12:21 UTC (permalink / raw)
  To: Felix Fietkau, Alexander Wetzel, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 08.03.23 12:57, Felix Fietkau wrote:
> On 08.03.23 12:41, Alexander Wetzel wrote:
>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>> I'm also planning to provide some more debug patches, to figuring out
>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>> above is correct the patch should not really fix/break anything for
>>>> you...With the findings above I would have expected your git bisec to
>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>> callback to drivers") as the first broken commit...
>>> I can't point to any specific series of events where it would go
>>> wrong, but I suspect that the problem might be the fact that you're
>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>> don't see how it's properly protected from potentially being called
>>> on different CPUs concurrently.
>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>> problems when tx scheduling could happen from multiple places. My
>>> solution was to have a single worker thread that handles tx, which is
>>> scheduled from the wake_tx_queue op.
>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>> I think it's already doing all of that:
>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>> to this handler.
> I know. The problem I see is that I can't find anything that guarantees
> that .wake_tx_queue_op is not being called concurrently from multiple
> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
> directly, instead of deferring it to a single workqueue/tasklet/thread,
> and multiple concurrent calls to it could potentially cause issues.

Alexander, Felix, many thx for looking into this.

This more and more sounds like something that might take a while to get
fixed, which makes it harder to get this fixed within those time-frames
Documentation/process/handling-regressions.rst outlines. So please allow
me to ask:

Is reverting the culprit (and reapplying it later once the real cause is
found and fixed) an option, or would that cause other regressions?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 10:26           ` Thomas Mann
  2023-03-08 12:10             ` Alexander Wetzel
@ 2023-03-08 12:29             ` Thomas Mann
  1 sibling, 0 replies; 20+ messages in thread
From: Thomas Mann @ 2023-03-08 12:29 UTC (permalink / raw)
  To: Alexander Wetzel
  Cc: Linux regressions mailing list, linux-wireless, LKML,
	Stanislaw Gruszka, Helmut Schaa, Johannes Berg

On Wed, 8 Mar 2023 11:26:50 +0100
Thomas Mann <rauchwolke@gmx.net> wrote:

> On Wed, 8 Mar 2023 08:13:32 +0100
> Alexander Wetzel <alexander@wetzel-home.de> wrote:
>
> > On 07.03.23 23:31, Thomas Mann wrote:
> > > Hi Alexander,
> >
> > Since I suspect we'll exchange quite some mails here:
> > Top posting is being frowned on the mailing lists on copy.
> > Details here: https://www.infradead.org/~dwmw2/email.html
> >
> > I've moved your post to the correct position and replied there.
> >
> > >
> > >>>>
> > >>>
> > >>> I just uploaded a test patch to bugzilla.
> > >>> Please have a look if that fixes the issue.
> > >>>
> > >>> If not I would be interested in the output of your iTXQ status.
> > >>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the
> > >>> connection is bad and send/share/upload to bugzilla the
> > >>> resulting debug.out:
> > >>>
> > >>> k=1; while [ $k -lt 10 ]; do \
> > >>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> > >>> k=$(($k+1)); done >> debug.out
> > >>
> > >> Thomas and I continued with some debugging in
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=217119
> > >>
> > >> But the results so far are unexpected and we decided to continue
> > >> the debugging with the round here. Hoping someone sees something
> > >> I miss.
> > >>
> > >> A very summary where we are:
> > >> I can't reproduce the bug with a very similar card and kernel
> > >> config so far. Thomas card stops the iTXQs for intervalls >30s.
> > >> Mine operates normally.
> > >>
> > >> A more useful but longer summary:
> > >>
> > >> Thomas updated to a 6.2 kernel and reported "connection drops and
> > >> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.)
> > >> Asked for some more details he reported:
> > >> "...slow bandwidth stuff works better, but the main problem/test
> > >> case is to start a 8-16 mbit video stream, which sometimes runs
> > >> for a few seconds and then stops or it doesn't start at all"
> > >>
> > >> He bisected the issue and identified my commit 4444bc2116ae
> > >> ("wifi: mac80211: Proper mark iTXQs for resumption") as culprit.
> > >>
> > >> Checking the internal iTXQ status when the issue is ongoing
> > >> shows, that TID zero is flagged as dirty and thus is not
> > >> transmitting queued packets. Interesting line from
> > >> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> > >> tid ac backlog-bytes backlog-packets new-flows drops marks
> > >> overlimit collisions tx-bytes tx-packets flags
> > >> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU
> > >> DIRTY)
> > >> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued
> > >> packets and is flagged as DIRTY. There even is a potential race
> > >> setting the DIRTY flag, but the fix for that is not helping.
> > >>
> > >> Thus Thomas applied two debug patches, to better understand why
> > >> the DIRTY flag is not cleared.
> > >>
> > >> And looking at the output from those we see that the driver stops
> > >> Tx by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> > >> mac80211 correctly resumes TX but is getting stopped by the
> > >> driver after a single packet again. (The start of the relevant
> > >> log is missing, so that may be initially more).
> > >> I assume TX is still ok at that stage. But after some singe Tx
> > >> operations the driver stops the queues again. Here the relevant
> > >> part of the log:
> > >> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> > >> [  179.585022] XXXX drv_tx: TX
> > >> [  179.585027] XXXX ieee80211_stop_queue: called
> > >> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > >> Reason: 1 [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT
> > >> dirty [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> > >> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> > >> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> > >> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> > >> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> > >> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> > >> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> > >> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> > >> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> > >> [  179.585047] XXXX drv_tx: TX
> > >> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > >> Reason: 1 [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0
> > >> dirty. Reason: 1 [  179.585868] XXXX ieee80211_tx_dequeue: mark
> > >> TID 0 dirty. Reason: 1 [  179.586120] XXXX ieee80211_tx_dequeue:
> > >> mark TID 0 dirty. Reason: 1 [  179.586544] XXXX
> > >> ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [  179.586792]
> > >> XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [
> > >> 179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason:
> > >> 1 [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > >> Reason: 1 [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0
> > >> dirty. Reason: 1 .... [  214.307617] XXXX ieee80211_wake_queue:
> > >> called
> > >>
> > >>
> > >> --> So the driver blocked TX for more than 30s. Which is a good
> > >> explanation of what Thomas observes.
> > >>
> > >> But there is nothing mac80211 can do differently here. Whatever
> > >> is the real reason for the issue, it's nothing obvious I see.
> > >>
> > >> Luckily I found a card using the same driver and nearly the same
> > >> card: Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc
> > >> (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld
> > >> (Gentoo 2.39 p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET
> > >> 2023ieee80211 phy0: rt2x00_set_rt: Info - RT chipset 3070, rev
> > >> 0201 detected ieee80211 phy0: rt2x00_set_rf: Info - RF chipset
> > >> 0005 detected ieee80211 phy0: Selected rate control algorithm
> > >> 'minstrel_ht'
> > >>
> > >> My system, using the kernel config from Thomas with only minor
> > >> modifications (different filesystems and initramfs settings and
> > >> enabled mac80211 debug and developer options):
> > >> Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo
> > >> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2)
> > >> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0:
> > >> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> > >> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> > >> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> > >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading
> > >> firmware file 'rt2870.bin'
> > >> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware
> > >> detected
> > >> - version: 0.36
> > >>
> > >> But there is one big difference on my system: I can't reproduce
> > >> the bug so far. It's working as it should... (I did not apply
> > >> the debug patches myself so far)
> > >>
> > >> I'm now planning to look a bit more into the rt2800usb driver and
> > >> provide another debug patch for interesting looking code pieces
> > >> in it.
> > >>
> > >> @Thomas:
> > >> I've also uploaded you my binary kernel I'm running at the moment
> > >> here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
> > >>
> > >> That kernel should also be able to boot and operate your system.
> > >> Can you try that and tell me, if that makes any difference?
> >
> >  >
> >  > i can't boot the binary kernel here, as the initramfs is included
> >  > in my kernel, if you send me a patch, i can apply it and test
> >  > it.
> >
> > That was an unpatched kernel. Idea was to verify that it's not a
> > compiler issue. (You seem to be using a hardened Gentoo profile.)
> >
> > Can you share your initrd, so I can include it? (Mail it to me
> > directly, upload it to bug in buguilla or send a link to some cloud
> > storage.)
> >
> I can't share this config, as it's a production system, and i'm not
> allowed to run abitrary binary code on the system. As 6.1.x works
> without a problem, i don't think it's a compiler problem. I will try
> to get a none hardened compiler and recompile the kernel.

I compiled the kernel now with a none hardened tools/compiler
(gcc (Gentoo 12.2.1_p20230121-r1 p10) 12.2.1 20230121) and the kernel
still has the same bug/behaviour.

>
> >
> >
> > >>
> > >> I'm also planning to provide some more debug patches, to figuring
> > >> out which part of commit 4444bc2116ae ("wifi: mac80211: Proper
> > >> mark iTXQs for resumption") fixes the issue for you. Assuming my
> > >> understanding above is correct the patch should not really
> > >> fix/break anything for you...With the findings above I would have
> > >> expected your git bisec to identify commit a790cc3a4fad ("wifi:
> > >> mac80211: add wake_tx_queue callback to drivers") as the first
> > >> broken commit...
> > >>
> > >> Alexander
> > >
> >
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 12:21             ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-03-08 16:50               ` Alexander Wetzel
  2023-03-09  7:59                 ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-08 16:50 UTC (permalink / raw)
  To: Linux regressions mailing list, Felix Fietkau
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 08.03.23 13:21, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 08.03.23 12:57, Felix Fietkau wrote:
>> On 08.03.23 12:41, Alexander Wetzel wrote:
>>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>>> I'm also planning to provide some more debug patches, to figuring out
>>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>>> above is correct the patch should not really fix/break anything for
>>>>> you...With the findings above I would have expected your git bisec to
>>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>>> callback to drivers") as the first broken commit...
>>>> I can't point to any specific series of events where it would go
>>>> wrong, but I suspect that the problem might be the fact that you're
>>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>>> don't see how it's properly protected from potentially being called
>>>> on different CPUs concurrently.
>>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>>> problems when tx scheduling could happen from multiple places. My
>>>> solution was to have a single worker thread that handles tx, which is
>>>> scheduled from the wake_tx_queue op.
>>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>> I think it's already doing all of that:
>>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>>> to this handler.
>> I know. The problem I see is that I can't find anything that guarantees
>> that .wake_tx_queue_op is not being called concurrently from multiple
>> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
>> directly, instead of deferring it to a single workqueue/tasklet/thread,
>> and multiple concurrent calls to it could potentially cause issues.
> 
> Alexander, Felix, many thx for looking into this.
> 
> This more and more sounds like something that might take a while to get
> fixed, which makes it harder to get this fixed within those time-frames
> Documentation/process/handling-regressions.rst outlines. So please allow
> me to ask:
> 
> Is reverting the culprit (and reapplying it later once the real cause is
> found and fixed) an option, or would that cause other regressions?

This patch turned out to fix a (much worse) pre-release regression. See e.g.
https://lore.kernel.org/linux-wireless/7cff27f8-d363-bbfb-241e-8d6fc0009c40@leemhuis.info/T/#t

To fix both regressions will force us to revert more commits other 
patches depends on...

> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 16:50               ` Alexander Wetzel
@ 2023-03-09  7:59                 ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 20+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-09  7:59 UTC (permalink / raw)
  To: Alexander Wetzel, Linux regressions mailing list, Felix Fietkau
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 08.03.23 17:50, Alexander Wetzel wrote:
> On 08.03.23 13:21, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 08.03.23 12:57, Felix Fietkau wrote:
>>> On 08.03.23 12:41, Alexander Wetzel wrote:
>>>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>>>> I'm also planning to provide some more debug patches, to figuring out
>>>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>>>> above is correct the patch should not really fix/break anything for
>>>>>> you...With the findings above I would have expected your git bisec to
>>>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>>>> callback to drivers") as the first broken commit...
>>>>> I can't point to any specific series of events where it would go
>>>>> wrong, but I suspect that the problem might be the fact that you're
>>>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>>>> don't see how it's properly protected from potentially being called
>>>>> on different CPUs concurrently.
>>>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>>>> problems when tx scheduling could happen from multiple places. My
>>>>> solution was to have a single worker thread that handles tx, which is
>>>>> scheduled from the wake_tx_queue op.
>>>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>>> I think it's already doing all of that:
>>>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>>>> wake_tx_queue op. The drivers without native iTXQ support simply
>>>> link it
>>>> to this handler.
>>> I know. The problem I see is that I can't find anything that guarantees
>>> that .wake_tx_queue_op is not being called concurrently from multiple
>>> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
>>> directly, instead of deferring it to a single workqueue/tasklet/thread,
>>> and multiple concurrent calls to it could potentially cause issues.
>>
>> Alexander, Felix, many thx for looking into this.
>>
>> This more and more sounds like something that might take a while to get
>> fixed, which makes it harder to get this fixed within those time-frames
>> Documentation/process/handling-regressions.rst outlines. So please allow
>> me to ask:
>>
>> Is reverting the culprit (and reapplying it later once the real cause is
>> found and fixed) an option, or would that cause other regressions?
> 
> This patch turned out to fix a (much worse) pre-release regression. See
> e.g.
> https://lore.kernel.org/linux-wireless/7cff27f8-d363-bbfb-241e-8d6fc0009c40@leemhuis.info/T/#t

Uggh, thx for the update, that's unfortunate, but that's how it is
sometimes. I just asked because the culprit didn't have a Reported-by or
together with a Link: to the backstory, so it looked like it might be
fine to revert. But then it's not a option.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-07 20:54     ` Alexander Wetzel
  2023-03-07 22:31       ` Thomas Mann
  2023-03-08  7:52       ` Felix Fietkau
@ 2023-03-09 17:00       ` Alexander Wetzel
  2023-03-09 17:29         ` Thomas Mann
  2 siblings, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-09 17:00 UTC (permalink / raw)
  To: Thomas Mann, sgruszka
  Cc: linux-wireless, LKML, Helmut Schaa, Johannes Berg,
	Linux regressions mailing list, Stanislaw Gruszka

On 07.03.23 21:54, Alexander Wetzel wrote:
>>>
>>
>> I just uploaded a test patch to bugzilla.
>> Please have a look if that fixes the issue.
>>
>> If not I would be interested in the output of your iTXQ status.
>> Enable CONFIG_MAC80211_DEBUGFS and run this command when the 
>> connection is bad and send/share/upload to bugzilla the resulting 
>> debug.out:
>>
>> k=1; while [ $k -lt 10 ]; do \
>> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
>> k=$(($k+1)); done >> debug.out
> 
> Thomas and I continued with some debugging in
> https://bugzilla.kernel.org/show_bug.cgi?id=217119
> 
> But the results so far are unexpected and we decided to continue the 
> debugging with the round here. Hoping someone sees something I miss.
> 
> A very summary where we are:
> I can't reproduce the bug with a very similar card and kernel config so 
> far. Thomas card stops the iTXQs for intervalls >30s. Mine operates 
> normally.
> 
> A more useful but longer summary:
> 
> Thomas updated to a 6.2 kernel and reported "connection drops and 
> bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.) Asked for 
> some more details he reported:
> "...slow bandwidth stuff works better, but the main problem/test case is 
> to start a 8-16 mbit video stream, which sometimes runs for a few 
> seconds and then stops or it doesn't start at all"
> 
> He bisected the issue and identified my commit 4444bc2116ae ("wifi: 
> mac80211: Proper mark iTXQs for resumption") as culprit.
> 
> Checking the internal iTXQ status when the issue is ongoing shows, that 
> TID zero is flagged as dirty and thus is not transmitting queued 
> packets. Interesting line from 
> /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> tid ac backlog-bytes backlog-packets new-flows drops marks overlimit 
> collisions tx-bytes tx-packets flags
> 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU DIRTY)
> 
> --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets and is 
> flagged as DIRTY. There even is a potential race setting the DIRTY flag, 
> but the fix for that is not helping.
> 
> Thus Thomas applied two debug patches, to better understand why the 
> DIRTY flag is not cleared.
> 
> And looking at the output from those we see that the driver stops Tx by 
> calling ieee80211_stop_queue(). When ieee80211_wake_queue() mac80211 
> correctly resumes TX but is getting stopped by the driver after a single 
> packet again. (The start of the relevant log is missing, so that may be 
> initially more).
> I assume TX is still ok at that stage. But after some singe Tx 
> operations the driver stops the queues again. Here the relevant part of 
> the log:
> [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> [  179.585022] XXXX drv_tx: TX
> [  179.585027] XXXX ieee80211_stop_queue: called
> [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> [  179.585047] XXXX drv_tx: TX
> [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586120] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586544] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.586792] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587317] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.587591] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> [  179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> ....
> [  214.307617] XXXX ieee80211_wake_queue: called
> 
> 
> --> So the driver blocked TX for more than 30s. Which is a good 
> explanation of what Thomas observes.
> 
> But there is nothing mac80211 can do differently here. Whatever is the 
> real reason for the issue, it's nothing obvious I see.

Best shot I have so far is a driver bug/issue now exposed by the changed 
traffic pattern from mac80211. And while digging into the rt2800usb 
driver I found a watchdog introduced here:
https://lore.kernel.org/20190615100100.29800-1-sgruszka@redhat.com

 From mac80211 debugging it looks like it may just be that: A random 
hang of the driver/card.

For sure rt2800usb tells mac80211 to stop TXing and needs ages (>30s in 
known sample) to unblock the queue. And this watchdog is disabled by 
default.

Now I'm clearly wondering, if the changed traffic pattern due to the 
mac80211 patch is just triggering the random hangs...

I've also uploaded more test patches to bugzilla.

@Thomas
Can you also try with this watchdog enabled? It must be enabled for 
rt2800lib. Since you have compiled in the driver the following boot 
parameter should enable it:
  rt2800lib.watchdog=1

@Stanislaw
Can you maybe also have a look at the issue and how that looks compared 
to the known random hangs you introduced the watchdog for?

> 
> Luckily I found a card using the same driver and nearly the same card:
> Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc (Gentoo 
> Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39 
> p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET 2023ieee80211 phy0: 
> rt2x00_set_rt: Info - RT chipset 3070, rev 0201 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> 
> My system, using the kernel config from Thomas with only minor 
> modifications (different filesystems and initramfs settings and enabled 
> mac80211 debug and developer options):
> Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo 
> 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2) 
> 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0: 
> rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware file 
> 'rt2870.bin'
> ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware detected - 
> version: 0.36
> 
> But there is one big difference on my system: I can't reproduce the bug 
> so far. It's working as it should... (I did not apply the debug patches 
> myself so far)
> 
> I'm now planning to look a bit more into the rt2800usb driver and 
> provide another debug patch for interesting looking code pieces in it.
> 
> @Thomas:
> I've also uploaded you my binary kernel I'm running at the moment here:
> https://www.awhome.eu/s/5FjqMS73rtCtSBM
> 
> That kernel should also be able to boot and operate your system. Can you 
> try that and tell me, if that makes any difference?
> 
> I'm also planning to provide some more debug patches, to figuring out 
> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs 
> for resumption") fixes the issue for you. Assuming my understanding 
> above is correct the patch should not really fix/break anything for 
> you...With the findings above I would have expected your git bisec to 
> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue 
> callback to drivers") as the first broken commit...
> 
> Alexander


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-09 17:00       ` Alexander Wetzel
@ 2023-03-09 17:29         ` Thomas Mann
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Mann @ 2023-03-09 17:29 UTC (permalink / raw)
  To: Alexander Wetzel
  Cc: sgruszka, linux-wireless, LKML, Helmut Schaa, Johannes Berg,
	Linux regressions mailing list, Stanislaw Gruszka

On Thu, 9 Mar 2023 18:00:04 +0100
Alexander Wetzel <alexander@wetzel-home.de> wrote:

> On 07.03.23 21:54, Alexander Wetzel wrote:
> >>>  
> >>
> >> I just uploaded a test patch to bugzilla.
> >> Please have a look if that fixes the issue.
> >>
> >> If not I would be interested in the output of your iTXQ status.
> >> Enable CONFIG_MAC80211_DEBUGFS and run this command when the 
> >> connection is bad and send/share/upload to bugzilla the resulting 
> >> debug.out:
> >>
> >> k=1; while [ $k -lt 10 ]; do \
> >> cat /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm; \
> >> k=$(($k+1)); done >> debug.out  
> > 
> > Thomas and I continued with some debugging in
> > https://bugzilla.kernel.org/show_bug.cgi?id=217119
> > 
> > But the results so far are unexpected and we decided to continue
> > the debugging with the round here. Hoping someone sees something I
> > miss.
> > 
> > A very summary where we are:
> > I can't reproduce the bug with a very similar card and kernel
> > config so far. Thomas card stops the iTXQs for intervalls >30s.
> > Mine operates normally.
> > 
> > A more useful but longer summary:
> > 
> > Thomas updated to a 6.2 kernel and reported "connection drops and 
> > bandwidth problems" with his rt2800usb wlan card. (6.1 is ok.)
> > Asked for some more details he reported:
> > "...slow bandwidth stuff works better, but the main problem/test
> > case is to start a 8-16 mbit video stream, which sometimes runs for
> > a few seconds and then stops or it doesn't start at all"
> > 
> > He bisected the issue and identified my commit 4444bc2116ae ("wifi: 
> > mac80211: Proper mark iTXQs for resumption") as culprit.
> > 
> > Checking the internal iTXQ status when the issue is ongoing shows,
> > that TID zero is flagged as dirty and thus is not transmitting
> > queued packets. Interesting line from 
> > /sys/kernel/debug/ieee80211/phy?/netdev:*/stations/*/aqm:
> > tid ac backlog-bytes backlog-packets new-flows drops marks
> > overlimit collisions tx-bytes tx-packets flags
> > 0 2 619736 404 1681 0 0 0 1 4513965 3019 0xe(RUN AMPDU NO-AMSDU
> > DIRTY) 
> > --> The "normal" iTXQ handling IEEE80211_AC_BE has queued packets
> > and is flagged as DIRTY. There even is a potential race setting the
> > DIRTY flag, but the fix for that is not helping.
> > 
> > Thus Thomas applied two debug patches, to better understand why the 
> > DIRTY flag is not cleared.
> > 
> > And looking at the output from those we see that the driver stops
> > Tx by calling ieee80211_stop_queue(). When ieee80211_wake_queue()
> > mac80211 correctly resumes TX but is getting stopped by the driver
> > after a single packet again. (The start of the relevant log is
> > missing, so that may be initially more).
> > I assume TX is still ok at that stage. But after some singe Tx 
> > operations the driver stops the queues again. Here the relevant
> > part of the log:
> > [  179.584997] XXXX __ieee80211_wake_txqs: waking TID 0
> > [  179.585022] XXXX drv_tx: TX
> > [  179.585027] XXXX ieee80211_stop_queue: called
> > [  179.585028] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason:
> > 1 [  179.585030] XXXX __ieee80211_wake_txqs: TID 3 NOT dirty
> > [  179.585031] XXXX __ieee80211_wake_txqs: TID 8 NOT dirty
> > [  179.585033] XXXX __ieee80211_wake_txqs: TID 11 NOT dirty
> > [  179.585034] XXXX __ieee80211_wake_txqs: EXIT
> > [  179.585035] XXXX __ieee80211_wake_txqs: ENTRY
> > [  179.585036] XXXX __ieee80211_wake_txqs: TID 1 NOT dirty
> > [  179.585037] XXXX __ieee80211_wake_txqs: TID 2 NOT dirty
> > [  179.585038] XXXX __ieee80211_wake_txqs: TID 9 NOT dirty
> > [  179.585040] XXXX __ieee80211_wake_txqs: TID 10 NOT dirty
> > [  179.585041] XXXX __ieee80211_wake_txqs: EXIT
> > [  179.585047] XXXX drv_tx: TX
> > [  179.585056] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason:
> > 1 [  179.585271] XXXX ieee80211_tx_dequeue: mark TID 0 dirty.
> > Reason: 1 [  179.585868] XXXX ieee80211_tx_dequeue: mark TID 0
> > dirty. Reason: 1 [  179.586120] XXXX ieee80211_tx_dequeue: mark TID
> > 0 dirty. Reason: 1 [  179.586544] XXXX ieee80211_tx_dequeue: mark
> > TID 0 dirty. Reason: 1 [  179.586792] XXXX ieee80211_tx_dequeue:
> > mark TID 0 dirty. Reason: 1 [  179.587317] XXXX
> > ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [  179.587591]
> > XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1 [
> > 179.588569] XXXX ieee80211_tx_dequeue: mark TID 0 dirty. Reason: 1
> > .... [  214.307617] XXXX ieee80211_wake_queue: called
> > 
> >   
> > --> So the driver blocked TX for more than 30s. Which is a good   
> > explanation of what Thomas observes.
> > 
> > But there is nothing mac80211 can do differently here. Whatever is
> > the real reason for the issue, it's nothing obvious I see.  
> 
> Best shot I have so far is a driver bug/issue now exposed by the
> changed traffic pattern from mac80211. And while digging into the
> rt2800usb driver I found a watchdog introduced here:
> https://lore.kernel.org/20190615100100.29800-1-sgruszka@redhat.com
> 
>  From mac80211 debugging it looks like it may just be that: A random 
> hang of the driver/card.
> 
> For sure rt2800usb tells mac80211 to stop TXing and needs ages (>30s
> in known sample) to unblock the queue. And this watchdog is disabled
> by default.
> 
> Now I'm clearly wondering, if the changed traffic pattern due to the 
> mac80211 patch is just triggering the random hangs...
> 
> I've also uploaded more test patches to bugzilla.
> 
> @Thomas
> Can you also try with this watchdog enabled? It must be enabled for 
> rt2800lib. Since you have compiled in the driver the following boot 
> parameter should enable it:
>   rt2800lib.watchdog=1

i responded on the bugtracker. Enabling the watchdog doesn't solve the
problem.

> 
> @Stanislaw
> Can you maybe also have a look at the issue and how that looks
> compared to the known random hangs you introduced the watchdog for?
> 
> > 
> > Luckily I found a card using the same driver and nearly the same
> > card: Thomas systems:Linux version 6.2.2-gentoo (root@foo) (gcc
> > (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld
> > (Gentoo 2.39 p5) 2.39.0) #2 SMP Fri Mar  3 16:59:02 CET
> > 2023ieee80211 phy0: rt2x00_set_rt: Info - RT chipset 3070, rev 0201
> > detected ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005
> > detected ieee80211 phy0: Selected rate control algorithm
> > 'minstrel_ht'
> > 
> > My system, using the kernel config from Thomas with only minor 
> > modifications (different filesystems and initramfs settings and
> > enabled mac80211 debug and developer options):
> > Linux version 6.2.2-gentoo (root@Perry.mordor) (gcc (Gentoo 
> > 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.40 p2) 
> > 2.40.0) #2 SMP Tue Mar  7 18:18:47 CET 2023ieee80211 phy0: 
> > rt2x00_set_rt: Info - RT chipset 3070, rev 0200 detected
> > ieee80211 phy0: rt2x00_set_rf: Info - RF chipset 0005 detected
> > ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
> > ieee80211 phy0: rt2x00lib_request_firmware: Info - Loading firmware
> > file 'rt2870.bin'
> > ieee80211 phy0: rt2x00lib_request_firmware: Info - Firmware
> > detected - version: 0.36
> > 
> > But there is one big difference on my system: I can't reproduce the
> > bug so far. It's working as it should... (I did not apply the debug
> > patches myself so far)
> > 
> > I'm now planning to look a bit more into the rt2800usb driver and 
> > provide another debug patch for interesting looking code pieces in
> > it.
> > 
> > @Thomas:
> > I've also uploaded you my binary kernel I'm running at the moment
> > here: https://www.awhome.eu/s/5FjqMS73rtCtSBM
> > 
> > That kernel should also be able to boot and operate your system.
> > Can you try that and tell me, if that makes any difference?
> > 
> > I'm also planning to provide some more debug patches, to figuring
> > out which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark
> > iTXQs for resumption") fixes the issue for you. Assuming my
> > understanding above is correct the patch should not really
> > fix/break anything for you...With the findings above I would have
> > expected your git bisec to identify commit a790cc3a4fad ("wifi:
> > mac80211: add wake_tx_queue callback to drivers") as the first
> > broken commit...
> > 
> > Alexander  
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-08 11:57           ` Felix Fietkau
  2023-03-08 12:21             ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-03-09 22:13             ` Alexander Wetzel
  2023-03-11 21:26               ` Alexander Wetzel
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-09 22:13 UTC (permalink / raw)
  To: Felix Fietkau, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 08.03.23 12:57, Felix Fietkau wrote:
> On 08.03.23 12:41, Alexander Wetzel wrote:
>> On 08.03.23 08:52, Felix Fietkau wrote:
>>
>>>> I'm also planning to provide some more debug patches, to figuring out
>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>> above is correct the patch should not really fix/break anything for
>>>> you...With the findings above I would have expected your git bisec to
>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>> callback to drivers") as the first broken commit...
>>> I can't point to any specific series of events where it would go 
>>> wrong, but I suspect that the problem might be the fact that you're 
>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I 
>>> don't see how it's properly protected from potentially being called 
>>> on different CPUs concurrently.
>>>
>>> Back when I was debugging some iTXQ issues in mt76, I also had 
>>> problems when tx scheduling could happen from multiple places. My 
>>> solution was to have a single worker thread that handles tx, which is 
>>> scheduled from the wake_tx_queue op.
>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>>
>>
>> I think it's already doing all of that:
>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>> to this handler.
> I know. The problem I see is that I can't find anything that guarantees 
> that .wake_tx_queue_op is not being called concurrently from multiple 
> different places. ieee80211_handle_wake_tx_queue is doing the scheduling 
> directly, instead of deferring it to a single workqueue/tasklet/thread, 
> and multiple concurrent calls to it could potentially cause issues.

Good hint, thanks.
According to the latest debug log exactly that seems to be happening:

ieee80211_wake_queue() is called by the driver and wake_txqs_tasklet 
tasklet is started. But during execution of the drv_wake_tx_queue() from 
the tasklet userspace queues a new skb and also calls into 
drv_wake_tx_queue(), which is then run overlapping...

Not sure yet how that could cause the problem. But this breaks the 
assumption that drv_wake_tx_queue() are not overlapping. And TX fails 
directly after such an overlapping TX...

I'll probably just serialize the calls and then we verify if that helps...

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-09 22:13             ` Alexander Wetzel
@ 2023-03-11 21:26               ` Alexander Wetzel
  2023-03-12  8:58                 ` Felix Fietkau
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Wetzel @ 2023-03-11 21:26 UTC (permalink / raw)
  To: Felix Fietkau, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 09.03.23 23:13, Alexander Wetzel wrote:
> On 08.03.23 12:57, Felix Fietkau wrote:
>> On 08.03.23 12:41, Alexander Wetzel wrote:
>>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>
>>>>> I'm also planning to provide some more debug patches, to figuring out
>>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>>> above is correct the patch should not really fix/break anything for
>>>>> you...With the findings above I would have expected your git bisec to
>>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>>> callback to drivers") as the first broken commit...
>>>> I can't point to any specific series of events where it would go 
>>>> wrong, but I suspect that the problem might be the fact that you're 
>>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I 
>>>> don't see how it's properly protected from potentially being called 
>>>> on different CPUs concurrently.
>>>>
>>>> Back when I was debugging some iTXQ issues in mt76, I also had 
>>>> problems when tx scheduling could happen from multiple places. My 
>>>> solution was to have a single worker thread that handles tx, which 
>>>> is scheduled from the wake_tx_queue op.
>>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>>>>
>>>
>>> I think it's already doing all of that:
>>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>>> to this handler.
>> I know. The problem I see is that I can't find anything that 
>> guarantees that .wake_tx_queue_op is not being called concurrently 
>> from multiple different places. ieee80211_handle_wake_tx_queue is 
>> doing the scheduling directly, instead of deferring it to a single 
>> workqueue/tasklet/thread, and multiple concurrent calls to it could 
>> potentially cause issues.
> 
> Good hint, thanks.
> According to the latest debug log exactly that seems to be happening:
> 
> ieee80211_wake_queue() is called by the driver and wake_txqs_tasklet 
> tasklet is started. But during execution of the drv_wake_tx_queue() from 
> the tasklet userspace queues a new skb and also calls into 
> drv_wake_tx_queue(), which is then run overlapping...
> 
> Not sure yet how that could cause the problem. But this breaks the 
> assumption that drv_wake_tx_queue() are not overlapping. And TX fails 
> directly after such an overlapping TX...
> 
> I'll probably just serialize the calls and then we verify if that helps...

Serialization helps. A (crude and in multiple ways incorrect) patch 
preventing two drv_wake_tx_queue() running for the same ac fixed the 
issue for Thomas:
https://bugzilla.kernel.org/show_bug.cgi?id=217119#c20

So it looks like we'll now have soon a fix for the issue.

The driver wakes the queue for IEEE80211_AC_BE often for only a single 
skb and then stops it again.
The short run time is insufficient for wake_txqs_tasklet to proper wake 
all queues itself and from time to time a new TX operation squeezes in 
after IEEE80211_AC_BE has been unblocked but prior of drv_wake_tx_queue 
being called from the wake_txqs_tasklet. When this happens 
drv_wake_tx_queue is called two times: Once from the tasklet, once from 
the userspace.

ieee80211_handle_wake_tx_queue is using ieee80211_txq_schedule_start, 
which has this documented requirement:
"The driver must not call multiple TXQ scheduling rounds concurrently."
Now I don't think that is causing the reported regression. Nevertheless 
we should prevent concurrent calls of ieee80211_handle_wake_tx_queue for 
that reason alone.

The real reason of the hangs is probably in the rt2800usb driver or 
hardware. I don't see anything in the driver code, so probably the HW 
itself has a problem with the two near-concurrent TX operations.

The real culprit of the regression should be commit a790cc3a4fad ("wifi: 
mac80211: add wake_tx_queue callback to drivers"), which switched 
rt2800usb over to iTXQs. But without the fix from commit 4444bc2116ae 
("wifi: mac80211: Proper mark iTXQs for resumption") mac80211 omitted to 
schedule the required run of the wake_txqs_tasklet. Thus thus instead of 
two concurrent drv_wake_tx_queue we only got one and the driver 
continued to work.

I asked Thomas on bugzilla to test the "best" solution I came up with.

There seems to be multiple ways. But I can't find a simple, low risk and 
complete fix. So I compromised...

When Thomas can confirm the fix we can soon discuss the fix on 
linux-wireless.


Alexander


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Regression] rt2800usb - Wifi performance issues and connection drops
  2023-03-11 21:26               ` Alexander Wetzel
@ 2023-03-12  8:58                 ` Felix Fietkau
  0 siblings, 0 replies; 20+ messages in thread
From: Felix Fietkau @ 2023-03-12  8:58 UTC (permalink / raw)
  To: Alexander Wetzel, Linux regressions mailing list
  Cc: linux-wireless, LKML, Thomas Mann, Stanislaw Gruszka,
	Helmut Schaa, Johannes Berg

On 11.03.23 22:26, Alexander Wetzel wrote:
> Serialization helps. A (crude and in multiple ways incorrect) patch
> preventing two drv_wake_tx_queue() running for the same ac fixed the
> issue for Thomas:
> https://bugzilla.kernel.org/show_bug.cgi?id=217119#c20
> 
> So it looks like we'll now have soon a fix for the issue.
> 
> The driver wakes the queue for IEEE80211_AC_BE often for only a single
> skb and then stops it again.
> The short run time is insufficient for wake_txqs_tasklet to proper wake
> all queues itself and from time to time a new TX operation squeezes in
> after IEEE80211_AC_BE has been unblocked but prior of drv_wake_tx_queue
> being called from the wake_txqs_tasklet. When this happens
> drv_wake_tx_queue is called two times: Once from the tasklet, once from
> the userspace.
> 
> ieee80211_handle_wake_tx_queue is using ieee80211_txq_schedule_start,
> which has this documented requirement:
> "The driver must not call multiple TXQ scheduling rounds concurrently."
> Now I don't think that is causing the reported regression. Nevertheless
> we should prevent concurrent calls of ieee80211_handle_wake_tx_queue for
> that reason alone.
> 
> The real reason of the hangs is probably in the rt2800usb driver or
> hardware. I don't see anything in the driver code, so probably the HW
> itself has a problem with the two near-concurrent TX operations.
> 
> The real culprit of the regression should be commit a790cc3a4fad ("wifi:
> mac80211: add wake_tx_queue callback to drivers"), which switched
> rt2800usb over to iTXQs. But without the fix from commit 4444bc2116ae
> ("wifi: mac80211: Proper mark iTXQs for resumption") mac80211 omitted to
> schedule the required run of the wake_txqs_tasklet. Thus thus instead of
> two concurrent drv_wake_tx_queue we only got one and the driver
> continued to work.
> 
> I asked Thomas on bugzilla to test the "best" solution I came up with.
> 
> There seems to be multiple ways. But I can't find a simple, low risk and
> complete fix. So I compromised...
> 
> When Thomas can confirm the fix we can soon discuss the fix on
> linux-wireless.

I would recommend the following approach for properly fixing this issue:

On init if the .wake_tx_queue op is set to 
ieee80211_handle_wake_tx_queue, create a single kthread that iterates 
over all hw queues and schedules each one of them like 
ieee80211_handle_wake_tx_queue does now.
Change ieee80211_handle_wake_tx_queue to simply schedule the kthread 
without doing anything else.
This is how mt76 handles tx scheduling in the driver, and it works quite 
well.

- Felix

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-03-12  8:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-04 16:24 [Regression] rt2800usb - Wifi performance issues and connection drops Linux regression tracking (Thorsten Leemhuis)
2023-03-05 17:25 ` Thorsten Leemhuis
2023-03-05 22:05   ` Alexander Wetzel
2023-03-07 20:54     ` Alexander Wetzel
2023-03-07 22:31       ` Thomas Mann
2023-03-08  7:13         ` Alexander Wetzel
2023-03-08 10:26           ` Thomas Mann
2023-03-08 12:10             ` Alexander Wetzel
2023-03-08 12:29             ` Thomas Mann
2023-03-08  7:52       ` Felix Fietkau
2023-03-08 11:41         ` Alexander Wetzel
2023-03-08 11:57           ` Felix Fietkau
2023-03-08 12:21             ` Linux regression tracking (Thorsten Leemhuis)
2023-03-08 16:50               ` Alexander Wetzel
2023-03-09  7:59                 ` Linux regression tracking (Thorsten Leemhuis)
2023-03-09 22:13             ` Alexander Wetzel
2023-03-11 21:26               ` Alexander Wetzel
2023-03-12  8:58                 ` Felix Fietkau
2023-03-09 17:00       ` Alexander Wetzel
2023-03-09 17:29         ` Thomas Mann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).