linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFT] ath9k: multi-rate-retry fails at HW level
@ 2020-10-23 14:06 Zefir Kurtisi
  2020-11-24 14:45 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 6+ messages in thread
From: Zefir Kurtisi @ 2020-10-23 14:06 UTC (permalink / raw)
  To: linux-wireless; +Cc: Felix Fietkau, qca-developer-program

[-- Attachment #1: Type: text/plain, Size: 2028 bytes --]

Hi,

I am running into a strange issue with the ath9k operating a 9590 device which to
me seems like a HW issue, but since work on rate controllers is already going for
decades, I hardly can imagine this never showed up.

The issue observed is this: the TX status descriptors never report rateindex 1, it
is always 0, 2, or 3, but never 1.

I noticed this by overwriting the rate configuration provided by minstrel to a
static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The device operates as iperf
client to a connected AP and continuously transmits data. While at that, the
attenuation between the endpoints is gradually increased, expecting to see a
gradual shift in the reported TX status rateindex from 0 to 3. But nada, the
values reported are 0,2, and 3 - never 1.

I double checked that the TX descriptors are correctly set with the rates and
retry counts - all looking sane.

More obvious, after changing the rate configuration to (7,3)(1,3)(5,3)(3,3) the
expectation would be to have either 0 or 1 reported as rateidx, since the
transmission ought to be successful with the lowest rate or never. Again all rates
are reported but 1.

Now the question for me is: what is the HW exactly doing with such a
configuration? Is it skipping the second rate, or is it just reporting wrong?

Both possibilities have great impact, since upper layers (like airtime) use the
returned rateidx to calculate and configure operating parameters at runtime.


If this is a know issue, nevermind and thanks for pointing me to it. Otherwise if
some of you have the named device operational, it would help a lot to get the
issue confirmed. Just apply the attached patch and perform some TX testing in
either attenuation adjustable or varying link condition setups. Whenever a frame
is reported to have been transmitted at a rateidx > 0, the collected stats are
logged, e.g.
MRR: 2: [51029, 0, 4741, 6454]

In essence, the failure is confirmed if the counter for 1 is 0 or very low
compared to higher numbers for 0, 2, or 3.



Cheers,
Zefir

[-- Attachment #2: 0001-ath9k-count-TX-successes-at-rate.patch --]
[-- Type: text/x-patch, Size: 1044 bytes --]

From 1548245968a97592b39abe1867106a22a30250c8 Mon Sep 17 00:00:00 2001
From: Zefir Kurtisi <zefir.kurtisi@westermo.com>
Date: Fri, 23 Oct 2020 14:31:54 +0200
Subject: [PATCH] ath9k: count TX successes at rate

---
 drivers/net/wireless/ath/ath9k/xmit.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/wireless/ath/ath9k/xmit.c b/drivers/net/wireless/ath/ath9k/xmit.c
index afb6a2f..de87ce7 100644
--- a/drivers/net/wireless/ath/ath9k/xmit.c
+++ b/drivers/net/wireless/ath/ath9k/xmit.c
@@ -3074,6 +3074,13 @@ void ath_tx_edma_tasklet(struct ath_softc *sc)
 			ath_dbg(common, XMIT, "Error processing tx status\n");
 			break;
 		}
+/* count number of successful TXes at each rateidx, print stats each time rateidx > 0 */
+static u32 rthist[IEEE80211_TX_MAX_RATES];
+rthist[ts.ts_rateindex]++;
+if (ts.ts_rateindex)
+	printk("MRR: %d: [%d, %d, %d, %d]\n", ts.ts_rateindex,
+		rthist[0], rthist[1], rthist[2], rthist[3]);
+
 
 		/* Process beacon completions separately */
 		if (ts.qid == sc->beacon.beaconq) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFT] ath9k: multi-rate-retry fails at HW level
  2020-10-23 14:06 [RFT] ath9k: multi-rate-retry fails at HW level Zefir Kurtisi
@ 2020-11-24 14:45 ` Toke Høiland-Jørgensen
  2020-11-27 15:38   ` Zefir Kurtisi
  0 siblings, 1 reply; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-11-24 14:45 UTC (permalink / raw)
  To: Zefir Kurtisi, linux-wireless; +Cc: Felix Fietkau, qca-developer-program

Zefir Kurtisi <zefku@westermo.com> writes:

> Hi,
>
> I am running into a strange issue with the ath9k operating a 9590
> device which to me seems like a HW issue, but since work on rate
> controllers is already going for decades, I hardly can imagine this
> never showed up.
>
> The issue observed is this: the TX status descriptors never report
> rateindex 1, it is always 0, 2, or 3, but never 1.
>
> I noticed this by overwriting the rate configuration provided by
> minstrel to a static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The
> device operates as iperf client to a connected AP and continuously
> transmits data. While at that, the attenuation between the endpoints
> is gradually increased, expecting to see a gradual shift in the
> reported TX status rateindex from 0 to 3. But nada, the values
> reported are 0,2, and 3 - never 1.
>
> I double checked that the TX descriptors are correctly set with the
> rates and retry counts - all looking sane.
>
> More obvious, after changing the rate configuration to
> (7,3)(1,3)(5,3)(3,3) the expectation would be to have either 0 or 1
> reported as rateidx, since the transmission ought to be successful
> with the lowest rate or never. Again all rates are reported but 1.
>
> Now the question for me is: what is the HW exactly doing with such a
> configuration? Is it skipping the second rate, or is it just reporting
> wrong?

You should be able to see this by looking at the rates the frames are
being sent at, shouldn't you?

> Both possibilities have great impact, since upper layers (like
> airtime) use the returned rateidx to calculate and configure operating
> parameters at runtime.

Have you actually observed any issues from this? If it's just skipping a
rate, minstrel should still be able to make decisions based on the
actual values returned, no?

> If this is a know issue, nevermind and thanks for pointing me to it. Otherwise if
> some of you have the named device operational, it would help a lot to get the
> issue confirmed. Just apply the attached patch and perform some TX testing in
> either attenuation adjustable or varying link condition setups. Whenever a frame
> is reported to have been transmitted at a rateidx > 0, the collected stats are
> logged, e.g.
> MRR: 2: [51029, 0, 4741, 6454]
>
> In essence, the failure is confirmed if the counter for 1 is 0 or very low
> compared to higher numbers for 0, 2, or 3.

Tried your patch and couldn't reproduce. Not the same hardware, though.
Mine is:

01:00.0 Network controller: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) (rev 01)

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFT] ath9k: multi-rate-retry fails at HW level
  2020-11-24 14:45 ` Toke Høiland-Jørgensen
@ 2020-11-27 15:38   ` Zefir Kurtisi
  2020-12-01 13:33     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 6+ messages in thread
From: Zefir Kurtisi @ 2020-11-27 15:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Zefir Kurtisi, linux-wireless
  Cc: Felix Fietkau, qca-developer-program, Adrian Chadd

CC += adrian

On 24.11.20 15:45, Toke Høiland-Jørgensen wrote:
> Zefir Kurtisi <zefku@westermo.com> writes:
> 
>> Hi,
>>
>> I am running into a strange issue with the ath9k operating a 9590
>> device which to me seems like a HW issue, but since work on rate
>> controllers is already going for decades, I hardly can imagine this
>> never showed up.
>>
>> The issue observed is this: the TX status descriptors never report
>> rateindex 1, it is always 0, 2, or 3, but never 1.
>>
>> I noticed this by overwriting the rate configuration provided by
>> minstrel to a static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The
>> device operates as iperf client to a connected AP and continuously
>> transmits data. While at that, the attenuation between the endpoints
>> is gradually increased, expecting to see a gradual shift in the
>> reported TX status rateindex from 0 to 3. But nada, the values
>> reported are 0,2, and 3 - never 1.
>>
>> I double checked that the TX descriptors are correctly set with the
>> rates and retry counts - all looking sane.
>>
>> More obvious, after changing the rate configuration to
>> (7,3)(1,3)(5,3)(3,3) the expectation would be to have either 0 or 1
>> reported as rateidx, since the transmission ought to be successful
>> with the lowest rate or never. Again all rates are reported but 1.
>>
>> Now the question for me is: what is the HW exactly doing with such a
>> configuration? Is it skipping the second rate, or is it just reporting
>> wrong?
> 
> You should be able to see this by looking at the rates the frames are
> being sent at, shouldn't you?
> 
Yes, did that and from there it points to that the second rate is just skipped.

Here are some use cases and their sniffing results. Setup is a 11ng STA connected
to AP with the attenuation adjusted such that MCS 7 fails, while MCS 5 and below
succeed. Monitor is sniffing while sending a single ping from AP to STA.

With a rate configuration of (7/2)(3/2)(1/2) we get:
14:02:42.923880 9481489761us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
14:02:42.923909 9481490037us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
14:02:42.925244 9481491044us tsft 2412 MHz 11n -68dBm signal 13.0 Mb/s MCS 1 20
MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0


with (7/2)(1/2)(3/2):
13:59:37.073147 9295637087us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
13:59:37.073467 9295637438us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
13:59:37.074591 9295638498us tsft 2412 MHz 11n -68dBm signal 26.0 Mb/s MCS 3 20
MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0

and with (7/2)(3/2):
14:04:27.269806 9585836783us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
14:04:27.270342 9585837344us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
14:04:27.271368 9585838370us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
[..]

a total of 14 attempts at MCS 7 with the ping finally failing.

>> Both possibilities have great impact, since upper layers (like
>> airtime) use the returned rateidx to calculate and configure operating
>> parameters at runtime.
> 
> Have you actually observed any issues from this? If it's just skipping a
> rate, minstrel should still be able to make decisions based on the
> actual values returned, no?
> 
The issues arise from the fact that the driver reports a
(tx-rateindex/tx-attemp-index) per TX descriptor, leaving the driver to calculate
what was put on air based on these two values. If one had rates set to
(7/2)(3/7)(1/2) and the TX status reports (tx-rateindex=2/tx-attempt-index=0),
driver assumes there were 10 attempts in total while in fact they were 3 when the
second rate is skipped. What direct effect this has on RC I can't grasp, but it
definitively falsifies statistics.

Same goes for airtime: check how this falsifies its calculation in
ath_tx_count_airtime().

Also, the above mentioned is an immediate visible issue: if RC provides two rates
e.g. (7/3)(5/3) of which the first is too high and the second is not even
attempted, frames don't make it through.

>> If this is a know issue, nevermind and thanks for pointing me to it. Otherwise if
>> some of you have the named device operational, it would help a lot to get the
>> issue confirmed. Just apply the attached patch and perform some TX testing in
>> either attenuation adjustable or varying link condition setups. Whenever a frame
>> is reported to have been transmitted at a rateidx > 0, the collected stats are
>> logged, e.g.
>> MRR: 2: [51029, 0, 4741, 6454]
>>
>> In essence, the failure is confirmed if the counter for 1 is 0 or very low
>> compared to higher numbers for 0, 2, or 3.
> 
> Tried your patch and couldn't reproduce. Not the same hardware, though.
> Mine is:
> 
> 01:00.0 Network controller: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) (rev 01)
> 
> -Toke
> 

Thanks a lot for trying, let's see if someone else has the affected variant still
in use.


Cheers,
Zefir

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFT] ath9k: multi-rate-retry fails at HW level
  2020-11-27 15:38   ` Zefir Kurtisi
@ 2020-12-01 13:33     ` Toke Høiland-Jørgensen
  2020-12-11  9:00       ` Zefir Kurtisi
  0 siblings, 1 reply; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-12-01 13:33 UTC (permalink / raw)
  To: Zefir Kurtisi, Zefir Kurtisi, linux-wireless
  Cc: Felix Fietkau, qca-developer-program, Adrian Chadd

Zefir Kurtisi <zefir.kurtisi@westermo.com> writes:

> CC += adrian
>
> On 24.11.20 15:45, Toke Høiland-Jørgensen wrote:
>> Zefir Kurtisi <zefku@westermo.com> writes:
>> 
>>> Hi,
>>>
>>> I am running into a strange issue with the ath9k operating a 9590
>>> device which to me seems like a HW issue, but since work on rate
>>> controllers is already going for decades, I hardly can imagine this
>>> never showed up.
>>>
>>> The issue observed is this: the TX status descriptors never report
>>> rateindex 1, it is always 0, 2, or 3, but never 1.
>>>
>>> I noticed this by overwriting the rate configuration provided by
>>> minstrel to a static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The
>>> device operates as iperf client to a connected AP and continuously
>>> transmits data. While at that, the attenuation between the endpoints
>>> is gradually increased, expecting to see a gradual shift in the
>>> reported TX status rateindex from 0 to 3. But nada, the values
>>> reported are 0,2, and 3 - never 1.
>>>
>>> I double checked that the TX descriptors are correctly set with the
>>> rates and retry counts - all looking sane.
>>>
>>> More obvious, after changing the rate configuration to
>>> (7,3)(1,3)(5,3)(3,3) the expectation would be to have either 0 or 1
>>> reported as rateidx, since the transmission ought to be successful
>>> with the lowest rate or never. Again all rates are reported but 1.
>>>
>>> Now the question for me is: what is the HW exactly doing with such a
>>> configuration? Is it skipping the second rate, or is it just reporting
>>> wrong?
>> 
>> You should be able to see this by looking at the rates the frames are
>> being sent at, shouldn't you?
>> 
> Yes, did that and from there it points to that the second rate is just skipped.
>
> Here are some use cases and their sniffing results. Setup is a 11ng STA connected
> to AP with the attenuation adjusted such that MCS 7 fails, while MCS 5 and below
> succeed. Monitor is sniffing while sending a single ping from AP to STA.
>
> With a rate configuration of (7/2)(3/2)(1/2) we get:
> 14:02:42.923880 9481489761us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
> 14:02:42.923909 9481490037us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
> 14:02:42.925244 9481491044us tsft 2412 MHz 11n -68dBm signal 13.0 Mb/s MCS 1 20
> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>
>
> with (7/2)(1/2)(3/2):
> 13:59:37.073147 9295637087us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
> 13:59:37.073467 9295637438us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
> 13:59:37.074591 9295638498us tsft 2412 MHz 11n -68dBm signal 26.0 Mb/s MCS 3 20
> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>
> and with (7/2)(3/2):
> 14:04:27.269806 9585836783us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
> 14:04:27.270342 9585837344us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
> 14:04:27.271368 9585838370us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
> [..]
>
> a total of 14 attempts at MCS 7 with the ping finally failing.
>
>>> Both possibilities have great impact, since upper layers (like
>>> airtime) use the returned rateidx to calculate and configure operating
>>> parameters at runtime.
>> 
>> Have you actually observed any issues from this? If it's just skipping a
>> rate, minstrel should still be able to make decisions based on the
>> actual values returned, no?
>> 
> The issues arise from the fact that the driver reports a
> (tx-rateindex/tx-attemp-index) per TX descriptor, leaving the driver to calculate
> what was put on air based on these two values. If one had rates set to
> (7/2)(3/7)(1/2) and the TX status reports (tx-rateindex=2/tx-attempt-index=0),
> driver assumes there were 10 attempts in total while in fact they were 3 when the
> second rate is skipped. What direct effect this has on RC I can't grasp, but it
> definitively falsifies statistics.
>
> Same goes for airtime: check how this falsifies its calculation in
> ath_tx_count_airtime().

Ah, right, I was assuming that rates[1].count would be reset to zero
somehow. Have you confirmed that the attempts actually go up on in the
Minstrel stats for the skipped rate?

> Also, the above mentioned is an immediate visible issue: if RC
> provides two rates e.g. (7/3)(5/3) of which the first is too high and
> the second is not even attempted, frames don't make it through.

Yeah, rate control would likely take longer to converge to the right
rate. I suppose if this is a hardware model-specific issue that a quirks
bit could be added to instruct Minstrel to disregard the second index.
But it does sound a bit odd; have you verified that it's consistent on
different units of the same model (and not just a busted device)?

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFT] ath9k: multi-rate-retry fails at HW level
  2020-12-01 13:33     ` Toke Høiland-Jørgensen
@ 2020-12-11  9:00       ` Zefir Kurtisi
  2020-12-11 10:37         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 6+ messages in thread
From: Zefir Kurtisi @ 2020-12-11  9:00 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Zefir Kurtisi, linux-wireless
  Cc: Felix Fietkau, qca-developer-program, Adrian Chadd

On 01.12.20 14:33, Toke Høiland-Jørgensen wrote:
> Zefir Kurtisi <zefir.kurtisi@westermo.com> writes:
> 
>> CC += adrian
>>
>> On 24.11.20 15:45, Toke Høiland-Jørgensen wrote:
>>> Zefir Kurtisi <zefku@westermo.com> writes:
>>>
>>>> Hi,
>>>>
>>>> I am running into a strange issue with the ath9k operating a 9590
>>>> device which to me seems like a HW issue, but since work on rate
>>>> controllers is already going for decades, I hardly can imagine this
>>>> never showed up.
>>>>
>>>> The issue observed is this: the TX status descriptors never report
>>>> rateindex 1, it is always 0, 2, or 3, but never 1.
>>>>
>>>> I noticed this by overwriting the rate configuration provided by
>>>> minstrel to a static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The
>>>> device operates as iperf client to a connected AP and continuously
>>>> transmits data. While at that, the attenuation between the endpoints
>>>> is gradually increased, expecting to see a gradual shift in the
>>>> reported TX status rateindex from 0 to 3. But nada, the values
>>>> reported are 0,2, and 3 - never 1.
>>>>
>>>> I double checked that the TX descriptors are correctly set with the
>>>> rates and retry counts - all looking sane.
>>>>
>>>> More obvious, after changing the rate configuration to
>>>> (7,3)(1,3)(5,3)(3,3) the expectation would be to have either 0 or 1
>>>> reported as rateidx, since the transmission ought to be successful
>>>> with the lowest rate or never. Again all rates are reported but 1.
>>>>
>>>> Now the question for me is: what is the HW exactly doing with such a
>>>> configuration? Is it skipping the second rate, or is it just reporting
>>>> wrong?
>>>
>>> You should be able to see this by looking at the rates the frames are
>>> being sent at, shouldn't you?
>>>
>> Yes, did that and from there it points to that the second rate is just skipped.
>>
>> Here are some use cases and their sniffing results. Setup is a 11ng STA connected
>> to AP with the attenuation adjusted such that MCS 7 fails, while MCS 5 and below
>> succeed. Monitor is sniffing while sending a single ping from AP to STA.
>>
>> With a rate configuration of (7/2)(3/2)(1/2) we get:
>> 14:02:42.923880 9481489761us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>> 14:02:42.923909 9481490037us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>> 14:02:42.925244 9481491044us tsft 2412 MHz 11n -68dBm signal 13.0 Mb/s MCS 1 20
>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>>
>>
>> with (7/2)(1/2)(3/2):
>> 13:59:37.073147 9295637087us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>> 13:59:37.073467 9295637438us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>> 13:59:37.074591 9295638498us tsft 2412 MHz 11n -68dBm signal 26.0 Mb/s MCS 3 20
>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>>
>> and with (7/2)(3/2):
>> 14:04:27.269806 9585836783us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
>> 14:04:27.270342 9585837344us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
>> 14:04:27.271368 9585838370us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
>> [..]
>>
>> a total of 14 attempts at MCS 7 with the ping finally failing.
>>
>>>> Both possibilities have great impact, since upper layers (like
>>>> airtime) use the returned rateidx to calculate and configure operating
>>>> parameters at runtime.
>>>
>>> Have you actually observed any issues from this? If it's just skipping a
>>> rate, minstrel should still be able to make decisions based on the
>>> actual values returned, no?
>>>
>> The issues arise from the fact that the driver reports a
>> (tx-rateindex/tx-attemp-index) per TX descriptor, leaving the driver to calculate
>> what was put on air based on these two values. If one had rates set to
>> (7/2)(3/7)(1/2) and the TX status reports (tx-rateindex=2/tx-attempt-index=0),
>> driver assumes there were 10 attempts in total while in fact they were 3 when the
>> second rate is skipped. What direct effect this has on RC I can't grasp, but it
>> definitively falsifies statistics.
>>
>> Same goes for airtime: check how this falsifies its calculation in
>> ath_tx_count_airtime().
> 
> Ah, right, I was assuming that rates[1].count would be reset to zero
> somehow. Have you confirmed that the attempts actually go up on in the
> Minstrel stats for the skipped rate?
> 
>> Also, the above mentioned is an immediate visible issue: if RC
>> provides two rates e.g. (7/3)(5/3) of which the first is too high and
>> the second is not even attempted, frames don't make it through.
> 
> Yeah, rate control would likely take longer to converge to the right
> rate. I suppose if this is a hardware model-specific issue that a quirks
> bit could be added to instruct Minstrel to disregard the second index.
> But it does sound a bit odd; have you verified that it's consistent on
> different units of the same model (and not just a busted device)?
> 

False alarm.

We got confirmation that the observed failure with that exact chip revision is not
happening on a different platform. It still might be a HW issue specific to our
rarely used PPC platform, but it is not an ath9k malfunction. I'll dig further
into that and report back if it is relevant for the list.

Thanks Toke for the feedback and insights and sorry for noise.


Cheers,
Zefir


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFT] ath9k: multi-rate-retry fails at HW level
  2020-12-11  9:00       ` Zefir Kurtisi
@ 2020-12-11 10:37         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-12-11 10:37 UTC (permalink / raw)
  To: Zefir Kurtisi, Zefir Kurtisi, linux-wireless
  Cc: Felix Fietkau, qca-developer-program, Adrian Chadd

Zefir Kurtisi <zefir.kurtisi@westermo.com> writes:

> On 01.12.20 14:33, Toke Høiland-Jørgensen wrote:
>> Zefir Kurtisi <zefir.kurtisi@westermo.com> writes:
>> 
>>> CC += adrian
>>>
>>> On 24.11.20 15:45, Toke Høiland-Jørgensen wrote:
>>>> Zefir Kurtisi <zefku@westermo.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am running into a strange issue with the ath9k operating a 9590
>>>>> device which to me seems like a HW issue, but since work on rate
>>>>> controllers is already going for decades, I hardly can imagine this
>>>>> never showed up.
>>>>>
>>>>> The issue observed is this: the TX status descriptors never report
>>>>> rateindex 1, it is always 0, 2, or 3, but never 1.
>>>>>
>>>>> I noticed this by overwriting the rate configuration provided by
>>>>> minstrel to a static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The
>>>>> device operates as iperf client to a connected AP and continuously
>>>>> transmits data. While at that, the attenuation between the endpoints
>>>>> is gradually increased, expecting to see a gradual shift in the
>>>>> reported TX status rateindex from 0 to 3. But nada, the values
>>>>> reported are 0,2, and 3 - never 1.
>>>>>
>>>>> I double checked that the TX descriptors are correctly set with the
>>>>> rates and retry counts - all looking sane.
>>>>>
>>>>> More obvious, after changing the rate configuration to
>>>>> (7,3)(1,3)(5,3)(3,3) the expectation would be to have either 0 or 1
>>>>> reported as rateidx, since the transmission ought to be successful
>>>>> with the lowest rate or never. Again all rates are reported but 1.
>>>>>
>>>>> Now the question for me is: what is the HW exactly doing with such a
>>>>> configuration? Is it skipping the second rate, or is it just reporting
>>>>> wrong?
>>>>
>>>> You should be able to see this by looking at the rates the frames are
>>>> being sent at, shouldn't you?
>>>>
>>> Yes, did that and from there it points to that the second rate is just skipped.
>>>
>>> Here are some use cases and their sniffing results. Setup is a 11ng STA connected
>>> to AP with the attenuation adjusted such that MCS 7 fails, while MCS 5 and below
>>> succeed. Monitor is sniffing while sending a single ping from AP to STA.
>>>
>>> With a rate configuration of (7/2)(3/2)(1/2) we get:
>>> 14:02:42.923880 9481489761us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>>> 14:02:42.923909 9481490037us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>>> 14:02:42.925244 9481491044us tsft 2412 MHz 11n -68dBm signal 13.0 Mb/s MCS 1 20
>>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  e Pad 20 KeyID 0
>>>
>>>
>>> with (7/2)(1/2)(3/2):
>>> 13:59:37.073147 9295637087us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>>> 13:59:37.073467 9295637438us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>>> 13:59:37.074591 9295638498us tsft 2412 MHz 11n -68dBm signal 26.0 Mb/s MCS 3 20
>>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV:  c Pad 20 KeyID 0
>>>
>>> and with (7/2)(3/2):
>>> 14:04:27.269806 9585836783us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
>>> 14:04:27.270342 9585837344us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
>>> 14:04:27.271368 9585838370us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20
>>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0
>>> [..]
>>>
>>> a total of 14 attempts at MCS 7 with the ping finally failing.
>>>
>>>>> Both possibilities have great impact, since upper layers (like
>>>>> airtime) use the returned rateidx to calculate and configure operating
>>>>> parameters at runtime.
>>>>
>>>> Have you actually observed any issues from this? If it's just skipping a
>>>> rate, minstrel should still be able to make decisions based on the
>>>> actual values returned, no?
>>>>
>>> The issues arise from the fact that the driver reports a
>>> (tx-rateindex/tx-attemp-index) per TX descriptor, leaving the driver to calculate
>>> what was put on air based on these two values. If one had rates set to
>>> (7/2)(3/7)(1/2) and the TX status reports (tx-rateindex=2/tx-attempt-index=0),
>>> driver assumes there were 10 attempts in total while in fact they were 3 when the
>>> second rate is skipped. What direct effect this has on RC I can't grasp, but it
>>> definitively falsifies statistics.
>>>
>>> Same goes for airtime: check how this falsifies its calculation in
>>> ath_tx_count_airtime().
>> 
>> Ah, right, I was assuming that rates[1].count would be reset to zero
>> somehow. Have you confirmed that the attempts actually go up on in the
>> Minstrel stats for the skipped rate?
>> 
>>> Also, the above mentioned is an immediate visible issue: if RC
>>> provides two rates e.g. (7/3)(5/3) of which the first is too high and
>>> the second is not even attempted, frames don't make it through.
>> 
>> Yeah, rate control would likely take longer to converge to the right
>> rate. I suppose if this is a hardware model-specific issue that a quirks
>> bit could be added to instruct Minstrel to disregard the second index.
>> But it does sound a bit odd; have you verified that it's consistent on
>> different units of the same model (and not just a busted device)?
>> 
>
> False alarm.
>
> We got confirmation that the observed failure with that exact chip
> revision is not happening on a different platform. It still might be a
> HW issue specific to our rarely used PPC platform, but it is not an
> ath9k malfunction. I'll dig further into that and report back if it is
> relevant for the list.
>
> Thanks Toke for the feedback and insights and sorry for noise.

You're welcome, and great to hear that you got closer to a resolution :)

-Toke


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-11 10:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-23 14:06 [RFT] ath9k: multi-rate-retry fails at HW level Zefir Kurtisi
2020-11-24 14:45 ` Toke Høiland-Jørgensen
2020-11-27 15:38   ` Zefir Kurtisi
2020-12-01 13:33     ` Toke Høiland-Jørgensen
2020-12-11  9:00       ` Zefir Kurtisi
2020-12-11 10:37         ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).