All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ath10k:  Restart xmit queues below low-water mark.
@ 2020-04-27 14:54 greearb
  2020-04-28 14:56 ` Steve deRosier
  2020-04-28 19:37 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 27+ messages in thread
From: greearb @ 2020-04-27 14:54 UTC (permalink / raw)
  To: linux-wireless; +Cc: Ben Greear

From: Ben Greear <greearb@candelatech.com>

While running tcp upload + download tests with ~200
concurrent TCP streams, 1-2 processes, and 30 station
vdevs, I noticed that the __ieee80211_stop_queue was taking
around 20% of the CPU according to perf-top, which other locking
taking an additional ~15%.

I believe the issue is that the ath10k driver would unlock the
txqueue when a single frame could be transmitted, instead of
waiting for a low water mark.

So, this patch adds a low-water mark that is 1/4 of the total
tx buffers allowed.

This appears to resolve the performance problem that I saw.

Tested with recent wave-1 ath10k-ct firmware.

Signed-off-by: Ben Greear <greearb@candelatech.com>
---
 drivers/net/wireless/ath/ath10k/htt.h    | 1 +
 drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
index 31c4ddbf45cb..b5634781c0dc 100644
--- a/drivers/net/wireless/ath/ath10k/htt.h
+++ b/drivers/net/wireless/ath/ath10k/htt.h
@@ -1941,6 +1941,7 @@ struct ath10k_htt {
 
 	u8 target_version_major;
 	u8 target_version_minor;
+	bool needs_unlock;
 	struct completion target_version_received;
 	u8 max_num_amsdu;
 	u8 max_num_ampdu;
diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 9b3c3b080e92..44795d9a7c0c 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
 	lockdep_assert_held(&htt->tx_lock);
 
 	htt->num_pending_tx--;
-	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
+	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
+		htt->needs_unlock = false;
 		ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
+	}
 }
 
 int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
@@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
 		return -EBUSY;
 
 	htt->num_pending_tx++;
-	if (htt->num_pending_tx == htt->max_num_pending_tx)
+	if (htt->num_pending_tx == htt->max_num_pending_tx) {
+		htt->needs_unlock = true;
 		ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
+	}
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-27 14:54 [PATCH] ath10k: Restart xmit queues below low-water mark greearb
@ 2020-04-28 14:56 ` Steve deRosier
  2020-04-28 14:58   ` Ben Greear
  2020-04-28 16:27   ` Dave Taht
  2020-04-28 19:37 ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 27+ messages in thread
From: Steve deRosier @ 2020-04-28 14:56 UTC (permalink / raw)
  To: Ben Greear, dave.taht; +Cc: linux-wireless

On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>
> From: Ben Greear <greearb@candelatech.com>
>
> While running tcp upload + download tests with ~200
> concurrent TCP streams, 1-2 processes, and 30 station
> vdevs, I noticed that the __ieee80211_stop_queue was taking
> around 20% of the CPU according to perf-top, which other locking
> taking an additional ~15%.
>
> I believe the issue is that the ath10k driver would unlock the
> txqueue when a single frame could be transmitted, instead of
> waiting for a low water mark.
>
> So, this patch adds a low-water mark that is 1/4 of the total
> tx buffers allowed.
>
> This appears to resolve the performance problem that I saw.
>
> Tested with recent wave-1 ath10k-ct firmware.
>

Hey Ben,

Did you do any testing with this patch around latency?  The nature of
the thing that you fixed makes me wonder if it was intentional with
respect to making WiFi fast - ie getting rid of buffers as much as
possible.  Obviously the CPU impact is likely to be an unintended
consequence. In any case, I don't know anything for sure, it was just
a thought that went through my head when reading this.


> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 31c4ddbf45cb..b5634781c0dc 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>
>         u8 target_version_major;
>         u8 target_version_minor;
> +       bool needs_unlock;
>         struct completion target_version_received;
>         u8 max_num_amsdu;
>         u8 max_num_ampdu;
> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> index 9b3c3b080e92..44795d9a7c0c 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>         lockdep_assert_held(&htt->tx_lock);
>
>         htt->num_pending_tx--;
> -       if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> +       if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
> +               htt->needs_unlock = false;
>                 ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> +       }
>  }
>
>  int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>                 return -EBUSY;
>
>         htt->num_pending_tx++;
> -       if (htt->num_pending_tx == htt->max_num_pending_tx)
> +       if (htt->num_pending_tx == htt->max_num_pending_tx) {
> +               htt->needs_unlock = true;
>                 ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> +       }
>
>         return 0;
>  }
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 14:56 ` Steve deRosier
@ 2020-04-28 14:58   ` Ben Greear
  2020-04-28 19:36     ` Toke Høiland-Jørgensen
  2020-04-28 16:27   ` Dave Taht
  1 sibling, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-04-28 14:58 UTC (permalink / raw)
  To: Steve deRosier, dave.taht; +Cc: linux-wireless



On 04/28/2020 07:56 AM, Steve deRosier wrote:
> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>>
>> From: Ben Greear <greearb@candelatech.com>
>>
>> While running tcp upload + download tests with ~200
>> concurrent TCP streams, 1-2 processes, and 30 station
>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>> around 20% of the CPU according to perf-top, which other locking
>> taking an additional ~15%.
>>
>> I believe the issue is that the ath10k driver would unlock the
>> txqueue when a single frame could be transmitted, instead of
>> waiting for a low water mark.
>>
>> So, this patch adds a low-water mark that is 1/4 of the total
>> tx buffers allowed.
>>
>> This appears to resolve the performance problem that I saw.
>>
>> Tested with recent wave-1 ath10k-ct firmware.
>>
>
> Hey Ben,
>
> Did you do any testing with this patch around latency?  The nature of
> the thing that you fixed makes me wonder if it was intentional with
> respect to making WiFi fast - ie getting rid of buffers as much as
> possible.  Obviously the CPU impact is likely to be an unintended
> consequence. In any case, I don't know anything for sure, it was just
> a thought that went through my head when reading this.

I did not, but on average my patch should make the queues be less full,
so I doubt it will hurt latency.

Thanks,
Ben

>
>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>> ---
>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>> index 31c4ddbf45cb..b5634781c0dc 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>
>>         u8 target_version_major;
>>         u8 target_version_minor;
>> +       bool needs_unlock;
>>         struct completion target_version_received;
>>         u8 max_num_amsdu;
>>         u8 max_num_ampdu;
>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> index 9b3c3b080e92..44795d9a7c0c 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>         lockdep_assert_held(&htt->tx_lock);
>>
>>         htt->num_pending_tx--;
>> -       if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>> +       if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>> +               htt->needs_unlock = false;
>>                 ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>> +       }
>>  }
>>
>>  int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>                 return -EBUSY;
>>
>>         htt->num_pending_tx++;
>> -       if (htt->num_pending_tx == htt->max_num_pending_tx)
>> +       if (htt->num_pending_tx == htt->max_num_pending_tx) {
>> +               htt->needs_unlock = true;
>>                 ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>> +       }
>>
>>         return 0;
>>  }
>> --
>> 2.20.1
>>
>

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 14:56 ` Steve deRosier
  2020-04-28 14:58   ` Ben Greear
@ 2020-04-28 16:27   ` Dave Taht
  2020-04-28 16:35     ` Ben Greear
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Taht @ 2020-04-28 16:27 UTC (permalink / raw)
  To: Steve deRosier; +Cc: Ben Greear, linux-wireless

On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <derosier@gmail.com> wrote:
>
> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
> >
> > From: Ben Greear <greearb@candelatech.com>
> >
> > While running tcp upload + download tests with ~200
> > concurrent TCP streams, 1-2 processes, and 30 station
> > vdevs, I noticed that the __ieee80211_stop_queue was taking
> > around 20% of the CPU according to perf-top, which other locking
> > taking an additional ~15%.
> >
> > I believe the issue is that the ath10k driver would unlock the
> > txqueue when a single frame could be transmitted, instead of
> > waiting for a low water mark.
> >
> > So, this patch adds a low-water mark that is 1/4 of the total
> > tx buffers allowed.
> >
> > This appears to resolve the performance problem that I saw.
> >
> > Tested with recent wave-1 ath10k-ct firmware.
> >
>
> Hey Ben,
>
> Did you do any testing with this patch around latency?  The nature of
> the thing that you fixed makes me wonder if it was intentional with
> respect to making WiFi fast - ie getting rid of buffers as much as
> possible.  Obviously the CPU impact is likely to be an unintended
> consequence. In any case, I don't know anything for sure, it was just
> a thought that went through my head when reading this.

I note that I'd prefer a BQL-like high/low watermark approach in
general... bytes, not packets, or better, perceived
airtime in a revision of AQL...

... but we'll try this patch, thx!

>
> > Signed-off-by: Ben Greear <greearb@candelatech.com>
> > ---
> >  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
> >  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> > index 31c4ddbf45cb..b5634781c0dc 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt.h
> > +++ b/drivers/net/wireless/ath/ath10k/htt.h
> > @@ -1941,6 +1941,7 @@ struct ath10k_htt {
> >
> >         u8 target_version_major;
> >         u8 target_version_minor;
> > +       bool needs_unlock;
> >         struct completion target_version_received;
> >         u8 max_num_amsdu;
> >         u8 max_num_ampdu;
> > diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> > index 9b3c3b080e92..44795d9a7c0c 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> > +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> > @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> >         lockdep_assert_held(&htt->tx_lock);
> >
> >         htt->num_pending_tx--;
> > -       if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> > +       if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
> > +               htt->needs_unlock = false;
> >                 ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> > +       }
> >  }
> >
> >  int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> > @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> >                 return -EBUSY;
> >
> >         htt->num_pending_tx++;
> > -       if (htt->num_pending_tx == htt->max_num_pending_tx)
> > +       if (htt->num_pending_tx == htt->max_num_pending_tx) {
> > +               htt->needs_unlock = true;
> >                 ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> > +       }
> >
> >         return 0;
> >  }
> > --
> > 2.20.1
> >



-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 16:27   ` Dave Taht
@ 2020-04-28 16:35     ` Ben Greear
  2020-04-28 17:10       ` Dave Taht
  2020-04-28 19:43       ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 27+ messages in thread
From: Ben Greear @ 2020-04-28 16:35 UTC (permalink / raw)
  To: Dave Taht, Steve deRosier; +Cc: linux-wireless



On 04/28/2020 09:27 AM, Dave Taht wrote:
> On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <derosier@gmail.com> wrote:
>>
>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>>>
>>> From: Ben Greear <greearb@candelatech.com>
>>>
>>> While running tcp upload + download tests with ~200
>>> concurrent TCP streams, 1-2 processes, and 30 station
>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>> around 20% of the CPU according to perf-top, which other locking
>>> taking an additional ~15%.
>>>
>>> I believe the issue is that the ath10k driver would unlock the
>>> txqueue when a single frame could be transmitted, instead of
>>> waiting for a low water mark.
>>>
>>> So, this patch adds a low-water mark that is 1/4 of the total
>>> tx buffers allowed.
>>>
>>> This appears to resolve the performance problem that I saw.
>>>
>>> Tested with recent wave-1 ath10k-ct firmware.
>>>
>>
>> Hey Ben,
>>
>> Did you do any testing with this patch around latency?  The nature of
>> the thing that you fixed makes me wonder if it was intentional with
>> respect to making WiFi fast - ie getting rid of buffers as much as
>> possible.  Obviously the CPU impact is likely to be an unintended
>> consequence. In any case, I don't know anything for sure, it was just
>> a thought that went through my head when reading this.
>
> I note that I'd prefer a BQL-like high/low watermark approach in
> general... bytes, not packets, or better, perceived
> airtime in a revision of AQL...
>
> ... but we'll try this patch, thx!

Is there a nice diagram somewhere that shows where the various
buffer-bloat technologies sit in the stack?  I suspect such should
be above the txqueue start/stop logic in the driver that I mucked
with, and certainly the old behaviour has no obvious tie-in with
any higher-level algorithms.

I doubt my patch will change much except in pathological cases where
the system is transmitting frames fast enough to fully fill the tx buffers
and also loaded in such a way that it can process just a few tx frames
at a time to keep bouncing to full and not full over and over.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 16:35     ` Ben Greear
@ 2020-04-28 17:10       ` Dave Taht
  2020-04-28 19:43       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 27+ messages in thread
From: Dave Taht @ 2020-04-28 17:10 UTC (permalink / raw)
  To: Ben Greear; +Cc: Steve deRosier, linux-wireless, Make-Wifi-fast

On Tue, Apr 28, 2020 at 9:35 AM Ben Greear <greearb@candelatech.com> wrote:
>
>
>
> On 04/28/2020 09:27 AM, Dave Taht wrote:
> > On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <derosier@gmail.com> wrote:
> >>
> >> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
> >>>
> >>> From: Ben Greear <greearb@candelatech.com>
> >>>
> >>> While running tcp upload + download tests with ~200
> >>> concurrent TCP streams, 1-2 processes, and 30 station
> >>> vdevs, I noticed that the __ieee80211_stop_queue was taking
> >>> around 20% of the CPU according to perf-top, which other locking
> >>> taking an additional ~15%.
> >>>
> >>> I believe the issue is that the ath10k driver would unlock the
> >>> txqueue when a single frame could be transmitted, instead of
> >>> waiting for a low water mark.
> >>>
> >>> So, this patch adds a low-water mark that is 1/4 of the total
> >>> tx buffers allowed.
> >>>
> >>> This appears to resolve the performance problem that I saw.
> >>>
> >>> Tested with recent wave-1 ath10k-ct firmware.
> >>>
> >>
> >> Hey Ben,
> >>
> >> Did you do any testing with this patch around latency?  The nature of
> >> the thing that you fixed makes me wonder if it was intentional with
> >> respect to making WiFi fast - ie getting rid of buffers as much as
> >> possible.  Obviously the CPU impact is likely to be an unintended
> >> consequence. In any case, I don't know anything for sure, it was just
> >> a thought that went through my head when reading this.
> >
> > I note that I'd prefer a BQL-like high/low watermark approach in
> > general... bytes, not packets, or better, perceived
> > airtime in a revision of AQL...
> >
> > ... but we'll try this patch, thx!
>
> Is there a nice diagram somewhere that shows where the various
> buffer-bloat technologies sit in the stack?  I suspect such should
> be above the txqueue start/stop logic in the driver that I mucked
> with, and certainly the old behaviour has no obvious tie-in with
> any higher-level algorithms.

There are some good diagrams of the new queue stuff buried in toke's
book and online papers, notably "ending the anomaly"

https://bufferbloat-and-beyond.net/

Plug: They just did a print run, and it makes for good bathroom
reading. There's also a preso on it around here somewhere.

That said, let's see... there's some rants in this:
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf and here
... https://conferences.sigcomm.org/sigcomm/2014/doc/slides/137.pdf
but that's mostly about what was wrong at the time.

Definitely! revising this piece would be a good idea in light of
modern developments and increased knowledge.
https://www.linuxjournal.com/content/queueing-linux-network-stack

IMHO "how to use AQL" is underdocumented at the moment. I'd hoped to
produce some after we successfully
got the iwl drivers to use it, but we haven't got around to it, and
merely getting the ath10k using it (really really) well,
has eaten into my ax200 time.....


>
> I doubt my patch will change much except in pathological cases where
> the system is transmitting frames fast enough to fully fill the tx buffers
> and also loaded in such a way that it can process just a few tx frames
> at a time to keep bouncing to full and not full over and over.
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com



-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 14:58   ` Ben Greear
@ 2020-04-28 19:36     ` Toke Høiland-Jørgensen
  2020-04-28 19:39       ` Dave Taht
  0 siblings, 1 reply; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-28 19:36 UTC (permalink / raw)
  To: Ben Greear, Steve deRosier, dave.taht; +Cc: linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/28/2020 07:56 AM, Steve deRosier wrote:
>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>>>
>>> From: Ben Greear <greearb@candelatech.com>
>>>
>>> While running tcp upload + download tests with ~200
>>> concurrent TCP streams, 1-2 processes, and 30 station
>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>> around 20% of the CPU according to perf-top, which other locking
>>> taking an additional ~15%.
>>>
>>> I believe the issue is that the ath10k driver would unlock the
>>> txqueue when a single frame could be transmitted, instead of
>>> waiting for a low water mark.
>>>
>>> So, this patch adds a low-water mark that is 1/4 of the total
>>> tx buffers allowed.
>>>
>>> This appears to resolve the performance problem that I saw.
>>>
>>> Tested with recent wave-1 ath10k-ct firmware.
>>>
>>
>> Hey Ben,
>>
>> Did you do any testing with this patch around latency?  The nature of
>> the thing that you fixed makes me wonder if it was intentional with
>> respect to making WiFi fast - ie getting rid of buffers as much as
>> possible.  Obviously the CPU impact is likely to be an unintended
>> consequence. In any case, I don't know anything for sure, it was just
>> a thought that went through my head when reading this.
>
> I did not, but on average my patch should make the queues be less full,
> so I doubt it will hurt latency.

I would tend to agree with that.

-Toke

>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>> ---
>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>
>>>         u8 target_version_major;
>>>         u8 target_version_minor;
>>> +       bool needs_unlock;
>>>         struct completion target_version_received;
>>>         u8 max_num_amsdu;
>>>         u8 max_num_ampdu;
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>         lockdep_assert_held(&htt->tx_lock);
>>>
>>>         htt->num_pending_tx--;
>>> -       if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>> +       if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>> +               htt->needs_unlock = false;
>>>                 ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>> +       }
>>>  }
>>>
>>>  int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>>                 return -EBUSY;
>>>
>>>         htt->num_pending_tx++;
>>> -       if (htt->num_pending_tx == htt->max_num_pending_tx)
>>> +       if (htt->num_pending_tx == htt->max_num_pending_tx) {
>>> +               htt->needs_unlock = true;
>>>                 ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>> +       }
>>>
>>>         return 0;
>>>  }
>>> --
>>> 2.20.1
>>>
>>
>
> -- 
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k:  Restart xmit queues below low-water mark.
  2020-04-27 14:54 [PATCH] ath10k: Restart xmit queues below low-water mark greearb
  2020-04-28 14:56 ` Steve deRosier
@ 2020-04-28 19:37 ` Toke Høiland-Jørgensen
  2020-04-28 19:51   ` Ben Greear
  1 sibling, 1 reply; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-28 19:37 UTC (permalink / raw)
  To: greearb, linux-wireless; +Cc: Ben Greear

greearb@candelatech.com writes:

> From: Ben Greear <greearb@candelatech.com>
>
> While running tcp upload + download tests with ~200
> concurrent TCP streams, 1-2 processes, and 30 station
> vdevs, I noticed that the __ieee80211_stop_queue was taking
> around 20% of the CPU according to perf-top, which other locking
> taking an additional ~15%.
>
> I believe the issue is that the ath10k driver would unlock the
> txqueue when a single frame could be transmitted, instead of
> waiting for a low water mark.
>
> So, this patch adds a low-water mark that is 1/4 of the total
> tx buffers allowed.
>
> This appears to resolve the performance problem that I saw.
>
> Tested with recent wave-1 ath10k-ct firmware.
>
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 31c4ddbf45cb..b5634781c0dc 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>  
>  	u8 target_version_major;
>  	u8 target_version_minor;
> +	bool needs_unlock;
>  	struct completion target_version_received;
>  	u8 max_num_amsdu;
>  	u8 max_num_ampdu;
> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> index 9b3c3b080e92..44795d9a7c0c 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>  	lockdep_assert_held(&htt->tx_lock);
>  
>  	htt->num_pending_tx--;
> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {

Why /4? Seems a bit arbitrary?

What's a typical value of max_num_pending_tx?

-Toke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 19:36     ` Toke Høiland-Jørgensen
@ 2020-04-28 19:39       ` Dave Taht
  2020-04-28 20:00         ` Ben Greear
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Taht @ 2020-04-28 19:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Ben Greear, Steve deRosier, linux-wireless

On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
> Ben Greear <greearb@candelatech.com> writes:
>
> > On 04/28/2020 07:56 AM, Steve deRosier wrote:
> >> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
> >>>
> >>> From: Ben Greear <greearb@candelatech.com>
> >>>
> >>> While running tcp upload + download tests with ~200
> >>> concurrent TCP streams, 1-2 processes, and 30 station
> >>> vdevs, I noticed that the __ieee80211_stop_queue was taking
> >>> around 20% of the CPU according to perf-top, which other locking
> >>> taking an additional ~15%.
> >>>
> >>> I believe the issue is that the ath10k driver would unlock the
> >>> txqueue when a single frame could be transmitted, instead of
> >>> waiting for a low water mark.
> >>>
> >>> So, this patch adds a low-water mark that is 1/4 of the total
> >>> tx buffers allowed.
> >>>
> >>> This appears to resolve the performance problem that I saw.
> >>>
> >>> Tested with recent wave-1 ath10k-ct firmware.
> >>>
> >>
> >> Hey Ben,
> >>
> >> Did you do any testing with this patch around latency?  The nature of
> >> the thing that you fixed makes me wonder if it was intentional with
> >> respect to making WiFi fast - ie getting rid of buffers as much as
> >> possible.  Obviously the CPU impact is likely to be an unintended
> >> consequence. In any case, I don't know anything for sure, it was just
> >> a thought that went through my head when reading this.
> >
> > I did not, but on average my patch should make the queues be less full,
> > so I doubt it will hurt latency.
>
> I would tend to agree with that.

Well, I don't, as it's dependent on right sizing the ring in the first place.

But testing is in order! :)

> -Toke
>
> >>> Signed-off-by: Ben Greear <greearb@candelatech.com>
> >>> ---
> >>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
> >>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
> >>>  2 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> >>> index 31c4ddbf45cb..b5634781c0dc 100644
> >>> --- a/drivers/net/wireless/ath/ath10k/htt.h
> >>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> >>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
> >>>
> >>>         u8 target_version_major;
> >>>         u8 target_version_minor;
> >>> +       bool needs_unlock;
> >>>         struct completion target_version_received;
> >>>         u8 max_num_amsdu;
> >>>         u8 max_num_ampdu;
> >>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> >>> index 9b3c3b080e92..44795d9a7c0c 100644
> >>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> >>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> >>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
> >>>         lockdep_assert_held(&htt->tx_lock);
> >>>
> >>>         htt->num_pending_tx--;
> >>> -       if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> >>> +       if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
> >>> +               htt->needs_unlock = false;
> >>>                 ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> >>> +       }
> >>>  }
> >>>
> >>>  int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> >>> @@ -157,8 +159,10 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
> >>>                 return -EBUSY;
> >>>
> >>>         htt->num_pending_tx++;
> >>> -       if (htt->num_pending_tx == htt->max_num_pending_tx)
> >>> +       if (htt->num_pending_tx == htt->max_num_pending_tx) {
> >>> +               htt->needs_unlock = true;
> >>>                 ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
> >>> +       }
> >>>
> >>>         return 0;
> >>>  }
> >>> --
> >>> 2.20.1
> >>>
> >>
> >
> > --
> > Ben Greear <greearb@candelatech.com>
> > Candela Technologies Inc  http://www.candelatech.com



-- 
Make Music, Not War

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 16:35     ` Ben Greear
  2020-04-28 17:10       ` Dave Taht
@ 2020-04-28 19:43       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-28 19:43 UTC (permalink / raw)
  To: Ben Greear, Dave Taht, Steve deRosier; +Cc: linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/28/2020 09:27 AM, Dave Taht wrote:
>> On Tue, Apr 28, 2020 at 7:56 AM Steve deRosier <derosier@gmail.com> wrote:
>>>
>>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>>>>
>>>> From: Ben Greear <greearb@candelatech.com>
>>>>
>>>> While running tcp upload + download tests with ~200
>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>> around 20% of the CPU according to perf-top, which other locking
>>>> taking an additional ~15%.
>>>>
>>>> I believe the issue is that the ath10k driver would unlock the
>>>> txqueue when a single frame could be transmitted, instead of
>>>> waiting for a low water mark.
>>>>
>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>> tx buffers allowed.
>>>>
>>>> This appears to resolve the performance problem that I saw.
>>>>
>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>
>>>
>>> Hey Ben,
>>>
>>> Did you do any testing with this patch around latency?  The nature of
>>> the thing that you fixed makes me wonder if it was intentional with
>>> respect to making WiFi fast - ie getting rid of buffers as much as
>>> possible.  Obviously the CPU impact is likely to be an unintended
>>> consequence. In any case, I don't know anything for sure, it was just
>>> a thought that went through my head when reading this.
>>
>> I note that I'd prefer a BQL-like high/low watermark approach in
>> general... bytes, not packets, or better, perceived
>> airtime in a revision of AQL...
>>
>> ... but we'll try this patch, thx!
>
> Is there a nice diagram somewhere that shows where the various
> buffer-bloat technologies sit in the stack?

Not really. Best thing I know of is the one I drew myself: Figure 3 of this paper:
https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf

That is still a semi-accurate representation of the queueing structure
in mac80211. For ath10k, just imagine that the bit that says "ath9k
driver" is part of mac80211, and that the "HW queue" is everything the
driver and firmware does... AQL also activates in the circle labelled
"RR" there, but the figure predates AQL.

> I suspect such should be above the txqueue start/stop logic in the
> driver that I mucked with, and certainly the old behaviour has no
> obvious tie-in with any higher-level algorithms.
>
> I doubt my patch will change much except in pathological cases where
> the system is transmitting frames fast enough to fully fill the tx
> buffers and also loaded in such a way that it can process just a few
> tx frames at a time to keep bouncing to full and not full over and
> over.

The latter part of that ("can process just a few tx frames at a time")
mostly happens when many stations are active at the same time, right?

-Toke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 19:37 ` Toke Høiland-Jørgensen
@ 2020-04-28 19:51   ` Ben Greear
  2020-04-28 20:39     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-04-28 19:51 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, linux-wireless



On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
> greearb@candelatech.com writes:
>
>> From: Ben Greear <greearb@candelatech.com>
>>
>> While running tcp upload + download tests with ~200
>> concurrent TCP streams, 1-2 processes, and 30 station
>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>> around 20% of the CPU according to perf-top, which other locking
>> taking an additional ~15%.
>>
>> I believe the issue is that the ath10k driver would unlock the
>> txqueue when a single frame could be transmitted, instead of
>> waiting for a low water mark.
>>
>> So, this patch adds a low-water mark that is 1/4 of the total
>> tx buffers allowed.
>>
>> This appears to resolve the performance problem that I saw.
>>
>> Tested with recent wave-1 ath10k-ct firmware.
>>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>> ---
>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>> index 31c4ddbf45cb..b5634781c0dc 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>
>>  	u8 target_version_major;
>>  	u8 target_version_minor;
>> +	bool needs_unlock;
>>  	struct completion target_version_received;
>>  	u8 max_num_amsdu;
>>  	u8 max_num_ampdu;
>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> index 9b3c3b080e92..44795d9a7c0c 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>  	lockdep_assert_held(&htt->tx_lock);
>>
>>  	htt->num_pending_tx--;
>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>
> Why /4? Seems a bit arbitrary?

Yes, arbitrary for sure.  I figure restart filling the queue when 1/4 full so that it
is unlikely to run dry.  Possibly it should restart sooner to keep it more full
on average?

Before my patch, the behaviour would be to try to keep it as full as possible, as in
restart the queues as soon as a single slot opens up in the tx queue.

>
> What's a typical value of max_num_pending_tx?

1424

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 19:39       ` Dave Taht
@ 2020-04-28 20:00         ` Ben Greear
  2020-04-28 20:58           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-04-28 20:00 UTC (permalink / raw)
  To: Dave Taht, Toke Høiland-Jørgensen
  Cc: Steve deRosier, linux-wireless



On 04/28/2020 12:39 PM, Dave Taht wrote:
> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>
>> Ben Greear <greearb@candelatech.com> writes:
>>
>>> On 04/28/2020 07:56 AM, Steve deRosier wrote:
>>>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>>>>>
>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>
>>>>> While running tcp upload + download tests with ~200
>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>> taking an additional ~15%.
>>>>>
>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>> txqueue when a single frame could be transmitted, instead of
>>>>> waiting for a low water mark.
>>>>>
>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>> tx buffers allowed.
>>>>>
>>>>> This appears to resolve the performance problem that I saw.
>>>>>
>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>
>>>>
>>>> Hey Ben,
>>>>
>>>> Did you do any testing with this patch around latency?  The nature of
>>>> the thing that you fixed makes me wonder if it was intentional with
>>>> respect to making WiFi fast - ie getting rid of buffers as much as
>>>> possible.  Obviously the CPU impact is likely to be an unintended
>>>> consequence. In any case, I don't know anything for sure, it was just
>>>> a thought that went through my head when reading this.
>>>
>>> I did not, but on average my patch should make the queues be less full,
>>> so I doubt it will hurt latency.
>>
>> I would tend to agree with that.
>
> Well, I don't, as it's dependent on right sizing the ring in the first place.

My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on
average.

If you want to test with different ring sizes, you can play with the tx_desc
setting in the ath10k-ct driver 'fwcfg' options.

http://www.candelatech.com/ath10k-10.4.php#config

My testing shows that overall throughput goes down when using lots of peers
if you have smaller numbers of txbuffers.  This is because the firmware
will typically spread its buffers over lots of peers and have smaller ampdu
chains per transmit.  An upper stack that more intelligently fed frames
to the firmware could mitigate this, and it is not all bad anyway since
giving everyone a 64 ampdu chains will increase burstiness at least somewhat.

I've always envisioned that the stuff you and Toke and others have been
working on would help in this area, but I don't understand your stuff well
enough to know if that is true or not.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 19:51   ` Ben Greear
@ 2020-04-28 20:39     ` Toke Høiland-Jørgensen
  2020-04-28 21:18       ` Ben Greear
  2020-04-30 23:31       ` John Deere
  0 siblings, 2 replies; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-28 20:39 UTC (permalink / raw)
  To: Ben Greear, linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>> greearb@candelatech.com writes:
>>
>>> From: Ben Greear <greearb@candelatech.com>
>>>
>>> While running tcp upload + download tests with ~200
>>> concurrent TCP streams, 1-2 processes, and 30 station
>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>> around 20% of the CPU according to perf-top, which other locking
>>> taking an additional ~15%.
>>>
>>> I believe the issue is that the ath10k driver would unlock the
>>> txqueue when a single frame could be transmitted, instead of
>>> waiting for a low water mark.
>>>
>>> So, this patch adds a low-water mark that is 1/4 of the total
>>> tx buffers allowed.
>>>
>>> This appears to resolve the performance problem that I saw.
>>>
>>> Tested with recent wave-1 ath10k-ct firmware.
>>>
>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>> ---
>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>
>>>  	u8 target_version_major;
>>>  	u8 target_version_minor;
>>> +	bool needs_unlock;
>>>  	struct completion target_version_received;
>>>  	u8 max_num_amsdu;
>>>  	u8 max_num_ampdu;
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>  	lockdep_assert_held(&htt->tx_lock);
>>>
>>>  	htt->num_pending_tx--;
>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>
>> Why /4? Seems a bit arbitrary?
>
> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
> full so that it is unlikely to run dry. Possibly it should restart
> sooner to keep it more full on average?

Theoretically, the "keep the queue at the lowest possible level that
keeps it from underflowing" is what BQL is supposed to do. The diff
below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
place of num_pending_tx. I've only compile tested it, and I'm a bit
skeptical that it will work right for this, but if anyone wants to give
it a shot, there it is.

BTW, while doing that, I noticed there's a similar arbitrary limit in
ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
to keep the arbitrary limit maybe use the same one? :)

> Before my patch, the behaviour would be to try to keep it as full as
> possible, as in restart the queues as soon as a single slot opens up
> in the tx queue.

Yeah, that seems somewhat misguided as well, from a latency perspective,
at least. But I guess that's what we're fixing with AQL. What does the
firmware do with the frames queued within? Do they just go on a FIFO
queue altogether, or something smarter?

-Toke




diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index f26cc6989dad..72771ff38a94 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
 		return -EINVAL;
 	}
 
+	dql_init(&ar->htt.dql, HZ);
+	ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
+	ar->htt.dql.min_limit = 8;
+
 	if (ar->hw_params.num_peers)
 		ar->max_num_peers = ar->hw_params.num_peers;
 	else
diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
index 4a12564fc30e..19024d063896 100644
--- a/drivers/net/wireless/ath/ath10k/htt.h
+++ b/drivers/net/wireless/ath/ath10k/htt.h
@@ -13,6 +13,7 @@
 #include <linux/dmapool.h>
 #include <linux/hashtable.h>
 #include <linux/kfifo.h>
+#include <linux/dynamic_queue_limits.h>
 #include <net/mac80211.h>
 
 #include "htc.h"
@@ -1965,8 +1966,8 @@ struct ath10k_htt {
 	/* Protects access to pending_tx, num_pending_tx */
 	spinlock_t tx_lock;
 	int max_num_pending_tx;
-	int num_pending_tx;
 	int num_pending_mgmt_tx;
+	struct dql dql;
 	struct idr pending_tx;
 	wait_queue_head_t empty_tx_wq;
 
diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index e9d12ea708b6..911a79470bdf 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
 {
 	lockdep_assert_held(&htt->tx_lock);
 
-	htt->num_pending_tx--;
-	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
+	dql_completed(&htt->dql, 1);
+	if (dql_avail(&htt->dql) > 0)
 		ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
 }
 
@@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
 {
 	lockdep_assert_held(&htt->tx_lock);
 
-	if (htt->num_pending_tx >= htt->max_num_pending_tx)
+	if (dql_avail(&htt->dql) <= 0)
 		return -EBUSY;
 
-	htt->num_pending_tx++;
-	if (htt->num_pending_tx == htt->max_num_pending_tx)
+	dql_queued(&htt->dql, 1);
+	if (dql_avail(&htt->dql) <= 0)
 		ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
 
 	return 0;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index 2d03b8dd3b8c..1fe251742b0a 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
 	if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
 		return true;
 
-	if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
+	if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
 		return true;
 
 	if (artxq->num_fw_queued < artxq->num_push_allowed)
@@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
 	if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
 		return;
 
-	if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
+	if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
 		return;
 
 	rcu_read_lock();
@@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
 			bool empty;
 
 			spin_lock_bh(&ar->htt.tx_lock);
-			empty = (ar->htt.num_pending_tx == 0);
+			empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
 			spin_unlock_bh(&ar->htt.tx_lock);
 
 			skip = (ar->state == ATH10K_STATE_WEDGED) ||
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
index 39abf8b12903..fe7cd53c2bf9 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
 
 	ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
 	ath10k_htt_tx_dec_pending(htt);
-	if (htt->num_pending_tx == 0)
+	if (htt->dql.num_completed == htt->dql.num_queued)
 		wake_up(&htt->empty_tx_wq);
 	spin_unlock_bh(&htt->tx_lock);
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 20:00         ` Ben Greear
@ 2020-04-28 20:58           ` Toke Høiland-Jørgensen
  2020-04-28 21:23             ` Steve deRosier
  0 siblings, 1 reply; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-28 20:58 UTC (permalink / raw)
  To: Ben Greear, Dave Taht; +Cc: Steve deRosier, linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/28/2020 12:39 PM, Dave Taht wrote:
>> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>>>
>>> Ben Greear <greearb@candelatech.com> writes:
>>>
>>>> On 04/28/2020 07:56 AM, Steve deRosier wrote:
>>>>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
>>>>>>
>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>
>>>>>> While running tcp upload + download tests with ~200
>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>> taking an additional ~15%.
>>>>>>
>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>> waiting for a low water mark.
>>>>>>
>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>> tx buffers allowed.
>>>>>>
>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>
>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>
>>>>>
>>>>> Hey Ben,
>>>>>
>>>>> Did you do any testing with this patch around latency?  The nature of
>>>>> the thing that you fixed makes me wonder if it was intentional with
>>>>> respect to making WiFi fast - ie getting rid of buffers as much as
>>>>> possible.  Obviously the CPU impact is likely to be an unintended
>>>>> consequence. In any case, I don't know anything for sure, it was just
>>>>> a thought that went through my head when reading this.
>>>>
>>>> I did not, but on average my patch should make the queues be less full,
>>>> so I doubt it will hurt latency.
>>>
>>> I would tend to agree with that.
>>
>> Well, I don't, as it's dependent on right sizing the ring in the first place.
>
> My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on
> average.
>
> If you want to test with different ring sizes, you can play with the tx_desc
> setting in the ath10k-ct driver 'fwcfg' options.
>
> http://www.candelatech.com/ath10k-10.4.php#config
>
> My testing shows that overall throughput goes down when using lots of peers
> if you have smaller numbers of txbuffers.  This is because the firmware
> will typically spread its buffers over lots of peers and have smaller ampdu
> chains per transmit.  An upper stack that more intelligently fed frames
> to the firmware could mitigate this, and it is not all bad anyway since
> giving everyone a 64 ampdu chains will increase burstiness at least
> somewhat.

Making each transmission shorter is arguably the right thing to do in
the "extremely congested" scenario, though. If you have to wait your
turn behind 100 other stations for your next TXOP you'd generally want
each of those other stations to only transmit (say) 1ms instead of their
full 4ms. Yes, this will hurt aggregate throughput somewhat, but I'd
argue that in most cases the overall application performance would be
better. You're right, though, that ideally this would be managed a bit
smarter than by just running out of buffers :)

> I've always envisioned that the stuff you and Toke and others have been
> working on would help in this area, but I don't understand your stuff well
> enough to know if that is true or not.

It might, although as per above I'm not quite sure what "helps" really
means in this context. What would you expect a good behaviour to be
here? I think what you're alluding to is to limit the total number of
stations that will be allowed to have outstanding data in the firmware
at the same time, right? Here it would help a bit to know some more
details of how the firmware manages its internal queueing, and how it
schedules stations (if at all)?

BTW, are you running any of these tests with AQL enabled?

-Toke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 20:39     ` Toke Høiland-Jørgensen
@ 2020-04-28 21:18       ` Ben Greear
  2020-04-29  9:28         ` Toke Høiland-Jørgensen
  2020-04-30 23:31       ` John Deere
  1 sibling, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-04-28 21:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, linux-wireless



On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>> greearb@candelatech.com writes:
>>>
>>>> From: Ben Greear <greearb@candelatech.com>
>>>>
>>>> While running tcp upload + download tests with ~200
>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>> around 20% of the CPU according to perf-top, which other locking
>>>> taking an additional ~15%.
>>>>
>>>> I believe the issue is that the ath10k driver would unlock the
>>>> txqueue when a single frame could be transmitted, instead of
>>>> waiting for a low water mark.
>>>>
>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>> tx buffers allowed.
>>>>
>>>> This appears to resolve the performance problem that I saw.
>>>>
>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>
>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>> ---
>>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>
>>>>  	u8 target_version_major;
>>>>  	u8 target_version_minor;
>>>> +	bool needs_unlock;
>>>>  	struct completion target_version_received;
>>>>  	u8 max_num_amsdu;
>>>>  	u8 max_num_ampdu;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>  	lockdep_assert_held(&htt->tx_lock);
>>>>
>>>>  	htt->num_pending_tx--;
>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>
>>> Why /4? Seems a bit arbitrary?
>>
>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>> full so that it is unlikely to run dry. Possibly it should restart
>> sooner to keep it more full on average?
>
> Theoretically, the "keep the queue at the lowest possible level that
> keeps it from underflowing" is what BQL is supposed to do. The diff
> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
> place of num_pending_tx. I've only compile tested it, and I'm a bit
> skeptical that it will work right for this, but if anyone wants to give
> it a shot, there it is.
>
> BTW, while doing that, I noticed there's a similar arbitrary limit in
> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
> to keep the arbitrary limit maybe use the same one? :)
>
>> Before my patch, the behaviour would be to try to keep it as full as
>> possible, as in restart the queues as soon as a single slot opens up
>> in the tx queue.
>
> Yeah, that seems somewhat misguided as well, from a latency perspective,
> at least. But I guess that's what we're fixing with AQL. What does the
> firmware do with the frames queued within? Do they just go on a FIFO
> queue altogether, or something smarter?

Sort of like a mini-mac80211 stack inside the firmware is used to
create ampdu/amsdu chains and schedule them with its own scheduler.

For optimal throughput with 200 users steaming video,
the ath10k driver should think that it has only a few active peers wanting
to send data at a time (and so firmware would think the same), and the driver should
be fed a large chunk of pkts for those peers.  And then the next few peers.
That should let firmware send large ampdu/amsdu to each peer, increasing throughput
over all.

If you feed a few frames to each of the 200 peers, then even if firmware has 2000
tx buffers, that is only 10 frames per peer at best, leading to small ampdu/amsdu
and thus worse over-all throughput and utilization of airtime.

It would be nice to be able to set certain traffic flows to have the
throughput optimization and others to have the latency optimization.
For instance, high latency on a streaming download is a good trade-off
if it increases total throughput.  The end device will have plenty of
buffers to handle the bursts of data.

And of course other traffic will benefit from lower latency.

Maybe some of the AI folks training their AI to categorize cat pictures could
instead start categorizing traffic flows and adjusting the stack realtime...

And now...back to the grind for me.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 20:58           ` Toke Høiland-Jørgensen
@ 2020-04-28 21:23             ` Steve deRosier
  0 siblings, 0 replies; 27+ messages in thread
From: Steve deRosier @ 2020-04-28 21:23 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Ben Greear, Dave Taht, linux-wireless

On Tue, Apr 28, 2020 at 1:58 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>
> Ben Greear <greearb@candelatech.com> writes:
>
> > On 04/28/2020 12:39 PM, Dave Taht wrote:
> >> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> >>>
> >>> Ben Greear <greearb@candelatech.com> writes:
> >>>
> >>>> On 04/28/2020 07:56 AM, Steve deRosier wrote:
> >>>>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@candelatech.com> wrote:
> >>>>>>
> >>>>>> From: Ben Greear <greearb@candelatech.com>
> >>>>>>
> >>>>>> While running tcp upload + download tests with ~200
> >>>>>> concurrent TCP streams, 1-2 processes, and 30 station
> >>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
> >>>>>> around 20% of the CPU according to perf-top, which other locking
> >>>>>> taking an additional ~15%.
> >>>>>>
> >>>>>> I believe the issue is that the ath10k driver would unlock the
> >>>>>> txqueue when a single frame could be transmitted, instead of
> >>>>>> waiting for a low water mark.
> >>>>>>
> >>>>>> So, this patch adds a low-water mark that is 1/4 of the total
> >>>>>> tx buffers allowed.
> >>>>>>
> >>>>>> This appears to resolve the performance problem that I saw.
> >>>>>>
> >>>>>> Tested with recent wave-1 ath10k-ct firmware.
> >>>>>>
> >>>>>
> >>>>> Hey Ben,
> >>>>>
> >>>>> Did you do any testing with this patch around latency?  The nature of
> >>>>> the thing that you fixed makes me wonder if it was intentional with
> >>>>> respect to making WiFi fast - ie getting rid of buffers as much as
> >>>>> possible.  Obviously the CPU impact is likely to be an unintended
> >>>>> consequence. In any case, I don't know anything for sure, it was just
> >>>>> a thought that went through my head when reading this.
> >>>>
> >>>> I did not, but on average my patch should make the queues be less full,
> >>>> so I doubt it will hurt latency.
> >>>
> >>> I would tend to agree with that.
> >>
> >> Well, I don't, as it's dependent on right sizing the ring in the first place.
> >
> > My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on
> > average.
> >
> > If you want to test with different ring sizes, you can play with the tx_desc
> > setting in the ath10k-ct driver 'fwcfg' options.
> >
> > http://www.candelatech.com/ath10k-10.4.php#config
> >
> > My testing shows that overall throughput goes down when using lots of peers
> > if you have smaller numbers of txbuffers.  This is because the firmware
> > will typically spread its buffers over lots of peers and have smaller ampdu
> > chains per transmit.  An upper stack that more intelligently fed frames
> > to the firmware could mitigate this, and it is not all bad anyway since
> > giving everyone a 64 ampdu chains will increase burstiness at least
> > somewhat.
>
> Making each transmission shorter is arguably the right thing to do in
> the "extremely congested" scenario, though. If you have to wait your
> turn behind 100 other stations for your next TXOP you'd generally want
> each of those other stations to only transmit (say) 1ms instead of their
> full 4ms. Yes, this will hurt aggregate throughput somewhat, but I'd
> argue that in most cases the overall application performance would be
> better. You're right, though, that ideally this would be managed a bit
> smarter than by just running out of buffers :)
>
> > I've always envisioned that the stuff you and Toke and others have been
> > working on would help in this area, but I don't understand your stuff well
> > enough to know if that is true or not.
>
> It might, although as per above I'm not quite sure what "helps" really
> means in this context. What would you expect a good behaviour to be
> here? I think what you're alluding to is to limit the total number of
> stations that will be allowed to have outstanding data in the firmware
> at the same time, right? Here it would help a bit to know some more
> details of how the firmware manages its internal queueing, and how it
> schedules stations (if at all)?
>
> BTW, are you running any of these tests with AQL enabled?
>

I don't know if Ben is doing so, but I will be doing so here very
soon. I've got some unrelated things in the way to clear up first, but
within a couple of weeks we hope to be testing AQL with ath10k-ct
firmware and driver.

And everyone thanks for the discussion, it's been very interesting and
useful to me. Hopefully we can improve ath10k even more with this
information.

- Steve

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 21:18       ` Ben Greear
@ 2020-04-29  9:28         ` Toke Høiland-Jørgensen
  2020-04-29 13:54           ` Ben Greear
  0 siblings, 1 reply; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-29  9:28 UTC (permalink / raw)
  To: Ben Greear, linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>>
>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>> greearb@candelatech.com writes:
>>>>
>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>
>>>>> While running tcp upload + download tests with ~200
>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>> taking an additional ~15%.
>>>>>
>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>> txqueue when a single frame could be transmitted, instead of
>>>>> waiting for a low water mark.
>>>>>
>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>> tx buffers allowed.
>>>>>
>>>>> This appears to resolve the performance problem that I saw.
>>>>>
>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>
>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>> ---
>>>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>
>>>>>  	u8 target_version_major;
>>>>>  	u8 target_version_minor;
>>>>> +	bool needs_unlock;
>>>>>  	struct completion target_version_received;
>>>>>  	u8 max_num_amsdu;
>>>>>  	u8 max_num_ampdu;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>  	lockdep_assert_held(&htt->tx_lock);
>>>>>
>>>>>  	htt->num_pending_tx--;
>>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>
>>>> Why /4? Seems a bit arbitrary?
>>>
>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>> full so that it is unlikely to run dry. Possibly it should restart
>>> sooner to keep it more full on average?
>>
>> Theoretically, the "keep the queue at the lowest possible level that
>> keeps it from underflowing" is what BQL is supposed to do. The diff
>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>> skeptical that it will work right for this, but if anyone wants to give
>> it a shot, there it is.
>>
>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>> to keep the arbitrary limit maybe use the same one? :)
>>
>>> Before my patch, the behaviour would be to try to keep it as full as
>>> possible, as in restart the queues as soon as a single slot opens up
>>> in the tx queue.
>>
>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>> at least. But I guess that's what we're fixing with AQL. What does the
>> firmware do with the frames queued within? Do they just go on a FIFO
>> queue altogether, or something smarter?
>
> Sort of like a mini-mac80211 stack inside the firmware is used to
> create ampdu/amsdu chains and schedule them with its own scheduler.
>
> For optimal throughput with 200 users steaming video,
> the ath10k driver should think that it has only a few active peers wanting
> to send data at a time (and so firmware would think the same), and the driver should
> be fed a large chunk of pkts for those peers.  And then the next few peers.
> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
> over all.

Yes, but also increasing latency because all other stations have to wait
for a longer TXOP (see my other reply).

> If you feed a few frames to each of the 200 peers, then even if
> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
> leading to small ampdu/amsdu and thus worse over-all throughput and
> utilization of airtime.

Do you have any data on exactly how long (in time) each txop becomes in
these highly congested scenarios?

> It would be nice to be able to set certain traffic flows to have the
> throughput optimization and others to have the latency optimization.
> For instance, high latency on a streaming download is a good trade-off
> if it increases total throughput.

For the individual flows to a peer, fq_codel actually does a pretty good
job at putting the latency-sensitive flows first. Which is why we want
the queueing to happen in mac80211 (where fq_codel is active) instead of
in the firmware.

> Maybe some of the AI folks training their AI to categorize cat
> pictures could instead start categorizing traffic flows and adjusting
> the stack realtime...

I know there are people trying to classify traffic with machine learning
(although usually for more nefarious purposes), but I'm not sure it's
feasible to do in a queue management algorithm (if at all). :)

-Toke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-29  9:28         ` Toke Høiland-Jørgensen
@ 2020-04-29 13:54           ` Ben Greear
  2020-04-29 14:56             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-04-29 13:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, linux-wireless



On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <greearb@candelatech.com> writes:
>>>
>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>> greearb@candelatech.com writes:
>>>>>
>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>
>>>>>> While running tcp upload + download tests with ~200
>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>> taking an additional ~15%.
>>>>>>
>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>> waiting for a low water mark.
>>>>>>
>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>> tx buffers allowed.
>>>>>>
>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>
>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>
>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>> ---
>>>>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>
>>>>>>  	u8 target_version_major;
>>>>>>  	u8 target_version_minor;
>>>>>> +	bool needs_unlock;
>>>>>>  	struct completion target_version_received;
>>>>>>  	u8 max_num_amsdu;
>>>>>>  	u8 max_num_ampdu;
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>  	lockdep_assert_held(&htt->tx_lock);
>>>>>>
>>>>>>  	htt->num_pending_tx--;
>>>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>
>>>>> Why /4? Seems a bit arbitrary?
>>>>
>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>> sooner to keep it more full on average?
>>>
>>> Theoretically, the "keep the queue at the lowest possible level that
>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>> skeptical that it will work right for this, but if anyone wants to give
>>> it a shot, there it is.
>>>
>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>> to keep the arbitrary limit maybe use the same one? :)
>>>
>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>> possible, as in restart the queues as soon as a single slot opens up
>>>> in the tx queue.
>>>
>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>> at least. But I guess that's what we're fixing with AQL. What does the
>>> firmware do with the frames queued within? Do they just go on a FIFO
>>> queue altogether, or something smarter?
>>
>> Sort of like a mini-mac80211 stack inside the firmware is used to
>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>
>> For optimal throughput with 200 users steaming video,
>> the ath10k driver should think that it has only a few active peers wanting
>> to send data at a time (and so firmware would think the same), and the driver should
>> be fed a large chunk of pkts for those peers.  And then the next few peers.
>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>> over all.
>
> Yes, but also increasing latency because all other stations have to wait
> for a longer TXOP (see my other reply).

If you at most sent 4 station's worth of data to the firmware, and max is 4ms per
txop, then you have at most 16ms of latency.  You could also send just two station's
worth of data at a time, as long as you can quickly service the tx-queues again
that should be enough to keep the firmware/radio productive.

In the case where you have many users wanting lots of throughput, 8 or 16ms of extra
latency is a good tradeoff vs no one being able to reliably get the bandwidth they
need.

Higher priority TIDs will get precedence in ath10k firmware anyway, so even if at time 0
you sent 64 frames to a peer on back-ground TID, if you sent a VO frame at time 0+1,
it could be transmitted first.

>
>> If you feed a few frames to each of the 200 peers, then even if
>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>> leading to small ampdu/amsdu and thus worse over-all throughput and
>> utilization of airtime.
>
> Do you have any data on exactly how long (in time) each txop becomes in
> these highly congested scenarios?

I didn't look at time, but avg packets-per-ampdu chain is 30+ in single station tests,
and with many stations it goes into the 4-8 range (from memory, so maybe I'm
off a bit).

Here is an open-source tool that can give you those metrics by processing a pcap:

https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag

# Ignore the LANforge bits about creating capture files, here is an example of how to use it:
https://www.candelatech.com/cookbook/wifire/wifi+diagnostic


>
>> It would be nice to be able to set certain traffic flows to have the
>> throughput optimization and others to have the latency optimization.
>> For instance, high latency on a streaming download is a good trade-off
>> if it increases total throughput.
>
> For the individual flows to a peer, fq_codel actually does a pretty good
> job at putting the latency-sensitive flows first. Which is why we want
> the queueing to happen in mac80211 (where fq_codel is active) instead of
> in the firmware.

That sounds good to me.  What is needed from the driver/firmware to make
this work well?

>> Maybe some of the AI folks training their AI to categorize cat
>> pictures could instead start categorizing traffic flows and adjusting
>> the stack realtime...
>
> I know there are people trying to classify traffic with machine learning
> (although usually for more nefarious purposes), but I'm not sure it's
> feasible to do in a queue management algorithm (if at all). :)

If you give an API (maybe using 'tc'?) that user-space can twiddle, then anyone
doing clever things can do it at higher levels.  It could easily be that multiple
APs would gain benefit from being coordinated together for instance, and no local
queueing logic would be able to coordinate that by itself.

Thanks,
Ben

>
> -Toke
>

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-29 13:54           ` Ben Greear
@ 2020-04-29 14:56             ` Toke Høiland-Jørgensen
  2020-04-29 15:26               ` Ben Greear
  0 siblings, 1 reply; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-29 14:56 UTC (permalink / raw)
  To: Ben Greear, linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>>
>>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <greearb@candelatech.com> writes:
>>>>
>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>> greearb@candelatech.com writes:
>>>>>>
>>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>>
>>>>>>> While running tcp upload + download tests with ~200
>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>> taking an additional ~15%.
>>>>>>>
>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>> waiting for a low water mark.
>>>>>>>
>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>> tx buffers allowed.
>>>>>>>
>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>
>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>
>>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>>> ---
>>>>>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>
>>>>>>>  	u8 target_version_major;
>>>>>>>  	u8 target_version_minor;
>>>>>>> +	bool needs_unlock;
>>>>>>>  	struct completion target_version_received;
>>>>>>>  	u8 max_num_amsdu;
>>>>>>>  	u8 max_num_ampdu;
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>  	lockdep_assert_held(&htt->tx_lock);
>>>>>>>
>>>>>>>  	htt->num_pending_tx--;
>>>>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>
>>>>>> Why /4? Seems a bit arbitrary?
>>>>>
>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>> sooner to keep it more full on average?
>>>>
>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>> skeptical that it will work right for this, but if anyone wants to give
>>>> it a shot, there it is.
>>>>
>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>
>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>> in the tx queue.
>>>>
>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>> queue altogether, or something smarter?
>>>
>>> Sort of like a mini-mac80211 stack inside the firmware is used to
>>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>>
>>> For optimal throughput with 200 users steaming video,
>>> the ath10k driver should think that it has only a few active peers wanting
>>> to send data at a time (and so firmware would think the same), and the driver should
>>> be fed a large chunk of pkts for those peers.  And then the next few peers.
>>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>>> over all.
>>
>> Yes, but also increasing latency because all other stations have to wait
>> for a longer TXOP (see my other reply).
>
> If you at most sent 4 station's worth of data to the firmware, and max
> is 4ms per txop, then you have at most 16ms of latency. You could also
> send just two station's worth of data at a time, as long as you can
> quickly service the tx-queues again that should be enough to keep the
> firmware/radio productive.

Sure, but if you have 100 stations that are backlogged, and they each
transmit for 4ms every time they get a chance, then on average each
station has to wait 400 ms between each TXOP. That is way too long; so
the maximum TXOP size should go down as the number of backlogged
stations go up.

AQL already does some the right thing, BTW: When the total outstanding
data queued in the firmware exceeds 24ms (by the AQL estimation), it'll
switch the per-station limit from 12ms to 5ms of queued data. Arguably
that could be lower (say, 2ms for the low per-station limit).

> In the case where you have many users wanting lots of throughput, 8 or
> 16ms of extra latency is a good tradeoff vs no one being able to
> reliably get the bandwidth they need.

Yes, 8-16ms of extra latency is likely worth it, but not much more than
that...

> Higher priority TIDs will get precedence in ath10k firmware anyway, so
> even if at time 0 you sent 64 frames to a peer on back-ground TID, if
> you sent a VO frame at time 0+1, it could be transmitted first.

Assuming anyone is actually using the priorities, that is; most
appliations are not (and those that are often do it wrong). Also, using
the VO queue hurts efficiency as well for the same reason, since it
can't aggregate at all.

>>> If you feed a few frames to each of the 200 peers, then even if
>>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>>> leading to small ampdu/amsdu and thus worse over-all throughput and
>>> utilization of airtime.
>>
>> Do you have any data on exactly how long (in time) each txop becomes in
>> these highly congested scenarios?
>
> I didn't look at time, but avg packets-per-ampdu chain is 30+ in
> single station tests, and with many stations it goes into the 4-8
> range (from memory, so maybe I'm off a bit).

Right; was just looking for rough numbers, so that is fine.

> Here is an open-source tool that can give you those metrics by processing a pcap:
>
> https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag
>
> # Ignore the LANforge bits about creating capture files, here is an example of how to use it:
> https://www.candelatech.com/cookbook/wifire/wifi+diagnostic
>
>>
>>> It would be nice to be able to set certain traffic flows to have the
>>> throughput optimization and others to have the latency optimization.
>>> For instance, high latency on a streaming download is a good trade-off
>>> if it increases total throughput.
>>
>> For the individual flows to a peer, fq_codel actually does a pretty good
>> job at putting the latency-sensitive flows first. Which is why we want
>> the queueing to happen in mac80211 (where fq_codel is active) instead of
>> in the firmware.
>
> That sounds good to me. What is needed from the driver/firmware to
> make this work well?

To queue as little as possible :)

AQL is the mechanism we have to enforce this (for ath10k), by throttling
queueing into the firmware earlier. It only does this on a per-station
limit, though, and there's currently no mechanism to limit the total
number of stations with outstanding data, as you're requesting. Which I
guess means it won't help much in your case. One could imagine building
a mechanism on top of AQL to do this, though, although I think it may
not be quite trivial to get the interaction with the station scheduler
right. The basic building blocks are there, though...

-Toke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-29 14:56             ` Toke Høiland-Jørgensen
@ 2020-04-29 15:26               ` Ben Greear
  2020-04-29 15:42                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-04-29 15:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, linux-wireless



On 04/29/2020 07:56 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>> On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <greearb@candelatech.com> writes:
>>>
>>>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>>>> Ben Greear <greearb@candelatech.com> writes:
>>>>>
>>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>>> greearb@candelatech.com writes:
>>>>>>>
>>>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>>>
>>>>>>>> While running tcp upload + download tests with ~200
>>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>>> taking an additional ~15%.
>>>>>>>>
>>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>>> waiting for a low water mark.
>>>>>>>>
>>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>>> tx buffers allowed.
>>>>>>>>
>>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>>
>>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>>
>>>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>>>> ---
>>>>>>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>>
>>>>>>>>  	u8 target_version_major;
>>>>>>>>  	u8 target_version_minor;
>>>>>>>> +	bool needs_unlock;
>>>>>>>>  	struct completion target_version_received;
>>>>>>>>  	u8 max_num_amsdu;
>>>>>>>>  	u8 max_num_ampdu;
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>>  	lockdep_assert_held(&htt->tx_lock);
>>>>>>>>
>>>>>>>>  	htt->num_pending_tx--;
>>>>>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>>
>>>>>>> Why /4? Seems a bit arbitrary?
>>>>>>
>>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>>> sooner to keep it more full on average?
>>>>>
>>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>>> skeptical that it will work right for this, but if anyone wants to give
>>>>> it a shot, there it is.
>>>>>
>>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>>
>>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>>> in the tx queue.
>>>>>
>>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>>> queue altogether, or something smarter?
>>>>
>>>> Sort of like a mini-mac80211 stack inside the firmware is used to
>>>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>>>
>>>> For optimal throughput with 200 users steaming video,
>>>> the ath10k driver should think that it has only a few active peers wanting
>>>> to send data at a time (and so firmware would think the same), and the driver should
>>>> be fed a large chunk of pkts for those peers.  And then the next few peers.
>>>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>>>> over all.
>>>
>>> Yes, but also increasing latency because all other stations have to wait
>>> for a longer TXOP (see my other reply).
>>
>> If you at most sent 4 station's worth of data to the firmware, and max
>> is 4ms per txop, then you have at most 16ms of latency. You could also
>> send just two station's worth of data at a time, as long as you can
>> quickly service the tx-queues again that should be enough to keep the
>> firmware/radio productive.
>
> Sure, but if you have 100 stations that are backlogged, and they each
> transmit for 4ms every time they get a chance, then on average each
> station has to wait 400 ms between each TXOP. That is way too long; so
> the maximum TXOP size should go down as the number of backlogged
> stations go up.

400ms should be fine for streaming netflix, and there is no reason that
higher priority traffic cannot jump ahead in the queue for much
better latency.  If the frames are queued up in mac80211, then you don't
have to depend on TID scheduling in the driver/firmware anyway.

> AQL already does some the right thing, BTW: When the total outstanding
> data queued in the firmware exceeds 24ms (by the AQL estimation), it'll
> switch the per-station limit from 12ms to 5ms of queued data. Arguably
> that could be lower (say, 2ms for the low per-station limit).

That still should work OK for bulk transport, as 5ms is a full ampdu chain.

>> In the case where you have many users wanting lots of throughput, 8 or
>> 16ms of extra latency is a good tradeoff vs no one being able to
>> reliably get the bandwidth they need.
>
> Yes, 8-16ms of extra latency is likely worth it, but not much more than
> that...
>
>> Higher priority TIDs will get precedence in ath10k firmware anyway, so
>> even if at time 0 you sent 64 frames to a peer on back-ground TID, if
>> you sent a VO frame at time 0+1, it could be transmitted first.
>
> Assuming anyone is actually using the priorities, that is; most
> appliations are not (and those that are often do it wrong). Also, using
> the VO queue hurts efficiency as well for the same reason, since it
> can't aggregate at all.

I think ath10k aggregates VO traffic just fine, but either way, you could
use VI instead.

The kernel can adjust the TID appropriately, so the mac80211 scheduler to could
tweak the tid so that bulk transport always goes on BK, and things needing lower
latency goes out on VI perhaps?

>
>>>> If you feed a few frames to each of the 200 peers, then even if
>>>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>>>> leading to small ampdu/amsdu and thus worse over-all throughput and
>>>> utilization of airtime.
>>>
>>> Do you have any data on exactly how long (in time) each txop becomes in
>>> these highly congested scenarios?
>>
>> I didn't look at time, but avg packets-per-ampdu chain is 30+ in
>> single station tests, and with many stations it goes into the 4-8
>> range (from memory, so maybe I'm off a bit).
>
> Right; was just looking for rough numbers, so that is fine.
>
>> Here is an open-source tool that can give you those metrics by processing a pcap:
>>
>> https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag
>>
>> # Ignore the LANforge bits about creating capture files, here is an example of how to use it:
>> https://www.candelatech.com/cookbook/wifire/wifi+diagnostic
>>
>>>
>>>> It would be nice to be able to set certain traffic flows to have the
>>>> throughput optimization and others to have the latency optimization.
>>>> For instance, high latency on a streaming download is a good trade-off
>>>> if it increases total throughput.
>>>
>>> For the individual flows to a peer, fq_codel actually does a pretty good
>>> job at putting the latency-sensitive flows first. Which is why we want
>>> the queueing to happen in mac80211 (where fq_codel is active) instead of
>>> in the firmware.
>>
>> That sounds good to me. What is needed from the driver/firmware to
>> make this work well?
>
> To queue as little as possible :)
>
> AQL is the mechanism we have to enforce this (for ath10k), by throttling
> queueing into the firmware earlier. It only does this on a per-station
> limit, though, and there's currently no mechanism to limit the total
> number of stations with outstanding data, as you're requesting. Which I
> guess means it won't help much in your case. One could imagine building
> a mechanism on top of AQL to do this, though, although I think it may
> not be quite trivial to get the interaction with the station scheduler
> right. The basic building blocks are there, though...

(I think this below is right, but it is complicated as can be so maybe I
am wrong in places.)

The ath10k wave-2 firmware has its own scheduler.  It asks the driver what
peers need to transmit, then it requests buffers for those peers (based on
its own idea of fairness and scheduling).  If the driver tells it all want
to tx frames, it will queue lots and you are at the mercy of its scheduler.
But, if the driver tells firmware only a few peers want to transmit, then
the driver (and upper stacks) will have a lot more control over the firmware's
queueing and scheduling.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-29 15:26               ` Ben Greear
@ 2020-04-29 15:42                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 27+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-04-29 15:42 UTC (permalink / raw)
  To: Ben Greear, linux-wireless

Ben Greear <greearb@candelatech.com> writes:

> On 04/29/2020 07:56 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>>
>>> On 04/29/2020 02:28 AM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <greearb@candelatech.com> writes:
>>>>
>>>>> On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote:
>>>>>> Ben Greear <greearb@candelatech.com> writes:
>>>>>>
>>>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> greearb@candelatech.com writes:
>>>>>>>>
>>>>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>>>>
>>>>>>>>> While running tcp upload + download tests with ~200
>>>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>>>> taking an additional ~15%.
>>>>>>>>>
>>>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>>>> waiting for a low water mark.
>>>>>>>>>
>>>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>>>> tx buffers allowed.
>>>>>>>>>
>>>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>>>
>>>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>>>>> ---
>>>>>>>>>  drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>>>>  drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>>>
>>>>>>>>>  	u8 target_version_major;
>>>>>>>>>  	u8 target_version_minor;
>>>>>>>>> +	bool needs_unlock;
>>>>>>>>>  	struct completion target_version_received;
>>>>>>>>>  	u8 max_num_amsdu;
>>>>>>>>>  	u8 max_num_ampdu;
>>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>>>  	lockdep_assert_held(&htt->tx_lock);
>>>>>>>>>
>>>>>>>>>  	htt->num_pending_tx--;
>>>>>>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>>>
>>>>>>>> Why /4? Seems a bit arbitrary?
>>>>>>>
>>>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>>>> sooner to keep it more full on average?
>>>>>>
>>>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>>>> skeptical that it will work right for this, but if anyone wants to give
>>>>>> it a shot, there it is.
>>>>>>
>>>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>>>
>>>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>>>> in the tx queue.
>>>>>>
>>>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>>>> queue altogether, or something smarter?
>>>>>
>>>>> Sort of like a mini-mac80211 stack inside the firmware is used to
>>>>> create ampdu/amsdu chains and schedule them with its own scheduler.
>>>>>
>>>>> For optimal throughput with 200 users steaming video,
>>>>> the ath10k driver should think that it has only a few active peers wanting
>>>>> to send data at a time (and so firmware would think the same), and the driver should
>>>>> be fed a large chunk of pkts for those peers.  And then the next few peers.
>>>>> That should let firmware send large ampdu/amsdu to each peer, increasing throughput
>>>>> over all.
>>>>
>>>> Yes, but also increasing latency because all other stations have to wait
>>>> for a longer TXOP (see my other reply).
>>>
>>> If you at most sent 4 station's worth of data to the firmware, and max
>>> is 4ms per txop, then you have at most 16ms of latency. You could also
>>> send just two station's worth of data at a time, as long as you can
>>> quickly service the tx-queues again that should be enough to keep the
>>> firmware/radio productive.
>>
>> Sure, but if you have 100 stations that are backlogged, and they each
>> transmit for 4ms every time they get a chance, then on average each
>> station has to wait 400 ms between each TXOP. That is way too long; so
>> the maximum TXOP size should go down as the number of backlogged
>> stations go up.
>
> 400ms should be fine for streaming netflix, and there is no reason
> that higher priority traffic cannot jump ahead in the queue for much
> better latency. If the frames are queued up in mac80211, then you
> don't have to depend on TID scheduling in the driver/firmware anyway.

Yeah, in an ideal world where you have total information about current
and future traffic patterns that is of course what you would do. The
trick is to design a general algorithm that works for all the weird
cases (and traffic mixes!) with only the information that's available to
it. We've gotten a fair way towards that, but as you point out there are
still improvements possible.

Really, I think you're the only one who's testing with this many
stations active at the same time, so I guess it's not surprising that
this case has not seen too much scrutiny thus far :)

>> AQL already does some the right thing, BTW: When the total outstanding
>> data queued in the firmware exceeds 24ms (by the AQL estimation), it'll
>> switch the per-station limit from 12ms to 5ms of queued data. Arguably
>> that could be lower (say, 2ms for the low per-station limit).
>
> That still should work OK for bulk transport, as 5ms is a full ampdu chain.
>
>>> In the case where you have many users wanting lots of throughput, 8 or
>>> 16ms of extra latency is a good tradeoff vs no one being able to
>>> reliably get the bandwidth they need.
>>
>> Yes, 8-16ms of extra latency is likely worth it, but not much more than
>> that...
>>
>>> Higher priority TIDs will get precedence in ath10k firmware anyway, so
>>> even if at time 0 you sent 64 frames to a peer on back-ground TID, if
>>> you sent a VO frame at time 0+1, it could be transmitted first.
>>
>> Assuming anyone is actually using the priorities, that is; most
>> appliations are not (and those that are often do it wrong). Also, using
>> the VO queue hurts efficiency as well for the same reason, since it
>> can't aggregate at all.
>
> I think ath10k aggregates VO traffic just fine, but either way, you could
> use VI instead.
>
> The kernel can adjust the TID appropriately, so the mac80211 scheduler
> to could tweak the tid so that bulk transport always goes on BK, and
> things needing lower latency goes out on VI perhaps?

I thought about messing with priorities like this, actually. The reason
I didn't go any further with it was that we found that it was often not
a win at all; if a station has both latency-sensitive and bulk traffic
you might as well just stick the low-latency traffic at the front of the
aggregate and send the whole thing in one go. Saves a TXOP :)

>>>>> If you feed a few frames to each of the 200 peers, then even if
>>>>> firmware has 2000 tx buffers, that is only 10 frames per peer at best,
>>>>> leading to small ampdu/amsdu and thus worse over-all throughput and
>>>>> utilization of airtime.
>>>>
>>>> Do you have any data on exactly how long (in time) each txop becomes in
>>>> these highly congested scenarios?
>>>
>>> I didn't look at time, but avg packets-per-ampdu chain is 30+ in
>>> single station tests, and with many stations it goes into the 4-8
>>> range (from memory, so maybe I'm off a bit).
>>
>> Right; was just looking for rough numbers, so that is fine.
>>
>>> Here is an open-source tool that can give you those metrics by processing a pcap:
>>>
>>> https://github.com/greearb/lanforge-scripts/tree/master/wifi_diag
>>>
>>> # Ignore the LANforge bits about creating capture files, here is an example of how to use it:
>>> https://www.candelatech.com/cookbook/wifire/wifi+diagnostic
>>>
>>>>
>>>>> It would be nice to be able to set certain traffic flows to have the
>>>>> throughput optimization and others to have the latency optimization.
>>>>> For instance, high latency on a streaming download is a good trade-off
>>>>> if it increases total throughput.
>>>>
>>>> For the individual flows to a peer, fq_codel actually does a pretty good
>>>> job at putting the latency-sensitive flows first. Which is why we want
>>>> the queueing to happen in mac80211 (where fq_codel is active) instead of
>>>> in the firmware.
>>>
>>> That sounds good to me. What is needed from the driver/firmware to
>>> make this work well?
>>
>> To queue as little as possible :)
>>
>> AQL is the mechanism we have to enforce this (for ath10k), by throttling
>> queueing into the firmware earlier. It only does this on a per-station
>> limit, though, and there's currently no mechanism to limit the total
>> number of stations with outstanding data, as you're requesting. Which I
>> guess means it won't help much in your case. One could imagine building
>> a mechanism on top of AQL to do this, though, although I think it may
>> not be quite trivial to get the interaction with the station scheduler
>> right. The basic building blocks are there, though...
>
> (I think this below is right, but it is complicated as can be so maybe I
> am wrong in places.)
>
> The ath10k wave-2 firmware has its own scheduler. It asks the driver
> what peers need to transmit, then it requests buffers for those peers
> (based on its own idea of fairness and scheduling). If the driver
> tells it all want to tx frames, it will queue lots and you are at the
> mercy of its scheduler. But, if the driver tells firmware only a few
> peers want to transmit, then the driver (and upper stacks) will have a
> lot more control over the firmware's queueing and scheduling.

Yeah, that's basically what we're trying to do in the airtime fairness
scheduler (see ieee80211_txq_may_transmit()).

-Toke

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-28 20:39     ` Toke Høiland-Jørgensen
  2020-04-28 21:18       ` Ben Greear
@ 2020-04-30 23:31       ` John Deere
  2020-05-01  2:16         ` Ben Greear
  1 sibling, 1 reply; 27+ messages in thread
From: John Deere @ 2020-04-30 23:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Ben Greear, linux-wireless

I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out 
with:

ath10k_pci  failed to increase tx pending count: -16, dropping

This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 
5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.

I've also tried the standalone patch by Ben and it seems to have reduced 
the latencies on top of AQL by another 5 ms.

On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
> Ben Greear <greearb@candelatech.com> writes:
> 
>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>> greearb@candelatech.com writes:
>>>
>>>> From: Ben Greear <greearb@candelatech.com>
>>>>
>>>> While running tcp upload + download tests with ~200
>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>> around 20% of the CPU according to perf-top, which other locking
>>>> taking an additional ~15%.
>>>>
>>>> I believe the issue is that the ath10k driver would unlock the
>>>> txqueue when a single frame could be transmitted, instead of
>>>> waiting for a low water mark.
>>>>
>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>> tx buffers allowed.
>>>>
>>>> This appears to resolve the performance problem that I saw.
>>>>
>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>
>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>> ---
>>>>   drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>   drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>
>>>>   	u8 target_version_major;
>>>>   	u8 target_version_minor;
>>>> +	bool needs_unlock;
>>>>   	struct completion target_version_received;
>>>>   	u8 max_num_amsdu;
>>>>   	u8 max_num_ampdu;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>   	lockdep_assert_held(&htt->tx_lock);
>>>>
>>>>   	htt->num_pending_tx--;
>>>> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>> +	if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>
>>> Why /4? Seems a bit arbitrary?
>>
>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>> full so that it is unlikely to run dry. Possibly it should restart
>> sooner to keep it more full on average?
> 
> Theoretically, the "keep the queue at the lowest possible level that
> keeps it from underflowing" is what BQL is supposed to do. The diff
> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
> place of num_pending_tx. I've only compile tested it, and I'm a bit
> skeptical that it will work right for this, but if anyone wants to give
> it a shot, there it is.
> 
> BTW, while doing that, I noticed there's a similar arbitrary limit in
> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
> to keep the arbitrary limit maybe use the same one? :)
> 
>> Before my patch, the behaviour would be to try to keep it as full as
>> possible, as in restart the queues as soon as a single slot opens up
>> in the tx queue.
> 
> Yeah, that seems somewhat misguided as well, from a latency perspective,
> at least. But I guess that's what we're fixing with AQL. What does the
> firmware do with the frames queued within? Do they just go on a FIFO
> queue altogether, or something smarter?
> 
> -Toke
> 
> 
> 
> 
> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
> index f26cc6989dad..72771ff38a94 100644
> --- a/drivers/net/wireless/ath/ath10k/core.c
> +++ b/drivers/net/wireless/ath/ath10k/core.c
> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>   		return -EINVAL;
>   	}
>   
> +	dql_init(&ar->htt.dql, HZ);
> +	ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
> +	ar->htt.dql.min_limit = 8;
> +
>   	if (ar->hw_params.num_peers)
>   		ar->max_num_peers = ar->hw_params.num_peers;
>   	else
> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
> index 4a12564fc30e..19024d063896 100644
> --- a/drivers/net/wireless/ath/ath10k/htt.h
> +++ b/drivers/net/wireless/ath/ath10k/htt.h
> @@ -13,6 +13,7 @@
>   #include <linux/dmapool.h>
>   #include <linux/hashtable.h>
>   #include <linux/kfifo.h>
> +#include <linux/dynamic_queue_limits.h>
>   #include <net/mac80211.h>
>   
>   #include "htc.h"
> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>   	/* Protects access to pending_tx, num_pending_tx */
>   	spinlock_t tx_lock;
>   	int max_num_pending_tx;
> -	int num_pending_tx;
>   	int num_pending_mgmt_tx;
> +	struct dql dql;
>   	struct idr pending_tx;
>   	wait_queue_head_t empty_tx_wq;
>   
> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
> index e9d12ea708b6..911a79470bdf 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>   {
>   	lockdep_assert_held(&htt->tx_lock);
>   
> -	htt->num_pending_tx--;
> -	if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
> +	dql_completed(&htt->dql, 1);
> +	if (dql_avail(&htt->dql) > 0)
>   		ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>   }
>   
> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>   {
>   	lockdep_assert_held(&htt->tx_lock);
>   
> -	if (htt->num_pending_tx >= htt->max_num_pending_tx)
> +	if (dql_avail(&htt->dql) <= 0)
>   		return -EBUSY;
>   
> -	htt->num_pending_tx++;
> -	if (htt->num_pending_tx == htt->max_num_pending_tx)
> +	dql_queued(&htt->dql, 1);
> +	if (dql_avail(&htt->dql) <= 0)
>   		ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>   
>   	return 0;
> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
> index 2d03b8dd3b8c..1fe251742b0a 100644
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>   	if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>   		return true;
>   
> -	if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
> +	if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>   		return true;
>   
>   	if (artxq->num_fw_queued < artxq->num_push_allowed)
> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>   	if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>   		return;
>   
> -	if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
> +	if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>   		return;
>   
>   	rcu_read_lock();
> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>   			bool empty;
>   
>   			spin_lock_bh(&ar->htt.tx_lock);
> -			empty = (ar->htt.num_pending_tx == 0);
> +			empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>   			spin_unlock_bh(&ar->htt.tx_lock);
>   
>   			skip = (ar->state == ATH10K_STATE_WEDGED) ||
> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
> index 39abf8b12903..fe7cd53c2bf9 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>   
>   	ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>   	ath10k_htt_tx_dec_pending(htt);
> -	if (htt->num_pending_tx == 0)
> +	if (htt->dql.num_completed == htt->dql.num_queued)
>   		wake_up(&htt->empty_tx_wq);
>   	spin_unlock_bh(&htt->tx_lock);
>   
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-04-30 23:31       ` John Deere
@ 2020-05-01  2:16         ` Ben Greear
  2020-05-01 15:50           ` John Deere
  0 siblings, 1 reply; 27+ messages in thread
From: Ben Greear @ 2020-05-01  2:16 UTC (permalink / raw)
  To: John Deere, Toke Høiland-Jørgensen, linux-wireless



On 04/30/2020 04:31 PM, John Deere wrote:
> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with:
>
> ath10k_pci  failed to increase tx pending count: -16, dropping
>
> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.
>
> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.

Hello,

Did you notice any throughput changes or system load changes in the test that you did with my patch?

Thanks,
Ben

>
> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>> Ben Greear <greearb@candelatech.com> writes:
>>
>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>> greearb@candelatech.com writes:
>>>>
>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>
>>>>> While running tcp upload + download tests with ~200
>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>> taking an additional ~15%.
>>>>>
>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>> txqueue when a single frame could be transmitted, instead of
>>>>> waiting for a low water mark.
>>>>>
>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>> tx buffers allowed.
>>>>>
>>>>> This appears to resolve the performance problem that I saw.
>>>>>
>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>
>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>> ---
>>>>>   drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>   drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>
>>>>>       u8 target_version_major;
>>>>>       u8 target_version_minor;
>>>>> +    bool needs_unlock;
>>>>>       struct completion target_version_received;
>>>>>       u8 max_num_amsdu;
>>>>>       u8 max_num_ampdu;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>       lockdep_assert_held(&htt->tx_lock);
>>>>>
>>>>>       htt->num_pending_tx--;
>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>> +    if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>
>>>> Why /4? Seems a bit arbitrary?
>>>
>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>> full so that it is unlikely to run dry. Possibly it should restart
>>> sooner to keep it more full on average?
>>
>> Theoretically, the "keep the queue at the lowest possible level that
>> keeps it from underflowing" is what BQL is supposed to do. The diff
>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>> skeptical that it will work right for this, but if anyone wants to give
>> it a shot, there it is.
>>
>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>> to keep the arbitrary limit maybe use the same one? :)
>>
>>> Before my patch, the behaviour would be to try to keep it as full as
>>> possible, as in restart the queues as soon as a single slot opens up
>>> in the tx queue.
>>
>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>> at least. But I guess that's what we're fixing with AQL. What does the
>> firmware do with the frames queued within? Do they just go on a FIFO
>> queue altogether, or something smarter?
>>
>> -Toke
>>
>>
>>
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
>> index f26cc6989dad..72771ff38a94 100644
>> --- a/drivers/net/wireless/ath/ath10k/core.c
>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>>           return -EINVAL;
>>       }
>>   +    dql_init(&ar->htt.dql, HZ);
>> +    ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>> +    ar->htt.dql.min_limit = 8;
>> +
>>       if (ar->hw_params.num_peers)
>>           ar->max_num_peers = ar->hw_params.num_peers;
>>       else
>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>> index 4a12564fc30e..19024d063896 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>> @@ -13,6 +13,7 @@
>>   #include <linux/dmapool.h>
>>   #include <linux/hashtable.h>
>>   #include <linux/kfifo.h>
>> +#include <linux/dynamic_queue_limits.h>
>>   #include <net/mac80211.h>
>>     #include "htc.h"
>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>       /* Protects access to pending_tx, num_pending_tx */
>>       spinlock_t tx_lock;
>>       int max_num_pending_tx;
>> -    int num_pending_tx;
>>       int num_pending_mgmt_tx;
>> +    struct dql dql;
>>       struct idr pending_tx;
>>       wait_queue_head_t empty_tx_wq;
>>   diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> index e9d12ea708b6..911a79470bdf 100644
>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>   {
>>       lockdep_assert_held(&htt->tx_lock);
>>   -    htt->num_pending_tx--;
>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>> +    dql_completed(&htt->dql, 1);
>> +    if (dql_avail(&htt->dql) > 0)
>>           ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>   }
>>   @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>   {
>>       lockdep_assert_held(&htt->tx_lock);
>>   -    if (htt->num_pending_tx >= htt->max_num_pending_tx)
>> +    if (dql_avail(&htt->dql) <= 0)
>>           return -EBUSY;
>>   -    htt->num_pending_tx++;
>> -    if (htt->num_pending_tx == htt->max_num_pending_tx)
>> +    dql_queued(&htt->dql, 1);
>> +    if (dql_avail(&htt->dql) <= 0)
>>           ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>         return 0;
>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>> index 2d03b8dd3b8c..1fe251742b0a 100644
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>>       if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>           return true;
>>   -    if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>> +    if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>           return true;
>>         if (artxq->num_fw_queued < artxq->num_push_allowed)
>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>       if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>           return;
>>   -    if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>> +    if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>           return;
>>         rcu_read_lock();
>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>>               bool empty;
>>                 spin_lock_bh(&ar->htt.tx_lock);
>> -            empty = (ar->htt.num_pending_tx == 0);
>> +            empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>>               spin_unlock_bh(&ar->htt.tx_lock);
>>                 skip = (ar->state == ATH10K_STATE_WEDGED) ||
>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
>> index 39abf8b12903..fe7cd53c2bf9 100644
>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>         ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>       ath10k_htt_tx_dec_pending(htt);
>> -    if (htt->num_pending_tx == 0)
>> +    if (htt->dql.num_completed == htt->dql.num_queued)
>>           wake_up(&htt->empty_tx_wq);
>>       spin_unlock_bh(&htt->tx_lock);
>>
>

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-05-01  2:16         ` Ben Greear
@ 2020-05-01 15:50           ` John Deere
  2020-05-01 17:51             ` Mark Baker
  0 siblings, 1 reply; 27+ messages in thread
From: John Deere @ 2020-05-01 15:50 UTC (permalink / raw)
  To: Ben Greear, Toke Høiland-Jørgensen, linux-wireless



On 5/1/20 10:16 AM, Ben Greear wrote:
> 
> 
> On 04/30/2020 04:31 PM, John Deere wrote:
>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors 
>> out with:
>>
>> ath10k_pci  failed to increase tx pending count: -16, dropping
>>
>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 
>> 5.4.27 & AQL). ath10k starts up but is unable to let any stations 
>> connect.
>>
>> I've also tried the standalone patch by Ben and it seems to have 
>> reduced the latencies on top of AQL by another 5 ms.
> 
> Hello,
> 
> Did you notice any throughput changes or system load changes in the test 
> that you did with my patch?
> 
> Thanks,
> Ben
> 

I have noticed that there has been a reduction in system load and memory 
use. Whereas previously with 11 clients on one Archer C7 acting as an AP 
only, my free memory available would be 51%, it now shows 55-56% - a 4% 
to 5% reduction. Note, these results were obtained alongside with 
reverting the following comit 
https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9. 
I believe that this same set of patches also are currently in use for 
the ath10k-ct smallbuffers variant on OpenWrt.

I have not had the time to conduct any meaningful throughput measurements.

>>
>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>>> Ben Greear <greearb@candelatech.com> writes:
>>>
>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>> greearb@candelatech.com writes:
>>>>>
>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>
>>>>>> While running tcp upload + download tests with ~200
>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>> taking an additional ~15%.
>>>>>>
>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>> waiting for a low water mark.
>>>>>>
>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>> tx buffers allowed.
>>>>>>
>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>
>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>
>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>> ---
>>>>>>   drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>   drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h 
>>>>>> b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>
>>>>>>       u8 target_version_major;
>>>>>>       u8 target_version_minor;
>>>>>> +    bool needs_unlock;
>>>>>>       struct completion target_version_received;
>>>>>>       u8 max_num_amsdu;
>>>>>>       u8 max_num_ampdu;
>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c 
>>>>>> b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct 
>>>>>> ath10k_htt *htt)
>>>>>>       lockdep_assert_held(&htt->tx_lock);
>>>>>>
>>>>>>       htt->num_pending_tx--;
>>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>> +    if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && 
>>>>>> htt->needs_unlock) {
>>>>>
>>>>> Why /4? Seems a bit arbitrary?
>>>>
>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>> sooner to keep it more full on average?
>>>
>>> Theoretically, the "keep the queue at the lowest possible level that
>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>> skeptical that it will work right for this, but if anyone wants to give
>>> it a shot, there it is.
>>>
>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>> to keep the arbitrary limit maybe use the same one? :)
>>>
>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>> possible, as in restart the queues as soon as a single slot opens up
>>>> in the tx queue.
>>>
>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>> at least. But I guess that's what we're fixing with AQL. What does the
>>> firmware do with the frames queued within? Do they just go on a FIFO
>>> queue altogether, or something smarter?
>>>
>>> -Toke
>>>
>>>
>>>
>>>
>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c 
>>> b/drivers/net/wireless/ath/ath10k/core.c
>>> index f26cc6989dad..72771ff38a94 100644
>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>> @@ -2497,6 +2497,10 @@ static int 
>>> ath10k_core_init_firmware_features(struct ath10k *ar)
>>>           return -EINVAL;
>>>       }
>>>   +    dql_init(&ar->htt.dql, HZ);
>>> +    ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>>> +    ar->htt.dql.min_limit = 8;
>>> +
>>>       if (ar->hw_params.num_peers)
>>>           ar->max_num_peers = ar->hw_params.num_peers;
>>>       else
>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h 
>>> b/drivers/net/wireless/ath/ath10k/htt.h
>>> index 4a12564fc30e..19024d063896 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>> @@ -13,6 +13,7 @@
>>>   #include <linux/dmapool.h>
>>>   #include <linux/hashtable.h>
>>>   #include <linux/kfifo.h>
>>> +#include <linux/dynamic_queue_limits.h>
>>>   #include <net/mac80211.h>
>>>     #include "htc.h"
>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>>       /* Protects access to pending_tx, num_pending_tx */
>>>       spinlock_t tx_lock;
>>>       int max_num_pending_tx;
>>> -    int num_pending_tx;
>>>       int num_pending_mgmt_tx;
>>> +    struct dql dql;
>>>       struct idr pending_tx;
>>>       wait_queue_head_t empty_tx_wq;
>>>   diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c 
>>> b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> index e9d12ea708b6..911a79470bdf 100644
>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt 
>>> *htt)
>>>   {
>>>       lockdep_assert_held(&htt->tx_lock);
>>>   -    htt->num_pending_tx--;
>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>> +    dql_completed(&htt->dql, 1);
>>> +    if (dql_avail(&htt->dql) > 0)
>>>           ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>   }
>>>   @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct 
>>> ath10k_htt *htt)
>>>   {
>>>       lockdep_assert_held(&htt->tx_lock);
>>>   -    if (htt->num_pending_tx >= htt->max_num_pending_tx)
>>> +    if (dql_avail(&htt->dql) <= 0)
>>>           return -EBUSY;
>>>   -    htt->num_pending_tx++;
>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx)
>>> +    dql_queued(&htt->dql, 1);
>>> +    if (dql_avail(&htt->dql) <= 0)
>>>           ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>         return 0;
>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c 
>>> b/drivers/net/wireless/ath/ath10k/mac.c
>>> index 2d03b8dd3b8c..1fe251742b0a 100644
>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct 
>>> ieee80211_hw *hw,
>>>       if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>>           return true;
>>>   -    if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>>> +    if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>>           return true;
>>>         if (artxq->num_fw_queued < artxq->num_push_allowed)
>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>       if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>>           return;
>>>   -    if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>>> +    if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>>           return;
>>>         rcu_read_lock();
>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k 
>>> *ar)
>>>               bool empty;
>>>                 spin_lock_bh(&ar->htt.tx_lock);
>>> -            empty = (ar->htt.num_pending_tx == 0);
>>> +            empty = (ar->htt.dql.num_completed == 
>>> ar->htt.dql.num_queued);
>>>               spin_unlock_bh(&ar->htt.tx_lock);
>>>                 skip = (ar->state == ATH10K_STATE_WEDGED) ||
>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c 
>>> b/drivers/net/wireless/ath/ath10k/txrx.c
>>> index 39abf8b12903..fe7cd53c2bf9 100644
>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>>         ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>>       ath10k_htt_tx_dec_pending(htt);
>>> -    if (htt->num_pending_tx == 0)
>>> +    if (htt->dql.num_completed == htt->dql.num_queued)
>>>           wake_up(&htt->empty_tx_wq);
>>>       spin_unlock_bh(&htt->tx_lock);
>>>
>>
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-05-01 15:50           ` John Deere
@ 2020-05-01 17:51             ` Mark Baker
  2020-05-02  5:49               ` John Deere
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Baker @ 2020-05-01 17:51 UTC (permalink / raw)
  To: John Deere; +Cc: Ben Greear, Toke Høiland-Jørgensen, linux-wireless

I have been using Toke’s patch successfully with my Netgear R7800 (QCA9884) running OpenWRT snapshot build with ath10k-ct firmware and ath10k-ct-smallbuffers driver. I am using my R7800 as an AP only, so I am running a very stripped down build of OpenWRT.

I provided several Flent tests to Dave Taht and Toke showing some pretty significant improvements in latency. I did not capture any CPU/sirq stats on my R7800 during these tests, though. It does appear my maximum download bandwidth has decreased somewhat, but I am more than happy to take that hit for the benefit of significantly lower latency.

> On May 1, 2020, at 11:50 AM, John Deere <24601deerej@gmail.com> wrote:
> 
> 
> 
> On 5/1/20 10:16 AM, Ben Greear wrote:
>> On 04/30/2020 04:31 PM, John Deere wrote:
>>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with:
>>> 
>>> ath10k_pci  failed to increase tx pending count: -16, dropping
>>> 
>>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.
>>> 
>>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.
>> Hello,
>> Did you notice any throughput changes or system load changes in the test that you did with my patch?
>> Thanks,
>> Ben
> 
> I have noticed that there has been a reduction in system load and memory use. Whereas previously with 11 clients on one Archer C7 acting as an AP only, my free memory available would be 51%, it now shows 55-56% - a 4% to 5% reduction. Note, these results were obtained alongside with reverting the following comit https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9. I believe that this same set of patches also are currently in use for the ath10k-ct smallbuffers variant on OpenWrt.
> 
> I have not had the time to conduct any meaningful throughput measurements.
> 
>>> 
>>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>>>> Ben Greear <greearb@candelatech.com> writes:
>>>> 
>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>> greearb@candelatech.com writes:
>>>>>> 
>>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>> 
>>>>>>> While running tcp upload + download tests with ~200
>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>> taking an additional ~15%.
>>>>>>> 
>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>> waiting for a low water mark.
>>>>>>> 
>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>> tx buffers allowed.
>>>>>>> 
>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>> 
>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>> 
>>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>>> ---
>>>>>>>   drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>>   drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>> 
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>> 
>>>>>>>       u8 target_version_major;
>>>>>>>       u8 target_version_minor;
>>>>>>> +    bool needs_unlock;
>>>>>>>       struct completion target_version_received;
>>>>>>>       u8 max_num_amsdu;
>>>>>>>       u8 max_num_ampdu;
>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>       lockdep_assert_held(&htt->tx_lock);
>>>>>>> 
>>>>>>>       htt->num_pending_tx--;
>>>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>> +    if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>> 
>>>>>> Why /4? Seems a bit arbitrary?
>>>>> 
>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>> sooner to keep it more full on average?
>>>> 
>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>> skeptical that it will work right for this, but if anyone wants to give
>>>> it a shot, there it is.
>>>> 
>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>> to keep the arbitrary limit maybe use the same one? :)
>>>> 
>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>> in the tx queue.
>>>> 
>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>> queue altogether, or something smarter?
>>>> 
>>>> -Toke
>>>> 
>>>> 
>>>> 
>>>> 
>>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
>>>> index f26cc6989dad..72771ff38a94 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>>>>           return -EINVAL;
>>>>       }
>>>>   +    dql_init(&ar->htt.dql, HZ);
>>>> +    ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>>>> +    ar->htt.dql.min_limit = 8;
>>>> +
>>>>       if (ar->hw_params.num_peers)
>>>>           ar->max_num_peers = ar->hw_params.num_peers;
>>>>       else
>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>> index 4a12564fc30e..19024d063896 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>> @@ -13,6 +13,7 @@
>>>>   #include <linux/dmapool.h>
>>>>   #include <linux/hashtable.h>
>>>>   #include <linux/kfifo.h>
>>>> +#include <linux/dynamic_queue_limits.h>
>>>>   #include <net/mac80211.h>
>>>>     #include "htc.h"
>>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>>>       /* Protects access to pending_tx, num_pending_tx */
>>>>       spinlock_t tx_lock;
>>>>       int max_num_pending_tx;
>>>> -    int num_pending_tx;
>>>>       int num_pending_mgmt_tx;
>>>> +    struct dql dql;
>>>>       struct idr pending_tx;
>>>>       wait_queue_head_t empty_tx_wq;
>>>>   diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> index e9d12ea708b6..911a79470bdf 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>   {
>>>>       lockdep_assert_held(&htt->tx_lock);
>>>>   -    htt->num_pending_tx--;
>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>> +    dql_completed(&htt->dql, 1);
>>>> +    if (dql_avail(&htt->dql) > 0)
>>>>           ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>>   }
>>>>   @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>>>   {
>>>>       lockdep_assert_held(&htt->tx_lock);
>>>>   -    if (htt->num_pending_tx >= htt->max_num_pending_tx)
>>>> +    if (dql_avail(&htt->dql) <= 0)
>>>>           return -EBUSY;
>>>>   -    htt->num_pending_tx++;
>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx)
>>>> +    dql_queued(&htt->dql, 1);
>>>> +    if (dql_avail(&htt->dql) <= 0)
>>>>           ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>>         return 0;
>>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>>>> index 2d03b8dd3b8c..1fe251742b0a 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>>>>       if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>>>           return true;
>>>>   -    if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>>>> +    if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>>>           return true;
>>>>         if (artxq->num_fw_queued < artxq->num_push_allowed)
>>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>>       if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>>>           return;
>>>>   -    if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>>>> +    if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>>>           return;
>>>>         rcu_read_lock();
>>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>>>>               bool empty;
>>>>                 spin_lock_bh(&ar->htt.tx_lock);
>>>> -            empty = (ar->htt.num_pending_tx == 0);
>>>> +            empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>>>>               spin_unlock_bh(&ar->htt.tx_lock);
>>>>                 skip = (ar->state == ATH10K_STATE_WEDGED) ||
>>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
>>>> index 39abf8b12903..fe7cd53c2bf9 100644
>>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>>>         ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>>>       ath10k_htt_tx_dec_pending(htt);
>>>> -    if (htt->num_pending_tx == 0)
>>>> +    if (htt->dql.num_completed == htt->dql.num_queued)
>>>>           wake_up(&htt->empty_tx_wq);
>>>>       spin_unlock_bh(&htt->tx_lock);
>>>> 
>>> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-05-01 17:51             ` Mark Baker
@ 2020-05-02  5:49               ` John Deere
  2020-05-02 15:56                 ` Ben Greear
  0 siblings, 1 reply; 27+ messages in thread
From: John Deere @ 2020-05-02  5:49 UTC (permalink / raw)
  To: Mark Baker; +Cc: Ben Greear, Toke Høiland-Jørgensen, linux-wireless



On 5/2/20 1:51 AM, Mark Baker wrote:
> I have been using Toke’s patch successfully with my Netgear R7800 (QCA9884) running OpenWRT snapshot build with ath10k-ct firmware and ath10k-ct-smallbuffers driver. I am using my R7800 as an AP only, so I am running a very stripped down build of OpenWRT.

Interesting to hear that it works for you. To narrow it down on whether 
it is due to CT-FW & Driver vs the device being Wave 2, is it possible 
for you to test a builed with ath10k non-CT driver and FW?

> 
> I provided several Flent tests to Dave Taht and Toke showing some pretty significant improvements in latency. I did not capture any CPU/sirq stats on my R7800 during these tests, though. It does appear my maximum download bandwidth has decreased somewhat, but I am more than happy to take that hit for the benefit of significantly lower latency.
> 
>> On May 1, 2020, at 11:50 AM, John Deere <24601deerej@gmail.com> wrote:
>>
>>
>>
>> On 5/1/20 10:16 AM, Ben Greear wrote:
>>> On 04/30/2020 04:31 PM, John Deere wrote:
>>>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with:
>>>>
>>>> ath10k_pci  failed to increase tx pending count: -16, dropping
>>>>
>>>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect.
>>>>
>>>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.
>>> Hello,
>>> Did you notice any throughput changes or system load changes in the test that you did with my patch?

Hi Ben,

I've ran some basic throughput tests with iperf and it seems that there 
was also slight hit in throughput, but only by 5-20 mbps or so.


>>> Thanks,
>>> Ben
>>
>> I have noticed that there has been a reduction in system load and memory use. Whereas previously with 11 clients on one Archer C7 acting as an AP only, my free memory available would be 51%, it now shows 55-56% - a 4% to 5% reduction. Note, these results were obtained alongside with reverting the following comit https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9. I believe that this same set of patches also are currently in use for the ath10k-ct smallbuffers variant on OpenWrt.
>>
>> I have not had the time to conduct any meaningful throughput measurements.
>>
>>>>
>>>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote:
>>>>> Ben Greear <greearb@candelatech.com> writes:
>>>>>
>>>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote:
>>>>>>> greearb@candelatech.com writes:
>>>>>>>
>>>>>>>> From: Ben Greear <greearb@candelatech.com>
>>>>>>>>
>>>>>>>> While running tcp upload + download tests with ~200
>>>>>>>> concurrent TCP streams, 1-2 processes, and 30 station
>>>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking
>>>>>>>> around 20% of the CPU according to perf-top, which other locking
>>>>>>>> taking an additional ~15%.
>>>>>>>>
>>>>>>>> I believe the issue is that the ath10k driver would unlock the
>>>>>>>> txqueue when a single frame could be transmitted, instead of
>>>>>>>> waiting for a low water mark.
>>>>>>>>
>>>>>>>> So, this patch adds a low-water mark that is 1/4 of the total
>>>>>>>> tx buffers allowed.
>>>>>>>>
>>>>>>>> This appears to resolve the performance problem that I saw.
>>>>>>>>
>>>>>>>> Tested with recent wave-1 ath10k-ct firmware.
>>>>>>>>
>>>>>>>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>>>>>>> ---
>>>>>>>>    drivers/net/wireless/ath/ath10k/htt.h    | 1 +
>>>>>>>>    drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++--
>>>>>>>>    2 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt {
>>>>>>>>
>>>>>>>>        u8 target_version_major;
>>>>>>>>        u8 target_version_minor;
>>>>>>>> +    bool needs_unlock;
>>>>>>>>        struct completion target_version_received;
>>>>>>>>        u8 max_num_amsdu;
>>>>>>>>        u8 max_num_ampdu;
>>>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644
>>>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>>>>        lockdep_assert_held(&htt->tx_lock);
>>>>>>>>
>>>>>>>>        htt->num_pending_tx--;
>>>>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>>>>> +    if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) {
>>>>>>>
>>>>>>> Why /4? Seems a bit arbitrary?
>>>>>>
>>>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4
>>>>>> full so that it is unlikely to run dry. Possibly it should restart
>>>>>> sooner to keep it more full on average?
>>>>>
>>>>> Theoretically, the "keep the queue at the lowest possible level that
>>>>> keeps it from underflowing" is what BQL is supposed to do. The diff
>>>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in
>>>>> place of num_pending_tx. I've only compile tested it, and I'm a bit
>>>>> skeptical that it will work right for this, but if anyone wants to give
>>>>> it a shot, there it is.
>>>>>
>>>>> BTW, while doing that, I noticed there's a similar arbitrary limit in
>>>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going
>>>>> to keep the arbitrary limit maybe use the same one? :)
>>>>>
>>>>>> Before my patch, the behaviour would be to try to keep it as full as
>>>>>> possible, as in restart the queues as soon as a single slot opens up
>>>>>> in the tx queue.
>>>>>
>>>>> Yeah, that seems somewhat misguided as well, from a latency perspective,
>>>>> at least. But I guess that's what we're fixing with AQL. What does the
>>>>> firmware do with the frames queued within? Do they just go on a FIFO
>>>>> queue altogether, or something smarter?
>>>>>
>>>>> -Toke
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
>>>>> index f26cc6989dad..72771ff38a94 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/core.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/core.c
>>>>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar)
>>>>>            return -EINVAL;
>>>>>        }
>>>>>    +    dql_init(&ar->htt.dql, HZ);
>>>>> +    ar->htt.dql.max_limit = ar->htt.max_num_pending_tx;
>>>>> +    ar->htt.dql.min_limit = 8;
>>>>> +
>>>>>        if (ar->hw_params.num_peers)
>>>>>            ar->max_num_peers = ar->hw_params.num_peers;
>>>>>        else
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> index 4a12564fc30e..19024d063896 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h
>>>>> @@ -13,6 +13,7 @@
>>>>>    #include <linux/dmapool.h>
>>>>>    #include <linux/hashtable.h>
>>>>>    #include <linux/kfifo.h>
>>>>> +#include <linux/dynamic_queue_limits.h>
>>>>>    #include <net/mac80211.h>
>>>>>      #include "htc.h"
>>>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt {
>>>>>        /* Protects access to pending_tx, num_pending_tx */
>>>>>        spinlock_t tx_lock;
>>>>>        int max_num_pending_tx;
>>>>> -    int num_pending_tx;
>>>>>        int num_pending_mgmt_tx;
>>>>> +    struct dql dql;
>>>>>        struct idr pending_tx;
>>>>>        wait_queue_head_t empty_tx_wq;
>>>>>    diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> index e9d12ea708b6..911a79470bdf 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
>>>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt)
>>>>>    {
>>>>>        lockdep_assert_held(&htt->tx_lock);
>>>>>    -    htt->num_pending_tx--;
>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx - 1)
>>>>> +    dql_completed(&htt->dql, 1);
>>>>> +    if (dql_avail(&htt->dql) > 0)
>>>>>            ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>>>    }
>>>>>    @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt)
>>>>>    {
>>>>>        lockdep_assert_held(&htt->tx_lock);
>>>>>    -    if (htt->num_pending_tx >= htt->max_num_pending_tx)
>>>>> +    if (dql_avail(&htt->dql) <= 0)
>>>>>            return -EBUSY;
>>>>>    -    htt->num_pending_tx++;
>>>>> -    if (htt->num_pending_tx == htt->max_num_pending_tx)
>>>>> +    dql_queued(&htt->dql, 1);
>>>>> +    if (dql_avail(&htt->dql) <= 0)
>>>>>            ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL);
>>>>>          return 0;
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
>>>>> index 2d03b8dd3b8c..1fe251742b0a 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw,
>>>>>        if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH)
>>>>>            return true;
>>>>>    -    if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed)
>>>>> +    if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed)
>>>>>            return true;
>>>>>          if (artxq->num_fw_queued < artxq->num_push_allowed)
>>>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>>>        if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH)
>>>>>            return;
>>>>>    -    if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2))
>>>>> +    if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2))
>>>>>            return;
>>>>>          rcu_read_lock();
>>>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar)
>>>>>                bool empty;
>>>>>                  spin_lock_bh(&ar->htt.tx_lock);
>>>>> -            empty = (ar->htt.num_pending_tx == 0);
>>>>> +            empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued);
>>>>>                spin_unlock_bh(&ar->htt.tx_lock);
>>>>>                  skip = (ar->state == ATH10K_STATE_WEDGED) ||
>>>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c
>>>>> index 39abf8b12903..fe7cd53c2bf9 100644
>>>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c
>>>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
>>>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>>>>>          ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id);
>>>>>        ath10k_htt_tx_dec_pending(htt);
>>>>> -    if (htt->num_pending_tx == 0)
>>>>> +    if (htt->dql.num_completed == htt->dql.num_queued)
>>>>>            wake_up(&htt->empty_tx_wq);
>>>>>        spin_unlock_bh(&htt->tx_lock);
>>>>>
>>>>
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] ath10k: Restart xmit queues below low-water mark.
  2020-05-02  5:49               ` John Deere
@ 2020-05-02 15:56                 ` Ben Greear
  0 siblings, 0 replies; 27+ messages in thread
From: Ben Greear @ 2020-05-02 15:56 UTC (permalink / raw)
  To: John Deere, Mark Baker; +Cc: Toke Høiland-Jørgensen, linux-wireless



On 05/01/2020 10:49 PM, John Deere wrote:
>

>>>>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms.
>>>> Hello,
>>>> Did you notice any throughput changes or system load changes in the test that you did with my patch?
>
> Hi Ben,
>
> I've ran some basic throughput tests with iperf and it seems that there was also slight hit in throughput, but only by 5-20 mbps or so.

If you are using smaller buffers anyway, maybe the 1/4 restart limit is too low.  I'm running closer to 2000 tx buffers,
so my restart level would be at around 500 buffers.  I did not see any drop in tput in my test case, but rather saw
improvement.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-05-02 15:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27 14:54 [PATCH] ath10k: Restart xmit queues below low-water mark greearb
2020-04-28 14:56 ` Steve deRosier
2020-04-28 14:58   ` Ben Greear
2020-04-28 19:36     ` Toke Høiland-Jørgensen
2020-04-28 19:39       ` Dave Taht
2020-04-28 20:00         ` Ben Greear
2020-04-28 20:58           ` Toke Høiland-Jørgensen
2020-04-28 21:23             ` Steve deRosier
2020-04-28 16:27   ` Dave Taht
2020-04-28 16:35     ` Ben Greear
2020-04-28 17:10       ` Dave Taht
2020-04-28 19:43       ` Toke Høiland-Jørgensen
2020-04-28 19:37 ` Toke Høiland-Jørgensen
2020-04-28 19:51   ` Ben Greear
2020-04-28 20:39     ` Toke Høiland-Jørgensen
2020-04-28 21:18       ` Ben Greear
2020-04-29  9:28         ` Toke Høiland-Jørgensen
2020-04-29 13:54           ` Ben Greear
2020-04-29 14:56             ` Toke Høiland-Jørgensen
2020-04-29 15:26               ` Ben Greear
2020-04-29 15:42                 ` Toke Høiland-Jørgensen
2020-04-30 23:31       ` John Deere
2020-05-01  2:16         ` Ben Greear
2020-05-01 15:50           ` John Deere
2020-05-01 17:51             ` Mark Baker
2020-05-02  5:49               ` John Deere
2020-05-02 15:56                 ` Ben Greear

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.