linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Optimizing performance for lots of virtual stations.
@ 2013-03-14 17:22 Ben Greear
  2013-03-14 23:12 ` Felix Fietkau
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Greear @ 2013-03-14 17:22 UTC (permalink / raw)
  To: linux-wireless

I've been doing some performance testing, and having lots of
stations causes quite a drag:  total throughput with 1 station: 250Mbps TCP throughput,
total with 50 stations:  225 Mbps, and with 128 stations: 20-40Mbps (it varies a lot..not so sure why).

I poked around in the rx logic and it seems the rx-data path is fairly
clean for data packets.  But, from what I can tell, each beacon is going
to cause an skb_copy() call and a queued work-item for each station interface,
and there are going to be lots of beacons per second in most scenarios...

I was wondering if this could be optimized a bit to special case beacons
and not make a new copy (or possibly move some of the beacon handling
logic up to the radio object and out of the sdata).

And of course, it could be there are more important optimizations...I'm curious
if anyone is aware of any other code that should be optimized to have better
performance with lots of stations...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Optimizing performance for lots of virtual stations.
  2013-03-14 17:22 Optimizing performance for lots of virtual stations Ben Greear
@ 2013-03-14 23:12 ` Felix Fietkau
  2013-03-14 23:18   ` Ben Greear
  0 siblings, 1 reply; 7+ messages in thread
From: Felix Fietkau @ 2013-03-14 23:12 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless

On 2013-03-14 6:22 PM, Ben Greear wrote:
> I've been doing some performance testing, and having lots of
> stations causes quite a drag:  total throughput with 1 station: 250Mbps TCP throughput,
> total with 50 stations:  225 Mbps, and with 128 stations: 20-40Mbps (it varies a lot..not so sure why).
> 
> I poked around in the rx logic and it seems the rx-data path is fairly
> clean for data packets.  But, from what I can tell, each beacon is going
> to cause an skb_copy() call and a queued work-item for each station interface,
> and there are going to be lots of beacons per second in most scenarios...
> 
> I was wondering if this could be optimized a bit to special case beacons
> and not make a new copy (or possibly move some of the beacon handling
> logic up to the radio object and out of the sdata).
> 
> And of course, it could be there are more important optimizations...I'm curious
> if anyone is aware of any other code that should be optimized to have better
> performance with lots of stations...
How about doing some profiling with lots of stations - that should
hopefully reveal where the real bottleneck is.
By the way, with that many stations and low throughput, is the CPU usage
on your system significantly higher, or could it just be some extra
latency introduced somewhere else in the code?

- Felix


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Optimizing performance for lots of virtual stations.
  2013-03-14 23:12 ` Felix Fietkau
@ 2013-03-14 23:18   ` Ben Greear
  2013-03-15  1:44     ` Felix Fietkau
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Greear @ 2013-03-14 23:18 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: linux-wireless

On 03/14/2013 04:12 PM, Felix Fietkau wrote:
> On 2013-03-14 6:22 PM, Ben Greear wrote:
>> I've been doing some performance testing, and having lots of
>> stations causes quite a drag:  total throughput with 1 station: 250Mbps TCP throughput,
>> total with 50 stations:  225 Mbps, and with 128 stations: 20-40Mbps (it varies a lot..not so sure why).
>>
>> I poked around in the rx logic and it seems the rx-data path is fairly
>> clean for data packets.  But, from what I can tell, each beacon is going
>> to cause an skb_copy() call and a queued work-item for each station interface,
>> and there are going to be lots of beacons per second in most scenarios...
>>
>> I was wondering if this could be optimized a bit to special case beacons
>> and not make a new copy (or possibly move some of the beacon handling
>> logic up to the radio object and out of the sdata).
>>
>> And of course, it could be there are more important optimizations...I'm curious
>> if anyone is aware of any other code that should be optimized to have better
>> performance with lots of stations...
> How about doing some profiling with lots of stations - that should
> hopefully reveal where the real bottleneck is.
> By the way, with that many stations and low throughput, is the CPU usage
> on your system significantly higher, or could it just be some extra
> latency introduced somewhere else in the code?

CPU load is fairly high, but doesn't seem to just be CPU bound.  Maybe
lots and lots of work items all piled up or something like that...

I'll work on some profiling as soon as I get a chance.

I'm suspicious that the the management frame handling will
need some optimization though..I think it basically copies
the skb and broadcasts all mgt frames to all running stations....

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Optimizing performance for lots of virtual stations.
  2013-03-14 23:18   ` Ben Greear
@ 2013-03-15  1:44     ` Felix Fietkau
  2013-03-15  3:26       ` Ben Greear
  0 siblings, 1 reply; 7+ messages in thread
From: Felix Fietkau @ 2013-03-15  1:44 UTC (permalink / raw)
  To: Ben Greear; +Cc: linux-wireless

On 2013-03-15 12:18 AM, Ben Greear wrote:
> On 03/14/2013 04:12 PM, Felix Fietkau wrote:
>> On 2013-03-14 6:22 PM, Ben Greear wrote:
>>> I've been doing some performance testing, and having lots of
>>> stations causes quite a drag:  total throughput with 1 station: 250Mbps TCP throughput,
>>> total with 50 stations:  225 Mbps, and with 128 stations: 20-40Mbps (it varies a lot..not so sure why).
>>>
>>> I poked around in the rx logic and it seems the rx-data path is fairly
>>> clean for data packets.  But, from what I can tell, each beacon is going
>>> to cause an skb_copy() call and a queued work-item for each station interface,
>>> and there are going to be lots of beacons per second in most scenarios...
>>>
>>> I was wondering if this could be optimized a bit to special case beacons
>>> and not make a new copy (or possibly move some of the beacon handling
>>> logic up to the radio object and out of the sdata).
>>>
>>> And of course, it could be there are more important optimizations...I'm curious
>>> if anyone is aware of any other code that should be optimized to have better
>>> performance with lots of stations...
>> How about doing some profiling with lots of stations - that should
>> hopefully reveal where the real bottleneck is.
>> By the way, with that many stations and low throughput, is the CPU usage
>> on your system significantly higher, or could it just be some extra
>> latency introduced somewhere else in the code?
> 
> CPU load is fairly high, but doesn't seem to just be CPU bound.  Maybe
> lots and lots of work items all piled up or something like that...
> 
> I'll work on some profiling as soon as I get a chance.
> 
> I'm suspicious that the the management frame handling will
> need some optimization though..I think it basically copies
> the skb and broadcasts all mgt frames to all running stations....
Here's another thing that might be negatively affecting your tests. The
driver has a 128-packet buffer limit per hardware queue for aggregation.
With too many stations, they will be competing for a very limited number
of buffers, making aggregation a lot less effective.
Increasing the number of buffers is a bad idea here, as it will harm
environments with fewer stations due to bufferbloat.

What's required to fix this properly is better queue management,
something that will require some bigger changes to the ath9k tx path and
some mac80211 changes as well. It's on my TODO list, but I don't know
when I'll get around to implementing it.

- Felix


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Optimizing performance for lots of virtual stations.
  2013-03-15  1:44     ` Felix Fietkau
@ 2013-03-15  3:26       ` Ben Greear
  2013-03-15 17:14         ` Ben Greear
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Greear @ 2013-03-15  3:26 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: linux-wireless

On 03/14/2013 06:44 PM, Felix Fietkau wrote:
> On 2013-03-15 12:18 AM, Ben Greear wrote:
>> On 03/14/2013 04:12 PM, Felix Fietkau wrote:
>>> On 2013-03-14 6:22 PM, Ben Greear wrote:
>>>> I've been doing some performance testing, and having lots of
>>>> stations causes quite a drag:  total throughput with 1 station: 250Mbps TCP throughput,
>>>> total with 50 stations:  225 Mbps, and with 128 stations: 20-40Mbps (it varies a lot..not so sure why).
>>>>
>>>> I poked around in the rx logic and it seems the rx-data path is fairly
>>>> clean for data packets.  But, from what I can tell, each beacon is going
>>>> to cause an skb_copy() call and a queued work-item for each station interface,
>>>> and there are going to be lots of beacons per second in most scenarios...
>>>>
>>>> I was wondering if this could be optimized a bit to special case beacons
>>>> and not make a new copy (or possibly move some of the beacon handling
>>>> logic up to the radio object and out of the sdata).
>>>>
>>>> And of course, it could be there are more important optimizations...I'm curious
>>>> if anyone is aware of any other code that should be optimized to have better
>>>> performance with lots of stations...
>>> How about doing some profiling with lots of stations - that should
>>> hopefully reveal where the real bottleneck is.
>>> By the way, with that many stations and low throughput, is the CPU usage
>>> on your system significantly higher, or could it just be some extra
>>> latency introduced somewhere else in the code?
>>
>> CPU load is fairly high, but doesn't seem to just be CPU bound.  Maybe
>> lots and lots of work items all piled up or something like that...
>>
>> I'll work on some profiling as soon as I get a chance.
>>
>> I'm suspicious that the the management frame handling will
>> need some optimization though..I think it basically copies
>> the skb and broadcasts all mgt frames to all running stations....
> Here's another thing that might be negatively affecting your tests. The
> driver has a 128-packet buffer limit per hardware queue for aggregation.
> With too many stations, they will be competing for a very limited number
> of buffers, making aggregation a lot less effective.
> Increasing the number of buffers is a bad idea here, as it will harm
> environments with fewer stations due to bufferbloat.
>
> What's required to fix this properly is better queue management,
> something that will require some bigger changes to the ath9k tx path and
> some mac80211 changes as well. It's on my TODO list, but I don't know
> when I'll get around to implementing it.

I thought of that as well, but I saw something that made me think rx
might be a big part of it as well:

With 50 stations each trying to transmit a 5Mbps TCP stream, I get around 210-220Mbps
of total TCP throughput.  But, if I simply add another 78 associated stations and do
not run any traffic on them, throughput drops to about 80Mbps.

But, when I add traffic on those extra 78 stations, total throughput does drop
down to around 20-40Mbps, so that part could easily be tx aggregation issues...

Would the tx-bytes-all / xmit-ampdus ratio give an idea of how well aggregation
is working?  (As reported by the ath9k xmit debugfs file).

I think I'll be better at trying to optimize the rx path than the tx path,
as I get endlessly confused when trying to figure out the ath9k xmit path,
but I can almost start to understand the mac80211 rx path after a while :)

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Optimizing performance for lots of virtual stations.
  2013-03-15  3:26       ` Ben Greear
@ 2013-03-15 17:14         ` Ben Greear
  2013-03-15 17:50           ` Ben Greear
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Greear @ 2013-03-15 17:14 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: linux-wireless

I probably should have done this first of course..but here is a 'perf top' on
the station machine (50 TCP streams transmitting on each of 50 stations,
and 78 associated-but-mostly-idle stations:

Looks like sta_info_get would be a good place to start :)


---------------------------------------------------------------------------------------------------------------------------
    PerfTop:    1890 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 2 CPUs)
---------------------------------------------------------------------------------------------------------------------------

              samples  pcnt function                        DSO
              _______ _____ _______________________________ ______________

              2261.00 20.8% sta_info_get                    [mac80211]
              1707.00 15.7% ieee80211_tx_status             [mac80211]
              1192.00 11.0% intel_idle                      [kernel]
               462.00  4.3% __ieee80211_recalc_idle         [mac80211]
               414.00  3.8% ieee80211_prepare_and_rx_handle [mac80211]
               240.00  2.2% dev_queue_xmit_nit              [kernel]
               199.00  1.8% ieee80211_rx                    [mac80211]
               154.00  1.4% ieee80211_find_sta_by_ifaddr    [mac80211]
               124.00  1.1% read_hpet                       [kernel]
               101.00  0.9% _raw_spin_lock_irqsave          [kernel]
                92.00  0.8% __netif_receive_skb             [kernel]
                80.00  0.7% tg_load_down                    [kernel]
                76.00  0.7% fget_light                      [kernel]
                75.00  0.7% ieee80211_propagate_queue_wake  [mac80211]
                75.00  0.7% memcpy                          [kernel]
                71.00  0.7% __ieee80211_stop_queue          [mac80211]
                70.00  0.6% ipt_do_table                    [kernel]
                66.00  0.6% csum_partial_copy_generic       [kernel]
                65.00  0.6% datagram_poll                   [kernel]
                63.00  0.6% _raw_spin_lock_bh               [kernel]
                61.00  0.6% ath_get_rate                    [ath9k]
                49.00  0.5% ieee80211_subif_start_xmit      [mac80211]
                48.00  0.4% tcp_poll                        [kernel]

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Optimizing performance for lots of virtual stations.
  2013-03-15 17:14         ` Ben Greear
@ 2013-03-15 17:50           ` Ben Greear
  0 siblings, 0 replies; 7+ messages in thread
From: Ben Greear @ 2013-03-15 17:50 UTC (permalink / raw)
  To: Felix Fietkau; +Cc: linux-wireless

On 03/15/2013 10:14 AM, Ben Greear wrote:
> I probably should have done this first of course..but here is a 'perf top' on
> the station machine (50 TCP streams transmitting on each of 50 stations,
> and 78 associated-but-mostly-idle stations:
>
> Looks like sta_info_get would be a good place to start :)
>
>
> ---------------------------------------------------------------------------------------------------------------------------
>     PerfTop:    1890 irqs/sec  kernel:87.9%  exact:  0.0% [1000Hz cycles],  (all, 2 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------
>
>               samples  pcnt function                        DSO
>               _______ _____ _______________________________ ______________
>
>               2261.00 20.8% sta_info_get                    [mac80211]

Ahh, crap...I see the problem.  The 'sta->addr' is the MAC of the VAP,
so if I have 100 stations all connected to the same AP, then the hashing
is worthless and just ends up being a linear search.

Probably not going to be fun to fix that!

Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-03-15 17:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-14 17:22 Optimizing performance for lots of virtual stations Ben Greear
2013-03-14 23:12 ` Felix Fietkau
2013-03-14 23:18   ` Ben Greear
2013-03-15  1:44     ` Felix Fietkau
2013-03-15  3:26       ` Ben Greear
2013-03-15 17:14         ` Ben Greear
2013-03-15 17:50           ` Ben Greear

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).