All of lore.kernel.org
 help / color / mirror / Atom feed
* Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-29 11:48 Michal Kazior
  2015-01-29 13:14 ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-01-29 11:48 UTC (permalink / raw)
  To: eric.dumazet; +Cc: linux-wireless, Network Development, eyalpe

Hi,

I'm not subscribed to netdev list and I can't find the message-id so I
can't reply directly to the original thread `BW regression after "tcp:
refine TSO autosizing"`.

I've noticed a big TCP performance drop with ath10k
(drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
get 250mbps in my testbed.

After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
`tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
to non-TSO packets` (for conflict free reverts) fixes the problem.

My testing setup is as follows:

 a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
 b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
3.19-rc5, w/ reverts
 c) (b) w/o reverts

Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
2x2 has 866mbps modulation rate. In practice this should deliver
~700mbps of real UDP traffic.

Here are some numbers:

UDP: (b) -> (a): 672mbps
UDP: (a) -> (b): 687mbps
TCP: (b) -> (a): 526mbps
TCP: (a) -> (b): 500mbps

UDP: (c) -> (a): 669mbps*
UDP: (a) -> (c): 689mbps*
TCP: (c) -> (a): 240mbps**
TCP: (a) -> (c): 490mbps*

* no changes/within error margin
** the performance drop

I'm using iperf:
  UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
  TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20

Result values were obtained at the receiver side.

Iperf reports a few frames lost and out-of-order at each UDP test
start (during first second) but later has no packet loss and no
out-of-order. This shouldn't have any effect on a TCP session, right?

The device delivers batched up tx/rx completions (no way to change
that). I suppose this could be an issue for timing sensitive
algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
aggregation windows so there's an inherent extra (and non-uniform)
latency when compared to, e.g. ethernet devices.

The driver doesn't have GRO. I have an internal patch which implements
it. It improves overall TCP traffic (more stable, up to 600mbps TCP
which is ~100mbps more than without GRO) but the TCP: (c) -> (a)
performance drop remains unaffected regardless.

I've tried applying stretch ACK patchset (v2) on both machines and
re-run the above tests. I got no measurable difference in performance.

I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
(c). It didn't seem to be affected by the TSO patch at all (it runs at
~360mbps of TCP regardless of the TSO patch).

Any hints/ideas?


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-01-29 11:48 Throughput regression with `tcp: refine TSO autosizing` Michal Kazior
@ 2015-01-29 13:14 ` Eric Dumazet
  2015-01-30 10:29   ` Arend van Spriel
  2015-01-30 13:39     ` Michal Kazior
  0 siblings, 2 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-01-29 13:14 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
> Hi,
> 
> I'm not subscribed to netdev list and I can't find the message-id so I
> can't reply directly to the original thread `BW regression after "tcp:
> refine TSO autosizing"`.
> 
> I've noticed a big TCP performance drop with ath10k
> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
> get 250mbps in my testbed.
> 
> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
> to non-TSO packets` (for conflict free reverts) fixes the problem.
> 
> My testing setup is as follows:
> 
>  a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
>  b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
> 3.19-rc5, w/ reverts
>  c) (b) w/o reverts
> 
> Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
> 2x2 has 866mbps modulation rate. In practice this should deliver
> ~700mbps of real UDP traffic.
> 
> Here are some numbers:
> 
> UDP: (b) -> (a): 672mbps
> UDP: (a) -> (b): 687mbps
> TCP: (b) -> (a): 526mbps
> TCP: (a) -> (b): 500mbps
> 
> UDP: (c) -> (a): 669mbps*
> UDP: (a) -> (c): 689mbps*
> TCP: (c) -> (a): 240mbps**
> TCP: (a) -> (c): 490mbps*
> 
> * no changes/within error margin
> ** the performance drop
> 
> I'm using iperf:
>   UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
>   TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
> 
> Result values were obtained at the receiver side.
> 
> Iperf reports a few frames lost and out-of-order at each UDP test
> start (during first second) but later has no packet loss and no
> out-of-order. This shouldn't have any effect on a TCP session, right?
> 
> The device delivers batched up tx/rx completions (no way to change
> that). I suppose this could be an issue for timing sensitive
> algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
> aggregation windows so there's an inherent extra (and non-uniform)
> latency when compared to, e.g. ethernet devices.
> 
> The driver doesn't have GRO. I have an internal patch which implements
> it. It improves overall TCP traffic (more stable, up to 600mbps TCP
> which is ~100mbps more than without GRO) but the TCP: (c) -> (a)
> performance drop remains unaffected regardless.
> 
> I've tried applying stretch ACK patchset (v2) on both machines and
> re-run the above tests. I got no measurable difference in performance.
> 
> I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
> (c). It didn't seem to be affected by the TSO patch at all (it runs at
> ~360mbps of TCP regardless of the TSO patch).
> 
> Any hints/ideas?
> 

Hi Michal

This patch restored original TSQ behavior, because the 1ms worth of data
per flow had totally destroyed TSQ intent.

vi +630 Documentation/networking/ip-sysctl.txt

tcp_limit_output_bytes - INTEGER
        Controls TCP Small Queue limit per tcp socket.
        TCP bulk sender tends to increase packets in flight until it
        gets losses notifications. With SNDBUF autotuning, this can
        result in a large amount of packets queued in qdisc/device
        on the local machine, hurting latency of other flows, for
        typical pfifo_fast qdiscs.
        tcp_limit_output_bytes limits the number of bytes on qdisc
        or device to reduce artificial RTT/cwnd and reduce bufferbloat.
        Default: 131072

This is why I suggested to Eyal Perry to change the TX interrupt
mitigation parameters as in :

ethtool -C eth0 tx-frames 4 rx-frames 4

With this change and the stretch ack fixes, I got 37Gbps of throughput
on a single flow, on a 40Gbit NIC (mlx4)

If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
get line rate, I suggest that you either :

1) tweak tcp_limit_output_bytes, but its not practical from a driver.

2) change the driver, knowing what are its exact requirements, by
removing a fraction of skb->truesize at ndo_start_xmit() time as in :

if ((skb->destructor == sock_wfree ||
     skb->restuctor == tcp_wfree) &&
    skb->sk) {
    u32 fraction = skb->truesize / 2;

    skb->truesize -= fraction;
    atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
}

Thanks.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-01-29 13:14 ` Eric Dumazet
@ 2015-01-30 10:29   ` Arend van Spriel
  2015-01-30 13:19       ` Eric Dumazet
  2015-01-30 13:39     ` Michal Kazior
  1 sibling, 1 reply; 107+ messages in thread
From: Arend van Spriel @ 2015-01-30 10:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Michal Kazior, linux-wireless, Network Development, eyalpe

On 01/29/15 14:14, Eric Dumazet wrote:
> On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
>> Hi,
>>
>> I'm not subscribed to netdev list and I can't find the message-id so I
>> can't reply directly to the original thread `BW regression after "tcp:
>> refine TSO autosizing"`.
>>
>> I've noticed a big TCP performance drop with ath10k
>> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
>> get 250mbps in my testbed.
>>
>> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
>> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
>> to non-TSO packets` (for conflict free reverts) fixes the problem.
>>
>> My testing setup is as follows:
>>
>>   a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
>>   b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
>> 3.19-rc5, w/ reverts
>>   c) (b) w/o reverts
>>
>> Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
>> 2x2 has 866mbps modulation rate. In practice this should deliver
>> ~700mbps of real UDP traffic.
>>
>> Here are some numbers:
>>
>> UDP: (b) ->  (a): 672mbps
>> UDP: (a) ->  (b): 687mbps
>> TCP: (b) ->  (a): 526mbps
>> TCP: (a) ->  (b): 500mbps
>>
>> UDP: (c) ->  (a): 669mbps*
>> UDP: (a) ->  (c): 689mbps*
>> TCP: (c) ->  (a): 240mbps**
>> TCP: (a) ->  (c): 490mbps*
>>
>> * no changes/within error margin
>> ** the performance drop
>>
>> I'm using iperf:
>>    UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
>>    TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
>>
>> Result values were obtained at the receiver side.
>>
>> Iperf reports a few frames lost and out-of-order at each UDP test
>> start (during first second) but later has no packet loss and no
>> out-of-order. This shouldn't have any effect on a TCP session, right?
>>
>> The device delivers batched up tx/rx completions (no way to change
>> that). I suppose this could be an issue for timing sensitive
>> algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
>> aggregation windows so there's an inherent extra (and non-uniform)
>> latency when compared to, e.g. ethernet devices.
>>
>> The driver doesn't have GRO. I have an internal patch which implements
>> it. It improves overall TCP traffic (more stable, up to 600mbps TCP
>> which is ~100mbps more than without GRO) but the TCP: (c) ->  (a)
>> performance drop remains unaffected regardless.
>>
>> I've tried applying stretch ACK patchset (v2) on both machines and
>> re-run the above tests. I got no measurable difference in performance.
>>
>> I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
>> (c). It didn't seem to be affected by the TSO patch at all (it runs at
>> ~360mbps of TCP regardless of the TSO patch).
>>
>> Any hints/ideas?
>>
>
> Hi Michal
>
> This patch restored original TSQ behavior, because the 1ms worth of data
> per flow had totally destroyed TSQ intent.
>
> vi +630 Documentation/networking/ip-sysctl.txt
>
> tcp_limit_output_bytes - INTEGER
>          Controls TCP Small Queue limit per tcp socket.
>          TCP bulk sender tends to increase packets in flight until it
>          gets losses notifications. With SNDBUF autotuning, this can
>          result in a large amount of packets queued in qdisc/device
>          on the local machine, hurting latency of other flows, for
>          typical pfifo_fast qdiscs.
>          tcp_limit_output_bytes limits the number of bytes on qdisc
>          or device to reduce artificial RTT/cwnd and reduce bufferbloat.
>          Default: 131072
>
> This is why I suggested to Eyal Perry to change the TX interrupt
> mitigation parameters as in :
>
> ethtool -C eth0 tx-frames 4 rx-frames 4
>
> With this change and the stretch ack fixes, I got 37Gbps of throughput
> on a single flow, on a 40Gbit NIC (mlx4)
>
> If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
> get line rate, I suggest that you either :
>
> 1) tweak tcp_limit_output_bytes, but its not practical from a driver.
>
> 2) change the driver, knowing what are its exact requirements, by
> removing a fraction of skb->truesize at ndo_start_xmit() time as in :
>
> if ((skb->destructor == sock_wfree ||
>       skb->restuctor == tcp_wfree)&&
>      skb->sk) {
>      u32 fraction = skb->truesize / 2;
>
>      skb->truesize -= fraction;
>      atomic_sub(fraction,&skb->sk->sk_wmem_alloc);
> }

Hi Eric,

Your suggestions are still based on the fact that you consider wireless 
networking to be similar to ethernet, but as Michal indicated there are 
some fundamental differences starting with CSMA/CD versus CSMA/CA. Also 
the medium conditions are far from comparable. There is no shielding so 
it needs to deal with interference and dynamically drops the link rate 
so transmission of packets can take several milliseconds. Then with 11n 
they came up with aggregation with sends up to 64 packets in a single 
transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The 
parameter value for tcp_limit_output_bytes of 131072 means that it 
allows queuing for about 1ms on a 1Gbps link, but I hope you can see 
this is not realistic for dealing with all variances of the wireless 
medium/standard. I suggested this as topic for the wireless workshop in 
Otawa [1], but I can not attend there. Still hope that there will be 
some discussions to get more awareness.

Regards,
Arend

[1] http://mid.gmane.org/54BE9791.1070706@broadcom.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-30 13:19       ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-01-30 13:19 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Michal Kazior, linux-wireless, Network Development, eyalpe

On Fri, 2015-01-30 at 11:29 +0100, Arend van Spriel wrote:

> Hi Eric,
> 
> Your suggestions are still based on the fact that you consider wireless 
> networking to be similar to ethernet, but as Michal indicated there are 
> some fundamental differences starting with CSMA/CD versus CSMA/CA. Also 
> the medium conditions are far from comparable. There is no shielding so 
> it needs to deal with interference and dynamically drops the link rate 
> so transmission of packets can take several milliseconds. Then with 11n 
> they came up with aggregation with sends up to 64 packets in a single 
> transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The 
> parameter value for tcp_limit_output_bytes of 131072 means that it 
> allows queuing for about 1ms on a 1Gbps link, but I hope you can see 
> this is not realistic for dealing with all variances of the wireless 
> medium/standard. I suggested this as topic for the wireless workshop in 
> Otawa [1], but I can not attend there. Still hope that there will be 
> some discussions to get more awareness.

Ever heard about bufferbloat ?

Have you read my suggestions and tried them ?

You can adjust the limit per flow to pretty much you want. If you need
64 packets, just do the math. If in 2018 you need 128 packets, do the
math again.

I am very well aware that wireless wants aggregation, thank you.

131072 bytes of queue on 40Gbit is not 1ms, but 26 usec of queueing, and
we get line rate nevertheless.

We need this level of shallow queues (BQL, TSQ), to get very precise rtt
estimations so that TCP has good entropy for its pacing, even in the 50
usec rtt ranges.

If we allowed 1ms of queueing, then a 40Gbit flow would queue 5 MBytes.

This was terrible, because it increased cwnd and all sender queues to
insane levels.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-30 13:19       ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-01-30 13:19 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Michal Kazior, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Fri, 2015-01-30 at 11:29 +0100, Arend van Spriel wrote:

> Hi Eric,
> 
> Your suggestions are still based on the fact that you consider wireless 
> networking to be similar to ethernet, but as Michal indicated there are 
> some fundamental differences starting with CSMA/CD versus CSMA/CA. Also 
> the medium conditions are far from comparable. There is no shielding so 
> it needs to deal with interference and dynamically drops the link rate 
> so transmission of packets can take several milliseconds. Then with 11n 
> they came up with aggregation with sends up to 64 packets in a single 
> transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The 
> parameter value for tcp_limit_output_bytes of 131072 means that it 
> allows queuing for about 1ms on a 1Gbps link, but I hope you can see 
> this is not realistic for dealing with all variances of the wireless 
> medium/standard. I suggested this as topic for the wireless workshop in 
> Otawa [1], but I can not attend there. Still hope that there will be 
> some discussions to get more awareness.

Ever heard about bufferbloat ?

Have you read my suggestions and tried them ?

You can adjust the limit per flow to pretty much you want. If you need
64 packets, just do the math. If in 2018 you need 128 packets, do the
math again.

I am very well aware that wireless wants aggregation, thank you.

131072 bytes of queue on 40Gbit is not 1ms, but 26 usec of queueing, and
we get line rate nevertheless.

We need this level of shallow queues (BQL, TSQ), to get very precise rtt
estimations so that TCP has good entropy for its pacing, even in the 50
usec rtt ranges.

If we allowed 1ms of queueing, then a 40Gbit flow would queue 5 MBytes.

This was terrible, because it increased cwnd and all sender queues to
insane levels.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-30 13:39     ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-01-30 13:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 29 January 2015 at 14:14, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
>> Hi,
>>
>> I'm not subscribed to netdev list and I can't find the message-id so I
>> can't reply directly to the original thread `BW regression after "tcp:
>> refine TSO autosizing"`.
>>
>> I've noticed a big TCP performance drop with ath10k
>> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
>> get 250mbps in my testbed.
>>
>> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
>> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
>> to non-TSO packets` (for conflict free reverts) fixes the problem.
[...]
>> Any hints/ideas?
>>
>
> Hi Michal
>
> This patch restored original TSQ behavior, because the 1ms worth of data
> per flow had totally destroyed TSQ intent.
>
> vi +630 Documentation/networking/ip-sysctl.txt
[...]
> With this change and the stretch ack fixes, I got 37Gbps of throughput
> on a single flow, on a 40Gbit NIC (mlx4)
>
> If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
> get line rate, I suggest that you either :
>
> 1) tweak tcp_limit_output_bytes, but its not practical from a driver.

I've briefly tried playing with this knob to no avail unfortunately. I
tried 256K, 1M - it didn't improve TCP performance. When I tried to
make it smaller (e.g. 16K) the traffic dropped even more so it does
have an effect. It seems there's some other limiting factor in this
case.


> 2) change the driver, knowing what are its exact requirements, by
> removing a fraction of skb->truesize at ndo_start_xmit() time as in :
>
> if ((skb->destructor == sock_wfree ||
>      skb->restuctor == tcp_wfree) &&
>     skb->sk) {
>     u32 fraction = skb->truesize / 2;
>
>     skb->truesize -= fraction;
>     atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
> }

tcp_wfree isn't exported so I went and just did a quick hack and used
`if (skb->sk)` as the condition. Didn't help with TCP.

I wonder if this TSO patch is the single thing to blame. Maybe it's
just the trigger and something else is broken? I've been seeing some
UDP performance inconsistencies in 3.19 (I don't recall seeing these
in 3.18) - but maybe my perception is failing on me.

Anyway, thanks for the tips so far.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-30 13:39     ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-01-30 13:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On 29 January 2015 at 14:14, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
>> Hi,
>>
>> I'm not subscribed to netdev list and I can't find the message-id so I
>> can't reply directly to the original thread `BW regression after "tcp:
>> refine TSO autosizing"`.
>>
>> I've noticed a big TCP performance drop with ath10k
>> (drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
>> get 250mbps in my testbed.
>>
>> After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
>> `tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
>> to non-TSO packets` (for conflict free reverts) fixes the problem.
[...]
>> Any hints/ideas?
>>
>
> Hi Michal
>
> This patch restored original TSQ behavior, because the 1ms worth of data
> per flow had totally destroyed TSQ intent.
>
> vi +630 Documentation/networking/ip-sysctl.txt
[...]
> With this change and the stretch ack fixes, I got 37Gbps of throughput
> on a single flow, on a 40Gbit NIC (mlx4)
>
> If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
> get line rate, I suggest that you either :
>
> 1) tweak tcp_limit_output_bytes, but its not practical from a driver.

I've briefly tried playing with this knob to no avail unfortunately. I
tried 256K, 1M - it didn't improve TCP performance. When I tried to
make it smaller (e.g. 16K) the traffic dropped even more so it does
have an effect. It seems there's some other limiting factor in this
case.


> 2) change the driver, knowing what are its exact requirements, by
> removing a fraction of skb->truesize at ndo_start_xmit() time as in :
>
> if ((skb->destructor == sock_wfree ||
>      skb->restuctor == tcp_wfree) &&
>     skb->sk) {
>     u32 fraction = skb->truesize / 2;
>
>     skb->truesize -= fraction;
>     atomic_sub(fraction, &skb->sk->sk_wmem_alloc);
> }

tcp_wfree isn't exported so I went and just did a quick hack and used
`if (skb->sk)` as the condition. Didn't help with TCP.

I wonder if this TSO patch is the single thing to blame. Maybe it's
just the trigger and something else is broken? I've been seeing some
UDP performance inconsistencies in 3.19 (I don't recall seeing these
in 3.18) - but maybe my perception is failing on me.

Anyway, thanks for the tips so far.


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-01-30 13:19       ` Eric Dumazet
  (?)
@ 2015-01-30 13:47       ` Arend van Spriel
  2015-01-30 14:37         ` Eric Dumazet
       [not found]         ` <CAA93jw5fqhz0Hiw74L2GXgtZ9JsMg+NtYydKxKzGDrvQcZn4hA@mail.gmail.com>
  -1 siblings, 2 replies; 107+ messages in thread
From: Arend van Spriel @ 2015-01-30 13:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Michal Kazior, linux-wireless, Network Development, eyalpe

On 01/30/15 14:19, Eric Dumazet wrote:
> On Fri, 2015-01-30 at 11:29 +0100, Arend van Spriel wrote:
>
>> Hi Eric,
>>
>> Your suggestions are still based on the fact that you consider wireless
>> networking to be similar to ethernet, but as Michal indicated there are
>> some fundamental differences starting with CSMA/CD versus CSMA/CA. Also
>> the medium conditions are far from comparable. There is no shielding so
>> it needs to deal with interference and dynamically drops the link rate
>> so transmission of packets can take several milliseconds. Then with 11n
>> they came up with aggregation with sends up to 64 packets in a single
>> transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The
>> parameter value for tcp_limit_output_bytes of 131072 means that it
>> allows queuing for about 1ms on a 1Gbps link, but I hope you can see
>> this is not realistic for dealing with all variances of the wireless
>> medium/standard. I suggested this as topic for the wireless workshop in
>> Otawa [1], but I can not attend there. Still hope that there will be
>> some discussions to get more awareness.
>
> Ever heard about bufferbloat ?

Sure. I am trying to get awareness about that in our wireless 
driver/firmware development teams. So bear with me.

> Have you read my suggestions and tried them ?
>
> You can adjust the limit per flow to pretty much you want. If you need
> 64 packets, just do the math. If in 2018 you need 128 packets, do the
> math again.
>
> I am very well aware that wireless wants aggregation, thank you.

Sorry if I offended you. I was just giving these as example combined 
with effective rate usable on the medium to say that the bandwidth is 
more dynamic in wireless and as such need dynamic change of queue depth. 
Now this can be done by making the fraction size as used in your 
suggestion adaptive to these conditions.

> 131072 bytes of queue on 40Gbit is not 1ms, but 26 usec of queueing, and
> we get line rate nevertheless.

I was saying it was about 1ms on *1Gbit* as the wireless TCP rates are 
moving into that direction in 11ac.

> We need this level of shallow queues (BQL, TSQ), to get very precise rtt
> estimations so that TCP has good entropy for its pacing, even in the 50
> usec rtt ranges.
>
> If we allowed 1ms of queueing, then a 40Gbit flow would queue 5 MBytes.
>
> This was terrible, because it increased cwnd and all sender queues to
> insane levels.

Indeed and that is what we would like to address in our wireless 
drivers. I will setup some experiments using the fraction sizing and 
post my findings. Again sorry if I offended you.

Regards,
Arend

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-01-30 13:47       ` Arend van Spriel
@ 2015-01-30 14:37         ` Eric Dumazet
       [not found]         ` <CAA93jw5fqhz0Hiw74L2GXgtZ9JsMg+NtYydKxKzGDrvQcZn4hA@mail.gmail.com>
  1 sibling, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-01-30 14:37 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Michal Kazior, linux-wireless, Network Development, eyalpe

On Fri, 2015-01-30 at 14:47 +0100, Arend van Spriel wrote:

> Indeed and that is what we would like to address in our wireless 
> drivers. I will setup some experiments using the fraction sizing and 
> post my findings. Again sorry if I offended you.

You did not, but I had no feedback about my suggestions.

Michal sent it now.

Thanks



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-30 14:40       ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-01-30 14:40 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:

> I've briefly tried playing with this knob to no avail unfortunately. I
> tried 256K, 1M - it didn't improve TCP performance. When I tried to
> make it smaller (e.g. 16K) the traffic dropped even more so it does
> have an effect. It seems there's some other limiting factor in this
> case.

Interesting.

Could you take some tcpdump/pcap with various tcp_limit_output_bytes
values ?

echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 20000 -w 128k.pcap

echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 20000 -w 256k.pcap

...

Thanks !



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-01-30 14:40       ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-01-30 14:40 UTC (permalink / raw)
  To: Michal Kazior
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:

> I've briefly tried playing with this knob to no avail unfortunately. I
> tried 256K, 1M - it didn't improve TCP performance. When I tried to
> make it smaller (e.g. 16K) the traffic dropped even more so it does
> have an effect. It seems there's some other limiting factor in this
> case.

Interesting.

Could you take some tcpdump/pcap with various tcp_limit_output_bytes
values ?

echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 20000 -w 128k.pcap

echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
tcpdump -p -i wlanX -s 128 -c 20000 -w 256k.pcap

...

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: Throughput regression with `tcp: refine TSO autosizing`
       [not found]                 ` <CAA_e5Z46Bu+zZZFzf_ejzA35Gw3g1_OG85yv6yd7MpbwZcE-nw@mail.gmail.com>
@ 2015-02-01  8:45                   ` Dave Taht
  2015-02-01 10:47                     ` Jonathan Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Taht @ 2015-02-01  8:45 UTC (permalink / raw)
  To: Andrew McGregor
  Cc: dstanley, Stig Thormodsrud, netdev, linux-wireless,
	Jesper Dangaard Brouer, Derrick Pallas, Matt Mathis,
	cerowrt-devel, Kathy Giori, Mahesh Paolini-Subramanya,
	Jonathan Morton, Tim Shepard, Avery Pennarun


[-- Attachment #1.1: Type: text/plain, Size: 20483 bytes --]

This convo ended up being cut from netdev due to containing HTML.

I'd really like to get started on finally fixing WiFi starting in April.

---------- Forwarded message ----------
From: "Andrew McGregor" <andrewmcgr@gmail.com>
Date: Feb 1, 2015 12:07 AM
Subject: Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine
TSO autosizing`
To: "Avery Pennarun" <apenwarr@google.com>
Cc: "David Reed" <dpreed@reed.com>, "Dave Taht" <dave.taht@gmail.com>, "Jim
Gettys" <jg@freedesktop.org>, "Tim Shepard" <shep@alum.mit.edu>, "Matt
Mathis" <mattmathis@google.com>, "Jesper Dangaard Brouer" <
jbrouer@redhat.com>, "Jonathan Morton" <chromatix99@gmail.com>, "
cerowrt-devel@lists.bufferbloat.net" <cerowrt-devel@lists.bufferbloat.net>

> I think I have a good idea how to do it, actually.  Or at least, I had a
design sketched out that was kind of plausible.
>
> In essence, we have a good way of controlling queues in the IP stack,
that being fq_codel.  But aggregating wifi drivers have to pull all the
traffic straight through that and queue it themselves, with a different
rotation policy and (at present) no AQM.
>
> Well, that can be fixed.  We need a transmission and aggregation
scheduler at the wifi layer that can do a per-station round-robin with a
time-based fairness policy, based on airtime.  This has to be in the wifi
layer because that's where rate selection policy is, and you can't
calculate airtime until you know the selected rate, the current success
probabilities, and how much aggregation you're going to do within your time
budget.  We need to be able to plug in per-flow and per-packet scheduling
and drop policies on top of that, to control scheduling of individual
packets into aggregates.
>
> We could also do the trick of assigning a target drop probability and
telling a Minstrel-like rate controller what drop probability we're after;
then you get the nice property that the transmission scheduler and rate
control are work-conserving and actually get faster when they get congested.
>
> There is also now the mathematics available to improve Minstrel, which is
about 80% of a truly optimal rate controller (interestingly, that math was
published in 2010, years after we did Minstrel; I had the intuition that it
was close to optimal but no way to do a proof).
>
> None of this is rocket science, but it is a pretty deep and complex
refactor of something that isn't the simplest code to start with.  That
said, I'd not expect it to be more than around 1k LoC, it's just going to
be a real bear to test.

I think the best path forward is to do proof of concept on the two open
enough drivers we have - the mt76 and ath9k.

A lot of what is needed to make rational decisions is simply not available
from closed firmwares.

> I do have the feeling that the people receiving this email ought to be
sufficient to fix the problem, if we can direct enough time to it; for
myself, I have to think quite hard as to whether this should be my 20%
project.

There needs to be focus and direct funding for quite a few people. Felix
for one. It would be nice to acquire half an engineer from each of multiple
companies making wireless products and chips in a linaro like fashion, also.

> On Sun, Feb 1, 2015 at 2:06 PM, Avery Pennarun <apenwarr@google.com>
wrote:
>>
>> I would argue that insofar as the bufferbloat project has made a
>> difference, it's because there was a very clear message and product:
>> - here's what sucks when you have bufferbloat
>> - here's how you can detect it
>> - here's how you can get rid of it
>> - by the way, here's which of your competitors are already beating you
at it.
>>
>> It turns out you don't need a standards org in order to push any of
>> the above things.  The IEEE exists to make sure things interop at the
>> MAC/PHY layer.  The IETF exists to make sure things interop at the
>> higher layers.  But bufferbloat isn't about interop, it's just a thing
>> that happens inside gateways, so it's not something you can really
>> write standards about.  It is something you can turn into a
>> competitive advantage (or disadvantage, if you're a straggler).
>>
>> ...but meanwhile, if we want to fix bufferbloat in wifi, nobody
>> actually knows how to do it, so we are still at steps 1 and 2.
>>
>> This is why we're funding Dave to continue work on netperf-wrappers.

I still don't have POs for the other 75% of me this year. It is comforting
to have got such a public  confirmation that I have at least this much
floor under me this year, tho.

>> Those diagrams of latency under load are pretty convincing.  The
>> diagrams of page load times under different levels of latency are even
>> more convincing.  First, we prove there's a problem and a way to
>> measure the problem.  Then hopefully more people will be interested in
>> solving it.

Well, a wall of shame might help, I guess. I have a ton of data on a dozen
products that is miserable to see and should be quite embarrassing for the
vendors.

I'd prefer to spend time trying to fix the darn problem more directly, but
I would expect this tool will be of use in tracking improvements.

>>
>> On Sat, Jan 31, 2015 at 4:51 PM,  <dpreed@reed.com> wrote:
>> > I think we need to create an Internet focused 802.11 working group that
>> > would be to the "OS wireless designers and IEEE 802.11 standards
groups" as
>> > the WHATML group was to W3C.
>> >
>> >
>> >
>> > W3C was clueless about the real world at the point WHATML was
created.  And
>> > WHATML was a "revenge of the real" against W3C - advancing a wide
variety of
>> > important practical innovations rather than attending endless standards
>> > meetings with people who were not focused on solving actually important
>> > problems.
>> >
>> >
>> >
>> > It took a bunch of work to get WHATML going, and it offended W3C, who
became
>> > unhelpful.  But the approach actually worked - we now have a Web that
really
>> > uses browser-side expressivity and that would never have happened if
W3C
>> > were left to its own devices.
>> >
>> >
>> >
>> > The WiFi consortium was an attempt to wrest control of pragmatic
direction
>> > from 802.11 and the proprietary-divergence folks at Qualcomm, Broadcom,
>> > Cisco, etc.  But it failed, because it became thieves on a raft, more
>> > focused on picking each others' pockets than on actually addressing
the big
>> > issues.
>> >
>> >
>> >
>> > Jim has seen this play out in the Linux community around X.  Though
there
>> > are lots of interests who would benefit by moving the engineering ball
>> > forward, everyone resists action because it means giving up the chance
at
>> > dominance, and the central group is far too weak to do anything beyond
>> > adjudicating the worst battles.
>> >
>> >
>> >
>> > When I say "we" I definitely include myself (though my time is limited
due
>> > to other commitments and the need to support my family), but I would
only
>> > play with people who actually are committed to making stuff happen -
which
>> > includes raising hell with the vendors if need be, but also effective
>> > engineering steps that can achieve quick adoption.
>> >
>> >
>> >
>> > Sadly, and I think it is manageable at the moment, there are moves out
there
>> > being made to get the FCC to "protect" WiFi from "interference".  The
>> > current one was Marriott, who requested the FCC for a rule to make it
legal
>> > to disrupt and block use of WiFi in people's rooms in their hotels,
except
>> > with their access points.  This also needs some technical defense.  I
>> > believe any issues with WiFi performance in actual Marriott hotels are
due
>> > to bufferbloat in their hotel-wide systems, just as the issues with
GoGo are
>> > the same.  But it's possible that queueing problems in their own WiFi
gear
>> > are bad as well.
>> >
>> >
>> >
>> > I mention this because it is related, and to the layperson, or
>> > non-radio-knowledgeable executive, indistinguishable.  It will take
away the
>> > incentive to actually fix the 802.11 implementations to be better
>> > performing, making the problem seem to be a "management" issue that
can be
>> > solved by making WiFi less interoperable and less flexible by rules,
rather
>> > than by engineering.
>> >
>> >
>> >
>> > However, solving the problems of hotspot networks and hotel networks
are
>> > definitely "real world" issues, and quite along the same lines you
mention,
>> > Dave.  FQ is almost certainly a big deal both in WiFi and in the
>> > distribution networks behind WiFi. Co-existence is also a big deal
>> > (RTS/CTS-like mechanisms can go a long way to remediate hidden-terminal
>> > disruption of the basic protocols). Roaming and scaling need work as
well.
>> >
>> >
>> >
>> > It would even be a good thing to invent pragmatic ways to provide "low
rate"
>> > subnets and "high rate" subnets that can coexist, so that
compatibility with
>> > ancient "b" networks need not be maintained on all nets, at great cost
-
>> > just send beacons at a high rate, so that the "b" NICs can't see
them....
>> > but you need pragmatic stack implementations.
>> >
>> >
>> >
>> > But the engineering is not the only challenge. The other challenge is
to
>> > take the initiative and get stuff deployed.  In the case of
bufferbloat, the
>> > grade currently is a "D" for deployments, maybe a "D-".  Beautiful
technical
>> > work, but the economic/business/political side of things has been poor.
>> > Look at how slow IETF has been to achieve anything (the perfect is
truly the
>> > enemy of the good, and Dave Clark's "rough consensus and working code"
has
>> > been replaced by technocratic malaise, and what appears to me to be a
class
>> > of people who love traveling the world to a floating cocktail party
without
>> > getting anything important done).
>> >
>> >
>> >
>> > The problem with communications is that you can't just ship a product
with a
>> > new "feature", because the innovation only works if widely adopted.
Since
>> > there is no "Linux Desktop" (and Linus hates the idea, to a large
extent)
>> > Linux can't be the sole carrier of the idea.  You pretty much need iOS
and
>> > Android both to buy in or to provide a path for easy third-party
upgrades.
>> > How do you do that?  Well, that's where the WHATML-type approach is
>> > necessary.
>> >
>> >
>> >
>> > I don't know if this can be achieved, and there are lots of details to
be
>> > worked out.  But I'll play.

I'll play.

>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Saturday, January 31, 2015 4:05pm, "Dave Taht" <dave.taht@gmail.com>
>> > said:
>> >
>> > I would like to have somehow assembled all the focused resources to
make a
>> > go at fixing wifi, or at least having a f2f with a bunch of people in
the
>> > late march timeframe. This message of mine to linux-wireless bounced
for
>> > some reason and I am off to log out for 10 days, so...
>> > see relevant netdev thread also for ore details.
>> >
>> > ---------- Forwarded message ----------
>> > From: Dave Taht <dave.taht@gmail.com>
>> > Date: Sat, Jan 31, 2015 at 12:29 PM
>> > Subject: Re: Throughput regression with `tcp: refine TSO autosizing`
>> > To: Arend van Spriel <arend@broadcom.com>
>> > Cc: linux-wireless <linux-wireless@vger.kernel.org>, Michal Kazior
>> > <michal.kazior@tieto.com>, Eyal Perry <eyalpe@dev.mellanox.co.il>,
Network
>> > Development <netdev@vger.kernel.org>, Eric Dumazet <
eric.dumazet@gmail.com>
>> >
>> >
>> > The wifi industry as a whole has vastly bigger problems than achieving
>> > 1500Mbits in a faraday cage on a single flow.
>> >
>> > I encourage you to try tests in netperf-wrapper that explicitly test
for
>> > latency under load, and in particular, the RTT_FAIR tests against 4 or
more
>> > stations on a single wifi AP. You will find the results very
depressing.
>> > Similarly, on your previous test series, a latency figure would have
been
>> > nice to have. I just did a talk at nznog, where I tested the local
wifi with
>> > less than ambits of throughput, and 3 seconds of latency, filmed here:
>> >
>> > https://plus.google.com/u/0/107942175615993706558/posts/CY8ew8MPnMt
>> >
>> > Do wish more folk were testing in the busy real world environments,
like
>> > coffee shops, cities... really, anywhere outside a faraday cage!
>> >
>> > I am not attending netconf - I was unable to raise funds to go, and the
>> > program committee wanted something "new",
>> >
>> > instead of the preso I gave the IEEE 802.11 working group back in
september.
>> > (
>> >
http://snapon.lab.bufferbloat.net/~d/ieee802.11-sept-17-2014/11-14-1265-00-0wng-More-on-Bufferbloat.pdf
>> > )
>> >
>> > I was very pleased with the results of that talk - the day after I
gave it,
>> > the phrase "test for latency" showed up in a bunch of 802.11ax (the
next
>> > generation after ac) documents. :) Still, we are stuck with the train
wreck
>> > that is 802.11ac glommed on top of 802.11n, glommed on top of 802.11g,
in
>> > terms of queue management, terrible uses of airtime, rate control and
other
>> > stuff. Aruba and Meraki, in particular took a big interest in what I'd
>> > outlined in the preso above (we have a half dozen less well baked
ideas -
>> > that's just the easy stuff that can be done to improve wifi).  I gave a
>> > followup at meraki but I don't think that's online.
>> >
>> > Felix (nbd) is on vacation right now, as I am I. In fact I am going
>> > somewhere for a week totally lacking internet access.
>> >
>> > Presently the plan, with what budget (none) we have and time (very
little)
>> > we have is to produce a pair of proof of concept implementations for
per tid
>> > queuing (see relevant commit by nbd),  leveraging the new minstrel
stats,
>> > the new minstrel-blues stuff, and an aggregation aware codel with a
>> > calculated target based on the most recently active stations, and a
bunch of
>> > the other stuff outlined above at IEEE.
>> >
>> > It is my hope that this will start to provide accurate back pressure
(or
>> > sufficient lack thereof for TSQ), to also improve throughput while
still
>> > retaining low latency. But it is a certainty that we will run into more
>> > cross layer issues that will be a pita to resolve.
>> >
>> > If we can put together a meet up around or during ELC in california in
>> > march?
>> >
>> > I am really not terribly optimistic on anything other than the 2
chipsets we
>> > can hack on (ath9k, mt76). Negotiations to get qualcomm to open up
their
>> > ath10k firmware have thus far failed, nor has a ath10k-lite got
anywhere.
>> > Perhaps broadcom would be willing to open up their firmware
sufficiently to
>> > build in a better API?
>> >
>> > A bit more below.
>> >
>> >
>> > On Jan 30, 2015 5:59 AM, "Arend van Spriel" <arend@broadcom.com> wrote:
>> >>
>> >> On 01/30/15 14:19, Eric Dumazet wrote:
>> >>>
>> >>> On Fri, 2015-01-30 at 11:29 +0100, Arend van Spriel wrote:
>> >>>
>> >>>> Hi Eric,
>> >>>>
>> >>>> Your suggestions are still based on the fact that you consider
wireless
>> >>>> networking to be similar to ethernet, but as Michal indicated there
are
>> >>>> some fundamental differences starting with CSMA/CD versus CSMA/CA.
Also
>> >>>> the medium conditions are far from comparable.
>> >
>> > The analogy i now use for it is that switched ethernet is generally
your
>> > classic "dumbbell"
>> >
>> > topology. Wifi is more like a "taxi-stand" topology. If you think
about how
>> > people
>> >
>> > queue up at a taxi stand (and sometimes agree to share a ride), the
inter
>> > arrival
>> >
>> > and departure times of a taxi stand make for a better mental model.
>> >
>> > Admittedly, I seem to spend a lot of time, waiting for taxies, thinking
>> > about
>> >
>> > wifi.
>> >
>> >>> There is no shielding so
>> >>>> it needs to deal with interference and dynamically drops the link
rate
>> >>>> so transmission of packets can take several milliseconds. Then with
11n
>> >>>> they came up with aggregation with sends up to 64 packets in a
single
>> >>>> transmit over the air at worst case 6.5 Mbps (if I am not
mistaken). The
>> >>>> parameter value for tcp_limit_output_bytes of 131072 means that it
>> >>>> allows queuing for about 1ms on a 1Gbps link, but I hope you can see
>> >>>> this is not realistic for dealing with all variances of the wireless
>> >>>> medium/standard. I suggested this as topic for the wireless
workshop in
>> >>>> Otawa [1], but I can not attend there. Still hope that there will be
>> >>>> some discussions to get more awareness.
>> >
>> > I have sometimes hoped that TSQ could be made more a function of the
>> >
>> > number of active flows exiting an interface, but eric tells me that's
>> > impossible.
>> >
>> > This is possibly another case where TSQ could use to be a callback
>> > function...
>> >
>> > but frankly I care not a whit about maximizing single flow tcp
throughput on
>> > wifi
>> >
>> > in a faraday cage.
>> >
>> >
>> >>>
>> >>> Ever heard about bufferbloat ?
>> >>
>> >>
>> >> Sure. I am trying to get awareness about that in our wireless
>> >> driver/firmware development teams. So bear with me.
>> >>
>> >>
>> >>> Have you read my suggestions and tried them ?
>> >>>
>> >>> You can adjust the limit per flow to pretty much you want. If you
need
>> >>> 64 packets, just do the math. If in 2018 you need 128 packets, do the
>> >>> math again.
>> >>>
>> >>> I am very well aware that wireless wants aggregation, thank you.
>> >
>> > I note that a lot of people testing this are getting it backwards.
Usually
>> > it is the AP that is sending lots and lots of big packets, where the
return
>> > path is predominately acks from the station.
>> >
>> > I am not a huge fan of stretch acks, but certainly a little bit of
thinning
>> > doesn't bother me on the return path there.
>> >
>> > Going the other way, particularly in a wifi world that insists on
treating
>> > every packet as sacred (which I don't agree with at all), thinning
acks can
>> > help, but single stream throughput is of interest only on benchmarks,
FQing
>> > as much as possible all the flows destined the station in each
aggregate
>> > masks loss and reduces the need to protect everything so much.
>> >
>> >>
>> >> Sorry if I offended you. I was just giving these as example combined
with
>> >> effective rate usable on the medium to say that the bandwidth is more
>> >> dynamic in wireless and as such need dynamic change of queue depth.
Now this
>> >> can be done by making the fraction size as used in your suggestion
adaptive
>> >> to these conditions.
>> >
>> > Well... see above. Maybe this technique will do more of the right
thing,
>> > but... go test.
>> >
>> >
>> >>
>> >>> 131072 bytes of queue on 40Gbit is not 1ms, but 26 usec of queueing,
and
>> >>> we get line rate nevertheless.
>> >>
>> >>
>> >> I was saying it was about 1ms on *1Gbit* as the wireless TCP rates are
>> >> moving into that direction in 11ac.
>> >>
>> >>
>> >>> We need this level of shallow queues (BQL, TSQ), to get very precise
rtt
>> >>> estimations so that TCP has good entropy for its pacing, even in the
50
>> >>> usec rtt ranges.
>> >>>
>> >>> If we allowed 1ms of queueing, then a 40Gbit flow would queue 5
MBytes.
>> >>>
>> >>> This was terrible, because it increased cwnd and all sender queues to
>> >>> insane levels.
>> >>
>> >>
>> >> Indeed and that is what we would like to address in our wireless
drivers.
>> >> I will setup some experiments using the fraction sizing and post my
>> >> findings. Again sorry if I offended you.
>> >
>> > You really, really, really need to test at rates below 50mbit and with
other
>> > stations, also while doing this. It's not going to be a linear curve.
>> >
>> >
>> >
>> >>
>> >> Regards,
>> >> Arend
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> > --
>> > Dave Täht
>> >
>> > thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks
>
>

[-- Attachment #1.2: Type: text/html, Size: 27633 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-01  8:45                   ` Fwd: " Dave Taht
@ 2015-02-01 10:47                     ` Jonathan Morton
  2015-02-01 14:43                       ` dpreed
  0 siblings, 1 reply; 107+ messages in thread
From: Jonathan Morton @ 2015-02-01 10:47 UTC (permalink / raw)
  To: Dave Taht
  Cc: dstanley, Andrew McGregor, Stig Thormodsrud, netdev,
	linux-wireless, Jesper Dangaard Brouer, cerowrt-devel,
	Matt Mathis, Derrick Pallas, Mahesh Paolini-Subramanya,
	Kathy Giori, Tim Shepard, Avery Pennarun


[-- Attachment #1.1: Type: text/plain, Size: 2975 bytes --]

Since this is going to be a big job, it's worth prioritising parts of it
appropriately.

Minstrel is probably already the single best feature of the Linux Wi-Fi
stack. AFAIK it still outperforms any other rate selector we know about. So
I don't consider improving it further to be a high priority, although that
trick of using it as a sneaky random packet loss inducer is intriguing.

Much more important and urgent is getting some form of functioning SQM
closer to the hardware, where the information is. I don't think we need to
get super fancy here to do some real good, in the same way that PIE is a
major improvement over drop-tail. I'd settle for a variant of fq_codel that
gets and uses information about whether the current packet request might be
aggregated with the previous packet provided, and adjusts its choice of
packet accordingly.

At the same time, models would undoubtedly be useful to help test and
repeatably demonstrate the advantages of both simple and more sophisticated
solutions. Ns3 allows laying out a reasonably complex radio environment,
which is great for this. To counter the prevalence of one-station Faraday
cage tests in the industry, the simulated environments should represent
realistic, challenging use cases:

1) the family home, with half a dozen client devices competing with several
interference sources (Bluetooth, RC toys, microwave oven, etc). This is a
relatively easy environment, representing the expected environment for
consumer equipment.

2) the apartment block, with fewer clients per AP but lots of APs
distributed throughout a large building. Walls and floors may provide
valuable attenuation here - unless you're in Japan, where they can be
notoriously thin.

3) the railway carriage, consisting of eighty passengers in a 20x3 m space,
and roughly the same number of client devices. The uplink is 3G based and
has some inherent latency. Add some Bluetooth for flavour, stir gently.
This one is rather challenging, but there is scope to optimise AP antenna
placement, and to scale the test down slightly by reducing seat occupancy.

4) the jumbo jet, consisting of several hundred passengers crammed in like
sardines. The uplink has satellite latencies built in. Good luck.

5) the business hotel. Multiple APs will be needed to provide adequate
coverage for this environment, which should encompass the rooms as well as
lounge, conference and dining areas. Some visitors may bring their own APs,
and the system must be able to cope with this without seriously degrading
performance.

6) the trade conference. A large arena filled with thousands of people.
Multiple APs required. Good luck.

I also feel that ultimately we're going to have to get industry on board.
Not just passively letting us play around as with ath9k, but actively
taking note of our findings and implementing at least a few of our ideas
themselves. Of course, tools, models and real-world results are likely to
make that easier.

- Jonathan Morton

[-- Attachment #1.2: Type: text/html, Size: 3211 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-01 10:47                     ` Jonathan Morton
@ 2015-02-01 14:43                       ` dpreed
  2015-02-01 23:34                         ` Andrew McGregor
  2015-02-02  4:21                         ` Avery Pennarun
  0 siblings, 2 replies; 107+ messages in thread
From: dpreed @ 2015-02-01 14:43 UTC (permalink / raw)
  To: Jonathan Morton
  Cc: dstanley, Andrew McGregor, Stig Thormodsrud, netdev,
	linux-wireless, Jesper Dangaard Brouer, cerowrt-devel,
	Matt Mathis, Derrick Pallas, Mahesh Paolini-Subramanya,
	Kathy Giori, Tim Shepard, Avery Pennarun


[-- Attachment #1.1: Type: text/plain, Size: 7456 bytes --]


Just to clarify, managing queueing in a single access point WiFi network is only a small part of the problem of fixing the rapidly degrading performance of WiFi based systems.  Similarly, mesh routing is only a small part of the problem with the scalability of cooperative meshes based on the WiFi MAC.
 
So I don't disagree with work on queue management (which is not good in any commercial product).
 
But Dave T has done some great talks on "fixing WiFi" that don't have to do with queueing very much at all.
 
For example, rate selection for various packets is terrible.  When you have nearly 1000:1 ratios of transmission rates and codes that are not backward compatible, there's a huge opportunity for improvement.  Similarly, choice of frequency bandwidth and center frequency at each station offers huge opportunities for practical scalability of systems.  Also, as we noted earlier, "handoff" from one next hop to another is a huge problem with performance in practical deployments (a factor of 10x at least, just in that).
 
Propagation information is not used at all when 802.11 systems share a channel, even in single AP deployments, yet all stations can measure propagation quite accurately in their hardware.
 
Finally, Listen-before-talk is highly wasteful for two reasons: 1) any random radio noise from other sources unnecessarily degrades communications (and in the 5.8 MHz band, the rule about "radar avoidance" requires treating very low level noise as a "signal to shut the net down by law", but there is a loophole if you can tell that it's not actually "radar" (the technique requires two or more stations to measure the same noise event, and if the power is significantly different - more than a few dB - then it can't possibly be due to a distant transmitter, and therefore can be ignored). 2) the transmitter cannot tell when the intended receiver will be perfectly able to decode the signal without interference with the station it hears (this second point is actually proven in theory in a paper by Jon Peha that argued against trivial "etiquettes" as a mechanism for sharing among uncooperative and non-interoperable stations).
 
Dave T has discussed more, as have I in other venues.
 
The reason no one is making progress on any of these particular issues is that there is no coordination at the "systems level" around creating rising tides that lift all boats in the WiFi-ish space.  It's all about ripping the competition by creating stuff that can sell better than the other guys' stuff, and avoiding cooperation at all costs.
 
I agree that, to the extent that managing queues in a single box or a single operating system doesn't require cooperation, it's much easier to get such things into the market.  That's why CeroWRT has been as effective as it has been.  But has Microsoft done anything at all about it?   Do the better ECN signals that can arise from good queue management get used by the TCP endpoints, or for that matter UDP-based protocol endpoints?
 
But the big wins in making WiFi better are going begging.  As WiFi becomes more closed, as it will as the major Internet Access Providers and Gadget builders (Google, Apple) start excluding innovators in wireless from the market by closed, proprietary solutions, the problem WILL get worse.  You won't be able to fix those problems at all.  If you have a solution you will have to convince the oligopoly to even bother trying it.
 
So, let me reiterate.  The problem is not just "getting Minstrel adopted", though I have nothing against that as a subgoal.  The problem is to find good systems-level answers, and to find a strategy to deliver those answers to a WiFi ecology that spans the planet, and where the marketing value-story focuses on things one can measure between two stations in a Faraday cage, and never on any systems-level issues.
 
 
I personally think that things like promoting semi-closed, essentially proprietary ESSID-based bridged distribution systems as "good ideas" are counterproductive to this goal.  But that's perhaps too radical for this crowd.  It reminds me of Cisco's attempt to create a proprietary Internet technology with IOS, which fortunately was not the success Cisco hoped for, or Juniper would not have existed. Maybe IOS would have been a fine standard, but it would have killed the evolution of the Internet as we know it.

On Sunday, February 1, 2015 5:47am, "Jonathan Morton" <chromatix99@gmail.com> said:



Since this is going to be a big job, it's worth prioritising parts of it appropriately.
Minstrel is probably already the single best feature of the Linux Wi-Fi stack. AFAIK it still outperforms any other rate selector we know about. So I don't consider improving it further to be a high priority, although that trick of using it as a sneaky random packet loss inducer is intriguing.
Much more important and urgent is getting some form of functioning SQM closer to the hardware, where the information is. I don't think we need to get super fancy here to do some real good, in the same way that PIE is a major improvement over drop-tail. I'd settle for a variant of fq_codel that gets and uses information about whether the current packet request might be aggregated with the previous packet provided, and adjusts its choice of packet accordingly.
At the same time, models would undoubtedly be useful to help test and repeatably demonstrate the advantages of both simple and more sophisticated solutions. Ns3 allows laying out a reasonably complex radio environment, which is great for this. To counter the prevalence of one-station Faraday cage tests in the industry, the simulated environments should represent realistic, challenging use cases:
1) the family home, with half a dozen client devices competing with several interference sources (Bluetooth, RC toys, microwave oven, etc). This is a relatively easy environment, representing the expected environment for consumer equipment.
2) the apartment block, with fewer clients per AP but lots of APs distributed throughout a large building. Walls and floors may provide valuable attenuation here - unless you're in Japan, where they can be notoriously thin.
3) the railway carriage, consisting of eighty passengers in a 20x3 m space, and roughly the same number of client devices. The uplink is 3G based and has some inherent latency. Add some Bluetooth for flavour, stir gently. This one is rather challenging, but there is scope to optimise AP antenna placement, and to scale the test down slightly by reducing seat occupancy.
4) the jumbo jet, consisting of several hundred passengers crammed in like sardines. The uplink has satellite latencies built in. Good luck.
5) the business hotel. Multiple APs will be needed to provide adequate coverage for this environment, which should encompass the rooms as well as lounge, conference and dining areas. Some visitors may bring their own APs, and the system must be able to cope with this without seriously degrading performance.
6) the trade conference. A large arena filled with thousands of people. Multiple APs required. Good luck.
I also feel that ultimately we're going to have to get industry on board. Not just passively letting us play around as with ath9k, but actively taking note of our findings and implementing at least a few of our ideas themselves. Of course, tools, models and real-world results are likely to make that easier.
- Jonathan Morton

[-- Attachment #1.2: Type: text/html, Size: 11383 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-01 14:43                       ` dpreed
@ 2015-02-01 23:34                         ` Andrew McGregor
  2015-02-02  4:04                             ` Avery Pennarun
  2015-02-02  4:21                         ` Avery Pennarun
  1 sibling, 1 reply; 107+ messages in thread
From: Andrew McGregor @ 2015-02-01 23:34 UTC (permalink / raw)
  To: dpreed
  Cc: dstanley, Stig Thormodsrud, netdev, linux-wireless,
	Jesper Dangaard Brouer, cerowrt-devel, Matt Mathis,
	Derrick Pallas, Kathy Giori, Mahesh Paolini-Subramanya,
	Jonathan Morton, Tim Shepard, Avery Pennarun


[-- Attachment #1.1: Type: text/plain, Size: 9023 bytes --]

So far as I'm concerned, Minstrel is as widely adopted as we could want; it
could go further, but it doesn't need any help with that.  It could be
better, however, and that's a fairly straightforward change.  One of the
problems with rate selection is that the standard itself has a bit to say
on what rates to use for what packets, and what it says is braindead.  So
if you want to be standards compliant you are also leaving quite a bit of
potential performance on the table.

I missed one item in my list of potential improvements: the most braindead
thing 802.11 has to say about rates is that broadcast and multicast packets
should be sent at 'the lowest basic rate in the current supported rate
set', which is really wasteful.  There are a couple of ways of dealing with
this: one, ignore the standard and pick the rate that is most likely to get
the frame to as many neighbours as possible (by a scan of the Minstrel
tables).  Or two, fan it out as unicast, which might well take less airtime
(due to aggregation) as well as being much more likely to be delivered,
since you get ACKs and retries by doing that.

As for the industry... sure, there are those forces, but then again there's
a non-trivial amount of influence in this audience.

On Mon, Feb 2, 2015 at 1:43 AM, <dpreed@reed.com> wrote:

> Just to clarify, managing queueing in a single access point WiFi network
> is only a small part of the problem of fixing the rapidly degrading
> performance of WiFi based systems.  Similarly, mesh routing is only a small
> part of the problem with the scalability of cooperative meshes based on the
> WiFi MAC.
>
>
>
> So I don't disagree with work on queue management (which is not good in
> any commercial product).
>
>
>
> But Dave T has done some great talks on "fixing WiFi" that don't have to
> do with queueing very much at all.
>
>
>
> For example, rate selection for various packets is terrible.  When you
> have nearly 1000:1 ratios of transmission rates and codes that are not
> backward compatible, there's a huge opportunity for improvement.
> Similarly, choice of frequency bandwidth and center frequency at each
> station offers huge opportunities for practical scalability of systems.
> Also, as we noted earlier, "handoff" from one next hop to another is a huge
> problem with performance in practical deployments (a factor of 10x at
> least, just in that).
>
>
>
> Propagation information is not used at all when 802.11 systems share a
> channel, even in single AP deployments, yet all stations can measure
> propagation quite accurately in their hardware.
>
>
>
> Finally, Listen-before-talk is highly wasteful for two reasons: 1) any
> random radio noise from other sources unnecessarily degrades communications
> (and in the 5.8 MHz band, the rule about "radar avoidance" requires
> treating very low level noise as a "signal to shut the net down by law",
> but there is a loophole if you can tell that it's not actually "radar" (the
> technique requires two or more stations to measure the same noise event,
> and if the power is significantly different - more than a few dB - then it
> can't possibly be due to a distant transmitter, and therefore can be
> ignored). 2) the transmitter cannot tell when the intended receiver will be
> perfectly able to decode the signal without interference with the station
> it hears (this second point is actually proven in theory in a paper by Jon
> Peha that argued against trivial "etiquettes" as a mechanism for sharing
> among uncooperative and non-interoperable stations).
>
>
>
> Dave T has discussed more, as have I in other venues.
>
>
>
> The reason no one is making progress on any of these particular issues is
> that there is no coordination at the "systems level" around creating rising
> tides that lift all boats in the WiFi-ish space.  It's all about ripping
> the competition by creating stuff that can sell better than the other guys'
> stuff, and avoiding cooperation at all costs.
>
>
>
> I agree that, to the extent that managing queues in a single box or a
> single operating system doesn't require cooperation, it's much easier to
> get such things into the market.  That's why CeroWRT has been as effective
> as it has been.  But has Microsoft done anything at all about it?   Do the
> better ECN signals that can arise from good queue management get used by
> the TCP endpoints, or for that matter UDP-based protocol endpoints?
>
>
>
> But the big wins in making WiFi better are going begging.  As WiFi becomes
> more closed, as it will as the major Internet Access Providers and Gadget
> builders (Google, Apple) start excluding innovators in wireless from the
> market by closed, proprietary solutions, the problem WILL get worse.  You
> won't be able to fix those problems at all.  If you have a solution you
> will have to convince the oligopoly to even bother trying it.
>
>
>
> So, let me reiterate.  The problem is not just "getting Minstrel adopted",
> though I have nothing against that as a subgoal.  The problem is to find
> good systems-level answers, and to find a strategy to deliver those answers
> to a WiFi ecology that spans the planet, and where the marketing
> value-story focuses on things one can measure between two stations in a
> Faraday cage, and never on any systems-level issues.
>
>
>
>
>
> I personally think that things like promoting semi-closed, essentially
> proprietary ESSID-based bridged distribution systems as "good ideas" are
> counterproductive to this goal.  But that's perhaps too radical for this
> crowd.  It reminds me of Cisco's attempt to create a proprietary Internet
> technology with IOS, which fortunately was not the success Cisco hoped for,
> or Juniper would not have existed. Maybe IOS would have been a fine
> standard, but it would have killed the evolution of the Internet as we know
> it.
>
>
> On Sunday, February 1, 2015 5:47am, "Jonathan Morton" <
> chromatix99@gmail.com> said:
>
>  Since this is going to be a big job, it's worth prioritising parts of it
> appropriately.
>
> Minstrel is probably already the single best feature of the Linux Wi-Fi
> stack. AFAIK it still outperforms any other rate selector we know about. So
> I don't consider improving it further to be a high priority, although that
> trick of using it as a sneaky random packet loss inducer is intriguing.
>
> Much more important and urgent is getting some form of functioning SQM
> closer to the hardware, where the information is. I don't think we need to
> get super fancy here to do some real good, in the same way that PIE is a
> major improvement over drop-tail. I'd settle for a variant of fq_codel that
> gets and uses information about whether the current packet request might be
> aggregated with the previous packet provided, and adjusts its choice of
> packet accordingly.
>
> At the same time, models would undoubtedly be useful to help test and
> repeatably demonstrate the advantages of both simple and more sophisticated
> solutions. Ns3 allows laying out a reasonably complex radio environment,
> which is great for this. To counter the prevalence of one-station Faraday
> cage tests in the industry, the simulated environments should represent
> realistic, challenging use cases:
>
> 1) the family home, with half a dozen client devices competing with
> several interference sources (Bluetooth, RC toys, microwave oven, etc).
> This is a relatively easy environment, representing the expected
> environment for consumer equipment.
>
> 2) the apartment block, with fewer clients per AP but lots of APs
> distributed throughout a large building. Walls and floors may provide
> valuable attenuation here - unless you're in Japan, where they can be
> notoriously thin.
>
> 3) the railway carriage, consisting of eighty passengers in a 20x3 m
> space, and roughly the same number of client devices. The uplink is 3G
> based and has some inherent latency. Add some Bluetooth for flavour, stir
> gently. This one is rather challenging, but there is scope to optimise AP
> antenna placement, and to scale the test down slightly by reducing seat
> occupancy.
>
> 4) the jumbo jet, consisting of several hundred passengers crammed in like
> sardines. The uplink has satellite latencies built in. Good luck.
>
> 5) the business hotel. Multiple APs will be needed to provide adequate
> coverage for this environment, which should encompass the rooms as well as
> lounge, conference and dining areas. Some visitors may bring their own APs,
> and the system must be able to cope with this without seriously degrading
> performance.
>
> 6) the trade conference. A large arena filled with thousands of people.
> Multiple APs required. Good luck.
>
> I also feel that ultimately we're going to have to get industry on board.
> Not just passively letting us play around as with ath9k, but actively
> taking note of our findings and implementing at least a few of our ideas
> themselves. Of course, tools, models and real-world results are likely to
> make that easier.
>
> - Jonathan Morton
>

[-- Attachment #1.2: Type: text/html, Size: 12890 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02  4:04                             ` Avery Pennarun
  0 siblings, 0 replies; 107+ messages in thread
From: Avery Pennarun @ 2015-02-02  4:04 UTC (permalink / raw)
  To: Andrew McGregor
  Cc: David Reed, Jonathan Morton, Dave Taht, Matt Mathis, Tim Shepard,
	dstanley, Kathy Giori, Stig Thormodsrud, Derrick Pallas,
	cerowrt-devel, Mahesh Paolini-Subramanya, Jim Gettys,
	Jesper Dangaard Brouer, linux-wireless, netdev

On Sun, Feb 1, 2015 at 6:34 PM, Andrew McGregor <andrewmcgr@gmail.com> wrote:
> I missed one item in my list of potential improvements: the most braindead
> thing 802.11 has to say about rates is that broadcast and multicast packets
> should be sent at 'the lowest basic rate in the current supported rate set',
> which is really wasteful.  There are a couple of ways of dealing with this:
> one, ignore the standard and pick the rate that is most likely to get the
> frame to as many neighbours as possible (by a scan of the Minstrel tables).
> Or two, fan it out as unicast, which might well take less airtime (due to
> aggregation) as well as being much more likely to be delivered, since you
> get ACKs and retries by doing that.

As far as I can see, the only sensible thing to do with
multicast/broadcast is some variation of the unicast fanout, unless
you've got a truly huge number of nodes.  I don't know of any
protocols (certainly not video streams) that actually work well with
the kind of packet loss you see at medium/long range with wifi if
retransmits aren't used.  I've heard that openwrt already has a patch
included that does this kind of fanout at the bridge layer.

I've also heard of a new "reliable multicast" in some newer 802.11
variant, which essentially sends out a single multicast packet and
expects an ACK from each intended recipient.  Other than adding
complexity, it seems like the best of both worlds.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02  4:04                             ` Avery Pennarun
  0 siblings, 0 replies; 107+ messages in thread
From: Avery Pennarun @ 2015-02-02  4:04 UTC (permalink / raw)
  To: Andrew McGregor
  Cc: David Reed, Jonathan Morton, Dave Taht, Matt Mathis, Tim Shepard,
	dstanley-lTB0Xd325TxqO0yjgLDq5wC/G2K4zDHf, Kathy Giori,
	Stig Thormodsrud, Derrick Pallas,
	cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1,
	Mahesh Paolini-Subramanya, Jim Gettys, Jesper Dangaard Brouer,
	linux-wireless, netdev-u79uwXL29TY76Z2rM5mHXA

On Sun, Feb 1, 2015 at 6:34 PM, Andrew McGregor <andrewmcgr-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I missed one item in my list of potential improvements: the most braindead
> thing 802.11 has to say about rates is that broadcast and multicast packets
> should be sent at 'the lowest basic rate in the current supported rate set',
> which is really wasteful.  There are a couple of ways of dealing with this:
> one, ignore the standard and pick the rate that is most likely to get the
> frame to as many neighbours as possible (by a scan of the Minstrel tables).
> Or two, fan it out as unicast, which might well take less airtime (due to
> aggregation) as well as being much more likely to be delivered, since you
> get ACKs and retries by doing that.

As far as I can see, the only sensible thing to do with
multicast/broadcast is some variation of the unicast fanout, unless
you've got a truly huge number of nodes.  I don't know of any
protocols (certainly not video streams) that actually work well with
the kind of packet loss you see at medium/long range with wifi if
retransmits aren't used.  I've heard that openwrt already has a patch
included that does this kind of fanout at the bridge layer.

I've also heard of a new "reliable multicast" in some newer 802.11
variant, which essentially sends out a single multicast packet and
expects an ACK from each intended recipient.  Other than adding
complexity, it seems like the best of both worlds.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-01 14:43                       ` dpreed
  2015-02-01 23:34                         ` Andrew McGregor
@ 2015-02-02  4:21                         ` Avery Pennarun
  2015-02-02  7:07                             ` David Lang
  1 sibling, 1 reply; 107+ messages in thread
From: Avery Pennarun @ 2015-02-02  4:21 UTC (permalink / raw)
  To: David Reed
  Cc: Jonathan Morton, Dave Taht, Matt Mathis, Tim Shepard, dstanley,
	Kathy Giori, Stig Thormodsrud, Derrick Pallas, cerowrt-devel,
	Mahesh Paolini-Subramanya, Jim Gettys, Andrew McGregor,
	Jesper Dangaard Brouer, linux-wireless, netdev

On Sun, Feb 1, 2015 at 9:43 AM,  <dpreed@reed.com> wrote:
> Just to clarify, managing queueing in a single access point WiFi network is
> only a small part of the problem of fixing the rapidly degrading performance
> of WiFi based systems.

Can you explain what you mean by "rapidly degrading?"  The performance
in odd situations is certainly not inspirational, but I haven't
noticed it getting worse over time.

> Similarly, mesh routing is only a small part of the
> problem with the scalability of cooperative meshes based on the WiFi MAC.

That's certainly true.  Not to say the mesh routing algorithms are
much good either.

>  Also, as we noted
> earlier, "handoff" from one next hop to another is a huge problem with
> performance in practical deployments (a factor of 10x at least, just in
> that).

While there is definitely some work to be done in handoff, it seems
like there are some find implementations of this already in existence.
Several brands of "enterprise access point" setups seem to do well at
this.  It would be nice if they interoperated, I guess.

The fact that there's no open source version of this kind of handoff
feature bugs me, but we are working on it here and the work is all
planned to be open source, for example: (very early version)
https://gfiber.googlesource.com/vendor/google/platform/+/master/waveguide/

> Propagation information is not used at all when 802.11 systems share a
> channel, even in single AP deployments, yet all stations can measure
> propagation quite accurately in their hardware.

802.11k seems to provide for sharing this information.  But I'm not
clear what I should use it for. :)

> Finally, Listen-before-talk is highly wasteful for two reasons: 1) any
> random radio noise from other sources unnecessarily degrades communications [...]
> 2) the transmitter cannot tell when the intended receiver will be perfectly
> able to decode the signal without interference with the station it hears
> (this second point is actually proven in theory in a paper by Jon Peha that
> argued against trivial "etiquettes" as a mechanism for sharing among
> uncooperative and non-interoperable stations).

I've thought quite a bit about your point #2 above, but I don't know
which direction to pursue.  The idea is that sometimes "just shout
over the background noise" is a globally optimal solution, right?  The
question seems to be to figure out when that is true and when it
isn't.

> I agree that, to the extent that managing queues in a single box or a single
> operating system doesn't require cooperation, it's much easier to get such
> things into the market.  That's why CeroWRT has been as effective as it has
> been.  But has Microsoft done anything at all about it?   Do the better ECN
> signals that can arise from good queue management get used by the TCP
> endpoints, or for that matter UDP-based protocol endpoints?

If we don't know the answer to the questions, then that is itself the
problem.  It's a lot easier to say, hey, ChromeOS and MacOS have good
network performance but Microsoft has bad network performance, if it's
true and we have good reproducible tests to demonstrate that.

> The reason no one is making progress on any of these particular issues is
> that there is no coordination at the "systems level" around creating rising
> tides that lift all boats in the WiFi-ish space.  It's all about ripping the
> competition by creating stuff that can sell better than the other guys'
> stuff, and avoiding cooperation at all costs.
> [...]
> But the big wins in making WiFi better are going begging.  As WiFi becomes
> more closed, as it will as the major Internet Access Providers and Gadget
> builders (Google, Apple) start excluding innovators in wireless from the
> market by closed, proprietary solutions, the problem WILL get worse.  You
> won't be able to fix those problems at all.  If you have a solution you will
> have to convince the oligopoly to even bother trying it.

As someone who works at Google Fiber (which is both a gadget maker and
an ISP) and who pushes all day long for our wifi stuff to be open
source, I'm slightly offended to be lumped in with other vendors in
your story :)  I think the ChromeOS team (which insists on only open
source wifi drivers in all chromebooks) would feel similarly.  We are
lucky to have defined our competitive advantage as something other
than short-lived slight improvements in wifi that will soon be
wastefully duplicated by everyone else.

That said, I see what you mean about the general state of the
industry.  The way to fix it is the way Linux always fixes it: make
the open source version so much better that building a proprietary
one, just to gather a small incremental advantage, is a huge waste of
time and effort.  Work on minstrel and fq_codel go really far here.

> I personally think that things like promoting semi-closed, essentially
> proprietary ESSID-based bridged distribution systems as "good ideas" are
> counterproductive to this goal.  But that's perhaps too radical for this
> crowd.

Not sure what you mean here.  ESSID-based distribution systems seem
pretty well defined to me.  The only proprietary part is the
decision-making process for assisted roaming (ie. the "inter-AP
protocol") which is only an optional performance optimization.  There
really should be an open source version of this, and I'm in fact
feebly attempting to build one, but I don't feel like the world is
falling apart through not having it.  You can build a bridged
multi-BSS ESSID today with plain out-of-the-box hostapd.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-02  4:21                         ` Avery Pennarun
@ 2015-02-02  7:07                             ` David Lang
  0 siblings, 0 replies; 107+ messages in thread
From: David Lang @ 2015-02-02  7:07 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: David Reed, dstanley, Andrew McGregor, Stig Thormodsrud, netdev,
	linux-wireless, Jesper Dangaard Brouer, cerowrt-devel,
	Matt Mathis, Derrick Pallas, Kathy Giori,
	Mahesh Paolini-Subramanya, Jonathan Morton, Tim Shepard

On Sun, 1 Feb 2015, Avery Pennarun wrote:

> On Sun, Feb 1, 2015 at 9:43 AM,  <dpreed@reed.com> wrote:
>> I personally think that things like promoting semi-closed, essentially
>> proprietary ESSID-based bridged distribution systems as "good ideas" are
>> counterproductive to this goal.  But that's perhaps too radical for this
>> crowd.
>
> Not sure what you mean here.  ESSID-based distribution systems seem
> pretty well defined to me.  The only proprietary part is the
> decision-making process for assisted roaming (ie. the "inter-AP
> protocol") which is only an optional performance optimization.  There
> really should be an open source version of this, and I'm in fact
> feebly attempting to build one, but I don't feel like the world is
> falling apart through not having it.  You can build a bridged
> multi-BSS ESSID today with plain out-of-the-box hostapd.

I will be running a fully opensource bridged ESSID system at SCaLE this month. 
last year we had ~2500 people and devices with ~50 APs deployed, and it worked 
well. The only problem was that I needed to deploy a few more APs to cover some 
of the hallway areas more reliably.

There are tricks that the commercial systems pull that I can't currently 
duplicate with opensource tools. But as Avery says, they are optimizations, not 
something required for successful operation. It would be nice to get the 
assisted roaming portion available. But it's not required.

David Lang

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Fwd: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02  7:07                             ` David Lang
  0 siblings, 0 replies; 107+ messages in thread
From: David Lang @ 2015-02-02  7:07 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: dstanley, Andrew McGregor, Stig Thormodsrud, netdev,
	linux-wireless, Jesper Dangaard Brouer, Derrick Pallas,
	Matt Mathis, cerowrt-devel, Jonathan Morton,
	Mahesh Paolini-Subramanya, Kathy Giori, Tim Shepard

On Sun, 1 Feb 2015, Avery Pennarun wrote:

> On Sun, Feb 1, 2015 at 9:43 AM,  <dpreed@reed.com> wrote:
>> I personally think that things like promoting semi-closed, essentially
>> proprietary ESSID-based bridged distribution systems as "good ideas" are
>> counterproductive to this goal.  But that's perhaps too radical for this
>> crowd.
>
> Not sure what you mean here.  ESSID-based distribution systems seem
> pretty well defined to me.  The only proprietary part is the
> decision-making process for assisted roaming (ie. the "inter-AP
> protocol") which is only an optional performance optimization.  There
> really should be an open source version of this, and I'm in fact
> feebly attempting to build one, but I don't feel like the world is
> falling apart through not having it.  You can build a bridged
> multi-BSS ESSID today with plain out-of-the-box hostapd.

I will be running a fully opensource bridged ESSID system at SCaLE this month. 
last year we had ~2500 people and devices with ~50 APs deployed, and it worked 
well. The only problem was that I needed to deploy a few more APs to cover some 
of the hallway areas more reliably.

There are tricks that the commercial systems pull that I can't currently 
duplicate with opensource tools. But as Avery says, they are optimizations, not 
something required for successful operation. It would be nice to get the 
assisted roaming portion available. But it's not required.

David Lang

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02 10:27         ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-02 10:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 30 January 2015 at 15:40, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:
>
>> I've briefly tried playing with this knob to no avail unfortunately. I
>> tried 256K, 1M - it didn't improve TCP performance. When I tried to
>> make it smaller (e.g. 16K) the traffic dropped even more so it does
>> have an effect. It seems there's some other limiting factor in this
>> case.
>
> Interesting.
>
> Could you take some tcpdump/pcap with various tcp_limit_output_bytes
> values ?
>
> echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> tcpdump -p -i wlanX -s 128 -c 20000 -w 128k.pcap
>
> echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> tcpdump -p -i wlanX -s 128 -c 20000 -w 256k.pcap

I've run a couple of tests across different kernels. This got pretty
big so I decided to use an external file hosting:
 http://www.filedropper.com/testtar

Let me know if you can't access it (and perhaps you could suggest how
you prefer the logs to be delivered in that case).

The layout of logs is: $kernel/$limit-P$threads.pcap. I've also
included the test script and output of each test run.

While testing I've had my internal GRO patch for ath10k and no stretch
ack patches.

When I was trying to come up with a testing methodology I've noticed
something interesting:
 1. set 16k limit
 2. start iperf -P1
 3. observe 200mbps
 4. set 2048k limit (while iperf is running)
 5. observe 600mbps
 6. set 16limit back (while iperf is running)
 7. observe 500-600mbps (i.e. no drop to 200mbps)

Due to that I've decided to re-start iperf for each limit test.

If you want me to gather some other logs/dumps/configuration
permutations let me know, please.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02 10:27         ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-02 10:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On 30 January 2015 at 15:40, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 2015-01-30 at 14:39 +0100, Michal Kazior wrote:
>
>> I've briefly tried playing with this knob to no avail unfortunately. I
>> tried 256K, 1M - it didn't improve TCP performance. When I tried to
>> make it smaller (e.g. 16K) the traffic dropped even more so it does
>> have an effect. It seems there's some other limiting factor in this
>> case.
>
> Interesting.
>
> Could you take some tcpdump/pcap with various tcp_limit_output_bytes
> values ?
>
> echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> tcpdump -p -i wlanX -s 128 -c 20000 -w 128k.pcap
>
> echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> tcpdump -p -i wlanX -s 128 -c 20000 -w 256k.pcap

I've run a couple of tests across different kernels. This got pretty
big so I decided to use an external file hosting:
 http://www.filedropper.com/testtar

Let me know if you can't access it (and perhaps you could suggest how
you prefer the logs to be delivered in that case).

The layout of logs is: $kernel/$limit-P$threads.pcap. I've also
included the test script and output of each test run.

While testing I've had my internal GRO patch for ath10k and no stretch
ack patches.

When I was trying to come up with a testing methodology I've noticed
something interesting:
 1. set 16k limit
 2. start iperf -P1
 3. observe 200mbps
 4. set 2048k limit (while iperf is running)
 5. observe 600mbps
 6. set 16limit back (while iperf is running)
 7. observe 500-600mbps (i.e. no drop to 200mbps)

Due to that I've decided to re-start iperf for each limit test.

If you want me to gather some other logs/dumps/configuration
permutations let me know, please.


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [Cerowrt-devel] Fwd: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-02  4:04                             ` Avery Pennarun
  (?)
@ 2015-02-02 15:25                             ` Jim Gettys
  -1 siblings, 0 replies; 107+ messages in thread
From: Jim Gettys @ 2015-02-02 15:25 UTC (permalink / raw)
  To: Avery Pennarun
  Cc: Andrew McGregor, David Reed, Jonathan Morton, Dave Taht,
	Matt Mathis, Tim Shepard, dstanley, Kathy Giori,
	Stig Thormodsrud, Derrick Pallas, cerowrt-devel,
	Mahesh Paolini-Subramanya, Jesper Dangaard Brouer,
	linux-wireless, netdev

On Sun, Feb 1, 2015 at 11:04 PM, Avery Pennarun <apenwarr@google.com> wrote:
> On Sun, Feb 1, 2015 at 6:34 PM, Andrew McGregor <andrewmcgr@gmail.com> wrote:
>> I missed one item in my list of potential improvements: the most braindead
>> thing 802.11 has to say about rates is that broadcast and multicast packets
>> should be sent at 'the lowest basic rate in the current supported rate set',
>> which is really wasteful.  There are a couple of ways of dealing with this:
>> one, ignore the standard and pick the rate that is most likely to get the
>> frame to as many neighbours as possible (by a scan of the Minstrel tables).
>> Or two, fan it out as unicast, which might well take less airtime (due to
>> aggregation) as well as being much more likely to be delivered, since you
>> get ACKs and retries by doing that.
>
> As far as I can see, the only sensible thing to do with
> multicast/broadcast is some variation of the unicast fanout, unless
> you've got a truly huge number of nodes.  I don't know of any
> protocols (certainly not video streams) that actually work well with
> the kind of packet loss you see at medium/long range with wifi if
> retransmits aren't used.  I've heard that openwrt already has a patch
> included that does this kind of fanout at the bridge layer.

I gather some Windows drivers from some vendors do this unicast fanout
(claim made by one of their engineers in an early homenet meeting).

>
> I've also heard of a new "reliable multicast" in some newer 802.11
> variant, which essentially sends out a single multicast packet and
> expects an ACK from each intended recipient.  Other than adding
> complexity, it seems like the best of both worlds.

So long as it times out in some very small, finite time.  We don't
want a repeat of the infinite retry bugs Dave found in drivers a few
years back...

"Reliable multicast" ultimately is an oxymoron, particularly on a
medium with hundreds/one bandwidth variation.  One remote low
bandwidth station cannot be allowed to drag the entire network to the
basement.
                                     - Jim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-02 10:27         ` Michal Kazior
  (?)
@ 2015-02-02 18:52         ` Eric Dumazet
  2015-02-02 21:25           ` Ben Greear
                             ` (2 more replies)
  -1 siblings, 3 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-02 18:52 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:

> While testing I've had my internal GRO patch for ath10k and no stretch
> ack patches.

Thanks for the data, I took a look at it.

I am afraid this GRO patch might be the problem.

It seems to break ACK clocking badly (linux stack has a somewhat buggy
tcp_tso_should_defer(), which relies on ACK being received smoothly, as
no timer is setup to split the TSO packet.)

I am seeing huge delays on ACK packets and bursts like that :

05:01:53.413038 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 76745, win 4435, options [nop,nop,TS val 4294758508 ecr 4294757300], length 0
05:01:53.413407 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 79641, win 4435, options [nop,nop,TS val 4294758508 ecr 4294757301], length 0
05:01:53.413969 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 92673, win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.413990 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 97017, win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
05:01:53.414011 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 110049, win 4435, options [nop,nop,TS val 4294758510 ecr 4294757302], length 0
...
05:01:53.422663 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 189689, win 4435, options [nop,nop,TS val 4294758519 ecr 4294757310], length 0
05:01:53.424354 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 198377, win 4435, options [nop,nop,TS val 4294758520 ecr 4294757311], length 0
05:01:53.424400 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 202721, win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
05:01:53.424409 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 205617, win 4435, options [nop,nop,TS val 4294758520 ecr 4294757313], length 0
...
05:01:53.450248 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 419921, win 4435, options [nop,nop,TS val 4294758547 ecr 4294757337], length 0
05:01:53.450266 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 427161, win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0
05:01:53.450289 IP 192.168.1.2.5001 > 192.168.1.3.49669: Flags [.], ack 431505, win 4435, options [nop,nop,TS val 4294758547 ecr 4294757340], length 0

Could you make again your experiments using upstream kernel (David
Miller net tree) ?

You also could post the GRO patch so that we can comment on it.

Thanks



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-02 18:52         ` Eric Dumazet
@ 2015-02-02 21:25           ` Ben Greear
  2015-02-02 23:06               ` Eric Dumazet
  2015-02-03  1:18             ` Eric Dumazet
  2015-02-03  8:44           ` Michal Kazior
  2 siblings, 1 reply; 107+ messages in thread
From: Ben Greear @ 2015-02-02 21:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Michal Kazior, linux-wireless, Network Development, eyalpe

On 02/02/2015 10:52 AM, Eric Dumazet wrote:
> On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:
> 
>> While testing I've had my internal GRO patch for ath10k and no stretch
>> ack patches.
> 
> Thanks for the data, I took a look at it.
> 
> I am afraid this GRO patch might be the problem.
> 
> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> no timer is setup to split the TSO packet.)

It is a big throughput win to have fewer TCP ack packets on
wireless since it is a half-duplex environment.  Is there anything
we could improve so that we can have fewer acks and still get
good tcp stack behaviour?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02 23:06               ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-02 23:06 UTC (permalink / raw)
  To: Ben Greear; +Cc: Michal Kazior, linux-wireless, Network Development, eyalpe

On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:

> It is a big throughput win to have fewer TCP ack packets on
> wireless since it is a half-duplex environment.  Is there anything
> we could improve so that we can have fewer acks and still get
> good tcp stack behaviour?

First apply TCP stretch ack fixes to the sender. There is no way to get
good performance if the sender does not handle stretch ack.

d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
c22bdca94782 tcp: fix stretch ACK bugs in Reno
814d488c6126 tcp: fix the timid additive increase on stretch ACKs
e73ebb0881ea tcp: stretch ACK fixes prep

Then, make sure you do not throttle ACK too long, especially if you hope
to get Gbit line rate on a 4 ms RTT flow.

GRO does not mean : send one ACK every ms, or after 3ms delay...

It is literally :
  aggregate X packets at receive, and send the ACK asap.

If the receiver expects to have 64 ACK packets in the TX ring buffer to
actually send them (wifi aggregation), then you certainly do not want to
compress ACK too much.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-02 23:06               ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-02 23:06 UTC (permalink / raw)
  To: Ben Greear
  Cc: Michal Kazior, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:

> It is a big throughput win to have fewer TCP ack packets on
> wireless since it is a half-duplex environment.  Is there anything
> we could improve so that we can have fewer acks and still get
> good tcp stack behaviour?

First apply TCP stretch ack fixes to the sender. There is no way to get
good performance if the sender does not handle stretch ack.

d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
c22bdca94782 tcp: fix stretch ACK bugs in Reno
814d488c6126 tcp: fix the timid additive increase on stretch ACKs
e73ebb0881ea tcp: stretch ACK fixes prep

Then, make sure you do not throttle ACK too long, especially if you hope
to get Gbit line rate on a 4 ms RTT flow.

GRO does not mean : send one ACK every ms, or after 3ms delay...

It is literally :
  aggregate X packets at receive, and send the ACK asap.

If the receiver expects to have 64 ACK packets in the TX ring buffer to
actually send them (wifi aggregation), then you certainly do not want to
compress ACK too much.



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03  1:18             ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-03  1:18 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> no timer is setup to split the TSO packet.)

Following patch might help the TSO split defer logic.

It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
Small Queue logic prevented the xmit at the expiration of first 'timer'.

This patch clears the tso_deferred variable only if we could really
send something.

Please try it, thanks !


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..e735f38557db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1821,7 +1821,6 @@ static bool tcp_tso_should_defer(struct sock *sk,
struct sk_buff *skb,
 	return true;
 
 send_now:
-	tp->tso_deferred = 0;
 	return false;
 }
 
@@ -2070,6 +2069,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;
 
+		tp->tso_deferred = 0;
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03  1:18             ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-03  1:18 UTC (permalink / raw)
  To: Michal Kazior
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:

> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> no timer is setup to split the TSO packet.)

Following patch might help the TSO split defer logic.

It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
Small Queue logic prevented the xmit at the expiration of first 'timer'.

This patch clears the tso_deferred variable only if we could really
send something.

Please try it, thanks !


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..e735f38557db 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1821,7 +1821,6 @@ static bool tcp_tso_should_defer(struct sock *sk,
struct sk_buff *skb,
 	return true;
 
 send_now:
-	tp->tso_deferred = 0;
 	return false;
 }
 
@@ -2070,6 +2069,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
 			break;
 
+		tp->tso_deferred = 0;
 repair:
 		/* Advance the send_head.  This one is sent out.
 		 * This call will increment packets_out.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-02 18:52         ` Eric Dumazet
  2015-02-02 21:25           ` Ben Greear
  2015-02-03  1:18             ` Eric Dumazet
@ 2015-02-03  8:44           ` Michal Kazior
  2 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-03  8:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 2 February 2015 at 19:52, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-02-02 at 11:27 +0100, Michal Kazior wrote:
>
>> While testing I've had my internal GRO patch for ath10k and no stretch
>> ack patches.
>
> Thanks for the data, I took a look at it.
>
> I am afraid this GRO patch might be the problem.

The entire performance drop happens without the GRO patch as well. I
tested with it included because I intended to upstream it later. I'll
run without it in future tests.


[...]
> Could you make again your experiments using upstream kernel (David
> Miller net tree) ?

Sure.


> You also could post the GRO patch so that we can comment on it.

(You probably want to see mac80211 patch as well:
06d181a8fd58031db9c114d920b40d8820380a6e "mac80211: add NAPI support
back")

diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 36a8fcf..367e896 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -1147,6 +1147,12 @@ err:
 }
 EXPORT_SYMBOL(ath10k_core_start);

+static int ath10k_core_napi_dummy_poll(struct napi_struct *napi, int budget)
+{
+       WARN_ON(1);
+       return 0;
+}
+
 int ath10k_wait_for_suspend(struct ath10k *ar, u32 suspend_opt)
 {
        int ret;
@@ -1414,6 +1420,10 @@ struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
        INIT_WORK(&ar->register_work, ath10k_core_register_work);
        INIT_WORK(&ar->restart_work, ath10k_core_restart);

+       init_dummy_netdev(&ar->napi_dev);
+       ieee80211_napi_add(ar->hw, &ar->napi, &ar->napi_dev,
+                          ath10k_core_napi_dummy_poll, 64);
+
        ret = ath10k_debug_create(ar);
        if (ret)
                goto err_free_wq;
@@ -1434,6 +1444,7 @@ void ath10k_core_destroy(struct ath10k *ar)
 {
        flush_workqueue(ar->workqueue);
        destroy_workqueue(ar->workqueue);
+       netif_napi_del(&ar->napi);

        ath10k_debug_destroy(ar);
        ath10k_mac_destroy(ar);
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 2d9f871..b5a8847 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -623,6 +623,9 @@ struct ath10k {

        struct dfs_pattern_detector *dfs_detector;

+       struct net_device napi_dev;
+       struct napi_struct napi;
+
 #ifdef CONFIG_ATH10K_DEBUGFS
        struct ath10k_debug debug;
 #endif
diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c
b/drivers/net/wireless/ath/ath10k/htt_rx.c
index c1da44f..7e58b38 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -2061,5 +2061,7 @@ static void ath10k_htt_txrx_compl_task(unsigned long ptr)
                ath10k_htt_rx_in_ord_ind(ar, skb);
                dev_kfree_skb_any(skb);
        }
+
+       napi_gro_flush(&htt->ar->napi, false);
        spin_unlock_bh(&htt->rx_ring.lock);
 }

So that you can quickly get an understanding how ath10k Rx works:
first tasklet (not visible in the patch) picks up smallish event
buffers from firmware and puts them into ath10k queue for latter
processing by another tasklet (the last hunk). Each such event buffer
is just some metainfo but can "carry" tens of frames (both Rx and Tx
completions). The count is arbitrary and depends on fw/hw combo and
air conditions. The GRO flush is called after all queued small event
buffers are processed (frames delivered up to mac80211 which can in
turn perform aggregation reordering in case some frames were
re-transmitted in the meantime before handing them to net subsystem).


Michał

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03  9:00                 ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-03  9:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, linux-wireless, Network Development, eyalpe

On 3 February 2015 at 00:06, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:
>
>> It is a big throughput win to have fewer TCP ack packets on
>> wireless since it is a half-duplex environment.  Is there anything
>> we could improve so that we can have fewer acks and still get
>> good tcp stack behaviour?
>
> First apply TCP stretch ack fixes to the sender. There is no way to get
> good performance if the sender does not handle stretch ack.
>
> d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
> 9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
> c22bdca94782 tcp: fix stretch ACK bugs in Reno
> 814d488c6126 tcp: fix the timid additive increase on stretch ACKs
> e73ebb0881ea tcp: stretch ACK fixes prep
>
> Then, make sure you do not throttle ACK too long, especially if you hope
> to get Gbit line rate on a 4 ms RTT flow.
>
> GRO does not mean : send one ACK every ms, or after 3ms delay...

I think it's worth pointing out that If you assume 3-frame A-MSDU and
64-frame A-MPDU you get 192 frames (as far as TCP/IP is concerned) per
aggregation window. Assuming effective 600mbps throughput:

 python> 1.0/((((600/8)*1024*1024)/1500)/(3*64))
 0.003663003663003663

This is probably worst case, but still probably worth to keep in mind.

ath10k has a knob to tune A-MSDU aggregation count. The default is "3"
and it's what I've been testing so far.

When I change it to "1" on sender I get 250->400mbps boost in TCP -P5
but see no difference with -P1 (number of flows). Changing it to "1"
on receiver yields no difference. I can try adding this configuration
permutation to my future tests if you're interested.

So that you have an idea - using "1" on sender degrades UDP throughput
(even 690->500mbps in some cases).


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03  9:00                 ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-03  9:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Greear, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On 3 February 2015 at 00:06, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, 2015-02-02 at 13:25 -0800, Ben Greear wrote:
>
>> It is a big throughput win to have fewer TCP ack packets on
>> wireless since it is a half-duplex environment.  Is there anything
>> we could improve so that we can have fewer acks and still get
>> good tcp stack behaviour?
>
> First apply TCP stretch ack fixes to the sender. There is no way to get
> good performance if the sender does not handle stretch ack.
>
> d6b1a8a92a14 tcp: fix timing issue in CUBIC slope calculation
> 9cd981dcf174 tcp: fix stretch ACK bugs in CUBIC
> c22bdca94782 tcp: fix stretch ACK bugs in Reno
> 814d488c6126 tcp: fix the timid additive increase on stretch ACKs
> e73ebb0881ea tcp: stretch ACK fixes prep
>
> Then, make sure you do not throttle ACK too long, especially if you hope
> to get Gbit line rate on a 4 ms RTT flow.
>
> GRO does not mean : send one ACK every ms, or after 3ms delay...

I think it's worth pointing out that If you assume 3-frame A-MSDU and
64-frame A-MPDU you get 192 frames (as far as TCP/IP is concerned) per
aggregation window. Assuming effective 600mbps throughput:

 python> 1.0/((((600/8)*1024*1024)/1500)/(3*64))
 0.003663003663003663

This is probably worst case, but still probably worth to keep in mind.

ath10k has a knob to tune A-MSDU aggregation count. The default is "3"
and it's what I've been testing so far.

When I change it to "1" on sender I get 250->400mbps boost in TCP -P5
but see no difference with -P1 (number of flows). Changing it to "1"
on receiver yields no difference. I can try adding this configuration
permutation to my future tests if you're interested.

So that you have an idea - using "1" on sender degrades UDP throughput
(even 690->500mbps in some cases).


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-03  1:18             ` Eric Dumazet
  (?)
@ 2015-02-03 11:50             ` Michal Kazior
  2015-02-03 14:27                 ` Eric Dumazet
  -1 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-03 11:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 3 February 2015 at 02:18, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
>
>> It seems to break ACK clocking badly (linux stack has a somewhat buggy
>> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
>> no timer is setup to split the TSO packet.)
>
> Following patch might help the TSO split defer logic.
>
> It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> Small Queue logic prevented the xmit at the expiration of first 'timer'.
>
> This patch clears the tso_deferred variable only if we could really
> send something.
>
> Please try it, thanks !
[..patch..]

I've done a second round of tests. I've added the A-MSDU count
parameter I've mentioned in my other email into the mix.

 net - net/master (includes stretch ack patches)
 net-tso - net/master + your TSO defer patch
 net-gro - net/master + my ath10k GRO patch
 net-gro-tso - net/master + duh

Here's the best of amsdu count 1 and 3:

 ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
 net-gro-tso/output.txt
 A-MSDU limit=1, 436 Mbits/sec
 A-MSDU limit=3, 284 Mbits/sec
 net-gro/output.txt
 A-MSDU limit=1, 444 Mbits/sec
 A-MSDU limit=3, 283 Mbits/sec
 net-tso/output.txt
 A-MSDU limit=1, 376 Mbits/sec
 A-MSDU limit=3, 251 Mbits/sec
 net/output.txt
 A-MSDU limit=1, 387 Mbits/sec
 A-MSDU limit=3, 260 Mbits/sec

IOW:
 - stretch acks / TSO defer don't seem to help much (when compared to
throughput results from yesterday)
 - GRO helps
 - disabling A-MSDU on sender helps
 - net/master+GRO still doesn't reach the performance from before the
regression (~600mbps w/ GRO)

You can grab logs and dumps here: http://www.filedropper.com/test2tar


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03 14:27                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-03 14:27 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
> On 3 February 2015 at 02:18, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
> >
> >> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> >> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> >> no timer is setup to split the TSO packet.)
> >
> > Following patch might help the TSO split defer logic.
> >
> > It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> > Small Queue logic prevented the xmit at the expiration of first 'timer'.
> >
> > This patch clears the tso_deferred variable only if we could really
> > send something.
> >
> > Please try it, thanks !
> [..patch..]
> 
> I've done a second round of tests. I've added the A-MSDU count
> parameter I've mentioned in my other email into the mix.
> 
>  net - net/master (includes stretch ack patches)
>  net-tso - net/master + your TSO defer patch
>  net-gro - net/master + my ath10k GRO patch
>  net-gro-tso - net/master + duh
> 
> Here's the best of amsdu count 1 and 3:
> 
>  ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
> 'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
> amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
> END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
>  net-gro-tso/output.txt
>  A-MSDU limit=1, 436 Mbits/sec
>  A-MSDU limit=3, 284 Mbits/sec
>  net-gro/output.txt
>  A-MSDU limit=1, 444 Mbits/sec
>  A-MSDU limit=3, 283 Mbits/sec
>  net-tso/output.txt
>  A-MSDU limit=1, 376 Mbits/sec
>  A-MSDU limit=3, 251 Mbits/sec
>  net/output.txt
>  A-MSDU limit=1, 387 Mbits/sec
>  A-MSDU limit=3, 260 Mbits/sec
> 
> IOW:
>  - stretch acks / TSO defer don't seem to help much (when compared to
> throughput results from yesterday)
>  - GRO helps
>  - disabling A-MSDU on sender helps
>  - net/master+GRO still doesn't reach the performance from before the
> regression (~600mbps w/ GRO)
> 
> You can grab logs and dumps here: http://www.filedropper.com/test2tar
> 

Thanks for these traces.

There is absolutely a problem at the sender, as we can see a big 2ms
delay between reception of ACK and send of following packets.
TCP stack should generate them immediately.
Are you using some kind of netem qdisc ?

These 2ms delays, in a flow with a 5ms RTT are terrible.

06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
<this 2ms delay is not generated by TCP stack.>
06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410271 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 83984:85432, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410289 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 85432:86880, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410310 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 86880:88328, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410326 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 88328:89776, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410339 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 89776:91224, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410353 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 91224:92672, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410370 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 92672:94120, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448

...
06:54:57.411178 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 28960, win 11268, options [nop,nop,TS val 1053306 ecr 1052253], length 0
06:54:57.411190 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 52128, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411220 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 78192, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411243 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 82536, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
< same 2ms unexplained gap here >
06:54:57.412912 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 165072:166520, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412935 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 166520:167968, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412948 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 167968:169416, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
...
06:54:57.413650 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 244712:246160, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413662 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 246160:247608, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413712 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 112944, win 11268, options [nop,nop,TS val 1053308 ecr 1052256], length 0
06:54:57.413730 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 130320, win 11268, options [nop,nop,TS val 1053308 ecr 1052257], length 0
06:54:57.413754 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 160728, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
06:54:57.413779 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 165072, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
< same 2ms delay >
06:54:57.415682 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 247608:249056, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448
06:54:57.415709 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 249056:250504, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448

Are packets TX completed after a timer or something ?

Some very heavy stuff might run from tasklet (or other softirq triggered) event.

BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
its not clear if they are delayed at one side or the other.

GRO should delay only GRO candidates. ACK packets are not GRO candidates.

Have you tried to disable GSO on sender ?

(Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03 14:27                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-03 14:27 UTC (permalink / raw)
  To: Michal Kazior
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
> On 3 February 2015 at 02:18, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Mon, 2015-02-02 at 10:52 -0800, Eric Dumazet wrote:
> >
> >> It seems to break ACK clocking badly (linux stack has a somewhat buggy
> >> tcp_tso_should_defer(), which relies on ACK being received smoothly, as
> >> no timer is setup to split the TSO packet.)
> >
> > Following patch might help the TSO split defer logic.
> >
> > It would avoid setting the TSO defer 'pseudo timer' twice, if/when TCP
> > Small Queue logic prevented the xmit at the expiration of first 'timer'.
> >
> > This patch clears the tso_deferred variable only if we could really
> > send something.
> >
> > Please try it, thanks !
> [..patch..]
> 
> I've done a second round of tests. I've added the A-MSDU count
> parameter I've mentioned in my other email into the mix.
> 
>  net - net/master (includes stretch ack patches)
>  net-tso - net/master + your TSO defer patch
>  net-gro - net/master + my ath10k GRO patch
>  net-gro-tso - net/master + duh
> 
> Here's the best of amsdu count 1 and 3:
> 
>  ; for (i in */output.txt) { echo $i; for (j in (1 3)) { cat $i | awk
> 'x && /Mbits/ {y=$0}; x && y && !/Mbits/ {print y; x=0; y=""}; /set
> amsdu cnt to '$j'/{x=1}' | awk '{ if (x < $(NF-1)) {x=$(NF-1)} }
> END{print "A-MSDU limit='$j', " x " Mbits/sec"}' } }
>  net-gro-tso/output.txt
>  A-MSDU limit=1, 436 Mbits/sec
>  A-MSDU limit=3, 284 Mbits/sec
>  net-gro/output.txt
>  A-MSDU limit=1, 444 Mbits/sec
>  A-MSDU limit=3, 283 Mbits/sec
>  net-tso/output.txt
>  A-MSDU limit=1, 376 Mbits/sec
>  A-MSDU limit=3, 251 Mbits/sec
>  net/output.txt
>  A-MSDU limit=1, 387 Mbits/sec
>  A-MSDU limit=3, 260 Mbits/sec
> 
> IOW:
>  - stretch acks / TSO defer don't seem to help much (when compared to
> throughput results from yesterday)
>  - GRO helps
>  - disabling A-MSDU on sender helps
>  - net/master+GRO still doesn't reach the performance from before the
> regression (~600mbps w/ GRO)
> 
> You can grab logs and dumps here: http://www.filedropper.com/test2tar
> 

Thanks for these traces.

There is absolutely a problem at the sender, as we can see a big 2ms
delay between reception of ACK and send of following packets.
TCP stack should generate them immediately.
Are you using some kind of netem qdisc ?

These 2ms delays, in a flow with a 5ms RTT are terrible.

06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
<this 2ms delay is not generated by TCP stack.>
06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410271 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 83984:85432, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410289 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 85432:86880, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410310 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 86880:88328, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410326 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 88328:89776, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410339 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 89776:91224, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410353 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 91224:92672, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
06:54:57.410370 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 92672:94120, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448

...
06:54:57.411178 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 28960, win 11268, options [nop,nop,TS val 1053306 ecr 1052253], length 0
06:54:57.411190 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 52128, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411220 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 78192, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
06:54:57.411243 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 82536, win 11268, options [nop,nop,TS val 1053306 ecr 1052254], length 0
< same 2ms unexplained gap here >
06:54:57.412912 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 165072:166520, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412935 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 166520:167968, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
06:54:57.412948 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 167968:169416, ack 1, win 457, options [nop,nop,TS val 1052259 ecr 1053306], length 1448
...
06:54:57.413650 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 244712:246160, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413662 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 246160:247608, ack 1, win 457, options [nop,nop,TS val 1052260 ecr 1053306], length 1448
06:54:57.413712 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 112944, win 11268, options [nop,nop,TS val 1053308 ecr 1052256], length 0
06:54:57.413730 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 130320, win 11268, options [nop,nop,TS val 1053308 ecr 1052257], length 0
06:54:57.413754 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 160728, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
06:54:57.413779 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 165072, win 11268, options [nop,nop,TS val 1053309 ecr 1052257], length 0
< same 2ms delay >
06:54:57.415682 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 247608:249056, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448
06:54:57.415709 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 249056:250504, ack 1, win 457, options [nop,nop,TS val 1052262 ecr 1053309], length 1448

Are packets TX completed after a timer or something ?

Some very heavy stuff might run from tasklet (or other softirq triggered) event.

BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
its not clear if they are delayed at one side or the other.

GRO should delay only GRO candidates. ACK packets are not GRO candidates.

Have you tried to disable GSO on sender ?

(Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03 15:03                   ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-03 15:03 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Tue, 2015-02-03 at 06:27 -0800, Eric Dumazet wrote:

> Are packets TX completed after a timer or something ?
> 
> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
> 

Right, commit 6c5151a9ffa9f796f2d707617cecb6b6b241dff8
("ath10k: batch htt tx/rx completions")
is very suspicious.

Please revert it.

BTW, ath10k_htt_txrx_compl_task() runs from softirq context, so the 
_bh() prefixes are not really needed.

It seems lot of batching happens in wifi drivers, not necessarily at the
right places.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-03 15:03                   ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-03 15:03 UTC (permalink / raw)
  To: Michal Kazior
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Tue, 2015-02-03 at 06:27 -0800, Eric Dumazet wrote:

> Are packets TX completed after a timer or something ?
> 
> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
> 

Right, commit 6c5151a9ffa9f796f2d707617cecb6b6b241dff8
("ath10k: batch htt tx/rx completions")
is very suspicious.

Please revert it.

BTW, ath10k_htt_txrx_compl_task() runs from softirq context, so the 
_bh() prefixes are not really needed.

It seems lot of batching happens in wifi drivers, not necessarily at the
right places.



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-03 14:27                 ` Eric Dumazet
  (?)
  (?)
@ 2015-02-04 11:35                 ` Michal Kazior
  2015-02-04 11:57                     ` Eric Dumazet
  -1 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-04 11:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 3 February 2015 at 15:27, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-02-03 at 12:50 +0100, Michal Kazior wrote:
[...]
>> IOW:
>>  - stretch acks / TSO defer don't seem to help much (when compared to
>> throughput results from yesterday)
>>  - GRO helps
>>  - disabling A-MSDU on sender helps
>>  - net/master+GRO still doesn't reach the performance from before the
>> regression (~600mbps w/ GRO)
>>
>> You can grab logs and dumps here: http://www.filedropper.com/test2tar
>>
>
> Thanks for these traces.
>
> There is absolutely a problem at the sender, as we can see a big 2ms
> delay between reception of ACK and send of following packets.
> TCP stack should generate them immediately.
> Are you using some kind of netem qdisc ?

Both systems have identical setup:

; tc qdisc
qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 root refcnt 2 bands 3 priomap  1 2 2 2 1
2 0 0 1 1 1 1 1 1 1 1
qdisc mq 0: dev wlan1 root
qdisc pfifo_fast 0: dev wlan1 parent :1 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :2 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :3 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev wlan1 parent :4 bands 3 priomap  1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1


> These 2ms delays, in a flow with a 5ms RTT are terrible.
>
> 06:54:57.408391 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294899240, win 11268, options [nop,nop,TS val 1053302 ecr 1052250], length 0
> 06:54:57.408418 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294910824, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408431 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294936888, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408453 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 4294962952, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> 06:54:57.408474 IP 192.168.1.2.5001 > 192.168.1.3.51645: Flags [.], ack 0, win 11268, options [nop,nop,TS val 1053303 ecr 1052251], length 0
> <this 2ms delay is not generated by TCP stack.>
> 06:54:57.410243 IP 192.168.1.3.51645 > 192.168.1.2.5001: Flags [.], seq 82536:83984, ack 1, win 457, options [nop,nop,TS val 1052256 ecr 1053303], length 1448
[...]
>
> Are packets TX completed after a timer or something ?

As far as ath10k is concerned - no timers here. Not sure about
firmware itself though.


> Some very heavy stuff might run from tasklet (or other softirq triggered) event.
>
> BTW, traces tend to show that you 'receive' multiple ACK in the same burst,
> its not clear if they are delayed at one side or the other.
>
> GRO should delay only GRO candidates. ACK packets are not GRO candidates.
>
> Have you tried to disable GSO on sender ?

I assume I do that via ethtool? This is my current setup on both systems:

; ethtool -k wlan1
Features for wlan1:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: off
        tx-scatter-gather: off [fixed]
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [fixed]
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: on [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]

; ethtool -K wlan1 generic-segmentation-offload off
ethtool: bad command line argument(s)
For more information run ethtool -h


> (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)

This could work if your firmware/device supports this kind of thing.
To my understanding ath10k firmware doesn't.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-04 11:57                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 11:57 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:

> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
> 
> This could work if your firmware/device supports this kind of thing.
> To my understanding ath10k firmware doesn't.

This is a pure software signal. You do not need firmware support.

Idea is the following :

Your driver gets a train of messages, coming from upper layers (TCP, IP,
qdisc)

It can know that a packet is not the last one, by looking at
skb->xmit_more.

Basically, aggregation logic could use this signal as a very clear
indicator you got the end of a train -> force the xmit right now.

To disable gso you would have to use :

ethtool -K wlan1 gso off




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-04 11:57                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 11:57 UTC (permalink / raw)
  To: Michal Kazior
  Cc: linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:

> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
> 
> This could work if your firmware/device supports this kind of thing.
> To my understanding ath10k firmware doesn't.

This is a pure software signal. You do not need firmware support.

Idea is the following :

Your driver gets a train of messages, coming from upper layers (TCP, IP,
qdisc)

It can know that a packet is not the last one, by looking at
skb->xmit_more.

Basically, aggregation logic could use this signal as a very clear
indicator you got the end of a train -> force the xmit right now.

To disable gso you would have to use :

ethtool -K wlan1 gso off



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 11:57                     ` Eric Dumazet
  (?)
@ 2015-02-04 12:22                     ` Michal Kazior
  2015-02-04 12:38                       ` Eric Dumazet
  -1 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-04 12:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 4 February 2015 at 12:57, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-02-04 at 12:35 +0100, Michal Kazior wrote:
>
>> > (Or maybe wifi drivers should start to use skb->xmit_more as a signal to end aggregation)
>>
>> This could work if your firmware/device supports this kind of thing.
>> To my understanding ath10k firmware doesn't.
>
> This is a pure software signal. You do not need firmware support.
>
> Idea is the following :
>
> Your driver gets a train of messages, coming from upper layers (TCP, IP,
> qdisc)
>
> It can know that a packet is not the last one, by looking at
> skb->xmit_more.
>
> Basically, aggregation logic could use this signal as a very clear
> indicator you got the end of a train -> force the xmit right now.

There's no way to tell ath10k firmware: "xmit right now". The firmware
does all tx aggregation logic by itself. Host driver just submits a
frame and hopes it'll get out soon. It's not even a tx-ring you'd
expect. Each frame has a host assigned id which firmware then uses in
tx completion.


> To disable gso you would have to use :
>
> ethtool -K wlan1 gso off

Oh, thanks! This works. However I can't turn it on:

; ethtool -K wlan1 gso on
Could not change any device features

..so I guess it makes no sense to re-run tests because:

; ethtool -k wlan1 | grep generic
        tx-checksum-ip-generic: on [fixed]
generic-segmentation-offload: off [requested on]
generic-receive-offload: on

And this seems to never change.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 12:22                     ` Michal Kazior
@ 2015-02-04 12:38                       ` Eric Dumazet
  2015-02-04 12:53                         ` Michal Kazior
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 12:38 UTC (permalink / raw)
  To: Michal Kazior; +Cc: linux-wireless, Network Development, eyalpe

On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
> On 4 February 2015 at 12:57, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 
> > To disable gso you would have to use :
> >
> > ethtool -K wlan1 gso off
> 
> Oh, thanks! This works. However I can't turn it on:
> 
> ; ethtool -K wlan1 gso on
> Could not change any device features
> 
> ..so I guess it makes no sense to re-run tests because:
> 
> ; ethtool -k wlan1 | grep generic
>         tx-checksum-ip-generic: on [fixed]
> generic-segmentation-offload: off [requested on]
> generic-receive-offload: on
> 
> And this seems to never change.

GSO requires SG (Scatter Gather)

Are you sure this hardware has no SG support ?



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 12:38                       ` Eric Dumazet
@ 2015-02-04 12:53                         ` Michal Kazior
  2015-02-04 12:55                           ` Johannes Berg
  2015-02-04 13:16                           ` Eric Dumazet
  0 siblings, 2 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-04 12:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-wireless, Network Development, eyalpe

On 4 February 2015 at 13:38, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-02-04 at 13:22 +0100, Michal Kazior wrote:
>> On 4 February 2015 at 12:57, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>>
>> > To disable gso you would have to use :
>> >
>> > ethtool -K wlan1 gso off
>>
>> Oh, thanks! This works. However I can't turn it on:
>>
>> ; ethtool -K wlan1 gso on
>> Could not change any device features
>>
>> ..so I guess it makes no sense to re-run tests because:
>>
>> ; ethtool -k wlan1 | grep generic
>>         tx-checksum-ip-generic: on [fixed]
>> generic-segmentation-offload: off [requested on]
>> generic-receive-offload: on
>>
>> And this seems to never change.
>
> GSO requires SG (Scatter Gather)
>
> Are you sure this hardware has no SG support ?

The hardware itself seems to be capable. The firmware is a problem
though. I'm also not sure if mac80211 can handle this as is. No 802.11
driver seems to support SG except wil6210 which uses cfg80211 and
netdevs directly.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 12:53                         ` Michal Kazior
@ 2015-02-04 12:55                           ` Johannes Berg
  2015-02-04 13:16                           ` Eric Dumazet
  1 sibling, 0 replies; 107+ messages in thread
From: Johannes Berg @ 2015-02-04 12:55 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Eric Dumazet, linux-wireless, Network Development, eyalpe

On Wed, 2015-02-04 at 13:53 +0100, Michal Kazior wrote:

> The hardware itself seems to be capable. The firmware is a problem
> though. I'm also not sure if mac80211 can handle this as is. No 802.11
> driver seems to support SG except wil6210 which uses cfg80211 and
> netdevs directly.

mac80211 cannot deal with this right now. This would make a good topic
for the workshop since there's interest elsewhere in this as well. It's
probably not terribly hard to do as far as mac80211 is concerned.

How much offload do you really have though? Sometimes people just want
to build A-MSDUs.

johannes


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 12:53                         ` Michal Kazior
  2015-02-04 12:55                           ` Johannes Berg
@ 2015-02-04 13:16                           ` Eric Dumazet
  2015-02-04 13:29                             ` Eric Dumazet
  1 sibling, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 13:16 UTC (permalink / raw)
  To: Michal Kazior, Neal Cardwell; +Cc: linux-wireless, Network Development, eyalpe

OK guys

Using a mlx4 testbed I can reproduce the problem by pushing coalescing
settings and disabling SG (thus disabling GSO)

ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
	tx-scatter-gather: off
generic-segmentation-offload: off [requested on]

ethtool -C eth0 tx-usecs 1024 tx-frames 64

Meaning that NIC waits one ms before sending the TX IRQ,
and can accumulate 64 frames before forcing the interrupt.

We probably have a bug in cwnd expansion logic :

lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00       530.39   0.40     0.32     2.965   2.398  


-> final cwnd=59 which is not enough to avoid the 1ms delay between each
burst. 

So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
before sending the following burst.

I am CCing Neal, he probably can help to root cause the problem.

Thanks



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 13:16                           ` Eric Dumazet
@ 2015-02-04 13:29                             ` Eric Dumazet
  2015-02-04 21:11                                 ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 13:29 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Wed, 2015-02-04 at 05:16 -0800, Eric Dumazet wrote:
> OK guys
> 
> Using a mlx4 testbed I can reproduce the problem by pushing coalescing
> settings and disabling SG (thus disabling GSO)
> 
> ethtool -K eth0 sg off
> Actual changes:
> scatter-gather: off
> 	tx-scatter-gather: off
> generic-segmentation-offload: off [requested on]
> 
> ethtool -C eth0 tx-usecs 1024 tx-frames 64
> 
> Meaning that NIC waits one ms before sending the TX IRQ,
> and can accumulate 64 frames before forcing the interrupt.
> 
> We probably have a bug in cwnd expansion logic :
> 
> lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
> rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=230 rttvar=30 snd_ssthresh=41 cwnd=59 reordering=3 total_retrans=1 ca_state=0 pacing_rate=5943.1 Mbits
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    10.00       530.39   0.40     0.32     2.965   2.398  
> 
> 
> -> final cwnd=59 which is not enough to avoid the 1ms delay between each
> burst. 
> 
> So sender sends ~60 packets, then has to wait 1ms (to get NIC TX IRQ)
> before sending the following burst.
> 
> I am CCing Neal, he probably can help to root cause the problem.

Arg, this was with net-next, ie not including our recent stretch ack
fixes.

Using David Miller 'net' tree, cwnd seems OK.

Speed is low because of 64 queued frames are exceeding
tcp_limit_output_bytes

lpaa23:~# cat /proc/sys/net/ipv4/tcp_limit_output_bytes
131072
lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=166 rttvar=16 snd_ssthresh=26 cwnd=59 reordering=3 total_retrans=0 ca_state=0 pacing_rate=8203.52 Mbits
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00       569.96   0.52     0.38     3.588   2.625  


lpaa23:~# echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
lpaa23:~# DUMP_TCP_INFO=1 ./netperf -H 10.246.7.152 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.152 () port 0 AF_INET
rto=201000 ato=0 pmtu=1500 rcv_ssthresh=29200 rtt=98 rttvar=18 snd_ssthresh=312 cwnd=313 reordering=3 total_retrans=23 ca_state=0
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      8518.40   2.60     1.57     1.200   0.727  





^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-04 21:11                                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 21:11 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

I do not see how a TSO patch could hurt a flow not using TSO/GSO.

This makes no sense.

ath10k tx completions being batched/deferred to a tasklet might increase
probability to hit this condition in tcp_wfree() :

        /* If this softirq is serviced by ksoftirqd, we are likely under stress.
         * Wait until our queues (qdisc + devices) are drained.
         * This gives :
         * - less callbacks to tcp_write_xmit(), reducing stress (batches)
         * - chance for incoming ACK (processed by another cpu maybe)
         *   to migrate this flow (skb->ooo_okay will be eventually set)
         */
        if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
                goto out;

Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
additional packets.

I would try to call skb_orphan() in ath10k if you really want to keep
these batches.

I have hard time to understand why tx completed packets go through
ath10k_htc_rx_completion_handler().. anyway...

Most conservative patch would be :

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
 		break;
 	}
 	case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
+		skb_orphan(skb);
 		spin_lock_bh(&htt->tx_lock);
 		__skb_queue_tail(&htt->tx_compl_q, skb);
 		spin_unlock_bh(&htt->tx_lock);



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-04 21:11                                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-04 21:11 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

I do not see how a TSO patch could hurt a flow not using TSO/GSO.

This makes no sense.

ath10k tx completions being batched/deferred to a tasklet might increase
probability to hit this condition in tcp_wfree() :

        /* If this softirq is serviced by ksoftirqd, we are likely under stress.
         * Wait until our queues (qdisc + devices) are drained.
         * This gives :
         * - less callbacks to tcp_write_xmit(), reducing stress (batches)
         * - chance for incoming ACK (processed by another cpu maybe)
         *   to migrate this flow (skb->ooo_okay will be eventually set)
         */
        if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
                goto out;

Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
additional packets.

I would try to call skb_orphan() in ath10k if you really want to keep
these batches.

I have hard time to understand why tx completed packets go through
ath10k_htc_rx_completion_handler().. anyway...

Most conservative patch would be :

diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
--- a/drivers/net/wireless/ath/ath10k/htt_rx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
@@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
 		break;
 	}
 	case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
+		skb_orphan(skb);
 		spin_lock_bh(&htt->tx_lock);
 		__skb_queue_tail(&htt->tx_compl_q, skb);
 		spin_unlock_bh(&htt->tx_lock);


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05  6:46                                   ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-05  6:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.
>
> ath10k tx completions being batched/deferred to a tasklet might increase
> probability to hit this condition in tcp_wfree() :
>
>         /* If this softirq is serviced by ksoftirqd, we are likely under stress.
>          * Wait until our queues (qdisc + devices) are drained.
>          * This gives :
>          * - less callbacks to tcp_write_xmit(), reducing stress (batches)
>          * - chance for incoming ACK (processed by another cpu maybe)
>          *   to migrate this flow (skb->ooo_okay will be eventually set)
>          */
>         if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
>                 goto out;
>
> Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
> additional packets.
>
> I would try to call skb_orphan() in ath10k if you really want to keep
> these batches.
>
> I have hard time to understand why tx completed packets go through
> ath10k_htc_rx_completion_handler().. anyway...

There's a couple of layers for host-firmware communication. The
transport layer (e.g. PCI) delivers HTC packets. These contain WMI
(configuration stuff) or HTT (traffic stuff). HTT can contain
different events (tx complete, rx complete, etc). HTT Tx completion
contains a list of ids which refer to frames that have been completed
(either sent or dropped).

I've tried reverting tx/rx tasklet batching. No change in throughput.
I can get tcpdump if you're interested.


> Most conservative patch would be :
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
>                 break;
>         }
>         case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> +               skb_orphan(skb);
>                 spin_lock_bh(&htt->tx_lock);
>                 __skb_queue_tail(&htt->tx_compl_q, skb);
>                 spin_unlock_bh(&htt->tx_lock);

I suppose you want to call skb_orphan() on actual data packets, right?
This skb is just a host-firmware communication buffer.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05  6:46                                   ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-05  6:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.
>
> ath10k tx completions being batched/deferred to a tasklet might increase
> probability to hit this condition in tcp_wfree() :
>
>         /* If this softirq is serviced by ksoftirqd, we are likely under stress.
>          * Wait until our queues (qdisc + devices) are drained.
>          * This gives :
>          * - less callbacks to tcp_write_xmit(), reducing stress (batches)
>          * - chance for incoming ACK (processed by another cpu maybe)
>          *   to migrate this flow (skb->ooo_okay will be eventually set)
>          */
>         if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
>                 goto out;
>
> Meaning tcp stack waits all skbs left qdisc/NIC queues before queuing
> additional packets.
>
> I would try to call skb_orphan() in ath10k if you really want to keep
> these batches.
>
> I have hard time to understand why tx completed packets go through
> ath10k_htc_rx_completion_handler().. anyway...

There's a couple of layers for host-firmware communication. The
transport layer (e.g. PCI) delivers HTC packets. These contain WMI
(configuration stuff) or HTT (traffic stuff). HTT can contain
different events (tx complete, rx complete, etc). HTT Tx completion
contains a list of ids which refer to frames that have been completed
(either sent or dropped).

I've tried reverting tx/rx tasklet batching. No change in throughput.
I can get tcpdump if you're interested.


> Most conservative patch would be :
>
> diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
>                 break;
>         }
>         case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> +               skb_orphan(skb);
>                 spin_lock_bh(&htt->tx_lock);
>                 __skb_queue_tail(&htt->tx_compl_q, skb);
>                 spin_unlock_bh(&htt->tx_lock);

I suppose you want to call skb_orphan() on actual data packets, right?
This skb is just a host-firmware communication buffer.


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-04 21:11                                 ` Eric Dumazet
  (?)
  (?)
@ 2015-02-05  8:38                                 ` Michal Kazior
  2015-02-05 12:57                                   ` Eric Dumazet
  -1 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-05  8:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I do not see how a TSO patch could hurt a flow not using TSO/GSO.
>
> This makes no sense.

Hmm..

@@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
-                             sk->sk_pacing_rate >> 10);
+               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {
                        set_bit(TSQ_THROTTLED, &tp->tsq_flags);

Doesn't this effectively invert how tcp_limit_output_bytes is used?
This would explain why raising the limit wasn't changing anything
anymore when you asked me do so. Only decreasing it yielded any
change.

I've added a printk to show up the new and old values. Excerpt from logs:

[  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)

(2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)

Reverting this patch hunk alone fixes my TCP problem. Not that I'm
saying the old logic was correct (it seems it wasn't, a limit should
be applied as min(value, max_value), right?).

Anyway the change doesn't seem to be TSO-only oriented so it would
explain the "makes no sense".


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05  8:38                                 ` Michal Kazior
@ 2015-02-05 12:57                                   ` Eric Dumazet
  2015-02-05 13:19                                     ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 12:57 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 09:38 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > I do not see how a TSO patch could hurt a flow not using TSO/GSO.
> >
> > This makes no sense.
> 
> Hmm..
> 
> @@ -2018,8 +2053,8 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
> -                             sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>                         set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> 
> Doesn't this effectively invert how tcp_limit_output_bytes is used?
> This would explain why raising the limit wasn't changing anything
> anymore when you asked me do so. Only decreasing it yielded any
> change.
> 
> I've added a printk to show up the new and old values. Excerpt from logs:
> 
> [  114.782740] (4608 39126 131072 = 39126) vs (131072 39126 = 131072)
> 
> (2*truesize, pacing_rate, tcp_limit = limit) vs (tcp_limit, pacing_rate = limit)
> 
> Reverting this patch hunk alone fixes my TCP problem. Not that I'm
> saying the old logic was correct (it seems it wasn't, a limit should
> be applied as min(value, max_value), right?).
> 
> Anyway the change doesn't seem to be TSO-only oriented so it would
> explain the "makes no sense".


The intention is to control the queues to the following :

1 ms of buffering, but limited to a configurable value.

On a 40Gbps flow, 1ms represents 5 MB, which is insane.

We do not want to queue 5 MB of traffic, this would destroy latencies
for all concurrent flows. (Or would require having fq_codel or fq as
packet schedulers, instead of default pfifo_fast)

This is why having 1.5 ms delay between the transmit and TX completion
is a problem in your case.

 




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 13:03                                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 13:03 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 07:46 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Most conservative patch would be :
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> > +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
> >                 break;
> >         }
> >         case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> > +               skb_orphan(skb);
> >                 spin_lock_bh(&htt->tx_lock);
> >                 __skb_queue_tail(&htt->tx_compl_q, skb);
> >                 spin_unlock_bh(&htt->tx_lock);
> 
> I suppose you want to call skb_orphan() on actual data packets, right?
> This skb is just a host-firmware communication buffer.

Right. I have no idea how you find the actual data packet at this stage.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 13:03                                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 13:03 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Thu, 2015-02-05 at 07:46 +0100, Michal Kazior wrote:
> On 4 February 2015 at 22:11, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> > Most conservative patch would be :
> >
> > diff --git a/drivers/net/wireless/ath/ath10k/htt_rx.c b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > index 9c782a42665e1aaf43bfbca441631ee58da50c09..6a36317d6bb0447202dee15528130bd5e21248c4 100644
> > --- a/drivers/net/wireless/ath/ath10k/htt_rx.c
> > +++ b/drivers/net/wireless/ath/ath10k/htt_rx.c
> > @@ -1642,6 +1642,7 @@ void ath10k_htt_t2h_msg_handler(struct ath10k *ar, struct sk_buff *skb)
> >                 break;
> >         }
> >         case HTT_T2H_MSG_TYPE_TX_COMPL_IND:
> > +               skb_orphan(skb);
> >                 spin_lock_bh(&htt->tx_lock);
> >                 __skb_queue_tail(&htt->tx_compl_q, skb);
> >                 spin_unlock_bh(&htt->tx_lock);
> 
> I suppose you want to call skb_orphan() on actual data packets, right?
> This skb is just a host-firmware communication buffer.

Right. I have no idea how you find the actual data packet at this stage.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05 12:57                                   ` Eric Dumazet
@ 2015-02-05 13:19                                     ` Eric Dumazet
  2015-02-05 13:33                                         ` Eric Dumazet
  2015-02-05 13:44                                       ` Michal Kazior
  0 siblings, 2 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 13:19 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:

> The intention is to control the queues to the following :
> 
> 1 ms of buffering, but limited to a configurable value.
> 
> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
> 
> We do not want to queue 5 MB of traffic, this would destroy latencies
> for all concurrent flows. (Or would require having fq_codel or fq as
> packet schedulers, instead of default pfifo_fast)
> 
> This is why having 1.5 ms delay between the transmit and TX completion
> is a problem in your case.

Note that TCP stack could detect when this happens, *if* ACK where
delivered before the TX completions, or when TX completion happens,
we could detect that the clone of the freed packet was freed.

In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
blocks on tcp_limit_output_bytes.

Then we receive 2 stretch ACKS after ~50 usec.

TCP stack tries to push again some packets but blocks on
tcp_limit_output_bytes again.

1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
push following ~60 packets.


TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
using a per flow dynamic value, but I would rather not add a kludge in
TCP stack only to deal with a possible bug in ath10k driver.

niu has a similar issue and simply had to call skb_orphan() :

drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 13:33                                         ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 13:33 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 05:19 -0800, Eric Dumazet wrote:

> 
> TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> using a per flow dynamic value, but I would rather not add a kludge in
> TCP stack only to deal with a possible bug in ath10k driver.
> 
> niu has a similar issue and simply had to call skb_orphan() :
> 
> drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);

In your case that might be the place :

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 4bc51d8a14a3..cbda7a87d5a1 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -468,6 +468,7 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct sk_buff *msdu)
 	msdu_id = res;
 	htt->pending_tx[msdu_id] = msdu;
 	spin_unlock_bh(&htt->tx_lock);
+	skb_orphan(msdu);
 
 	prefetch_len = min(htt->prefetch_len, msdu->len);
 	prefetch_len = roundup(prefetch_len, 4);




^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 13:33                                         ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 13:33 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Thu, 2015-02-05 at 05:19 -0800, Eric Dumazet wrote:

> 
> TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> using a per flow dynamic value, but I would rather not add a kludge in
> TCP stack only to deal with a possible bug in ath10k driver.
> 
> niu has a similar issue and simply had to call skb_orphan() :
> 
> drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);

In your case that might be the place :

diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c
index 4bc51d8a14a3..cbda7a87d5a1 100644
--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -468,6 +468,7 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct sk_buff *msdu)
 	msdu_id = res;
 	htt->pending_tx[msdu_id] = msdu;
 	spin_unlock_bh(&htt->tx_lock);
+	skb_orphan(msdu);
 
 	prefetch_len = min(htt->prefetch_len, msdu->len);
 	prefetch_len = roundup(prefetch_len, 4);



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05 13:19                                     ` Eric Dumazet
  2015-02-05 13:33                                         ` Eric Dumazet
@ 2015-02-05 13:44                                       ` Michal Kazior
  2015-02-05 14:41                                           ` Eric Dumazet
                                                           ` (2 more replies)
  1 sibling, 3 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-05 13:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On 5 February 2015 at 14:19, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
>
>> The intention is to control the queues to the following :
>>
>> 1 ms of buffering, but limited to a configurable value.
>>
>> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
>>
>> We do not want to queue 5 MB of traffic, this would destroy latencies
>> for all concurrent flows. (Or would require having fq_codel or fq as
>> packet schedulers, instead of default pfifo_fast)
>>
>> This is why having 1.5 ms delay between the transmit and TX completion
>> is a problem in your case.

I do get your point. But 1.5ms is really tough on Wi-Fi.

Just look at this:

; ping 192.168.1.2 -c 3
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

; ping 192.168.1.2 -c 3 -Q 224
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms

This was run with no load so batching code in the driver itself should
have no measurable effect. The channel was near-ideal: low noise
floor, cabled rf, no other traffic.

The lower latency ping is when 802.11 QoS Voice Access Category is
used. I also get 400mbps instead of 250mbps in this case with 5 flows
(net/master).

Dealing with black box firmware blobs is a pain.


> Note that TCP stack could detect when this happens, *if* ACK where
> delivered before the TX completions, or when TX completion happens,
> we could detect that the clone of the freed packet was freed.
>
> In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> blocks on tcp_limit_output_bytes.
>
> Then we receive 2 stretch ACKS after ~50 usec.
>
> TCP stack tries to push again some packets but blocks on
> tcp_limit_output_bytes again.
>
> 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> push following ~60 packets.
>
>
> TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> using a per flow dynamic value, but I would rather not add a kludge in
> TCP stack only to deal with a possible bug in ath10k driver.
>
> niu has a similar issue and simply had to call skb_orphan() :
>
> drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);

Ok. I tried calling skb_orphan() right after I submit each Tx frame
(similar to niu which does this in start_xmit):

--- a/drivers/net/wireless/ath/ath10k/htt_tx.c
+++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
@@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
sk_buff *msdu)
        if (res)
                goto err_unmap_msdu;

+       skb_orphan(msdu);
+
        return 0;

 err_unmap_msdu:


Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
single flow (even better then before). Wow.

Does this look ok/safe as a solution to you?


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 14:41                                           ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 14:41 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
> 
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
>         if (res)
>                 goto err_unmap_msdu;
> 
> +       skb_orphan(msdu);
> +
>         return 0;
> 
>  err_unmap_msdu:
> 
> 
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
> 
> Does this look ok/safe as a solution to you?

Not at all. This basically removes backpressure.

A single UDP socket can now blast packets regardless of SO_SNDBUF
limits.

This basically remove years of work trying to fix bufferbloat.

I still do not understand why increasing tcp_limit_output_bytes is not
working for you.





^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 14:41                                           ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 14:41 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
> 
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
>         if (res)
>                 goto err_unmap_msdu;
> 
> +       skb_orphan(msdu);
> +
>         return 0;
> 
>  err_unmap_msdu:
> 
> 
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
> 
> Does this look ok/safe as a solution to you?

Not at all. This basically removes backpressure.

A single UDP socket can now blast packets regardless of SO_SNDBUF
limits.

This basically remove years of work trying to fix bufferbloat.

I still do not understand why increasing tcp_limit_output_bytes is not
working for you.




--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 14:48                                           ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 14:48 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> I do get your point. But 1.5ms is really tough on Wi-Fi.
> 
> Just look at this:
> 
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

Thats a different point.

I dont care about rtt but TX completions. (usually much much lower than
rtt)

I can have a 4 usec delay from the moment a NIC submits a packet to the
wire and I get TX completion IRQ, free the packet.

Yet the pong reply can come 100 ms later.

It does not mean the 4 usec delay is a problem.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-05 14:48                                           ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 14:48 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:

> I do get your point. But 1.5ms is really tough on Wi-Fi.
> 
> Just look at this:
> 
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms

Thats a different point.

I dont care about rtt but TX completions. (usually much much lower than
rtt)

I can have a 4 usec delay from the moment a NIC submits a packet to the
wire and I get TX completion IRQ, free the packet.

Yet the pong reply can come 100 ms later.

It does not mean the 4 usec delay is a problem.



--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05 14:41                                           ` Eric Dumazet
  (?)
@ 2015-02-05 17:10                                           ` Eric Dumazet
  2015-02-06  9:42                                             ` Michal Kazior
  -1 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-05 17:10 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:

> Not at all. This basically removes backpressure.
> 
> A single UDP socket can now blast packets regardless of SO_SNDBUF
> limits.
> 
> This basically remove years of work trying to fix bufferbloat.
> 
> I still do not understand why increasing tcp_limit_output_bytes is not
> working for you.

Oh well, tcp_limit_output_bytes might be ok.

In fact, the problem comes from GSO assumption. Maybe Herbert was right,
when he suggested TCP would be simpler if we enforced GSO...

When GSO is used, the thing works because 2*skb->truesize is roughly 2
ms worth of traffic.

Because you do not use GSO, and tx completions are slow, we need this :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b95e17..ac01b4cd0035 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			break;
 
 		/* TCP Small Queues :
-		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
+		 * Control number of packets in qdisc/devices to two packets /
+		 * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
 		 * This allows for :
 		 *  - better RTT estimation and ACK scheduling
 		 *  - faster recovery
@@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * of queued bytes to ensure line rate.
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
 		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05 13:44                                       ` Michal Kazior
  2015-02-05 14:41                                           ` Eric Dumazet
  2015-02-05 14:48                                           ` Eric Dumazet
@ 2015-02-05 19:50                                         ` Dave Taht
  2015-02-06  9:57                                             ` Michal Kazior
  2 siblings, 1 reply; 107+ messages in thread
From: Dave Taht @ 2015-02-05 19:50 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Eric Dumazet, Neal Cardwell, Eyal Perry, netdev, linux-wireless

On Fri, Feb 6, 2015 at 2:44 AM, Michal Kazior <michal.kazior@tieto.com> wrote:
>
> On 5 February 2015 at 14:19, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Thu, 2015-02-05 at 04:57 -0800, Eric Dumazet wrote:
> >
> >> The intention is to control the queues to the following :
> >>
> >> 1 ms of buffering, but limited to a configurable value.
> >>
> >> On a 40Gbps flow, 1ms represents 5 MB, which is insane.
> >>
> >> We do not want to queue 5 MB of traffic, this would destroy latencies
> >> for all concurrent flows. (Or would require having fq_codel or fq as
> >> packet schedulers, instead of default pfifo_fast)
> >>
> >> This is why having 1.5 ms delay between the transmit and TX completion
> >> is a problem in your case.
>
> I do get your point. But 1.5ms is really tough on Wi-Fi.
>
> Just look at this:
>
> ; ping 192.168.1.2 -c 3
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
>
> ; ping 192.168.1.2 -c 3 -Q 224
> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=0.939 ms
> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.906 ms
> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.946 ms
>
> This was run with no load so batching code in the driver itself should
> have no measurable effect. The channel was near-ideal: low noise
> floor, cabled rf, no other traffic.
>
> The lower latency ping is when 802.11 QoS Voice Access Category is
> used. I also get 400mbps instead of 250mbps in this case with 5 flows
> (net/master).
>

The VO queue is now nearly useless in a real world environment. Whlle
it does grab the media mildly faster in some cases, on a good day with
no other competing APs, it cannot aggregate packets, and wastes TXOPS.
It is far saner to aim for better aggregate (use the VI queue if you
must try to get better media acquisition).

It is disabled in multiple products I know of.

And I really, really, really wish, that just once during this thread,
someone had bothered to try running a test
at a real world MCS rate - say MCS1, or MCS4, and measured the latency
under load of that...

or tried talking to two or more stations at the same time.

Instead of trying for 1.5Gbits in a faraday cage.


>
> Dealing with black box firmware blobs is a pain.
>

+10

>
>
> > Note that TCP stack could detect when this happens, *if* ACK where
> > delivered before the TX completions, or when TX completion happens,
> > we could detect that the clone of the freed packet was freed.
> >
> > In my test, when I did "ethtool -C eth0 tx-usecs 1024 tx-frames 64", and
> > disabling GSO, TCP stack sends a bunch of packets (a bit less than 64),
> > blocks on tcp_limit_output_bytes.
> >
> > Then we receive 2 stretch ACKS after ~50 usec.
> >
> > TCP stack tries to push again some packets but blocks on
> > tcp_limit_output_bytes again.
> >
> > 1ms later, TX completion happens, tcp_wfree() is called, and TCP stack
> > push following ~60 packets.
> >
> >
> > TCP could  eventually dynamically adjust the tcp_limit_output_bytes,
> > using a per flow dynamic value, but I would rather not add a kludge in
> > TCP stack only to deal with a possible bug in ath10k driver.
> >
> > niu has a similar issue and simply had to call skb_orphan() :
> >
> > drivers/net/ethernet/sun/niu.c:6669:            skb_orphan(skb);
>
> Ok. I tried calling skb_orphan() right after I submit each Tx frame
> (similar to niu which does this in start_xmit):
>
> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c
> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c
> @@ -564,6 +564,8 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct
> sk_buff *msdu)
>         if (res)
>                 goto err_unmap_msdu;
>
> +       skb_orphan(msdu);
> +
>         return 0;
>
>  err_unmap_msdu:
>
>
> Now, with {net/master + ath10k GRO + the above} I get 620mbps on a
> single flow (even better then before). Wow.
>
> Does this look ok/safe as a solution to you?
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05 14:48                                           ` Eric Dumazet
@ 2015-02-06  9:39                                             ` Nicolas Cavallari
  -1 siblings, 0 replies; 107+ messages in thread
From: Nicolas Cavallari @ 2015-02-06  9:39 UTC (permalink / raw)
  To: Eric Dumazet, Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On 05/02/2015 15:48, Eric Dumazet wrote:
> On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:
> 
>> I do get your point. But 1.5ms is really tough on Wi-Fi.
>>
>> Just look at this:
>>
>> ; ping 192.168.1.2 -c 3
>> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
>> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
>> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
>> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
> 
> Thats a different point.
> 
> I dont care about rtt but TX completions. (usually much much lower than
> rtt)

On wired network perhaps, but definitely not on Wi-Fi.

With aggregation, you may send up to 4ms of data before the receiver
can acknowledge anything. But you have to gain access to the channel
first, so you may wait while others finish off their 4ms
transmissions. And this does not account for retransmissions.

And aggregation is not the only problem as far as bufferbloat is
concerned. I don't even want to think about powersave.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06  9:39                                             ` Nicolas Cavallari
  0 siblings, 0 replies; 107+ messages in thread
From: Nicolas Cavallari @ 2015-02-06  9:39 UTC (permalink / raw)
  To: Eric Dumazet, Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On 05/02/2015 15:48, Eric Dumazet wrote:
> On Thu, 2015-02-05 at 14:44 +0100, Michal Kazior wrote:
> 
>> I do get your point. But 1.5ms is really tough on Wi-Fi.
>>
>> Just look at this:
>>
>> ; ping 192.168.1.2 -c 3
>> PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
>> 64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1.83 ms
>> 64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=2.02 ms
>> 64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=1.98 ms
> 
> Thats a different point.
> 
> I dont care about rtt but TX completions. (usually much much lower than
> rtt)

On wired network perhaps, but definitely not on Wi-Fi.

With aggregation, you may send up to 4ms of data before the receiver
can acknowledge anything. But you have to gain access to the channel
first, so you may wait while others finish off their 4ms
transmissions. And this does not account for retransmissions.

And aggregation is not the only problem as far as bufferbloat is
concerned. I don't even want to think about powersave.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-05 17:10                                           ` Eric Dumazet
@ 2015-02-06  9:42                                             ` Michal Kazior
  2015-02-06 13:40                                                 ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-06  9:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On 5 February 2015 at 18:10, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-02-05 at 06:41 -0800, Eric Dumazet wrote:
>
>> Not at all. This basically removes backpressure.
>>
>> A single UDP socket can now blast packets regardless of SO_SNDBUF
>> limits.
>>
>> This basically remove years of work trying to fix bufferbloat.
>>
>> I still do not understand why increasing tcp_limit_output_bytes is not
>> working for you.
>
> Oh well, tcp_limit_output_bytes might be ok.
>
> In fact, the problem comes from GSO assumption. Maybe Herbert was right,
> when he suggested TCP would be simpler if we enforced GSO...
>
> When GSO is used, the thing works because 2*skb->truesize is roughly 2
> ms worth of traffic.
>
> Because you do not use GSO, and tx completions are slow, we need this :
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b95e17..ac01b4cd0035 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2044,7 +2044,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                         break;
>
>                 /* TCP Small Queues :
> -                * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> +                * Control number of packets in qdisc/devices to two packets /
> +                * or ~2 ms (sk->sk_pacing_rate >> 9) in case GSO is off.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
> @@ -2053,7 +2054,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {
>

The above brings back previous behaviour, i.e. I can get 600mbps TCP
on 5 flows again. Single flow is still (as it was before TSO
autosizing) limited to roughly ~280mbps.

I never really bothered before to understand why I need to push a few
flows through ath10k to max it out, i.e. if I run a single UDP flow I
get ~300mbps while with, e.g. 5 I get 670mbps easily.

I guess it was the tx completion latency all along.

I just put an extra debug to ath10k to see the latency between
submission and completion. Here's a log
(http://www.filedropper.com/complete-log) of 2s run of UDP iperf
trying to push 1gbps but managing only 300mbps.

I've made sure to not hold any locks nor introduce internal to ath10k
delays. Frames get completed between 2-4ms in avarage during load.

When I tried using different ath10k hw&fw I got between 1-2ms of
latency for tx completionsyielding ~430mbps while max should be around
670mbps.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06  9:57                                             ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-06  9:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: Eric Dumazet, Neal Cardwell, Eyal Perry, netdev, linux-wireless

On 5 February 2015 at 20:50, Dave Taht <dave.taht@gmail.com> wrote:
[...]
> And I really, really, really wish, that just once during this thread,
> someone had bothered to try running a test
> at a real world MCS rate - say MCS1, or MCS4, and measured the latency
> under load of that...

Time between frame submission to firmware and tx-completion on one of
my ath10k machines:

Legacy 54mbps: ~18ms
Legacy 6mbps: ~37ms
11n MCS 3 (nss=0): ~13ms
11n MCS 8 (nss=1): ~6-8ms
11ac NSS=1 MCS=2: ~4-6ms
11ac NSS=2 MCS=0: ~5-8ms

Keep in mind this is a clean room environment so retransmissions are
kept at minimum. Obviously with a noisy environment you'll get retries
at different rates and higher latency.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06  9:57                                             ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-06  9:57 UTC (permalink / raw)
  To: Dave Taht
  Cc: Eric Dumazet, Neal Cardwell, Eyal Perry,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-wireless

On 5 February 2015 at 20:50, Dave Taht <dave.taht-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
[...]
> And I really, really, really wish, that just once during this thread,
> someone had bothered to try running a test
> at a real world MCS rate - say MCS1, or MCS4, and measured the latency
> under load of that...

Time between frame submission to firmware and tx-completion on one of
my ath10k machines:

Legacy 54mbps: ~18ms
Legacy 6mbps: ~37ms
11n MCS 3 (nss=0): ~13ms
11n MCS 8 (nss=1): ~6-8ms
11ac NSS=1 MCS=2: ~4-6ms
11ac NSS=2 MCS=0: ~5-8ms

Keep in mind this is a clean room environment so retransmissions are
kept at minimum. Obviously with a noisy environment you'll get retries
at different rates and higher latency.


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 13:40                                                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 13:40 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

> The above brings back previous behaviour, i.e. I can get 600mbps TCP
> on 5 flows again. Single flow is still (as it was before TSO
> autosizing) limited to roughly ~280mbps.
> 
> I never really bothered before to understand why I need to push a few
> flows through ath10k to max it out, i.e. if I run a single UDP flow I
> get ~300mbps while with, e.g. 5 I get 670mbps easily.
> 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1450   10.00      697705      0     809.27
212992           10.00      673412            781.09

# echo 800000 >/proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

800000    1450   10.00     7329221      0    8501.84
212992           10.00     7284051           8449.44


> I guess it was the tx completion latency all along.
> 
> I just put an extra debug to ath10k to see the latency between
> submission and completion. Here's a log
> (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
> trying to push 1gbps but managing only 300mbps.
> 
> I've made sure to not hold any locks nor introduce internal to ath10k
> delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ < 1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);

to

   amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 

   limit = max(2 * skb->truesize, amount / 1000);

I'll cook a patch.

Thanks.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 13:40                                                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 13:40 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:

> The above brings back previous behaviour, i.e. I can get 600mbps TCP
> on 5 flows again. Single flow is still (as it was before TSO
> autosizing) limited to roughly ~280mbps.
> 
> I never really bothered before to understand why I need to push a few
> flows through ath10k to max it out, i.e. if I run a single UDP flow I
> get ~300mbps while with, e.g. 5 I get 670mbps easily.
> 

For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
enough : UDP has no callback from TX completion to feed following frames
(No write queue like TCP)

# cat /proc/sys/net/core/wmem_default
212992
# ethtool -C eth1 tx-usecs 1024 tx-frames 120
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1450   10.00      697705      0     809.27
212992           10.00      673412            781.09

# echo 800000 >/proc/sys/net/core/wmem_default
# ./netperf -H remote -t UDP_STREAM -- -m 1450
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

800000    1450   10.00     7329221      0    8501.84
212992           10.00     7284051           8449.44


> I guess it was the tx completion latency all along.
> 
> I just put an extra debug to ath10k to see the latency between
> submission and completion. Here's a log
> (http://www.filedropper.com/complete-log) of 2s run of UDP iperf
> trying to push 1gbps but managing only 300mbps.
> 
> I've made sure to not hold any locks nor introduce internal to ath10k
> delays. Frames get completed between 2-4ms in avarage during load.


tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
of TX completion delay. But this would require yet another expensive
call to ktime_get() if HZ < 1000.

Then tcp_write_xmit() could use it to adjust :

   limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);

to

   amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 

   limit = max(2 * skb->truesize, amount / 1000);

I'll cook a patch.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 13:53                                                   ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 13:53 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
> of TX completion delay. But this would require yet another expensive
> call to ktime_get() if HZ < 1000.
> 
> Then tcp_write_xmit() could use it to adjust :
> 
>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
> 
> to
> 
>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 
> 
>    limit = max(2 * skb->truesize, amount / 1000);
> 
> I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb->tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 13:53                                                   ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 13:53 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development,
	eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:

> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
> of TX completion delay. But this would require yet another expensive
> call to ktime_get() if HZ < 1000.
> 
> Then tcp_write_xmit() could use it to adjust :
> 
>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
> 
> to
> 
>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate 
> 
>    limit = max(2 * skb->truesize, amount / 1000);
> 
> I'll cook a patch.

Hmm... doing this in all protocols would be too expensive,
and we do not want to include time spent in qdiscs.

wifi could eventually do that, providing in skb->tx_completion_delay_us
the time spent in wifi driver.

This way, we would have no penalty for network devices doing normal skb
orphaning (loopback interface, ethernet, ...)


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 14:08                                                   ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-06 14:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On 6 February 2015 at 14:40, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:
>
>> The above brings back previous behaviour, i.e. I can get 600mbps TCP
>> on 5 flows again. Single flow is still (as it was before TSO
>> autosizing) limited to roughly ~280mbps.
>>
>> I never really bothered before to understand why I need to push a few
>> flows through ath10k to max it out, i.e. if I run a single UDP flow I
>> get ~300mbps while with, e.g. 5 I get 670mbps easily.
>>
>
> For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
> enough : UDP has no callback from TX completion to feed following frames
> (No write queue like TCP)
>
> # cat /proc/sys/net/core/wmem_default
> 212992
> # ethtool -C eth1 tx-usecs 1024 tx-frames 120
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1450   10.00      697705      0     809.27
> 212992           10.00      673412            781.09
>
> # echo 800000 >/proc/sys/net/core/wmem_default
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 800000    1450   10.00     7329221      0    8501.84
> 212992           10.00     7284051           8449.44

Hmm.. I confirm it works. However the value at which I get full rate
on a single flow is more than 2048K. Also using non-default
wmem_default seems to introduce packet loss as per iperf reports at
the receiver. I suppose this is kind of expected but on the other hand
wmem_default=262992 and 5 flows of UDP max the device out with 0
packet loss.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 14:08                                                   ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-06 14:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On 6 February 2015 at 14:40, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 2015-02-06 at 10:42 +0100, Michal Kazior wrote:
>
>> The above brings back previous behaviour, i.e. I can get 600mbps TCP
>> on 5 flows again. Single flow is still (as it was before TSO
>> autosizing) limited to roughly ~280mbps.
>>
>> I never really bothered before to understand why I need to push a few
>> flows through ath10k to max it out, i.e. if I run a single UDP flow I
>> get ~300mbps while with, e.g. 5 I get 670mbps easily.
>>
>
> For single UDP flow, tweaking /proc/sys/net/core/wmem_default might be
> enough : UDP has no callback from TX completion to feed following frames
> (No write queue like TCP)
>
> # cat /proc/sys/net/core/wmem_default
> 212992
> # ethtool -C eth1 tx-usecs 1024 tx-frames 120
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1450   10.00      697705      0     809.27
> 212992           10.00      673412            781.09
>
> # echo 800000 >/proc/sys/net/core/wmem_default
> # ./netperf -H remote -t UDP_STREAM -- -m 1450
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 800000    1450   10.00     7329221      0    8501.84
> 212992           10.00     7284051           8449.44

Hmm.. I confirm it works. However the value at which I get full rate
on a single flow is more than 2048K. Also using non-default
wmem_default seems to introduce packet loss as per iperf reports at
the receiver. I suppose this is kind of expected but on the other hand
wmem_default=262992 and 5 flows of UDP max the device out with 0
packet loss.


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-06 13:53                                                   ` Eric Dumazet
  (?)
@ 2015-02-06 14:09                                                   ` Michal Kazior
  2015-02-09 13:47                                                     ` Michal Kazior
  -1 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-06 14:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>
>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>> of TX completion delay. But this would require yet another expensive
>> call to ktime_get() if HZ < 1000.
>>
>> Then tcp_write_xmit() could use it to adjust :
>>
>>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>
>> to
>>
>>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>
>>    limit = max(2 * skb->truesize, amount / 1000);
>>
>> I'll cook a patch.
>
> Hmm... doing this in all protocols would be too expensive,
> and we do not want to include time spent in qdiscs.
>
> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
>
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

I'll play around with this idea and report back later.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-06 13:53                                                   ` Eric Dumazet
  (?)
  (?)
@ 2015-02-06 14:10                                                   ` Eric Dumazet
  2015-02-06 14:31                                                       ` David Laight
  -1 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 14:10 UTC (permalink / raw)
  To: Michal Kazior; +Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:


> wifi could eventually do that, providing in skb->tx_completion_delay_us
> the time spent in wifi driver.
> 
> This way, we would have no penalty for network devices doing normal skb
> orphaning (loopback interface, ethernet, ...)

Another way would be that wifi does an automatic orphaning after 1 or
2ms.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-06 14:10                                                   ` Eric Dumazet
@ 2015-02-06 14:31                                                       ` David Laight
  0 siblings, 0 replies; 107+ messages in thread
From: David Laight @ 2015-02-06 14:31 UTC (permalink / raw)
  To: 'Eric Dumazet', Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

RnJvbTogRXJpYyBEdW1hemV0DQo+IE9uIEZyaSwgMjAxNS0wMi0wNiBhdCAwNTo1MyAtMDgwMCwg
RXJpYyBEdW1hemV0IHdyb3RlOg0KPiANCj4gDQo+ID4gd2lmaSBjb3VsZCBldmVudHVhbGx5IGRv
IHRoYXQsIHByb3ZpZGluZyBpbiBza2ItPnR4X2NvbXBsZXRpb25fZGVsYXlfdXMNCj4gPiB0aGUg
dGltZSBzcGVudCBpbiB3aWZpIGRyaXZlci4NCj4gPg0KPiA+IFRoaXMgd2F5LCB3ZSB3b3VsZCBo
YXZlIG5vIHBlbmFsdHkgZm9yIG5ldHdvcmsgZGV2aWNlcyBkb2luZyBub3JtYWwgc2tiDQo+ID4g
b3JwaGFuaW5nIChsb29wYmFjayBpbnRlcmZhY2UsIGV0aGVybmV0LCAuLi4pDQo+IA0KPiBBbm90
aGVyIHdheSB3b3VsZCBiZSB0aGF0IHdpZmkgZG9lcyBhbiBhdXRvbWF0aWMgb3JwaGFuaW5nIGFm
dGVyIDEgb3INCj4gMm1zLg0KDQpDb3VsZG4ndCB5b3UgZG8gYnl0ZSBjb3VudGluZz8NClNvIG9y
cGhhbiBlbm91Z2ggcGFja2V0cyB0byBrZWVwIGEgZmV3IG1zIG9mIHR4IHRyYWZmaWMgKGF0IHRo
ZSBjdXJyZW50DQp0eCByYXRlKSBvcnBoYW5lZC4NCllvdSBtaWdodCBuZWVkIHRvIGdpdmUgdGhl
IGhhcmR3YXJlIGJvdGggb3JwaGFuZWQgYW5kIG5vbi1vcnBoYW5lZCAocGFyZW50ZWQ/KQ0KcGFj
a2V0cyBhbmQgb3JwaGFuIHNvbWUgd2hlbiB5b3UgZ2V0IGEgdHggY29tcGxldGUgZm9yIGFuIG9y
cGhhbmVkIHBhY2tldC4NCg0KCURhdmlkDQoNCg==

^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 14:31                                                       ` David Laight
  0 siblings, 0 replies; 107+ messages in thread
From: David Laight @ 2015-02-06 14:31 UTC (permalink / raw)
  To: 'Eric Dumazet', Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, eyalpe

From: Eric Dumazet
> On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
> 
> 
> > wifi could eventually do that, providing in skb->tx_completion_delay_us
> > the time spent in wifi driver.
> >
> > This way, we would have no penalty for network devices doing normal skb
> > orphaning (loopback interface, ethernet, ...)
> 
> Another way would be that wifi does an automatic orphaning after 1 or
> 2ms.

Couldn't you do byte counting?
So orphan enough packets to keep a few ms of tx traffic (at the current
tx rate) orphaned.
You might need to give the hardware both orphaned and non-orphaned (parented?)
packets and orphan some when you get a tx complete for an orphaned packet.

	David


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 14:35                                                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 14:35 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

> Hmm.. I confirm it works. However the value at which I get full rate
> on a single flow is more than 2048K. Also using non-default
> wmem_default seems to introduce packet loss as per iperf reports at
> the receiver. I suppose this is kind of expected but on the other hand
> wmem_default=262992 and 5 flows of UDP max the device out with 0
> packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 2000000 >/proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 14:35                                                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 14:35 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Fri, 2015-02-06 at 15:08 +0100, Michal Kazior wrote:

> Hmm.. I confirm it works. However the value at which I get full rate
> on a single flow is more than 2048K. Also using non-default
> wmem_default seems to introduce packet loss as per iperf reports at
> the receiver. I suppose this is kind of expected but on the other hand
> wmem_default=262992 and 5 flows of UDP max the device out with 0
> packet loss.

If you increase ability to flood on one flow, then you need to make sure
receiver has big rcvbuf as well.

echo 2000000 >/proc/sys/net/core/rmem_default

Otherwise it might drop bursts.

This is the kind of things that TCP does automatically, not UDP.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 15:02                                                         ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 15:02 UTC (permalink / raw)
  To: David Laight
  Cc: Michal Kazior, Neal Cardwell, linux-wireless,
	Network Development, eyalpe

On Fri, 2015-02-06 at 14:31 +0000, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
> > 
> > 
> > > wifi could eventually do that, providing in skb->tx_completion_delay_us
> > > the time spent in wifi driver.
> > >
> > > This way, we would have no penalty for network devices doing normal skb
> > > orphaning (loopback interface, ethernet, ...)
> > 
> > Another way would be that wifi does an automatic orphaning after 1 or
> > 2ms.
> 
> Couldn't you do byte counting?
> So orphan enough packets to keep a few ms of tx traffic (at the current
> tx rate) orphaned.
> You might need to give the hardware both orphaned and non-orphaned (parented?)
> packets and orphan some when you get a tx complete for an orphaned packet.

We already have byte counting.

The thing is : A driver can keep an skb for itself, but calling
skb_orphan() in time to allow a socket to send more packets.

For say a UDP server, it would be quite mandatory, as it usually uses a
single UDP socket to receive and send messages.






^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 15:02                                                         ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-06 15:02 UTC (permalink / raw)
  To: David Laight
  Cc: Michal Kazior, Neal Cardwell, linux-wireless,
	Network Development, eyalpe-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb

On Fri, 2015-02-06 at 14:31 +0000, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2015-02-06 at 05:53 -0800, Eric Dumazet wrote:
> > 
> > 
> > > wifi could eventually do that, providing in skb->tx_completion_delay_us
> > > the time spent in wifi driver.
> > >
> > > This way, we would have no penalty for network devices doing normal skb
> > > orphaning (loopback interface, ethernet, ...)
> > 
> > Another way would be that wifi does an automatic orphaning after 1 or
> > 2ms.
> 
> Couldn't you do byte counting?
> So orphan enough packets to keep a few ms of tx traffic (at the current
> tx rate) orphaned.
> You might need to give the hardware both orphaned and non-orphaned (parented?)
> packets and orphan some when you get a tx complete for an orphaned packet.

We already have byte counting.

The thing is : A driver can keep an skb for itself, but calling
skb_orphan() in time to allow a socket to send more packets.

For say a UDP server, it would be quite mandatory, as it usually uses a
single UDP socket to receive and send messages.





--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-06 14:35                                                     ` Eric Dumazet
@ 2015-02-06 17:48                                                       ` Rick Jones
  -1 siblings, 0 replies; 107+ messages in thread
From: Rick Jones @ 2015-02-06 17:48 UTC (permalink / raw)
  To: Eric Dumazet, Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

> If you increase ability to flood on one flow, then you need to make sure
> receiver has big rcvbuf as well.
>
> echo 2000000 >/proc/sys/net/core/rmem_default
>
> Otherwise it might drop bursts.
>
> This is the kind of things that TCP does automatically, not UDP.

An alternative, if the application involved can make explicit 
setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak 
rmem_max and wmem_max and then let the application make the setsockopt() 
calls.

Which path one would take would depend on circumstances I suspect.

rick jones

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-06 17:48                                                       ` Rick Jones
  0 siblings, 0 replies; 107+ messages in thread
From: Rick Jones @ 2015-02-06 17:48 UTC (permalink / raw)
  To: Eric Dumazet, Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

> If you increase ability to flood on one flow, then you need to make sure
> receiver has big rcvbuf as well.
>
> echo 2000000 >/proc/sys/net/core/rmem_default
>
> Otherwise it might drop bursts.
>
> This is the kind of things that TCP does automatically, not UDP.

An alternative, if the application involved can make explicit 
setsockopt() calls to set SO_SNDBUF and/or SO_RCVBUF, is to tweak 
rmem_max and wmem_max and then let the application make the setsockopt() 
calls.

Which path one would take would depend on circumstances I suspect.

rick jones

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-06 14:09                                                   ` Michal Kazior
@ 2015-02-09 13:47                                                     ` Michal Kazior
  2015-02-09 15:11                                                       ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-09 13:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On 6 February 2015 at 15:09, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 6 February 2015 at 14:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Fri, 2015-02-06 at 05:40 -0800, Eric Dumazet wrote:
>>
>>> tcp_wfree() could maintain in tp->tx_completion_delay_ms an EWMA
>>> of TX completion delay. But this would require yet another expensive
>>> call to ktime_get() if HZ < 1000.
>>>
>>> Then tcp_write_xmit() could use it to adjust :
>>>
>>>    limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 9);
>>>
>>> to
>>>
>>>    amount = (2 + tp->tx_completion_delay_ms) * sk->sk_pacing_rate
>>>
>>>    limit = max(2 * skb->truesize, amount / 1000);
>>>
>>> I'll cook a patch.
>>
>> Hmm... doing this in all protocols would be too expensive,
>> and we do not want to include time spent in qdiscs.
>>
>> wifi could eventually do that, providing in skb->tx_completion_delay_us
>> the time spent in wifi driver.
>>
>> This way, we would have no penalty for network devices doing normal skb
>> orphaning (loopback interface, ethernet, ...)
>
> I'll play around with this idea and report back later.

I'm able to get 600mbps with 5 flows and 250mbps with 1 flow, i.e.
same as before the regression. I'm attaching the patch at the end of
my mail - is this approach viable?

I wonder if there's anything that can be done to allow 600mbps (line
rate) on 1 flow with ath10k without tweaking tcp_limit_output_bytes
(you can't expect end-users to tweak this).

Perhaps tcp_limit_output_bytes should also consider tx_completion_delay, e.g.:

  amount = sk->sk_tx_completion_delay_us;
  amount *= sk->sk_pacing_rate >> 10;
  limit = max(2 * skb->truesize, amount >> 10);
  max_limit = sysctl_tcp_limit_output_bytes;
  max_limit *= 1 + (sk->sk_tx_completion_delay_us / USEC_PER_MSEC);
  limit = min(u32, limit, max_limit);

With this I get ~400mbps on 1 flow. If I add the original 1ms extra
delay from your formula to tx_completion_delay I fill in ath10k I get
nearly line rate in 1 flow (almost 600mbps; it hops between 570-620).
Decreasing tcp_limit_output_bytes decreases throughput (e.g. 64K gives
300mbps, 32K gives 180mbps, 16K gives 110mbps). Multiple flows in
iperf seem unbalanced with 128K limit, but look okay with 32K).


Michał


diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..4ff0ae8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -82,6 +82,7 @@ struct ath10k_skb_cb {
        dma_addr_t paddr;
        u8 eid;
        u8 vdev_id;
+       ktime_t stamp;

        struct {
                u8 tid;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       ATH10K_SKB_CB(skb)->stamp = ktime_get();
        ATH10K_SKB_CB(skb)->htt.is_offchan = false;
        ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
        ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0d5539b 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,7 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include <net/sock.h>
 #include "core.h"
 #include "txrx.h"
 #include "htt.h"
@@ -82,6 +83,13 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

        ath10k_report_offchan_tx(htt->ar, msdu);

+       if (msdu->sk) {
+               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_us) =
+                               ktime_to_ns(ktime_sub(ktime_get(),
+                                                     skb_cb->stamp)) /
+                               NSEC_PER_USEC;
+       }
+
        info = IEEE80211_SKB_CB(msdu);
        memset(&info->status, 0, sizeof(info->status));
        trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6b15d71 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -390,6 +390,7 @@ struct sock {
        int                     sk_wmem_queued;
        gfp_t                   sk_allocation;
        u32                     sk_pacing_rate; /* bytes per second */
+       u32                     sk_tx_completion_delay_us;
        u32                     sk_max_pacing_rate;
        netdev_features_t       sk_route_caps;
        netdev_features_t       sk_route_nocaps;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..5e249bf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
        max_segs = tcp_tso_autosize(sk, mss_now);
        while ((skb = tcp_send_head(sk))) {
                unsigned int limit;
+               unsigned int amount;

                tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
                BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               amount = sk->sk_tx_completion_delay_us *
+                        (sk->sk_pacing_rate >> 10);
+               limit = max(2 * skb->truesize, amount >> 10);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-09 13:47                                                     ` Michal Kazior
@ 2015-02-09 15:11                                                       ` Eric Dumazet
  2015-02-10 10:33                                                         ` Michal Kazior
  2015-03-31 11:08                                                         ` Johannes Berg
  0 siblings, 2 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-09 15:11 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 65caf8b..5e249bf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
>         max_segs = tcp_tso_autosize(sk, mss_now);
>         while ((skb = tcp_send_head(sk))) {
>                 unsigned int limit;
> +               unsigned int amount;
> 
>                 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
>                 BUG_ON(!tso_segs);
> @@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
> -               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> +               amount = sk->sk_tx_completion_delay_us *
> +                        (sk->sk_pacing_rate >> 10);
> +               limit = max(2 * skb->truesize, amount >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
>                 if (atomic_read(&sk->sk_wmem_alloc) > limit) {

This is not what I suggested.

If you test this on any other network device, you'll have
sk->sk_tx_completion_delay_us == 0

amount = 0 * (sk->sk_pacing_rate >> 10); --> 0
limit = max(2 * skb->truesize, amount >> 10); --> 2 * skb->truesize

So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
per skb)

Then if you store only the last tx completion, you have the possibility
of having a last packet of a train (say a retransmit) to make it very
low.

Ideally the formula would be in TCP something very fast to compute :

amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
limit = max(2 * skb->truesize, amount);
limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

So a 'problematic' driver would have to do the math (64 bit maths) like
this :


sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;






^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-09 15:11                                                       ` Eric Dumazet
@ 2015-02-10 10:33                                                         ` Michal Kazior
  2015-02-10 12:54                                                           ` Eric Dumazet
                                                                             ` (2 more replies)
  2015-03-31 11:08                                                         ` Johannes Berg
  1 sibling, 3 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-10 10:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On 9 February 2015 at 16:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2015-02-09 at 14:47 +0100, Michal Kazior wrote:
[...]
> This is not what I suggested.
>
> If you test this on any other network device, you'll have
> sk->sk_tx_completion_delay_us == 0
>
> amount = 0 * (sk->sk_pacing_rate >> 10); --> 0
> limit = max(2 * skb->truesize, amount >> 10); --> 2 * skb->truesize

You're right. Sorry for mixing up.


> So non TSO/GSO NIC will not be able to queue more than 2 MSS (one MSS
> per skb)
>
> Then if you store only the last tx completion, you have the possibility
> of having a last packet of a train (say a retransmit) to make it very
> low.
>
> Ideally the formula would be in TCP something very fast to compute :
>
> amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
> limit = max(2 * skb->truesize, amount);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>
> So a 'problematic' driver would have to do the math (64 bit maths) like
> this :
>
>
> sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

Hmm. So I've done like you suggested (hopefully I didn't mix anything
up this time around).

I now get pre-regression performance, ~250mbps on 1 flow, 600mbps on 5
flows (vs 250mbps whatever number of flows).


Michał


diff --git a/drivers/net/wireless/ath/ath10k/core.c
b/drivers/net/wireless/ath/ath10k/core.c
index 367e896..a29111c 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -18,6 +18,7 @@
 #include <linux/module.h>
 #include <linux/firmware.h>
 #include <linux/of.h>
+#include <linux/average.h>

 #include "core.h"
 #include "mac.h"
@@ -1423,6 +1424,7 @@ struct ath10k *ath10k_core_create(size_t
priv_size, struct device *dev,
        init_dummy_netdev(&ar->napi_dev);
        ieee80211_napi_add(ar->hw, &ar->napi, &ar->napi_dev,
                           ath10k_core_napi_dummy_poll, 64);
+       ewma_init(&ar->tx_delay_us, 16384, 8);

        ret = ath10k_debug_create(ar);
        if (ret)
diff --git a/drivers/net/wireless/ath/ath10k/core.h
b/drivers/net/wireless/ath/ath10k/core.h
index 3be3a59..34f6d78 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -24,6 +24,7 @@
 #include <linux/pci.h>
 #include <linux/uuid.h>
 #include <linux/time.h>
+#include <linux/average.h>

 #include "htt.h"
 #include "htc.h"
@@ -82,6 +83,7 @@ struct ath10k_skb_cb {
        dma_addr_t paddr;
        u8 eid;
        u8 vdev_id;
+       ktime_t stamp;

        struct {
                u8 tid;
@@ -625,6 +627,7 @@ struct ath10k {

        struct net_device napi_dev;
        struct napi_struct napi;
+       struct ewma tx_delay_us;

 #ifdef CONFIG_ATH10K_DEBUGFS
        struct ath10k_debug debug;
diff --git a/drivers/net/wireless/ath/ath10k/mac.c
b/drivers/net/wireless/ath/ath10k/mac.c
index 15e47f4..5efb2a7 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -2620,6 +2620,7 @@ static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       ATH10K_SKB_CB(skb)->stamp = ktime_get();
        ATH10K_SKB_CB(skb)->htt.is_offchan = false;
        ATH10K_SKB_CB(skb)->htt.tid = ath10k_tx_h_get_tid(hdr);
        ATH10K_SKB_CB(skb)->vdev_id = ath10k_tx_h_get_vdev_id(ar, vif);
diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
b/drivers/net/wireless/ath/ath10k/txrx.c
index 3f00cec..0f5f0f2 100644
--- a/drivers/net/wireless/ath/ath10k/txrx.c
+++ b/drivers/net/wireless/ath/ath10k/txrx.c
@@ -15,6 +15,8 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */

+#include <net/sock.h>
+#include <linux/average.h>
 #include "core.h"
 #include "txrx.h"
 #include "htt.h"
@@ -82,6 +84,16 @@ void ath10k_txrx_tx_unref(struct ath10k_htt *htt,

        ath10k_report_offchan_tx(htt->ar, msdu);

+       if (msdu->sk) {
+               ewma_add(&ar->tx_delay_us,
+                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
+                        NSEC_PER_USEC);
+
+               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
+                               (ewma_read(&ar->tx_delay_us) *
+                                msdu->sk->sk_pacing_rate) >> 20;
+       }
+
        info = IEEE80211_SKB_CB(msdu);
        memset(&info->status, 0, sizeof(info->status));
        trace_ath10k_txrx_tx_unref(ar, tx_done->msdu_id);
diff --git a/include/net/sock.h b/include/net/sock.h
index 2210fec..6772543 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -391,6 +391,7 @@ struct sock {
        gfp_t                   sk_allocation;
        u32                     sk_pacing_rate; /* bytes per second */
        u32                     sk_max_pacing_rate;
+       u32                     sk_tx_completion_delay_cushion;
        netdev_features_t       sk_route_caps;
        netdev_features_t       sk_route_nocaps;
        int                     sk_gso_type;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 65caf8b..526a568 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1996,6 +1996,7 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
        max_segs = tcp_tso_autosize(sk, mss_now);
        while ((skb = tcp_send_head(sk))) {
                unsigned int limit;
+               unsigned int amount;

                tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
                BUG_ON(!tso_segs);
@@ -2053,7 +2054,9 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,
                 * of queued bytes to ensure line rate.
                 * One example is wifi aggregation (802.11 AMPDU)
                 */
-               limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+               amount = (sk->sk_pacing_rate >> 10) +
+                        sk->sk_tx_completion_delay_cushion;
+               limit = max(2 * skb->truesize, amount);
                limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);

                if (atomic_read(&sk->sk_wmem_alloc) > limit) {

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-10 10:33                                                         ` Michal Kazior
@ 2015-02-10 12:54                                                           ` Eric Dumazet
  2015-02-10 13:05                                                             ` Eric Dumazet
  2015-02-10 13:14                                                           ` Eric Dumazet
  2015-02-10 14:19                                                           ` Johannes Berg
  2 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-10 12:54 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

> +       if (msdu->sk) {
> +               ewma_add(&ar->tx_delay_us,
> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> +                        NSEC_PER_USEC);
> +
> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> +                               (ewma_read(&ar->tx_delay_us) *
> +                                msdu->sk->sk_pacing_rate) >> 20;
> +       }
> +

Hi Michal

This is almost it ;)

As I said you must do this using u64 arithmetics, we still support 32bit
kernels.

Also, >> 20 instead of / 1000000 introduces a 5% error, I would use a
plain divide, as the compiler will use a reciprocal divide (ie : a
multiply)

We use >> 10 instead of /1000 because a 2.4 % error is probably okay.

        ewma_add(&ar->tx_delay_us,
                 ktime_to_ns(ktime_sub(ktime_get(),
skb_cb->stamp)) /
	                        NSEC_PER_USEC);
	u64 val = (u64)ewma_read(&ar->tx_delay_us) *
                   msdu->sk->sk_pacing_rate;

	do_div(val, USEC_PER_SEC);
		
        ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
                    (u32)val;
                 
(WRITE_ONCE() would be better for new kernels, but ACCESS_ONCE() is ok
since we probably want to backport to stable kernels)


Thanks



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-10 12:54                                                           ` Eric Dumazet
@ 2015-02-10 13:05                                                             ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-10 13:05 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Tue, 2015-02-10 at 04:54 -0800, Eric Dumazet wrote:

> Hi Michal
> 
> This is almost it ;)
> 
> As I said you must do this using u64 arithmetics, we still support 32bit
> kernels.
> 
> Also, >> 20 instead of / 1000000 introduces a 5% error, I would use a
> plain divide, as the compiler will use a reciprocal divide (ie : a
> multiply)
> 
> We use >> 10 instead of /1000 because a 2.4 % error is probably okay.
> 
>         ewma_add(&ar->tx_delay_us,
>                  ktime_to_ns(ktime_sub(ktime_get(),
> skb_cb->stamp)) /
> 	                        NSEC_PER_USEC);

btw I suspect this wont compile on 32 bit kernel

You need to use do_div() as well :

u64 val = ktime_to_ns(ktime_sub(ktime_get(),
				skb_cb->stamp));

do_div(val, NSEC_PER_USEC);

ewma_add(&ar->tx_delay_us, val);



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-10 10:33                                                         ` Michal Kazior
  2015-02-10 12:54                                                           ` Eric Dumazet
@ 2015-02-10 13:14                                                           ` Eric Dumazet
  2015-02-11  8:33                                                             ` Michal Kazior
  2015-02-10 14:19                                                           ` Johannes Berg
  2 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2015-02-10 13:14 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>                            ath10k_core_napi_dummy_poll, 64);
> +       ewma_init(&ar->tx_delay_us, 16384, 8);


1) 16384 factor might be too big.

2) a weight of 8 seems too low given aggregation values used in wifi.

On 32bit arches, the max range for ewma value would be 262144 usec,
a quarter of a second...

You could use a factor of 64 instead, and a weight of 16.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-10 10:33                                                         ` Michal Kazior
  2015-02-10 12:54                                                           ` Eric Dumazet
  2015-02-10 13:14                                                           ` Eric Dumazet
@ 2015-02-10 14:19                                                           ` Johannes Berg
  2015-02-10 15:09                                                             ` Eric Dumazet
  2015-02-11  8:57                                                               ` Michal Kazior
  2 siblings, 2 replies; 107+ messages in thread
From: Johannes Berg @ 2015-02-10 14:19 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:

> +       if (msdu->sk) {
> +               ewma_add(&ar->tx_delay_us,
> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> +                        NSEC_PER_USEC);
> +
> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> +                               (ewma_read(&ar->tx_delay_us) *
> +                                msdu->sk->sk_pacing_rate) >> 20;
> +       }

To some extent, every wifi driver is going to have this problem. Perhaps
we should do this in mac80211?

johannes


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-10 14:19                                                           ` Johannes Berg
@ 2015-02-10 15:09                                                             ` Eric Dumazet
  2015-02-11  8:57                                                               ` Michal Kazior
  1 sibling, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-10 15:09 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Michal Kazior, Neal Cardwell, linux-wireless,
	Network Development, Eyal Perry

On Tue, 2015-02-10 at 15:19 +0100, Johannes Berg wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
> 
> > +       if (msdu->sk) {
> > +               ewma_add(&ar->tx_delay_us,
> > +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
> > +                        NSEC_PER_USEC);
> > +
> > +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
> > +                               (ewma_read(&ar->tx_delay_us) *
> > +                                msdu->sk->sk_pacing_rate) >> 20;
> > +       }
> 
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

I'll provide the TCP patch.

sk->sk_tx_completion_delay_cushion is probably a wrong name, as the
units here are in bytes, since it is really number of bytes in the
network driver that accommodate for tx completions delays. 

tx_completion_delay * pacing_rate

sk_tx_completion_cushion maybe.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-10 13:14                                                           ` Eric Dumazet
@ 2015-02-11  8:33                                                             ` Michal Kazior
  2015-02-11 13:17                                                                 ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Michal Kazior @ 2015-02-11  8:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On 10 February 2015 at 14:14, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>                            ath10k_core_napi_dummy_poll, 64);
>> +       ewma_init(&ar->tx_delay_us, 16384, 8);
>
>
> 1) 16384 factor might be too big.
>
> 2) a weight of 8 seems too low given aggregation values used in wifi.
>
> On 32bit arches, the max range for ewma value would be 262144 usec,
> a quarter of a second...
>
> You could use a factor of 64 instead, and a weight of 16.

64/16 seems to work fine as well.

On a related note: I still wonder how to get single TCP flow to reach
line rate with ath10k (it still doesn't; I reach line rate with
multiple flows only). Isn't the tcp_limit_output_bytes just too small
for devices like Wi-Fi where you can send aggregates of even 64*3*1500
bytes long in a single shot and you can't expect even a single
tx-completion of it to come in before its transmitted entirely? You
effectively operate with bursts of traffic.

Some numbers:
 ath10k w/o cushion w/o aggregation 1 flow: UDP 65mbps, TCP 30mbps
 ath10k w/ cushion w/o aggregation 1 flow: UDP 65mbps, TCP 59mbps
 ath10k w/o cushion w/ aggregation 1 flow: UDP 650mbps, TCP 250mbps
 ath10k w/ cushion w/ aggregation 1 flow: UDP 650mbps, TCP 250mbps
 ath10k w/o cushion w/ aggregation 5 flows: UDP 650mbps, TCP 250mbps
 ath10k w/ cushion w/ aggregation 5 flows: UDP 650mbps, TCP 600mbps

"w/o aggregation" means forcing ath10k to use 1 A-MSDU and 1 A-MPDU
per aggregate so latencies due to aggregation itself should be pretty
much nil.

If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
w/ aggregation to reach 600mbps on a single flow.


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-11  8:57                                                               ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-11  8:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On 10 February 2015 at 15:19, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>
>> +       if (msdu->sk) {
>> +               ewma_add(&ar->tx_delay_us,
>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>> +                        NSEC_PER_USEC);
>> +
>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>> +                               (ewma_read(&ar->tx_delay_us) *
>> +                                msdu->sk->sk_pacing_rate) >> 20;
>> +       }
>
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

Good point. I was actually thinking about it. I can try cooking a
patch unless you want to do it yourself :-)


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-11  8:57                                                               ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-11  8:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On 10 February 2015 at 15:19, Johannes Berg <johannes-cdvu00un1VgdHxzADdlk8Q@public.gmane.org> wrote:
> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>
>> +       if (msdu->sk) {
>> +               ewma_add(&ar->tx_delay_us,
>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>> +                        NSEC_PER_USEC);
>> +
>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>> +                               (ewma_read(&ar->tx_delay_us) *
>> +                                msdu->sk->sk_pacing_rate) >> 20;
>> +       }
>
> To some extent, every wifi driver is going to have this problem. Perhaps
> we should do this in mac80211?

Good point. I was actually thinking about it. I can try cooking a
patch unless you want to do it yourself :-)


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-11  8:33                                                             ` Michal Kazior
@ 2015-02-11 13:17                                                                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-11 13:17 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
> w/ aggregation to reach 600mbps on a single flow.

You know, there is a reason this sysctl exists in the first place ;)

The first suggestion I made to you was to raise it.

The default setting must stay as is as long default Qdisc is pfifo_fast.

I believe I already mentioned skb->truesize tricks for drivers willing
to adjust the TSQ given their constraints.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-11 13:17                                                                 ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2015-02-11 13:17 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry

On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:

> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
> w/ aggregation to reach 600mbps on a single flow.

You know, there is a reason this sysctl exists in the first place ;)

The first suggestion I made to you was to raise it.

The default setting must stay as is as long default Qdisc is pfifo_fast.

I believe I already mentioned skb->truesize tricks for drivers willing
to adjust the TSQ given their constraints.


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-12  7:16                                                                   ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-12  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry,
	Johannes Berg

On 11 February 2015 at 14:17, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:
>
>> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
>> w/ aggregation to reach 600mbps on a single flow.
>
> You know, there is a reason this sysctl exists in the first place ;)
>
> The first suggestion I made to you was to raise it.
>
> The default setting must stay as is as long default Qdisc is pfifo_fast.
>
> I believe I already mentioned skb->truesize tricks for drivers willing
> to adjust the TSQ given their constraints.

Right. truesize didn't help in my early tests and once the cushion
thing came about I had assumed that it's not relevant anymore.

I just checked:

@@ -2620,6 +2621,12 @@ static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       if (skb->sk) {
+               u32 trim = skb->truesize - (skb->truesize / 8);
+               skb->truesize -= trim;
+               atomic_sub(trim, &skb->sk->sk_wmem_alloc);
+       }

With this I get 600mbps on a single flow. The /2 wasn't enough (it
barely made a difference, 250->300mbps). The question is how do I know
how much of trimming is too much? Could the tx completion delay be
used to compute the trim factor, hmm..

Maybe this should be done in mac80211 as well?


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-12  7:16                                                                   ` Michal Kazior
  0 siblings, 0 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-12  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, linux-wireless, Network Development, Eyal Perry,
	Johannes Berg

On 11 February 2015 at 14:17, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, 2015-02-11 at 09:33 +0100, Michal Kazior wrote:
>
>> If I set tcp_limit_output_bytes to 700K+ I can get ath10k w/ cushion
>> w/ aggregation to reach 600mbps on a single flow.
>
> You know, there is a reason this sysctl exists in the first place ;)
>
> The first suggestion I made to you was to raise it.
>
> The default setting must stay as is as long default Qdisc is pfifo_fast.
>
> I believe I already mentioned skb->truesize tricks for drivers willing
> to adjust the TSQ given their constraints.

Right. truesize didn't help in my early tests and once the cushion
thing came about I had assumed that it's not relevant anymore.

I just checked:

@@ -2620,6 +2621,12 @@ static void ath10k_tx(struct ieee80211_hw *hw,
        if (info->flags & IEEE80211_TX_CTL_NO_CCK_RATE)
                ath10k_dbg(ar, ATH10K_DBG_MAC,
"IEEE80211_TX_CTL_NO_CCK_RATE\n");

+       if (skb->sk) {
+               u32 trim = skb->truesize - (skb->truesize / 8);
+               skb->truesize -= trim;
+               atomic_sub(trim, &skb->sk->sk_wmem_alloc);
+       }

With this I get 600mbps on a single flow. The /2 wasn't enough (it
barely made a difference, 250->300mbps). The question is how do I know
how much of trimming is too much? Could the tx completion delay be
used to compute the trim factor, hmm..

Maybe this should be done in mac80211 as well?


Michał
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-11  8:57                                                               ` Michal Kazior
  (?)
@ 2015-02-12  7:48                                                               ` Michal Kazior
  2015-02-12  8:33                                                                 ` Dave Taht
  2015-02-24 10:24                                                                   ` Johannes Berg
  -1 siblings, 2 replies; 107+ messages in thread
From: Michal Kazior @ 2015-02-12  7:48 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On 11 February 2015 at 09:57, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 10 February 2015 at 15:19, Johannes Berg <johannes@sipsolutions.net> wrote:
>> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>
>>> +       if (msdu->sk) {
>>> +               ewma_add(&ar->tx_delay_us,
>>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>>> +                        NSEC_PER_USEC);
>>> +
>>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>>> +                               (ewma_read(&ar->tx_delay_us) *
>>> +                                msdu->sk->sk_pacing_rate) >> 20;
>>> +       }
>>
>> To some extent, every wifi driver is going to have this problem. Perhaps
>> we should do this in mac80211?
>
> Good point. I was actually thinking about it. I can try cooking a
> patch unless you want to do it yourself :-)

I've taken a look into this. The most obvious place to add the
timestamp for each packet would be ieee80211_tx_info (i.e. the
skb->cb[48]). The problem is it's very tight there. Even squeezing 2
bytes (allowing up to 64ms of tx completion delay which I'm worried
won't be enough) will be troublesome. Some drivers already use every
last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
40 bytes of driver_data).

I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
of changes required to implement the tx completion delay accounting?


Michał

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-12  7:48                                                               ` Michal Kazior
@ 2015-02-12  8:33                                                                 ` Dave Taht
  2015-02-24 10:24                                                                   ` Johannes Berg
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Taht @ 2015-02-12  8:33 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Johannes Berg, Eric Dumazet, Neal Cardwell, linux-wireless,
	Network Development, Eyal Perry

On Wed, Feb 11, 2015 at 11:48 PM, Michal Kazior <michal.kazior@tieto.com> wrote:
> On 11 February 2015 at 09:57, Michal Kazior <michal.kazior@tieto.com> wrote:
>> On 10 February 2015 at 15:19, Johannes Berg <johannes@sipsolutions.net> wrote:
>>> On Tue, 2015-02-10 at 11:33 +0100, Michal Kazior wrote:
>>>
>>>> +       if (msdu->sk) {
>>>> +               ewma_add(&ar->tx_delay_us,
>>>> +                        ktime_to_ns(ktime_sub(ktime_get(), skb_cb->stamp)) /
>>>> +                        NSEC_PER_USEC);
>>>> +
>>>> +               ACCESS_ONCE(msdu->sk->sk_tx_completion_delay_cushion) =
>>>> +                               (ewma_read(&ar->tx_delay_us) *
>>>> +                                msdu->sk->sk_pacing_rate) >> 20;
>>>> +       }
>>>
>>> To some extent, every wifi driver is going to have this problem. Perhaps
>>> we should do this in mac80211?
>>
>> Good point. I was actually thinking about it. I can try cooking a
>> patch unless you want to do it yourself :-)
>
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried

I will argue strongly in favor of never allowing more than 4ms packets
to accumulate in the firmware.

> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).
>
> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?
>
>
> Michał
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Dave Täht

thttp://www.bufferbloat.net/projects/bloat/wiki/Upcoming_Talks

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-24 10:24                                                                   ` Johannes Berg
  0 siblings, 0 replies; 107+ messages in thread
From: Johannes Berg @ 2015-02-24 10:24 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:

> > Good point. I was actually thinking about it. I can try cooking a
> > patch unless you want to do it yourself :-)
> 
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried
> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).

Couldn't we just repurpose the existing skb->tstamp field for this, as
long as the skb is fully contained within the wireless layer?

Actually, it looks like we can't, since I guess timestamping options can
be turned on on any socket.

> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?

I have no doubt that would be rejected :)

johannes


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
@ 2015-02-24 10:24                                                                   ` Johannes Berg
  0 siblings, 0 replies; 107+ messages in thread
From: Johannes Berg @ 2015-02-24 10:24 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:

> > Good point. I was actually thinking about it. I can try cooking a
> > patch unless you want to do it yourself :-)
> 
> I've taken a look into this. The most obvious place to add the
> timestamp for each packet would be ieee80211_tx_info (i.e. the
> skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> bytes (allowing up to 64ms of tx completion delay which I'm worried
> won't be enough) will be troublesome. Some drivers already use every
> last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> 40 bytes of driver_data).

Couldn't we just repurpose the existing skb->tstamp field for this, as
long as the skb is fully contained within the wireless layer?

Actually, it looks like we can't, since I guess timestamping options can
be turned on on any socket.

> I wonder if it's okay to bump skb->cb to 56 bytes to avoid the cascade
> of changes required to implement the tx completion delay accounting?

I have no doubt that would be rejected :)

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-24 10:24                                                                   ` Johannes Berg
  (?)
@ 2015-02-24 10:30                                                                   ` Johannes Berg
  2015-02-24 10:59                                                                     ` Johannes Berg
  -1 siblings, 1 reply; 107+ messages in thread
From: Johannes Berg @ 2015-02-24 10:30 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
> On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
> 
> > > Good point. I was actually thinking about it. I can try cooking a
> > > patch unless you want to do it yourself :-)
> > 
> > I've taken a look into this. The most obvious place to add the
> > timestamp for each packet would be ieee80211_tx_info (i.e. the
> > skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> > bytes (allowing up to 64ms of tx completion delay which I'm worried
> > won't be enough) will be troublesome. Some drivers already use every
> > last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> > 40 bytes of driver_data).
> 
> Couldn't we just repurpose the existing skb->tstamp field for this, as
> long as the skb is fully contained within the wireless layer?
> 
> Actually, it looks like we can't, since I guess timestamping options can
> be turned on on any socket.

Actually, that creates a clone or a new skb? Hmm.

Anyway, I think we should just move the vif and hw_key pointers out of
the tx_info and into ieee80211_tx_control. That will give us plenty of
space, and they can't really be used long-term anyway? Although ... both
might be somewhat problematic for mac80211 itself? Hmm.

johannes


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-24 10:30                                                                   ` Johannes Berg
@ 2015-02-24 10:59                                                                     ` Johannes Berg
  0 siblings, 0 replies; 107+ messages in thread
From: Johannes Berg @ 2015-02-24 10:59 UTC (permalink / raw)
  To: Michal Kazior
  Cc: Eric Dumazet, Neal Cardwell, linux-wireless, Network Development,
	Eyal Perry

On Tue, 2015-02-24 at 11:30 +0100, Johannes Berg wrote:
> On Tue, 2015-02-24 at 11:24 +0100, Johannes Berg wrote:
> > On Thu, 2015-02-12 at 08:48 +0100, Michal Kazior wrote:
> > 
> > > > Good point. I was actually thinking about it. I can try cooking a
> > > > patch unless you want to do it yourself :-)
> > > 
> > > I've taken a look into this. The most obvious place to add the
> > > timestamp for each packet would be ieee80211_tx_info (i.e. the
> > > skb->cb[48]). The problem is it's very tight there. Even squeezing 2
> > > bytes (allowing up to 64ms of tx completion delay which I'm worried
> > > won't be enough) will be troublesome. Some drivers already use every
> > > last byte of their allowance on 64bit archs (e.g. ar5523 uses entire
> > > 40 bytes of driver_data).
> > 
> > Couldn't we just repurpose the existing skb->tstamp field for this, as
> > long as the skb is fully contained within the wireless layer?
> > 
> > Actually, it looks like we can't, since I guess timestamping options can
> > be turned on on any socket.
> 
> Actually, that creates a clone or a new skb? Hmm.

Ah and then it puts it on the error queue right away, so I think we can
reuse it.

johannes


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: Throughput regression with `tcp: refine TSO autosizing`
  2015-02-09 15:11                                                       ` Eric Dumazet
  2015-02-10 10:33                                                         ` Michal Kazior
@ 2015-03-31 11:08                                                         ` Johannes Berg
  1 sibling, 0 replies; 107+ messages in thread
From: Johannes Berg @ 2015-03-31 11:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Kazior, Neal Cardwell, linux-wireless,
	Network Development, Eyal Perry

To revive an old thread ...

On Mon, 2015-02-09 at 07:11 -0800, Eric Dumazet wrote:


> Ideally the formula would be in TCP something very fast to compute :
> 
> amount = (sk->sk_pacing_rate >> 10) + sk->tx_completion_delay_cushion;
> limit = max(2 * skb->truesize, amount);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> So a 'problematic' driver would have to do the math (64 bit maths) like
> this :
> 
> 
> sk->tx_completion_delay_cushion = ewma_tx_delay * sk->sk_pacing_rate;

The whole notion with "ewma_tx_delay" seems very wrong to me. Measuring
something while also trying to control it (or something closely related)
seems a bit strange, but perhaps you meant to measure something
different than what Michal implemented.

What he implemented was measuring the time it takes from submission to
the hardware queues, but that seems to create a bad feedback cycle:
Allowing it as extra transmit "cushion" will, IIUC, cause TCP to submit
more data to the queues, which will in turn cause the next packets to be
potentially delayed more (since there are still waiting ones) thus
causing a longer delay to be measured, which in turn allows even more
data to be submitted etc.

IOW, while traffic is flowing this will likely cause feedback that
completely removes the point of this, no? Leaving only
sysctl_tcp_limit_output_bytes as the limit (*).

It seems it'd be better to provide a calculated estimate, perhaps based
on current transmit rate and (if available) CCA/TXOP acquisition time.

johannes

(*) Which, btw, isn't all that big given that a maximally sized A-MPDU
is like 1MB containing close to that much actual data! Don't think that
can actually be done at current transmit rates though.


^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2015-03-31 11:08 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29 11:48 Throughput regression with `tcp: refine TSO autosizing` Michal Kazior
2015-01-29 13:14 ` Eric Dumazet
2015-01-30 10:29   ` Arend van Spriel
2015-01-30 13:19     ` Eric Dumazet
2015-01-30 13:19       ` Eric Dumazet
2015-01-30 13:47       ` Arend van Spriel
2015-01-30 14:37         ` Eric Dumazet
     [not found]         ` <CAA93jw5fqhz0Hiw74L2GXgtZ9JsMg+NtYydKxKzGDrvQcZn4hA@mail.gmail.com>
     [not found]           ` <CAA93jw7b0E9jjQYXrEPzjLLC9j8xNC0TFYXpWVtgFameJaNBdw@mail.gmail.com>
     [not found]             ` <1422741065.199624134@apps.rackspace.com>
     [not found]               ` <CAPp0ZBb2nkA6Y0s=W0kw=zvyn0wi0NMBRsBCw_xcD61ScOmgQg@mail.gmail.com>
     [not found]                 ` <CAA_e5Z46Bu+zZZFzf_ejzA35Gw3g1_OG85yv6yd7MpbwZcE-nw@mail.gmail.com>
2015-02-01  8:45                   ` Fwd: " Dave Taht
2015-02-01 10:47                     ` Jonathan Morton
2015-02-01 14:43                       ` dpreed
2015-02-01 23:34                         ` Andrew McGregor
2015-02-02  4:04                           ` [Cerowrt-devel] " Avery Pennarun
2015-02-02  4:04                             ` Avery Pennarun
2015-02-02 15:25                             ` Jim Gettys
2015-02-02  4:21                         ` Avery Pennarun
2015-02-02  7:07                           ` David Lang
2015-02-02  7:07                             ` David Lang
2015-01-30 13:39   ` Michal Kazior
2015-01-30 13:39     ` Michal Kazior
2015-01-30 14:40     ` Eric Dumazet
2015-01-30 14:40       ` Eric Dumazet
2015-02-02 10:27       ` Michal Kazior
2015-02-02 10:27         ` Michal Kazior
2015-02-02 18:52         ` Eric Dumazet
2015-02-02 21:25           ` Ben Greear
2015-02-02 23:06             ` Eric Dumazet
2015-02-02 23:06               ` Eric Dumazet
2015-02-03  9:00               ` Michal Kazior
2015-02-03  9:00                 ` Michal Kazior
2015-02-03  1:18           ` Eric Dumazet
2015-02-03  1:18             ` Eric Dumazet
2015-02-03 11:50             ` Michal Kazior
2015-02-03 14:27               ` Eric Dumazet
2015-02-03 14:27                 ` Eric Dumazet
2015-02-03 15:03                 ` Eric Dumazet
2015-02-03 15:03                   ` Eric Dumazet
2015-02-04 11:35                 ` Michal Kazior
2015-02-04 11:57                   ` Eric Dumazet
2015-02-04 11:57                     ` Eric Dumazet
2015-02-04 12:22                     ` Michal Kazior
2015-02-04 12:38                       ` Eric Dumazet
2015-02-04 12:53                         ` Michal Kazior
2015-02-04 12:55                           ` Johannes Berg
2015-02-04 13:16                           ` Eric Dumazet
2015-02-04 13:29                             ` Eric Dumazet
2015-02-04 21:11                               ` Eric Dumazet
2015-02-04 21:11                                 ` Eric Dumazet
2015-02-05  6:46                                 ` Michal Kazior
2015-02-05  6:46                                   ` Michal Kazior
2015-02-05 13:03                                   ` Eric Dumazet
2015-02-05 13:03                                     ` Eric Dumazet
2015-02-05  8:38                                 ` Michal Kazior
2015-02-05 12:57                                   ` Eric Dumazet
2015-02-05 13:19                                     ` Eric Dumazet
2015-02-05 13:33                                       ` Eric Dumazet
2015-02-05 13:33                                         ` Eric Dumazet
2015-02-05 13:44                                       ` Michal Kazior
2015-02-05 14:41                                         ` Eric Dumazet
2015-02-05 14:41                                           ` Eric Dumazet
2015-02-05 17:10                                           ` Eric Dumazet
2015-02-06  9:42                                             ` Michal Kazior
2015-02-06 13:40                                               ` Eric Dumazet
2015-02-06 13:40                                                 ` Eric Dumazet
2015-02-06 13:53                                                 ` Eric Dumazet
2015-02-06 13:53                                                   ` Eric Dumazet
2015-02-06 14:09                                                   ` Michal Kazior
2015-02-09 13:47                                                     ` Michal Kazior
2015-02-09 15:11                                                       ` Eric Dumazet
2015-02-10 10:33                                                         ` Michal Kazior
2015-02-10 12:54                                                           ` Eric Dumazet
2015-02-10 13:05                                                             ` Eric Dumazet
2015-02-10 13:14                                                           ` Eric Dumazet
2015-02-11  8:33                                                             ` Michal Kazior
2015-02-11 13:17                                                               ` Eric Dumazet
2015-02-11 13:17                                                                 ` Eric Dumazet
2015-02-12  7:16                                                                 ` Michal Kazior
2015-02-12  7:16                                                                   ` Michal Kazior
2015-02-10 14:19                                                           ` Johannes Berg
2015-02-10 15:09                                                             ` Eric Dumazet
2015-02-11  8:57                                                             ` Michal Kazior
2015-02-11  8:57                                                               ` Michal Kazior
2015-02-12  7:48                                                               ` Michal Kazior
2015-02-12  8:33                                                                 ` Dave Taht
2015-02-24 10:24                                                                 ` Johannes Berg
2015-02-24 10:24                                                                   ` Johannes Berg
2015-02-24 10:30                                                                   ` Johannes Berg
2015-02-24 10:59                                                                     ` Johannes Berg
2015-03-31 11:08                                                         ` Johannes Berg
2015-02-06 14:10                                                   ` Eric Dumazet
2015-02-06 14:31                                                     ` David Laight
2015-02-06 14:31                                                       ` David Laight
2015-02-06 15:02                                                       ` Eric Dumazet
2015-02-06 15:02                                                         ` Eric Dumazet
2015-02-06 14:08                                                 ` Michal Kazior
2015-02-06 14:08                                                   ` Michal Kazior
2015-02-06 14:35                                                   ` Eric Dumazet
2015-02-06 14:35                                                     ` Eric Dumazet
2015-02-06 17:48                                                     ` Rick Jones
2015-02-06 17:48                                                       ` Rick Jones
2015-02-05 14:48                                         ` Eric Dumazet
2015-02-05 14:48                                           ` Eric Dumazet
2015-02-06  9:39                                           ` Nicolas Cavallari
2015-02-06  9:39                                             ` Nicolas Cavallari
2015-02-05 19:50                                         ` Dave Taht
2015-02-06  9:57                                           ` Michal Kazior
2015-02-06  9:57                                             ` Michal Kazior
2015-02-03  8:44           ` Michal Kazior

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.