* TCP performance regression in mac80211 triggered by the fq code @ 2016-07-12 10:09 Felix Fietkau 2016-07-12 12:13 ` Dave Taht ` (4 more replies) 0 siblings, 5 replies; 23+ messages in thread From: Felix Fietkau @ 2016-07-12 10:09 UTC (permalink / raw) To: linux-wireless; +Cc: Michal Kazior, Toke Høiland-Jørgensen Hi, With Toke's ath9k txq patch I've noticed a pretty nasty performance regression when running local iperf on an AP (running the txq stuff) to a wireless client. Here's some things that I found: - when I use only one TCP stream I get around 90-110 Mbit/s - when running multiple TCP streams, I get only 35-40 Mbit/s total - fairness between TCP streams looks completely fine - there's no big queue buildup, the code never actually drops any packets - if I put a hack in the fq code to force the hash to a constant value (effectively disabling fq without disabling codel), the problem disappears and even multiple streams get proper performance. Please let me know if you have any ideas. - Felix ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 10:09 TCP performance regression in mac80211 triggered by the fq code Felix Fietkau @ 2016-07-12 12:13 ` Dave Taht 2016-07-12 13:21 ` Felix Fietkau 2016-07-12 12:28 ` Toke Høiland-Jørgensen ` (3 subsequent siblings) 4 siblings, 1 reply; 23+ messages in thread From: Dave Taht @ 2016-07-12 12:13 UTC (permalink / raw) To: Felix Fietkau, make-wifi-fast Cc: linux-wireless, Michal Kazior, Toke Høiland-Jørgensen On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <nbd@nbd.name> wrote: > Hi, > > With Toke's ath9k txq patch I've noticed a pretty nasty performance > regression when running local iperf on an AP (running the txq stuff) to > a wireless client. Your kernel? cpu architecture? What happens when going through the AP to a server from the wireless client? Which direction? > Here's some things that I found: > - when I use only one TCP stream I get around 90-110 Mbit/s with how much cpu left over? > - when running multiple TCP streams, I get only 35-40 Mbit/s total with how much cpu left over? context switch difference between the two tests? tcp_limit_output_bytes is? got perf? > - fairness between TCP streams looks completely fine A codel will get to long term fairness pretty fast. Packet captures from a fq will show much more regular interleaving of packets, regardless. > - there's no big queue buildup, the code never actually drops any packets A "trick" I have been using to observe codel behavior has been to enable ecn on server and client, then checking in wireshark for ect(3) marked packets. > - if I put a hack in the fq code to force the hash to a constant value You could also set "flows" to 1 to keep the hash being generated, but not actually use it. > (effectively disabling fq without disabling codel), the problem > disappears and even multiple streams get proper performance. Meaning you get 90-110Mbits ? Do you have a "before toke" figure for this platform? > Please let me know if you have any ideas. I am in berlin, packing hardware... > > - Felix > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:13 ` Dave Taht @ 2016-07-12 13:21 ` Felix Fietkau 2016-07-12 14:02 ` Dave Taht 0 siblings, 1 reply; 23+ messages in thread From: Felix Fietkau @ 2016-07-12 13:21 UTC (permalink / raw) To: Dave Taht, make-wifi-fast Cc: linux-wireless, Michal Kazior, Toke Høiland-Jørgensen On 2016-07-12 14:13, Dave Taht wrote: > On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <nbd@nbd.name> wrote: >> Hi, >> >> With Toke's ath9k txq patch I've noticed a pretty nasty performance >> regression when running local iperf on an AP (running the txq stuff) to >> a wireless client. > > Your kernel? cpu architecture? QCA9558, 720 MHz, running Linux 4.4.14 > What happens when going through the AP to a server from the wireless client? Will test that next. > Which direction? AP->STA, iperf running on the AP. Client is a regular MacBook Pro (Broadcom). >> Here's some things that I found: >> - when I use only one TCP stream I get around 90-110 Mbit/s > > with how much cpu left over? ~20% >> - when running multiple TCP streams, I get only 35-40 Mbit/s total > with how much cpu left over? ~30% > context switch difference between the two tests? What's the easiest way to track that? > tcp_limit_output_bytes is? 262144 > got perf? Need to make a new build for that. >> - fairness between TCP streams looks completely fine > > A codel will get to long term fairness pretty fast. Packet captures > from a fq will show much more regular interleaving of packets, > regardless. > >> - there's no big queue buildup, the code never actually drops any packets > > A "trick" I have been using to observe codel behavior has been to > enable ecn on server and client, then checking in wireshark for ect(3) > marked packets. I verified this with printk. The same issue already appears if I have just the fq patch (with the codel patch reverted). >> - if I put a hack in the fq code to force the hash to a constant value > > You could also set "flows" to 1 to keep the hash being generated, but > not actually use it. > >> (effectively disabling fq without disabling codel), the problem >> disappears and even multiple streams get proper performance. > > Meaning you get 90-110Mbits ? Right. > Do you have a "before toke" figure for this platform? It's quite similar. >> Please let me know if you have any ideas. > > I am in berlin, packing hardware... Nice! - Felix ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 13:21 ` Felix Fietkau @ 2016-07-12 14:02 ` Dave Taht 2016-07-13 7:57 ` Dave Taht 2016-07-19 13:10 ` Michal Kazior 0 siblings, 2 replies; 23+ messages in thread From: Dave Taht @ 2016-07-12 14:02 UTC (permalink / raw) To: Felix Fietkau Cc: make-wifi-fast, linux-wireless, Michal Kazior, Toke Høiland-Jørgensen On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <nbd@nbd.name> wrote: > On 2016-07-12 14:13, Dave Taht wrote: >> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <nbd@nbd.name> wrote: >>> Hi, >>> >>> With Toke's ath9k txq patch I've noticed a pretty nasty performance >>> regression when running local iperf on an AP (running the txq stuff) to >>> a wireless client. >> >> Your kernel? cpu architecture? > QCA9558, 720 MHz, running Linux 4.4.14 > >> What happens when going through the AP to a server from the wireless client? > Will test that next. > >> Which direction? > AP->STA, iperf running on the AP. Client is a regular MacBook Pro > (Broadcom). There are always 2 wifi chips in play. Like the Sith. >>> Here's some things that I found: >>> - when I use only one TCP stream I get around 90-110 Mbit/s >> >> with how much cpu left over? > ~20% > >>> - when running multiple TCP streams, I get only 35-40 Mbit/s total >> with how much cpu left over? > ~30% Hmm. Care to try netperf? > >> context switch difference between the two tests? > What's the easiest way to track that? if you have gnu "time" time -v the_process or: perf record -e context-switches -ag or: process /proc/$PID/status for cntx >> tcp_limit_output_bytes is? > 262144 I keep hoping to be able to reduce this to something saner like 4096 one day. It got bumped to 64k based on bad wifi performance once, and then to it's current size to make the Xen folk happier. The other param I'd like to see fiddled with is tcp_notsent_lowat. In both cases reductions will increase your context switches but reduce memory pressure and lead to a more reactive tcp. And in neither case I think this is the real cause of this problem. >> got perf? > Need to make a new build for that. > >>> - fairness between TCP streams looks completely fine >> >> A codel will get to long term fairness pretty fast. Packet captures >> from a fq will show much more regular interleaving of packets, >> regardless. >> >>> - there's no big queue buildup, the code never actually drops any packets >> >> A "trick" I have been using to observe codel behavior has been to >> enable ecn on server and client, then checking in wireshark for ect(3) >> marked packets. > I verified this with printk. The same issue already appears if I have > just the fq patch (with the codel patch reverted). OK. A four flow test "should" trigger codel.... Running out of cpu (or hitting some other bottleneck), without loss/marking "should" result in a tcptrace -G and xplot.org of the packet capture showing the window continuing to increase.... >>> - if I put a hack in the fq code to force the hash to a constant value >> >> You could also set "flows" to 1 to keep the hash being generated, but >> not actually use it. >> >>> (effectively disabling fq without disabling codel), the problem >>> disappears and even multiple streams get proper performance. >> >> Meaning you get 90-110Mbits ? > Right. > >> Do you have a "before toke" figure for this platform? > It's quite similar. > >>> Please let me know if you have any ideas. >> >> I am in berlin, packing hardware... > Nice! > > - Felix > -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 14:02 ` Dave Taht @ 2016-07-13 7:57 ` Dave Taht 2016-07-13 8:53 ` Felix Fietkau 2016-07-19 13:10 ` Michal Kazior 1 sibling, 1 reply; 23+ messages in thread From: Dave Taht @ 2016-07-13 7:57 UTC (permalink / raw) To: Felix Fietkau Cc: make-wifi-fast, linux-wireless, Michal Kazior, Toke Høiland-Jørgensen On Tue, Jul 12, 2016 at 4:02 PM, Dave Taht <dave.taht@gmail.com> wrote: > On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <nbd@nbd.name> wrote: >> On 2016-07-12 14:13, Dave Taht wrote: >>> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <nbd@nbd.name> wrote: >>>> Hi, >>>> >>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance >>>> regression when running local iperf on an AP (running the txq stuff) to >>>> a wireless client. >>> >>> Your kernel? cpu architecture? >> QCA9558, 720 MHz, running Linux 4.4.14 So this is a single core at the near-bottom end of the range. I guess we also should find a MIPS 24c derivative that runs at 400Mhz or so. What HZ? (I no longer know how much higher HZ settings make any difference, but I'm usually at NOHZ and 250, rather than 100.) And all the testing to date was on much higher end multi-cores. >>> What happens when going through the AP to a server from the wireless client? >> Will test that next. Anddddd? >> >>> Which direction? >> AP->STA, iperf running on the AP. Client is a regular MacBook Pro >> (Broadcom). > > There are always 2 wifi chips in play. Like the Sith. > >>>> Here's some things that I found: >>>> - when I use only one TCP stream I get around 90-110 Mbit/s >>> >>> with how much cpu left over? >> ~20% >> >>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total >>> with how much cpu left over? >> ~30% To me this implies a contending lock issue, too much work in the irq handler or too delayed work in the softirq handler.... I thought you were very brave to try and backport this. > > Hmm. > > Care to try netperf? > >> >>> context switch difference between the two tests? >> What's the easiest way to track that? > > if you have gnu "time" time -v the_process > > or: > > perf record -e context-switches -ag > > or: process /proc/$PID/status for cntx > >>> tcp_limit_output_bytes is? >> 262144 > > I keep hoping to be able to reduce this to something saner like 4096 > one day. It got bumped to 64k based on bad wifi performance once, and > then to it's current size to make the Xen folk happier. > > The other param I'd like to see fiddled with is tcp_notsent_lowat. > > In both cases reductions will increase your context switches but > reduce memory pressure and lead to a more reactive tcp. > > And in neither case I think this is the real cause of this problem. > > >>> got perf? >> Need to make a new build for that. >> >>>> - fairness between TCP streams looks completely fine >>> >>> A codel will get to long term fairness pretty fast. Packet captures >>> from a fq will show much more regular interleaving of packets, >>> regardless. >>> >>>> - there's no big queue buildup, the code never actually drops any packets >>> >>> A "trick" I have been using to observe codel behavior has been to >>> enable ecn on server and client, then checking in wireshark for ect(3) >>> marked packets. >> I verified this with printk. The same issue already appears if I have >> just the fq patch (with the codel patch reverted). > > OK. A four flow test "should" trigger codel.... > > Running out of cpu (or hitting some other bottleneck), without > loss/marking "should" result in a tcptrace -G and xplot.org of the > packet capture showing the window continuing to increase.... > > >>>> - if I put a hack in the fq code to force the hash to a constant value >>> >>> You could also set "flows" to 1 to keep the hash being generated, but >>> not actually use it. >>> >>>> (effectively disabling fq without disabling codel), the problem >>>> disappears and even multiple streams get proper performance. >>> >>> Meaning you get 90-110Mbits ? >> Right. >> >>> Do you have a "before toke" figure for this platform? >> It's quite similar. >> >>>> Please let me know if you have any ideas. >>> >>> I am in berlin, packing hardware... >> Nice! >> >> - Felix >> > > > > -- > Dave Täht > Let's go make home routers and wifi faster! With better software! > http://blog.cerowrt.org -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-13 7:57 ` Dave Taht @ 2016-07-13 8:53 ` Felix Fietkau 2016-07-13 9:13 ` Dave Taht 0 siblings, 1 reply; 23+ messages in thread From: Felix Fietkau @ 2016-07-13 8:53 UTC (permalink / raw) To: Dave Taht Cc: make-wifi-fast, linux-wireless, Michal Kazior, Toke Høiland-Jørgensen On 2016-07-13 09:57, Dave Taht wrote: > On Tue, Jul 12, 2016 at 4:02 PM, Dave Taht <dave.taht@gmail.com> wrote: >> On Tue, Jul 12, 2016 at 3:21 PM, Felix Fietkau <nbd@nbd.name> wrote: >>> On 2016-07-12 14:13, Dave Taht wrote: >>>> On Tue, Jul 12, 2016 at 12:09 PM, Felix Fietkau <nbd@nbd.name> wrote: >>>>> Hi, >>>>> >>>>> With Toke's ath9k txq patch I've noticed a pretty nasty performance >>>>> regression when running local iperf on an AP (running the txq stuff) to >>>>> a wireless client. >>>> >>>> Your kernel? cpu architecture? >>> QCA9558, 720 MHz, running Linux 4.4.14 > > So this is a single core at the near-bottom end of the range. I guess > we also should find a MIPS 24c derivative that runs at 400Mhz or so. > > What HZ? (I no longer know how much higher HZ settings make any > difference, but I'm usually at NOHZ and 250, rather than 100.) > > And all the testing to date was on much higher end multi-cores. > >>>> What happens when going through the AP to a server from the wireless client? >>> Will test that next. > > Anddddd? Single stream: 130 Mbit/s, 70% idle Two streams: 50-80 Mbit/s (wildly fluctuating), 73% idle. >>>> Which direction? >>> AP->STA, iperf running on the AP. Client is a regular MacBook Pro >>> (Broadcom). >> >> There are always 2 wifi chips in play. Like the Sith. >> >>>>> Here's some things that I found: >>>>> - when I use only one TCP stream I get around 90-110 Mbit/s >>>> >>>> with how much cpu left over? >>> ~20% >>> >>>>> - when running multiple TCP streams, I get only 35-40 Mbit/s total >>>> with how much cpu left over? >>> ~30% > > To me this implies a contending lock issue, too much work in the irq > handler or too delayed work in the softirq handler.... > > I thought you were very brave to try and backport this. I don't think this has anything to do with contending locks, CPU utilization, etc. The code does something to the packets that TCP really doesn't like. - Felix ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-13 8:53 ` Felix Fietkau @ 2016-07-13 9:13 ` Dave Taht 0 siblings, 0 replies; 23+ messages in thread From: Dave Taht @ 2016-07-13 9:13 UTC (permalink / raw) To: Felix Fietkau Cc: make-wifi-fast, linux-wireless, Michal Kazior, Toke Høiland-Jørgensen On Wed, Jul 13, 2016 at 10:53 AM, Felix Fietkau <nbd@nbd.name> wrote: >> To me this implies a contending lock issue, too much work in the irq >> handler or too delayed work in the softirq handler.... >> >> I thought you were very brave to try and backport this. > I don't think this has anything to do with contending locks, CPU > utilization, etc. The code does something to the packets that TCP really > doesn't like. With your 70% idle figure, I am inclined to agree... could you get an aircap of the two different tests? - as well as a regular packetcap taken at the client or server? And put somewhere I can get at them? What version of OSX are you running? I will setup an ath9k box shortly... -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 14:02 ` Dave Taht 2016-07-13 7:57 ` Dave Taht @ 2016-07-19 13:10 ` Michal Kazior 1 sibling, 0 replies; 23+ messages in thread From: Michal Kazior @ 2016-07-19 13:10 UTC (permalink / raw) To: Dave Taht Cc: Felix Fietkau, make-wifi-fast, linux-wireless, Toke Høiland-Jørgensen On 12 July 2016 at 16:02, Dave Taht <dave.taht@gmail.com> wrote: [...] >>> tcp_limit_output_bytes is? >> 262144 > > I keep hoping to be able to reduce this to something saner like 4096 > one day. It got bumped to 64k based on bad wifi performance once, and > then to it's current size to make the Xen folk happier. Not sure if it's possible. You do need this to be at least as big as a single A-MPDU can get. In extreme 11ac cases it can be pretty big. I recall a discussion from a long time ago and the tcp limit output logic was/is coupled with the assumption that tx-completions always come max 1ms after tx submission. This is rather tricky to *guarantee* on wifi, especially with firmware blobs, big aggregates, lots of stations and retries. Michał ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 10:09 TCP performance regression in mac80211 triggered by the fq code Felix Fietkau 2016-07-12 12:13 ` Dave Taht @ 2016-07-12 12:28 ` Toke Høiland-Jørgensen 2016-07-12 12:44 ` Dave Taht ` (2 more replies) 2016-07-19 13:13 ` Michal Kazior ` (2 subsequent siblings) 4 siblings, 3 replies; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-12 12:28 UTC (permalink / raw) To: Felix Fietkau; +Cc: linux-wireless, Michal Kazior Felix Fietkau <nbd@nbd.name> writes: > Hi, > > With Toke's ath9k txq patch I've noticed a pretty nasty performance > regression when running local iperf on an AP (running the txq stuff) to > a wireless client. > > Here's some things that I found: > - when I use only one TCP stream I get around 90-110 Mbit/s > - when running multiple TCP streams, I get only 35-40 Mbit/s total > - fairness between TCP streams looks completely fine > - there's no big queue buildup, the code never actually drops any packets > - if I put a hack in the fq code to force the hash to a constant value > (effectively disabling fq without disabling codel), the problem > disappears and even multiple streams get proper performance. > > Please let me know if you have any ideas. Hmm, I see two TCP streams get about the same aggregate throughput as one, both when started from the AP and when started one hop away. However, do see TCP flows take a while to ramp up when started from the AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps when run from the AP. how long are you running the tests for? (I seem to recall the ramp-up issue to be there pre-patch as well, though). As for why this would happen... There could be a bug in the dequeue code somewhere, but since you get better performance from sticking everything into one queue, my best guess would be that the client is choking on the interleaved packets? I.e. expending more CPU when it can't stick subsequent packets into the same TCP flow? -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:28 ` Toke Høiland-Jørgensen @ 2016-07-12 12:44 ` Dave Taht 2016-07-12 12:57 ` Toke Høiland-Jørgensen 2016-07-12 13:22 ` Felix Fietkau 2016-07-12 13:23 ` Felix Fietkau 2016-07-18 21:49 ` Toke Høiland-Jørgensen 2 siblings, 2 replies; 23+ messages in thread From: Dave Taht @ 2016-07-12 12:44 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Felix Fietkau, linux-wireless, Michal Kazior On Tue, Jul 12, 2016 at 2:28 PM, Toke Høiland-Jørgensen <toke@toke.dk> wrote: > Felix Fietkau <nbd@nbd.name> writes: > >> Hi, >> >> With Toke's ath9k txq patch I've noticed a pretty nasty performance >> regression when running local iperf on an AP (running the txq stuff) to >> a wireless client. >> >> Here's some things that I found: >> - when I use only one TCP stream I get around 90-110 Mbit/s >> - when running multiple TCP streams, I get only 35-40 Mbit/s total >> - fairness between TCP streams looks completely fine >> - there's no big queue buildup, the code never actually drops any packets >> - if I put a hack in the fq code to force the hash to a constant value >> (effectively disabling fq without disabling codel), the problem >> disappears and even multiple streams get proper performance. >> >> Please let me know if you have any ideas. > > Hmm, I see two TCP streams get about the same aggregate throughput as > one, both when started from the AP and when started one hop away. > However, do see TCP flows take a while to ramp up when started from the > AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps > when run from the AP. how long are you running the tests for? > > (I seem to recall the ramp-up issue to be there pre-patch as well, > though). The original ath10k code had a "swag" at hooking in an estimator from rate control. With minstrel in play that can be done better in the ath9k. > As for why this would happen... There could be a bug in the dequeue code > somewhere, but since you get better performance from sticking everything > into one queue, my best guess would be that the client is choking on the > interleaved packets? I.e. expending more CPU when it can't stick > subsequent packets into the same TCP flow? I share this concern. The quantum is? I am not opposed to a larger quantum (2 full size packets = 3028 in this case?). > -Toke > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:44 ` Dave Taht @ 2016-07-12 12:57 ` Toke Høiland-Jørgensen 2016-07-12 13:03 ` Dave Taht 2016-07-12 13:22 ` Felix Fietkau 1 sibling, 1 reply; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-12 12:57 UTC (permalink / raw) To: Dave Taht; +Cc: Felix Fietkau, linux-wireless, Michal Kazior Dave Taht <dave.taht@gmail.com> writes: >> As for why this would happen... There could be a bug in the dequeue code >> somewhere, but since you get better performance from sticking everything >> into one queue, my best guess would be that the client is choking on the >> interleaved packets? I.e. expending more CPU when it can't stick >> subsequent packets into the same TCP flow? > > I share this concern. > > The quantum is? I am not opposed to a larger quantum (2 full size > packets = 3028 in this case?). The quantum is hard-coded to 300 bytes in the current implementation (see net/fq_impl.h). -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:57 ` Toke Høiland-Jørgensen @ 2016-07-12 13:03 ` Dave Taht 0 siblings, 0 replies; 23+ messages in thread From: Dave Taht @ 2016-07-12 13:03 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Felix Fietkau, linux-wireless, Michal Kazior, make-wifi-fast On Tue, Jul 12, 2016 at 2:57 PM, Toke Høiland-Jørgensen <toke@toke.dk> wrote: > Dave Taht <dave.taht@gmail.com> writes: > >>> As for why this would happen... There could be a bug in the dequeue code >>> somewhere, but since you get better performance from sticking everything >>> into one queue, my best guess would be that the client is choking on the >>> interleaved packets? I.e. expending more CPU when it can't stick >>> subsequent packets into the same TCP flow? >> >> I share this concern. >> >> The quantum is? I am not opposed to a larger quantum (2 full size >> packets = 3028 in this case?). > > The quantum is hard-coded to 300 bytes in the current implementation > (see net/fq_impl.h). don't do that. :) A single full size packet is preferable, and saves going around the main dequeue loop 5-6 times per flow on this workload. My tests on the prior patch set were mostly at the larger quantum. > -Toke -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:44 ` Dave Taht 2016-07-12 12:57 ` Toke Høiland-Jørgensen @ 2016-07-12 13:22 ` Felix Fietkau 1 sibling, 0 replies; 23+ messages in thread From: Felix Fietkau @ 2016-07-12 13:22 UTC (permalink / raw) To: Dave Taht, Toke Høiland-Jørgensen; +Cc: linux-wireless, Michal Kazior On 2016-07-12 14:44, Dave Taht wrote: > On Tue, Jul 12, 2016 at 2:28 PM, Toke Høiland-Jørgensen <toke@toke.dk> wrote: >> Felix Fietkau <nbd@nbd.name> writes: >> >>> Hi, >>> >>> With Toke's ath9k txq patch I've noticed a pretty nasty performance >>> regression when running local iperf on an AP (running the txq stuff) to >>> a wireless client. >>> >>> Here's some things that I found: >>> - when I use only one TCP stream I get around 90-110 Mbit/s >>> - when running multiple TCP streams, I get only 35-40 Mbit/s total >>> - fairness between TCP streams looks completely fine >>> - there's no big queue buildup, the code never actually drops any packets >>> - if I put a hack in the fq code to force the hash to a constant value >>> (effectively disabling fq without disabling codel), the problem >>> disappears and even multiple streams get proper performance. >>> >>> Please let me know if you have any ideas. >> >> Hmm, I see two TCP streams get about the same aggregate throughput as >> one, both when started from the AP and when started one hop away. >> However, do see TCP flows take a while to ramp up when started from the >> AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps >> when run from the AP. how long are you running the tests for? >> >> (I seem to recall the ramp-up issue to be there pre-patch as well, >> though). > > The original ath10k code had a "swag" at hooking in an estimator from > rate control. > With minstrel in play that can be done better in the ath9k. > >> As for why this would happen... There could be a bug in the dequeue code >> somewhere, but since you get better performance from sticking everything >> into one queue, my best guess would be that the client is choking on the >> interleaved packets? I.e. expending more CPU when it can't stick >> subsequent packets into the same TCP flow? > > I share this concern. > > The quantum is? I am not opposed to a larger quantum (2 full size > packets = 3028 in this case?). I also agree with increasing quantum, however that did not make any difference in my tests. - Felix ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:28 ` Toke Høiland-Jørgensen 2016-07-12 12:44 ` Dave Taht @ 2016-07-12 13:23 ` Felix Fietkau 2016-07-18 21:49 ` Toke Høiland-Jørgensen 2 siblings, 0 replies; 23+ messages in thread From: Felix Fietkau @ 2016-07-12 13:23 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: linux-wireless, Michal Kazior On 2016-07-12 14:28, Toke Høiland-Jørgensen wrote: > Felix Fietkau <nbd@nbd.name> writes: > >> Hi, >> >> With Toke's ath9k txq patch I've noticed a pretty nasty performance >> regression when running local iperf on an AP (running the txq stuff) to >> a wireless client. >> >> Here's some things that I found: >> - when I use only one TCP stream I get around 90-110 Mbit/s >> - when running multiple TCP streams, I get only 35-40 Mbit/s total >> - fairness between TCP streams looks completely fine >> - there's no big queue buildup, the code never actually drops any packets >> - if I put a hack in the fq code to force the hash to a constant value >> (effectively disabling fq without disabling codel), the problem >> disappears and even multiple streams get proper performance. >> >> Please let me know if you have any ideas. > > Hmm, I see two TCP streams get about the same aggregate throughput as > one, both when started from the AP and when started one hop away. > However, do see TCP flows take a while to ramp up when started from the > AP - a short test gets ~70Mbps when run from one hop away and ~50Mbps > when run from the AP. how long are you running the tests for? Long enough to see that it's not ramping up. > (I seem to recall the ramp-up issue to be there pre-patch as well, > though). > > As for why this would happen... There could be a bug in the dequeue code > somewhere, but since you get better performance from sticking everything > into one queue, my best guess would be that the client is choking on the > interleaved packets? I.e. expending more CPU when it can't stick > subsequent packets into the same TCP flow? Could be. I'll see what the tests show when I push traffic through the AP instead of from the AP. - Felix ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 12:28 ` Toke Høiland-Jørgensen 2016-07-12 12:44 ` Dave Taht 2016-07-12 13:23 ` Felix Fietkau @ 2016-07-18 21:49 ` Toke Høiland-Jørgensen 2016-07-18 22:02 ` Dave Taht 2 siblings, 1 reply; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-18 21:49 UTC (permalink / raw) To: Felix Fietkau; +Cc: linux-wireless, Michal Kazior Toke Høiland-Jørgensen <toke@toke.dk> writes: > Felix Fietkau <nbd@nbd.name> writes: > >> Hi, >> >> With Toke's ath9k txq patch I've noticed a pretty nasty performance >> regression when running local iperf on an AP (running the txq stuff) to >> a wireless client. >> >> Here's some things that I found: >> - when I use only one TCP stream I get around 90-110 Mbit/s >> - when running multiple TCP streams, I get only 35-40 Mbit/s total >> - fairness between TCP streams looks completely fine >> - there's no big queue buildup, the code never actually drops any packets >> - if I put a hack in the fq code to force the hash to a constant value >> (effectively disabling fq without disabling codel), the problem >> disappears and even multiple streams get proper performance. >> >> Please let me know if you have any ideas. > > Hmm, I see two TCP streams get about the same aggregate throughput as > one, both when started from the AP and when started one hop away. So while I have still not been able to reproduce the issue you described, I have seen something else that is at least puzzling, and may or may not be related: When monitoring the output of /sys/kernel/debug/ieee80211/phy0/aqm I see that all stations have their queues empty all the way to zero several times per second. This is a bit puzzling; the queue should be kept under control, but really shouldn't empty completely. I figure this might also be the reason why you're seeing degraded performance... Since the stats output doesn't include a counter for drops, I haven't gotten any further with figuring out if it's CoDel that's being too aggressive, or what is happening. But will probably add that in and take another look. -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-18 21:49 ` Toke Høiland-Jørgensen @ 2016-07-18 22:02 ` Dave Taht 0 siblings, 0 replies; 23+ messages in thread From: Dave Taht @ 2016-07-18 22:02 UTC (permalink / raw) To: Toke Høiland-Jørgensen Cc: Felix Fietkau, linux-wireless, Michal Kazior Just to add another datapoint, the "rack" optimization for tcp entered the kernel recently. It has some "interesting" timing/batching sensitive behaviors. While the TSO case is described, the packet aggregation case seems similar, and is not. https://www.ietf.org/proceedings/96/slides/slides-96-tcpm-3.pdf 10 Jan 2016 https://kernelnewbies.org/Linux_4.4#head-2583c31a65e6592bef9af426a78940078df7f630 The draft was significantly updated this month. https://tools.ietf.org/html/draft-cheng-tcpm-rack-01 -- Andrew Shewmaker On Mon, Jul 18, 2016 at 2:49 PM, Toke Høiland-Jørgensen <toke@toke.dk> wrote: > Toke Høiland-Jørgensen <toke@toke.dk> writes: > >> Felix Fietkau <nbd@nbd.name> writes: >> >>> Hi, >>> >>> With Toke's ath9k txq patch I've noticed a pretty nasty performance >>> regression when running local iperf on an AP (running the txq stuff) to >>> a wireless client. >>> >>> Here's some things that I found: >>> - when I use only one TCP stream I get around 90-110 Mbit/s >>> - when running multiple TCP streams, I get only 35-40 Mbit/s total >>> - fairness between TCP streams looks completely fine >>> - there's no big queue buildup, the code never actually drops any packets >>> - if I put a hack in the fq code to force the hash to a constant value >>> (effectively disabling fq without disabling codel), the problem >>> disappears and even multiple streams get proper performance. >>> >>> Please let me know if you have any ideas. >> >> Hmm, I see two TCP streams get about the same aggregate throughput as >> one, both when started from the AP and when started one hop away. > > So while I have still not been able to reproduce the issue you > described, I have seen something else that is at least puzzling, and may > or may not be related: > > When monitoring the output of /sys/kernel/debug/ieee80211/phy0/aqm I see > that all stations have their queues empty all the way to zero several > times per second. This is a bit puzzling; the queue should be kept under > control, but really shouldn't empty completely. I figure this might also > be the reason why you're seeing degraded performance... > > Since the stats output doesn't include a counter for drops, I haven't > gotten any further with figuring out if it's CoDel that's being too > aggressive, or what is happening. But will probably add that in and take > another look. > > -Toke > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Dave Täht Let's go make home routers and wifi faster! With better software! http://blog.cerowrt.org ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 10:09 TCP performance regression in mac80211 triggered by the fq code Felix Fietkau 2016-07-12 12:13 ` Dave Taht 2016-07-12 12:28 ` Toke Høiland-Jørgensen @ 2016-07-19 13:13 ` Michal Kazior 2016-07-19 14:32 ` Felix Fietkau 2016-07-20 14:45 ` Toke Høiland-Jørgensen 2016-07-22 10:51 ` Toke Høiland-Jørgensen 4 siblings, 1 reply; 23+ messages in thread From: Michal Kazior @ 2016-07-19 13:13 UTC (permalink / raw) To: Felix Fietkau; +Cc: linux-wireless, Toke Høiland-Jørgensen On 12 July 2016 at 12:09, Felix Fietkau <nbd@nbd.name> wrote: > Hi, > > With Toke's ath9k txq patch I've noticed a pretty nasty performance > regression when running local iperf on an AP (running the txq stuff) to > a wireless client. > > Here's some things that I found: > - when I use only one TCP stream I get around 90-110 Mbit/s > - when running multiple TCP streams, I get only 35-40 Mbit/s total What is the baseline here (i.e. without fq/txq stuff)? Is it ~100mbps? Did you try running multiple streams, each on separate tids (matching the same AC perhaps) or different clients? Michał ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-19 13:13 ` Michal Kazior @ 2016-07-19 14:32 ` Felix Fietkau 0 siblings, 0 replies; 23+ messages in thread From: Felix Fietkau @ 2016-07-19 14:32 UTC (permalink / raw) To: Michal Kazior; +Cc: linux-wireless, Toke Høiland-Jørgensen On 2016-07-19 15:13, Michal Kazior wrote: > On 12 July 2016 at 12:09, Felix Fietkau <nbd@nbd.name> wrote: >> Hi, >> >> With Toke's ath9k txq patch I've noticed a pretty nasty performance >> regression when running local iperf on an AP (running the txq stuff) to >> a wireless client. >> >> Here's some things that I found: >> - when I use only one TCP stream I get around 90-110 Mbit/s >> - when running multiple TCP streams, I get only 35-40 Mbit/s total > > What is the baseline here (i.e. without fq/txq stuff)? Is it ~100mbps? Yes. > Did you try running multiple streams, each on separate tids (matching > the same AC perhaps) or different clients? Not yet. - Felix ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 10:09 TCP performance regression in mac80211 triggered by the fq code Felix Fietkau ` (2 preceding siblings ...) 2016-07-19 13:13 ` Michal Kazior @ 2016-07-20 14:45 ` Toke Høiland-Jørgensen 2016-07-20 15:24 ` Toke Høiland-Jørgensen 2016-07-22 10:51 ` Toke Høiland-Jørgensen 4 siblings, 1 reply; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-20 14:45 UTC (permalink / raw) To: Felix Fietkau; +Cc: linux-wireless, Michal Kazior Felix Fietkau <nbd@nbd.name> writes: > - if I put a hack in the fq code to force the hash to a constant value > (effectively disabling fq without disabling codel), the problem > disappears and even multiple streams get proper performance. There's definitely something iffy about the hashing. Here's the output relevant line from the aqm debug file after running a single TCP stream for 60 seconds to that station: ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions tx-bytes tx-packets wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925 (there are two extra fields here; I added per-txq CoDel stats, will send a patch later). This shows that the txq has 146 flows associated from that one TCP flow. Looking at this over time, it seems that each time the queue runs empty (which happens way too often, which is what I was originally investigating), another flow is assigned. Michal, any idea why? :) -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-20 14:45 ` Toke Høiland-Jørgensen @ 2016-07-20 15:24 ` Toke Høiland-Jørgensen 2016-07-25 5:15 ` Michal Kazior 0 siblings, 1 reply; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-20 15:24 UTC (permalink / raw) To: Felix Fietkau; +Cc: linux-wireless, Michal Kazior Toke Høiland-Jørgensen <toke@toke.dk> writes: > Felix Fietkau <nbd@nbd.name> writes: > >> - if I put a hack in the fq code to force the hash to a constant value >> (effectively disabling fq without disabling codel), the problem >> disappears and even multiple streams get proper performance. > > There's definitely something iffy about the hashing. Here's the output > relevant line from the aqm debug file after running a single TCP stream > for 60 seconds to that station: > > ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions > tx-bytes tx-packets > wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925 > > (there are two extra fields here; I added per-txq CoDel stats, will send > a patch later). > > This shows that the txq has 146 flows associated from that one TCP flow. > Looking at this over time, it seems that each time the queue runs empty > (which happens way too often, which is what I was originally > investigating), another flow is assigned. > > Michal, any idea why? :) And to answer this: because the flow is being freed to be reassigned when it runs empty, but the counter is not decremented. Is this deliberate? I.e. is the 'flows' var supposed to be a total 'new_flows' counter and not a measure of the current number of assigned flows? -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-20 15:24 ` Toke Høiland-Jørgensen @ 2016-07-25 5:15 ` Michal Kazior 2016-07-27 17:31 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 23+ messages in thread From: Michal Kazior @ 2016-07-25 5:15 UTC (permalink / raw) To: Toke Høiland-Jørgensen; +Cc: Felix Fietkau, linux-wireless On 20 July 2016 at 17:24, Toke Høiland-Jørgensen <toke@toke.dk> wrote: > Toke Høiland-Jørgensen <toke@toke.dk> writes: > >> Felix Fietkau <nbd@nbd.name> writes: >> >>> - if I put a hack in the fq code to force the hash to a constant value >>> (effectively disabling fq without disabling codel), the problem >>> disappears and even multiple streams get proper performance. >> >> There's definitely something iffy about the hashing. Here's the output >> relevant line from the aqm debug file after running a single TCP stream >> for 60 seconds to that station: >> >> ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions >> tx-bytes tx-packets >> wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925 >> >> (there are two extra fields here; I added per-txq CoDel stats, will send >> a patch later). >> >> This shows that the txq has 146 flows associated from that one TCP flow. >> Looking at this over time, it seems that each time the queue runs empty >> (which happens way too often, which is what I was originally >> investigating), another flow is assigned. >> >> Michal, any idea why? :) > > And to answer this: because the flow is being freed to be reassigned > when it runs empty, but the counter is not decremented. Is this > deliberate? I.e. is the 'flows' var supposed to be a total 'new_flows' > counter and not a measure of the current number of assigned flows? Yes, it is deliberate. fq_codel qdisc does the same thing and I just mimicked it. Michał ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-25 5:15 ` Michal Kazior @ 2016-07-27 17:31 ` Toke Høiland-Jørgensen 0 siblings, 0 replies; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-27 17:31 UTC (permalink / raw) To: Michal Kazior; +Cc: Felix Fietkau, linux-wireless Michal Kazior <michal.kazior@tieto.com> writes: > On 20 July 2016 at 17:24, Toke Høiland-Jørgensen <toke@toke.dk> wrote: >> Toke Høiland-Jørgensen <toke@toke.dk> writes: >> >>> Felix Fietkau <nbd@nbd.name> writes: >>> >>>> - if I put a hack in the fq code to force the hash to a constant value >>>> (effectively disabling fq without disabling codel), the problem >>>> disappears and even multiple streams get proper performance. >>> >>> There's definitely something iffy about the hashing. Here's the output >>> relevant line from the aqm debug file after running a single TCP stream >>> for 60 seconds to that station: >>> >>> ifname addr tid ac backlog-bytes backlog-packets flows drops marks overlimit collisions >>> tx-bytes tx-packets >>> wlp2s0 04:f0:21:1e:74:20 0 2 0 0 146 16 0 0 0 717758966 467925 >>> >>> (there are two extra fields here; I added per-txq CoDel stats, will send >>> a patch later). >>> >>> This shows that the txq has 146 flows associated from that one TCP flow. >>> Looking at this over time, it seems that each time the queue runs empty >>> (which happens way too often, which is what I was originally >>> investigating), another flow is assigned. >>> >>> Michal, any idea why? :) >> >> And to answer this: because the flow is being freed to be reassigned >> when it runs empty, but the counter is not decremented. Is this >> deliberate? I.e. is the 'flows' var supposed to be a total 'new_flows' >> counter and not a measure of the current number of assigned flows? > > Yes, it is deliberate. fq_codel qdisc does the same thing and I just > mimicked it. Right. Think it was the name that sent me down the wrong track ('flows' instead of 'new_flows'). Especially since the way you structured things, having a counter for how many flows are currently assigned each tid might actually make sense... -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: TCP performance regression in mac80211 triggered by the fq code 2016-07-12 10:09 TCP performance regression in mac80211 triggered by the fq code Felix Fietkau ` (3 preceding siblings ...) 2016-07-20 14:45 ` Toke Høiland-Jørgensen @ 2016-07-22 10:51 ` Toke Høiland-Jørgensen 4 siblings, 0 replies; 23+ messages in thread From: Toke Høiland-Jørgensen @ 2016-07-22 10:51 UTC (permalink / raw) To: Felix Fietkau; +Cc: linux-wireless, Michal Kazior, Dave Taht Felix Fietkau <nbd@nbd.name> writes: > Please let me know if you have any ideas. Two more things to try: - Andrew McGregor mentioned that some versions of Iperf on OSX has a threading bug when running multiple streams against the same server instance. So try running two separate instances of iperf on the server side, or run netperf instead. - It could be that HyStart is acting up. You could try disabling it (echo 0 > /sys/module/tcp_cubic/parameters/hystart) and see if that makes a difference. -Toke ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2016-07-27 17:31 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-12 10:09 TCP performance regression in mac80211 triggered by the fq code Felix Fietkau 2016-07-12 12:13 ` Dave Taht 2016-07-12 13:21 ` Felix Fietkau 2016-07-12 14:02 ` Dave Taht 2016-07-13 7:57 ` Dave Taht 2016-07-13 8:53 ` Felix Fietkau 2016-07-13 9:13 ` Dave Taht 2016-07-19 13:10 ` Michal Kazior 2016-07-12 12:28 ` Toke Høiland-Jørgensen 2016-07-12 12:44 ` Dave Taht 2016-07-12 12:57 ` Toke Høiland-Jørgensen 2016-07-12 13:03 ` Dave Taht 2016-07-12 13:22 ` Felix Fietkau 2016-07-12 13:23 ` Felix Fietkau 2016-07-18 21:49 ` Toke Høiland-Jørgensen 2016-07-18 22:02 ` Dave Taht 2016-07-19 13:13 ` Michal Kazior 2016-07-19 14:32 ` Felix Fietkau 2016-07-20 14:45 ` Toke Høiland-Jørgensen 2016-07-20 15:24 ` Toke Høiland-Jørgensen 2016-07-25 5:15 ` Michal Kazior 2016-07-27 17:31 ` Toke Høiland-Jørgensen 2016-07-22 10:51 ` Toke Høiland-Jørgensen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.