[PATCH net-next] fq_codel: report congestion notification at enqueue time

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] fq_codel: report congestion notification at enqueue time
@ 2012-06-28 17:07 Eric Dumazet
  2012-06-28 17:51 ` Dave Taht
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-06-28 17:07 UTC (permalink / raw)
  To: David Miller
  Cc: Nandita Dukkipati, netdev, codel, Yuchung Cheng, Neal Cardwell,
	Matt Mathis

From: Eric Dumazet <edumazet@google.com>

At enqueue time, check sojourn time of packet at head of the queue,
and return NET_XMIT_CN instead of NET_XMIT_SUCCESS if this sejourn
time is above codel @target.

This permits local TCP stack to call tcp_enter_cwr() and reduce its cwnd
without drops (for example if ECN is not enabled for the flow)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 include/linux/pkt_sched.h |    1 +
 include/net/codel.h       |    2 +-
 net/sched/sch_fq_codel.c  |   19 +++++++++++++++----
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 32aef0a..4d409a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -714,6 +714,7 @@ struct tc_fq_codel_qd_stats {
 				 */
 	__u32	new_flows_len;	/* count of flows in new list */
 	__u32	old_flows_len;	/* count of flows in old list */
+	__u32	congestion_count;
 };
 
 struct tc_fq_codel_cl_stats {
diff --git a/include/net/codel.h b/include/net/codel.h
index 550debf..8c7d6a7 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -148,7 +148,7 @@ struct codel_vars {
  * struct codel_stats - contains codel shared variables and stats
  * @maxpacket:	largest packet we've seen so far
  * @drop_count:	temp count of dropped packets in dequeue()
- * ecn_mark:	number of packets we ECN marked instead of dropping
+ * @ecn_mark:	number of packets we ECN marked instead of dropping
  */
 struct codel_stats {
 	u32		maxpacket;
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 9fc1c62..c0485a0 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -62,6 +62,7 @@ struct fq_codel_sched_data {
 	struct codel_stats cstats;
 	u32		drop_overlimit;
 	u32		new_flow_count;
+	u32		congestion_count;
 
 	struct list_head new_flows;	/* list of new flows */
 	struct list_head old_flows;	/* list of old flows */
@@ -196,16 +197,25 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		flow->deficit = q->quantum;
 		flow->dropped = 0;
 	}
-	if (++sch->q.qlen < sch->limit)
+	if (++sch->q.qlen < sch->limit) {
+		codel_time_t hdelay = codel_get_enqueue_time(skb) -
+				      codel_get_enqueue_time(flow->head);
+
+		/* If this flow is congested, tell the caller ! */
+		if (codel_time_after(hdelay, q->cparams.target)) {
+			q->congestion_count++;
+			return NET_XMIT_CN;
+		}
 		return NET_XMIT_SUCCESS;
-
+	}
 	q->drop_overlimit++;
 	/* Return Congestion Notification only if we dropped a packet
 	 * from this flow.
 	 */
-	if (fq_codel_drop(sch) == idx)
+	if (fq_codel_drop(sch) == idx) {
+		q->congestion_count++;
 		return NET_XMIT_CN;
-
+	}
 	/* As we dropped a packet, better let upper stack know this */
 	qdisc_tree_decrease_qlen(sch, 1);
 	return NET_XMIT_SUCCESS;
@@ -467,6 +477,7 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 
 	st.qdisc_stats.maxpacket = q->cstats.maxpacket;
 	st.qdisc_stats.drop_overlimit = q->drop_overlimit;
+	st.qdisc_stats.congestion_count = q->congestion_count;
 	st.qdisc_stats.ecn_mark = q->cstats.ecn_mark;
 	st.qdisc_stats.new_flow_count = q->new_flow_count;

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 17:07 [PATCH net-next] fq_codel: report congestion notification at enqueue time Eric Dumazet
@ 2012-06-28 17:51 ` Dave Taht
  2012-06-28 18:12   ` Eric Dumazet
  2012-06-28 23:52 ` [PATCH net-next] fq_codel: report congestion notification at enqueue time Nandita Dukkipati
  2012-06-29  4:53 ` Eric Dumazet
  2 siblings, 1 reply; 44+ messages in thread
From: Dave Taht @ 2012-06-28 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nandita Dukkipati, netdev, codel, Yuchung Cheng, Neal Cardwell,
	David Miller, Matt Mathis

On Thu, Jun 28, 2012 at 10:07 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> At enqueue time, check sojourn time of packet at head of the queue,
> and return NET_XMIT_CN instead of NET_XMIT_SUCCESS if this sejourn
> time is above codel @target.
>
> This permits local TCP stack to call tcp_enter_cwr() and reduce its cwnd
> without drops (for example if ECN is not enabled for the flow)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Dave Taht <dave.taht@bufferbloat.net>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Matt Mathis <mattmathis@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>  include/linux/pkt_sched.h |    1 +
>  include/net/codel.h       |    2 +-
>  net/sched/sch_fq_codel.c  |   19 +++++++++++++++----
>  3 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 32aef0a..4d409a5 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -714,6 +714,7 @@ struct tc_fq_codel_qd_stats {
>                                 */
>        __u32   new_flows_len;  /* count of flows in new list */
>        __u32   old_flows_len;  /* count of flows in old list */
> +       __u32   congestion_count;
>  };
>
>  struct tc_fq_codel_cl_stats {
> diff --git a/include/net/codel.h b/include/net/codel.h
> index 550debf..8c7d6a7 100644
> --- a/include/net/codel.h
> +++ b/include/net/codel.h
> @@ -148,7 +148,7 @@ struct codel_vars {
>  * struct codel_stats - contains codel shared variables and stats
>  * @maxpacket: largest packet we've seen so far
>  * @drop_count:        temp count of dropped packets in dequeue()
> - * ecn_mark:   number of packets we ECN marked instead of dropping
> + * @ecn_mark:  number of packets we ECN marked instead of dropping
>  */
>  struct codel_stats {
>        u32             maxpacket;
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index 9fc1c62..c0485a0 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -62,6 +62,7 @@ struct fq_codel_sched_data {
>        struct codel_stats cstats;
>        u32             drop_overlimit;
>        u32             new_flow_count;
> +       u32             congestion_count;
>
>        struct list_head new_flows;     /* list of new flows */
>        struct list_head old_flows;     /* list of old flows */
> @@ -196,16 +197,25 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>                flow->deficit = q->quantum;
>                flow->dropped = 0;
>        }
> -       if (++sch->q.qlen < sch->limit)
> +       if (++sch->q.qlen < sch->limit) {
> +               codel_time_t hdelay = codel_get_enqueue_time(skb) -
> +                                     codel_get_enqueue_time(flow->head);
> +
> +               /* If this flow is congested, tell the caller ! */
> +               if (codel_time_after(hdelay, q->cparams.target)) {
> +                       q->congestion_count++;
> +                       return NET_XMIT_CN;
> +               }
>                return NET_XMIT_SUCCESS;
> -
> +       }
>        q->drop_overlimit++;
>        /* Return Congestion Notification only if we dropped a packet
>         * from this flow.
>         */
> -       if (fq_codel_drop(sch) == idx)
> +       if (fq_codel_drop(sch) == idx) {
> +               q->congestion_count++;
>                return NET_XMIT_CN;
> -
> +       }
>        /* As we dropped a packet, better let upper stack know this */
>        qdisc_tree_decrease_qlen(sch, 1);
>        return NET_XMIT_SUCCESS;
> @@ -467,6 +477,7 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
>
>        st.qdisc_stats.maxpacket = q->cstats.maxpacket;
>        st.qdisc_stats.drop_overlimit = q->drop_overlimit;
> +       st.qdisc_stats.congestion_count = q->congestion_count;
>        st.qdisc_stats.ecn_mark = q->cstats.ecn_mark;
>        st.qdisc_stats.new_flow_count = q->new_flow_count;
>
>
>

clever idea. A problem is there are other forms of network traffic on
a link, and this is punishing a single tcp
stream that may not be the source of the problem in the first place,
and basically putting it into double jeopardy.

I am curious as to how often an enqueue is actually dropping in the
codel/fq_codel case, the hope was that there would be plenty of
headroom under far more circumstances on this qdisc.

I note that on the dequeue side of codel (and in the network stack
generally) I was thinking that supplying a netlink level message on a
packet drop/congestion indication that userspace could register for
and see would be very useful, particularly in the case of a routing
daemon, but also for statistics collection, and perhaps other levels
of overall network control (DCTCP-like)

The existing NET_DROP functionality is hard to use, and your idea is
"in-band", the more general netlink message idea would be "out of
band" and more general.

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-6 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 17:51 ` Dave Taht
@ 2012-06-28 18:12   ` Eric Dumazet
  2012-06-28 22:56     ` Yuchung Cheng
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-06-28 18:12 UTC (permalink / raw)
  To: Dave Taht
  Cc: Nandita Dukkipati, netdev, codel, Yuchung Cheng, Neal Cardwell,
	David Miller, Matt Mathis

On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:

> clever idea. A problem is there are other forms of network traffic on
> a link, and this is punishing a single tcp
> stream that may not be the source of the problem in the first place,
> and basically putting it into double jeopardy.
> 

Why ? In fact this patch helps the tcp session being signaled (as it
will be anyway) at enqueue time, instead of having to react to packet
losses indications given (after RTT) by receiver.

Avoiding losses help receiver to consume data without having to buffer
it into Out Of Order queue.

So its not jeopardy, but early congestion notification without RTT
delay.

NET_XMIT_CN is a soft signal, far more disruptive than a DROP.

> I am curious as to how often an enqueue is actually dropping in the
> codel/fq_codel case, the hope was that there would be plenty of
> headroom under far more circumstances on this qdisc.
> 

"tc -s qdisc show dev eth0" can show you all the counts.

We never drop a packet at enqueue time, unless you hit the emergency
limit (10240 packets for fq_codel). When you reach this limit, you are
under trouble.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 18:12   ` Eric Dumazet
@ 2012-06-28 22:56     ` Yuchung Cheng
  2012-06-28 23:47       ` Dave Taht
  0 siblings, 1 reply; 44+ messages in thread
From: Yuchung Cheng @ 2012-06-28 22:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Taht, David Miller, netdev, codel, Tom Herbert, Matt Mathis,
	Nandita Dukkipati, Neal Cardwell

On Thu, Jun 28, 2012 at 11:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:
>
>> clever idea. A problem is there are other forms of network traffic on
>> a link, and this is punishing a single tcp
Dave: it won't just punish a single TCP, all protocols that react to
XMIT_CN will share similar fate.

>> stream that may not be the source of the problem in the first place,
>> and basically putting it into double jeopardy.
>>
>
> Why ? In fact this patch helps the tcp session being signaled (as it
> will be anyway) at enqueue time, instead of having to react to packet
> losses indications given (after RTT) by receiver.
>
> Avoiding losses help receiver to consume data without having to buffer
> it into Out Of Order queue.
>
> So its not jeopardy, but early congestion notification without RTT
> delay.
>
> NET_XMIT_CN is a soft signal, far more disruptive than a DROP.
I don't read here: you mean far "less" disruptive in terms of performance?

>
>> I am curious as to how often an enqueue is actually dropping in the
>> codel/fq_codel case, the hope was that there would be plenty of
>> headroom under far more circumstances on this qdisc.
>>
>
> "tc -s qdisc show dev eth0" can show you all the counts.
>
> We never drop a packet at enqueue time, unless you hit the emergency
> limit (10240 packets for fq_codel). When you reach this limit, you are
> under trouble.

Another nice feature is to resume TCP xmit operation at when the skb
hits the wire.
My hutch is that Eric is working on this too.
>
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 22:56     ` Yuchung Cheng
@ 2012-06-28 23:47       ` Dave Taht
  2012-06-29  4:50         ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Taht @ 2012-06-28 23:47 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Nandita Dukkipati, netdev, codel, Matt Mathis, Neal Cardwell,
	David Miller

On Thu, Jun 28, 2012 at 6:56 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On Thu, Jun 28, 2012 at 11:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:
>>
>>> clever idea. A problem is there are other forms of network traffic on
>>> a link, and this is punishing a single tcp
> Dave: it won't just punish a single TCP, all protocols that react to
> XMIT_CN will share similar fate.

What protocols in the kernel do and don't? was the crux of this question.

I'm not objecting to the idea, it's clever, as I said. I'm thinking I'll
apply it to cerowrt's next build and see what happens, if this
will apply against 3.3. Or maybe the ns3 model. Or both.

As a segue...

I am still fond of gaining an ability to also throw congestion notifications
(who, where, and why) up to userspace via netlink. Having a viable
metric for things like mesh routing and multipath has been a age-old
quest of mine, and getting that data out could be a start towards it.

Another example is something that lives in userspace like uTP.

>
>>> stream that may not be the source of the problem in the first place,
>>> and basically putting it into double jeopardy.
>>>
>>
>> Why ? In fact this patch helps the tcp session being signaled (as it
>> will be anyway) at enqueue time, instead of having to react to packet
>> losses indications given (after RTT) by receiver.

I tend to think more in terms of routing packets rather than originating
them.

>> Avoiding losses help receiver to consume data without having to buffer
>> it into Out Of Order queue.
>>
>> So its not jeopardy, but early congestion notification without RTT
>> delay.

Well there is the birthday problem and hashing to the same queues.
the sims we have do some interesting things on new streams in slow
start sometimes. But don't have enough of a grip on it to talk about it
yet...

>>
>> NET_XMIT_CN is a soft signal, far more disruptive than a DROP.
> I don't read here: you mean far "less" disruptive in terms of performance?

I figured eric meant less.

>>
>>> I am curious as to how often an enqueue is actually dropping in the
>>> codel/fq_codel case, the hope was that there would be plenty of
>>> headroom under far more circumstances on this qdisc.
>>>
>>
>> "tc -s qdisc show dev eth0" can show you all the counts.
>>
>> We never drop a packet at enqueue time, unless you hit the emergency
>> limit (10240 packets for fq_codel). When you reach this limit, you are
>> under trouble.

In my own tests with artificial streams that set but don't respect ecn,
I hit limit easily. But that's the subject of another thread on the codel list,
and a different problem entirely. I just am not testing at > 1GigE
speeds and I know you guys are. I worry about behaviors above 10GigE,
and here too, the NET_XMIT_CN idea seems like a good idea.

so, applause. new idea on top of fair queue-ing + codel. cool. So many
hard problems seem to be getting tractable!

-- 
Dave Täht
http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-6 is out
with fq_codel!"

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 23:47       ` Dave Taht
@ 2012-06-29  4:50         ` Eric Dumazet
  2012-06-29  5:24           ` Dave Taht
  2012-07-04 10:11           ` [RFC PATCH] tcp: limit data skbs in qdisc layer Eric Dumazet
  0 siblings, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-06-29  4:50 UTC (permalink / raw)
  To: Dave Taht
  Cc: Nandita Dukkipati, netdev, Yuchung Cheng, codel, Matt Mathis,
	Neal Cardwell, David Miller

On Thu, 2012-06-28 at 19:47 -0400, Dave Taht wrote: 
> On Thu, Jun 28, 2012 at 6:56 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > On Thu, Jun 28, 2012 at 11:12 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> On Thu, 2012-06-28 at 10:51 -0700, Dave Taht wrote:
> >>
> >>> clever idea. A problem is there are other forms of network traffic on
> >>> a link, and this is punishing a single tcp
> > Dave: it won't just punish a single TCP, all protocols that react to
> > XMIT_CN will share similar fate.
> 
> What protocols in the kernel do and don't? was the crux of this question.
> 

AFAIK that only tcp cares a bit, or seems to.

But not that much, since it continues to send packets.

Thats because tcp_transmit_skb() changes the NET_XMIT_CN status to plain
NET_XMIT_SUCCESS.

My long term plan is to reduce number of skbs queued in Qdisc for TCP
stack, to reduce RTT (removing the artificial RTT bias because of local
queues)

> I'm not objecting to the idea, it's clever, as I said. I'm thinking I'll
> apply it to cerowrt's next build and see what happens, if this
> will apply against 3.3. Or maybe the ns3 model. Or both.

A router will have no use of this feature, not sure you need to spend
time trying this ;)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  4:50         ` Eric Dumazet
@ 2012-06-29  5:24           ` Dave Taht
  2012-07-04 10:11           ` [RFC PATCH] tcp: limit data skbs in qdisc layer Eric Dumazet
  1 sibling, 0 replies; 44+ messages in thread
From: Dave Taht @ 2012-06-29  5:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nandita Dukkipati, netdev, Yuchung Cheng, codel, Matt Mathis,
	Neal Cardwell, David Miller

On Fri, Jun 29, 2012 at 12:50 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> A router will have no use of this feature, not sure you need to spend
> time trying this ;)

It's not yer ordinary router...

A cerowrt router has iperf, netperf/netserver from svn with congestion
control switching and classification setting, rsync (with same),
samba, transmission, a polipo proxy, scamper, and a legion of other
network analysis tools on-board and available as optional packages.

and it's used in the bufferbloat project as a thoroughly understood
platform for originating, receiving, AND routing packets on a real
embedded home gateway platform that end users actually use, through a
decent set of drivers, on ethernet and wifi.

I am always concerned when changes to the stack like
GSO/GRO/BQL/fq_codel go into linux - or things like the infinite
window in ECN bug from a few months back happen - as they hold promise
to mutate (or explain) the statistics and analysis we've accumulated
over the last year and a half.

And as I'm hoping to do a major test run shortly to get some fresh
statistics vs a vs fq_codel vs the old sfqred tests ( I'm looking
forward to redoing this one in particular:
http://www.teklibre.com/~d/bloat/hoqvssfqred.ps - )

... and you are about to change what those stats are going to look
like, under load, with this change... I kind of need to
understand/track it/parse it/capture it. I've got sufficient hardware
now to easily A/B things.

(sorry for the noise on the lists)

-- 
Dave Täht

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH] tcp: limit data skbs in qdisc layer
  2012-06-29  4:50         ` Eric Dumazet
  2012-06-29  5:24           ` Dave Taht
@ 2012-07-04 10:11           ` Eric Dumazet
  2012-07-09  7:08             ` David Miller
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-04 10:11 UTC (permalink / raw)
  To: David Miller
  Cc: Nandita Dukkipati, netdev, Yuchung Cheng, codel, Matt Mathis,
	Neal Cardwell

On Fri, 2012-06-29 at 06:51 +0200, Eric Dumazet wrote:

> My long term plan is to reduce number of skbs queued in Qdisc for TCP
> stack, to reduce RTT (removing the artificial RTT bias because of local
> queues)

preliminar patch to give the rough idea :

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~4 segments [1] per tcp socket in qdisc layer at a
given time. (if TSO is enabled, then a single TSO packet hits the limit)

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() is called for the already
queued skbs.

Note : While this lowers artificial cwnd and rtt, this means more work
has to be done by tx completion handler (softirq instead of preemptable
process context)

To reduce this possible source of latency, we can skb_try_orphan(skb) in
NIC drivers ndo_start_xmit() instead of waiting TX completion, but do
this orphaning only on BQL enabled drivers, or in drivers known to
possibly delay TX completion (NIU is an example, as David had to revert
BQL because of this delaying)

results on my dev machine (tg3 nic) are really impressive, using
standard pfifo_fast, and with or without TSO/GSO

I no longer have 3MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

tcp_write_xmit() call is probably very naive and lacks proper tests.

Note :

[1] because sk_wmem_alloc accounts skb truesize, mss*4 test allow less
then 4 frames, but it seems ok.

 drivers/net/ethernet/broadcom/tg3.c |    1 
 include/linux/skbuff.h              |    6 +++
 include/net/sock.h                  |    1 
 net/ipv4/tcp_output.c               |   44 +++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index e47ff8b..ce0ca96 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7020,6 +7020,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	skb_tx_timestamp(skb);
 	netdev_tx_sent_queue(txq, skb->len);
+	skb_try_orphan(skb);
 
 	/* Sync BD data before updating mailbox */
 	wmb();
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 885c9bd..c4d4d15 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1667,6 +1667,12 @@ static inline void skb_orphan(struct sk_buff *skb)
 	skb->sk		= NULL;
 }
 
+static inline void skb_try_orphan(struct sk_buff *skb)
+{
+	if (!skb_shinfo(skb)->tx_flags)
+		skb_orphan(skb);
+}
+
 /**
  *	__skb_queue_purge - empty a list
  *	@list: list to empty
diff --git a/include/net/sock.h b/include/net/sock.h
index 640432a..2ce17b1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1389,6 +1389,7 @@ extern void release_sock(struct sock *sk);
 
 /* BH context may only use the following locking interface. */
 #define bh_lock_sock(__sk)	spin_lock(&((__sk)->sk_lock.slock))
+#define bh_trylock_sock(__sk)	spin_trylock(&((__sk)->sk_lock.slock))
 #define bh_lock_sock_nested(__sk) \
 				spin_lock_nested(&((__sk)->sk_lock.slock), \
 				SINGLE_DEPTH_NESTING)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c465d3e..4e6ef82 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -65,6 +65,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
 
+static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+			   int push_one, gfp_t gfp);
 
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
@@ -783,6 +785,34 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	return size;
 }
 
+/*
+ * Write buffer destructor automatically called from kfree_skb.
+ */
+void tcp_wfree(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	unsigned int len = skb->truesize;
+
+	atomic_sub(len - 1, &sk->sk_wmem_alloc);
+	if (bh_trylock_sock(sk)) {
+		if (!sock_owned_by_user(sk)) {
+			if ((1 << sk->sk_state) &
+			    (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
+				tcp_write_xmit(sk,
+					       tcp_current_mss(sk),
+					       0, 0,
+					       GFP_ATOMIC);
+		} else {
+			/* TODO : might signal something here
+			 * so that user thread can call tcp_write_xmit()
+			 */
+		}
+		bh_unlock_sock(sk);
+	}
+
+	sk_free(sk);	
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -844,7 +874,11 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
-	skb_set_owner_w(skb, sk);
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = tcp_wfree;
+	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 
 	/* Build TCP header and checksum it. */
 	th = tcp_hdr(skb);
@@ -1780,6 +1814,14 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
+		/* yes, sk_wmem_alloc accounts skb truesize, so mss_now * 4
+		 * is a lazy approximation of our needs, but it seems ok
+		 */
+		if (atomic_read(&sk->sk_wmem_alloc) >= mss_now * 4) {
+			pr_debug("here we stop sending frame because sk_wmem_alloc %d\n",
+				atomic_read(&sk->sk_wmem_alloc));
+			break;
+		}
 		tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
 		BUG_ON(!tso_segs);

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH] tcp: limit data skbs in qdisc layer
  2012-07-04 10:11           ` [RFC PATCH] tcp: limit data skbs in qdisc layer Eric Dumazet
@ 2012-07-09  7:08             ` David Miller
  2012-07-09  8:03               ` Eric Dumazet
  2012-07-09 14:55               ` Eric Dumazet
  0 siblings, 2 replies; 44+ messages in thread
From: David Miller @ 2012-07-09  7:08 UTC (permalink / raw)
  To: eric.dumazet
  Cc: ycheng, dave.taht, netdev, codel, therbert, mattmathis, nanditad,
	ncardwell, andrewmcgr

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 04 Jul 2012 12:11:27 +0200

> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~4 segments [1] per tcp socket in qdisc layer at a
> given time. (if TSO is enabled, then a single TSO packet hits the limit)

I'm suspicious and anticipate that 10G will need more queueing than
you are able to get away with tg3 at 1G speeds.  But it is an exciting
idea nonetheless :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH] tcp: limit data skbs in qdisc layer
  2012-07-09  7:08             ` David Miller
@ 2012-07-09  8:03               ` Eric Dumazet
  2012-07-09  8:48                 ` Eric Dumazet
  2012-07-09 14:55               ` Eric Dumazet
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-09  8:03 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell

On Mon, 2012-07-09 at 00:08 -0700, David Miller wrote:

> I'm suspicious and anticipate that 10G will need more queueing than
> you are able to get away with tg3 at 1G speeds.  But it is an exciting
> idea nonetheless :-)

I tested the patch on 10G NIC and had no problem on netperf tests.

Only that ixgbe is not yet BQL enabled so I could not add
skb_try_orphan() in its start_xmit() : So if TX completion is a bit
delayed, we can have a throughput slowdown.

I added a /proc/sys/net/ipv4/tcp_limit_output_segs to play with various
limits.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH] tcp: limit data skbs in qdisc layer
  2012-07-09  8:03               ` Eric Dumazet
@ 2012-07-09  8:48                 ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-09  8:48 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell

On Mon, 2012-07-09 at 10:04 +0200, Eric Dumazet wrote:

> Only that ixgbe is not yet BQL enabled so I could not add
> skb_try_orphan() in its start_xmit() : So if TX completion is a bit
> delayed, we can have a throughput slowdown.

Oops, ixgbe already is BQL enabled, so I can add the skb_try_orphan()

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH] tcp: limit data skbs in qdisc layer
  2012-07-09  7:08             ` David Miller
  2012-07-09  8:03               ` Eric Dumazet
@ 2012-07-09 14:55               ` Eric Dumazet
  2012-07-10 13:28                 ` Lin Ming
  2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
  1 sibling, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-09 14:55 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell

On Mon, 2012-07-09 at 00:08 -0700, David Miller wrote:

> I'm suspicious and anticipate that 10G will need more queueing than
> you are able to get away with tg3 at 1G speeds.  But it is an exciting
> idea nonetheless :-)

There is a fundamental problem calling any xmit function from skb
destructor.

skb destructor can be called while qdisc lock is taken, so we can
deadlock trying to reacquire it.

One such path is the dev_deactivate_queue() -> qdisc_reset() ->
qdisc_reset_queue(), but also any dropped skbs in qdisc.

So I should only do this stuff from a separate context, for example a
tasklet or timer.

Alternative would be to use dev_kfree_skb_irq() for all dropped skbs in
qdisc layer.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH] tcp: limit data skbs in qdisc layer
  2012-07-09 14:55               ` Eric Dumazet
@ 2012-07-10 13:28                 ` Lin Ming
  2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
  1 sibling, 0 replies; 44+ messages in thread
From: Lin Ming @ 2012-07-10 13:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr

On Mon, Jul 9, 2012 at 10:55 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2012-07-09 at 00:08 -0700, David Miller wrote:
>
>> I'm suspicious and anticipate that 10G will need more queueing than
>> you are able to get away with tg3 at 1G speeds.  But it is an exciting
>> idea nonetheless :-)
>
> There is a fundamental problem calling any xmit function from skb
> destructor.
>
> skb destructor can be called while qdisc lock is taken, so we can
> deadlock trying to reacquire it.
>
> One such path is the dev_deactivate_queue() -> qdisc_reset() ->
> qdisc_reset_queue(), but also any dropped skbs in qdisc.
>
> So I should only do this stuff from a separate context, for example a
> tasklet or timer.
>
> Alternative would be to use dev_kfree_skb_irq() for all dropped skbs in
> qdisc layer.

Hi Eric,

Maybe a bit off topic ...

Could you share how to test qdisc related change?
Assume I'm testing qdisc performance, for example, codel qdisc,
then how to setup the test environment?

Do you use some network simulator, for example, using the special netem qdisc to
simulate slow network/packet loss/network delay, etc?

Thanks,
Lin Ming

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-09 14:55               ` Eric Dumazet
  2012-07-10 13:28                 ` Lin Ming
@ 2012-07-10 15:13                 ` Eric Dumazet
  2012-07-10 17:06                   ` Eric Dumazet
                                     ` (3 more replies)
  1 sibling, 4 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-10 15:13 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machine (tg3 nic) are really impressive, using
standard pfifo_fast, and with or without TSO/GSO. Without reduction of
nominal bandwidth.

I no longer have 3MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.



[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.

Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/tcp.h        |    9 ++
 include/net/tcp.h          |    3 
 net/ipv4/sysctl_net_ipv4.c |    7 +
 net/ipv4/tcp.c             |   14 ++-
 net/ipv4/tcp_minisocks.c   |    1 
 net/ipv4/tcp_output.c      |  132 ++++++++++++++++++++++++++++++++++-
 6 files changed, 160 insertions(+), 6 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7d3bced..55b8cf9 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -339,6 +339,9 @@ struct tcp_sock {
 	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
 	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
 
+	struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
+	unsigned long	tsq_flags;
+
 	/* Data for direct copy to user */
 	struct {
 		struct sk_buff_head	prequeue;
@@ -494,6 +497,12 @@ struct tcp_sock {
 	struct tcp_cookie_values  *cookie_values;
 };
 
+enum tsq_flags {
+	TSQ_THROTTLED,
+	TSQ_QUEUED,
+	TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
+};
+
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
 {
 	return (struct tcp_sock *)sk;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 53fb7d8..3a6ed09 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
+extern int sysctl_tcp_limit_output_bytes;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -321,6 +322,8 @@ extern struct proto tcp_prot;
 
 extern void tcp_init_mem(struct net *net);
 
+extern void tcp_tasklet_init(void);
+
 extern void tcp_v4_err(struct sk_buff *skb, u32);
 
 extern void tcp_shutdown (struct sock *sk, int how);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 12aa0c5..70730f7 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -598,6 +598,13 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "tcp_limit_output_bytes",
+		.data		= &sysctl_tcp_limit_output_bytes,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 #ifdef CONFIG_NET_DMA
 	{
 		.procname	= "tcp_dma_copybreak",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 3ba605f..8838bd2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -376,6 +376,7 @@ void tcp_init_sock(struct sock *sk)
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
+	INIT_LIST_HEAD(&tp->tsq_node);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
 	tp->mdev = TCP_TIMEOUT_INIT;
@@ -786,15 +787,17 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 				       int large_allowed)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 xmit_size_goal, old_size_goal;
+	u32 xmit_size_goal, old_size_goal, gso_max_size;
 
 	xmit_size_goal = mss_now;
 
 	if (large_allowed && sk_can_gso(sk)) {
-		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
-				  inet_csk(sk)->icsk_af_ops->net_header_len -
-				  inet_csk(sk)->icsk_ext_hdr_len -
-				  tp->tcp_header_len);
+		gso_max_size = min_t(u32, sk->sk_gso_max_size,
+				     sysctl_tcp_limit_output_bytes >> 1);
+		xmit_size_goal = (gso_max_size - 1) -
+				 inet_csk(sk)->icsk_af_ops->net_header_len -
+				 inet_csk(sk)->icsk_ext_hdr_len -
+				 tp->tcp_header_len;
 
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
 
@@ -3573,4 +3576,5 @@ void __init tcp_init(void)
 	tcp_secret_primary = &tcp_secret_one;
 	tcp_secret_retiring = &tcp_secret_two;
 	tcp_secret_secondary = &tcp_secret_two;
+	tcp_tasklet_init();
 }
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 72b7c63..83b358f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 			treq->snt_isn + 1 + tcp_s_data_size(oldtp);
 
 		tcp_prequeue_init(newtp);
+		INIT_LIST_HEAD(&newtp->tsq_node);
 
 		tcp_init_wl(newtp, treq->rcv_isn);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c465d3e..991ae45 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,6 +50,9 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
 
+/* Default TSQ limit of two TSO segments */
+int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+
 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames
  * which are too large can cause TCP streams to be bursty.
@@ -65,6 +68,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
 
+static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+			   int push_one, gfp_t gfp);
 
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
@@ -783,6 +788,118 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
 	return size;
 }
 
+
+/* TCP SMALL QUEUES (TSQ)
+ *
+ * TSQ goal is to keep small amount of skbs per tcp flow in tx queues (qdisc+dev)
+ * to reduce RTT and bufferbloat.
+ * We do this using a special skb destructor (tcp_wfree).
+ *
+ * Its important tcp_wfree() can be replaced by sock_wfree() in the event skb
+ * needs to be reallocated in a driver.
+ * The invariant being skb->truesize substracted from sk->sk_wmem_alloc
+ *
+ * Since transmit from skb destructor is forbidden, we use a tasklet
+ * to process all sockets that eventually need to send more skbs.
+ * We use one tasklet per cpu, with its own queue of sockets.
+ */
+struct tsq_tasklet {
+	struct tasklet_struct	tasklet;
+	struct list_head	head; /* queue of tcp sockets */
+};
+static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
+
+/*
+ * One tasklest per cpu tries to send more skbs.
+ * We run in tasklet context but need to disable irqs when
+ * transfering tsq->head because tcp_wfree() might
+ * interrupt us (non NAPI drivers)
+ */
+static void tcp_tasklet_func(unsigned long data)
+{
+	struct tsq_tasklet *tsq = (struct tsq_tasklet *)data;
+	LIST_HEAD(list);
+	unsigned long flags;
+	struct list_head *q, *n;
+	struct tcp_sock *tp;
+	struct sock *sk;
+
+	local_irq_save(flags);
+	list_splice_init(&tsq->head, &list);
+	local_irq_restore(flags);
+
+	list_for_each_safe(q, n, &list) {
+		tp = list_entry(q, struct tcp_sock, tsq_node);
+		list_del(&tp->tsq_node);
+
+		sk = (struct sock *)tp;
+		bh_lock_sock(sk);
+
+		if (!sock_owned_by_user(sk)) {
+			if ((1 << sk->sk_state) &
+			    (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
+				tcp_write_xmit(sk,
+					       tcp_current_mss(sk),
+					       0, 0,
+					       GFP_ATOMIC);
+		} else {
+			/* TODO:
+			 * setup a timer, or check TSQ_OWNED in release_sock()
+			 */
+			set_bit(TSQ_OWNED, &tp->tsq_flags);
+		}
+		bh_unlock_sock(sk);
+
+		clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+		sk_free(sk);
+	}
+}
+
+void __init tcp_tasklet_init(void)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
+
+		INIT_LIST_HEAD(&tsq->head);
+		tasklet_init(&tsq->tasklet,
+			     tcp_tasklet_func,
+			     (unsigned long)tsq);
+	}
+}
+
+/*
+ * Write buffer destructor automatically called from kfree_skb.
+ * We cant xmit new skbs from this context, as we might already
+ * hold qdisc lock.
+ */
+void tcp_wfree(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
+	    !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
+		unsigned long flags;
+		struct tsq_tasklet *tsq;
+
+		/* Keep a ref on socket.
+		 * This last ref will be released in tcp_tasklet_func()
+		 */
+		atomic_sub(skb->truesize - 1, &sk->sk_wmem_alloc);
+
+		/* queue this socket to tasklet queue */
+		local_irq_save(flags);
+		tsq = &__get_cpu_var(tsq_tasklet);
+		list_add(&tp->tsq_node, &tsq->head);
+		tasklet_schedule(&tsq->tasklet);
+		local_irq_restore(flags);
+	} else {
+		sock_wfree(skb);
+	}
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -844,7 +961,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
-	skb_set_owner_w(skb, sk);
+
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
+			  tcp_wfree : sock_wfree;
+	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 
 	/* Build TCP header and checksum it. */
 	th = tcp_hdr(skb);
@@ -1780,6 +1902,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
+
 		tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
 		BUG_ON(!tso_segs);
 
@@ -1800,6 +1923,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
+		/* TSQ : sk_wmem_alloc accounts skb truesize,
+		 * including skb overhead. But thats OK.
+		 */
+		if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
+			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+			break;
+		}
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
@ 2012-07-10 17:06                   ` Eric Dumazet
  2012-07-10 17:37                   ` Yuchung Cheng
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-10 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell

On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
> This introduce TSQ (TCP Small Queues)
> 
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
> 
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
> 
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
> 
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
> 
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
> 
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
> 
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
> 
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
> 
> 
> 
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
>   but some drivers call it in their start_xmit() handler.
>   These drivers should at least use BQL, or else a single TCP
>   session can still fill the whole NIC TX ring, since TSQ will
>   have no effect.
> 
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

By the way, Rick Jones asked me :

"Is there also any chance in service demand?"

I copy here my answer since its a very good point:

I worked on the idea of a CoDel like feedback, to have a timed limit
instead of byte limit ("allow up to 1ms" delay in qdisc/dev queue.)

But it seemed a bit complex : I would need to add skb fields to properly
track the residence time (sojourn time) of queued packets.

Alternative would be to have a per tcp socket tracking array,
but it might be expensive to search a packet in it...

With multi queue devices or bad qdiscs, we can have reordering in skb
orphanings. So the lookup can be relatively expensive.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
  2012-07-10 17:06                   ` Eric Dumazet
@ 2012-07-10 17:37                   ` Yuchung Cheng
  2012-07-10 18:32                     ` Eric Dumazet
  2012-07-11 15:11                   ` Eric Dumazet
  2012-07-12 13:33                   ` John Heffner
  3 siblings, 1 reply; 44+ messages in thread
From: Yuchung Cheng @ 2012-07-10 17:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, dave.taht, netdev, codel, therbert, mattmathis,
	nanditad, ncardwell, andrewmcgr

On Tue, Jul 10, 2012 at 8:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
>   but some drivers call it in their start_xmit() handler.
>   These drivers should at least use BQL, or else a single TCP
>   session can still fill the whole NIC TX ring, since TSQ will
>   have no effect.
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/tcp.h        |    9 ++
>  include/net/tcp.h          |    3
>  net/ipv4/sysctl_net_ipv4.c |    7 +
>  net/ipv4/tcp.c             |   14 ++-
>  net/ipv4/tcp_minisocks.c   |    1
>  net/ipv4/tcp_output.c      |  132 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 160 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7d3bced..55b8cf9 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -339,6 +339,9 @@ struct tcp_sock {
>         u32     rcv_tstamp;     /* timestamp of last received ACK (for keepalives) */
>         u32     lsndtime;       /* timestamp of last sent data packet (for restart window) */
>
> +       struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> +       unsigned long   tsq_flags;
> +
>         /* Data for direct copy to user */
>         struct {
>                 struct sk_buff_head     prequeue;
> @@ -494,6 +497,12 @@ struct tcp_sock {
>         struct tcp_cookie_values  *cookie_values;
>  };
>
> +enum tsq_flags {
> +       TSQ_THROTTLED,
> +       TSQ_QUEUED,
> +       TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
> +};
> +
>  static inline struct tcp_sock *tcp_sk(const struct sock *sk)
>  {
>         return (struct tcp_sock *)sk;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 53fb7d8..3a6ed09 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_early_retrans;
> +extern int sysctl_tcp_limit_output_bytes;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> @@ -321,6 +322,8 @@ extern struct proto tcp_prot;
>
>  extern void tcp_init_mem(struct net *net);
>
> +extern void tcp_tasklet_init(void);
> +
>  extern void tcp_v4_err(struct sk_buff *skb, u32);
>
>  extern void tcp_shutdown (struct sock *sk, int how);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 12aa0c5..70730f7 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -598,6 +598,13 @@ static struct ctl_table ipv4_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec
>         },
> +       {
> +               .procname       = "tcp_limit_output_bytes",
> +               .data           = &sysctl_tcp_limit_output_bytes,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec
> +       },
>  #ifdef CONFIG_NET_DMA
>         {
>                 .procname       = "tcp_dma_copybreak",
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3ba605f..8838bd2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -376,6 +376,7 @@ void tcp_init_sock(struct sock *sk)
>         skb_queue_head_init(&tp->out_of_order_queue);
>         tcp_init_xmit_timers(sk);
>         tcp_prequeue_init(tp);
> +       INIT_LIST_HEAD(&tp->tsq_node);
>
>         icsk->icsk_rto = TCP_TIMEOUT_INIT;
>         tp->mdev = TCP_TIMEOUT_INIT;
> @@ -786,15 +787,17 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>                                        int large_allowed)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
> -       u32 xmit_size_goal, old_size_goal;
> +       u32 xmit_size_goal, old_size_goal, gso_max_size;
>
>         xmit_size_goal = mss_now;
>
>         if (large_allowed && sk_can_gso(sk)) {
> -               xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -                                 inet_csk(sk)->icsk_af_ops->net_header_len -
> -                                 inet_csk(sk)->icsk_ext_hdr_len -
> -                                 tp->tcp_header_len);
> +               gso_max_size = min_t(u32, sk->sk_gso_max_size,
> +                                    sysctl_tcp_limit_output_bytes >> 1);
> +               xmit_size_goal = (gso_max_size - 1) -
> +                                inet_csk(sk)->icsk_af_ops->net_header_len -
> +                                inet_csk(sk)->icsk_ext_hdr_len -
> +                                tp->tcp_header_len;
>
>                 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>
> @@ -3573,4 +3576,5 @@ void __init tcp_init(void)
>         tcp_secret_primary = &tcp_secret_one;
>         tcp_secret_retiring = &tcp_secret_two;
>         tcp_secret_secondary = &tcp_secret_two;
> +       tcp_tasklet_init();
>  }
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 72b7c63..83b358f 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>                         treq->snt_isn + 1 + tcp_s_data_size(oldtp);
>
>                 tcp_prequeue_init(newtp);
> +               INIT_LIST_HEAD(&newtp->tsq_node);
>
>                 tcp_init_wl(newtp, treq->rcv_isn);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c465d3e..991ae45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,6 +50,9 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>
> +/* Default TSQ limit of two TSO segments */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
>   * which are too large can cause TCP streams to be bursty.
> @@ -65,6 +68,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
>  int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
>  EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
>
> +static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> +                          int push_one, gfp_t gfp);
>
>  /* Account for new data that has been sent to the network. */
>  static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> @@ -783,6 +788,118 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>         return size;
>  }
>
> +
> +/* TCP SMALL QUEUES (TSQ)
> + *
> + * TSQ goal is to keep small amount of skbs per tcp flow in tx queues (qdisc+dev)
> + * to reduce RTT and bufferbloat.
> + * We do this using a special skb destructor (tcp_wfree).
> + *
> + * Its important tcp_wfree() can be replaced by sock_wfree() in the event skb
> + * needs to be reallocated in a driver.
> + * The invariant being skb->truesize substracted from sk->sk_wmem_alloc
> + *
> + * Since transmit from skb destructor is forbidden, we use a tasklet
> + * to process all sockets that eventually need to send more skbs.
> + * We use one tasklet per cpu, with its own queue of sockets.
> + */
> +struct tsq_tasklet {
> +       struct tasklet_struct   tasklet;
> +       struct list_head        head; /* queue of tcp sockets */
> +};
> +static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
> +
> +/*
> + * One tasklest per cpu tries to send more skbs.
> + * We run in tasklet context but need to disable irqs when
> + * transfering tsq->head because tcp_wfree() might
> + * interrupt us (non NAPI drivers)
> + */
> +static void tcp_tasklet_func(unsigned long data)
> +{
> +       struct tsq_tasklet *tsq = (struct tsq_tasklet *)data;
> +       LIST_HEAD(list);
> +       unsigned long flags;
> +       struct list_head *q, *n;
> +       struct tcp_sock *tp;
> +       struct sock *sk;
> +
> +       local_irq_save(flags);
> +       list_splice_init(&tsq->head, &list);
> +       local_irq_restore(flags);
> +
> +       list_for_each_safe(q, n, &list) {
> +               tp = list_entry(q, struct tcp_sock, tsq_node);
> +               list_del(&tp->tsq_node);
> +
> +               sk = (struct sock *)tp;
> +               bh_lock_sock(sk);
> +
> +               if (!sock_owned_by_user(sk)) {
> +                       if ((1 << sk->sk_state) &
> +                           (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> +                               tcp_write_xmit(sk,
> +                                              tcp_current_mss(sk),
> +                                              0, 0,
> +                                              GFP_ATOMIC);
Is this case possible: app does a large send and immediately closes
the socket. then
the queue is throttled and tcp_write_xmit is called back when state is
in TCP_FIN_WAIT1.

I think tcp_write_xmit should continue regardless of the current state
because the
send maybe throttled/delayed but state change is synchronous.


> +               } else {
> +                       /* TODO:
> +                        * setup a timer, or check TSQ_OWNED in release_sock()
> +                        */
> +                       set_bit(TSQ_OWNED, &tp->tsq_flags);
> +               }
> +               bh_unlock_sock(sk);
> +
> +               clear_bit(TSQ_QUEUED, &tp->tsq_flags);
> +               sk_free(sk);
> +       }
> +}
> +
> +void __init tcp_tasklet_init(void)
> +{
> +       int i;
> +
> +       for_each_possible_cpu(i) {
> +               struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
> +
> +               INIT_LIST_HEAD(&tsq->head);
> +               tasklet_init(&tsq->tasklet,
> +                            tcp_tasklet_func,
> +                            (unsigned long)tsq);
> +       }
> +}
> +
> +/*
> + * Write buffer destructor automatically called from kfree_skb.
> + * We cant xmit new skbs from this context, as we might already
> + * hold qdisc lock.
> + */
> +void tcp_wfree(struct sk_buff *skb)
> +{
> +       struct sock *sk = skb->sk;
> +       struct tcp_sock *tp = tcp_sk(sk);
> +
> +       if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
> +           !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
> +               unsigned long flags;
> +               struct tsq_tasklet *tsq;
> +
> +               /* Keep a ref on socket.
> +                * This last ref will be released in tcp_tasklet_func()
> +                */
> +               atomic_sub(skb->truesize - 1, &sk->sk_wmem_alloc);
> +
> +               /* queue this socket to tasklet queue */
> +               local_irq_save(flags);
> +               tsq = &__get_cpu_var(tsq_tasklet);
> +               list_add(&tp->tsq_node, &tsq->head);
> +               tasklet_schedule(&tsq->tasklet);
> +               local_irq_restore(flags);
> +       } else {
> +               sock_wfree(skb);
> +       }
> +}
> +
>  /* This routine actually transmits TCP packets queued in by
>   * tcp_do_sendmsg().  This is used by both the initial
>   * transmission and possible later retransmissions.
> @@ -844,7 +961,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>
>         skb_push(skb, tcp_header_size);
>         skb_reset_transport_header(skb);
> -       skb_set_owner_w(skb, sk);
> +
> +       skb_orphan(skb);
> +       skb->sk = sk;
> +       skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
> +                         tcp_wfree : sock_wfree;
> +       atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>
>         /* Build TCP header and checksum it. */
>         th = tcp_hdr(skb);
> @@ -1780,6 +1902,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>         while ((skb = tcp_send_head(sk))) {
>                 unsigned int limit;
>
> +
>                 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
>                 BUG_ON(!tso_segs);
>
> @@ -1800,6 +1923,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 }
>
> +               /* TSQ : sk_wmem_alloc accounts skb truesize,
> +                * including skb overhead. But thats OK.
> +                */
> +               if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
> +                       set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> +                       break;
> +               }
>                 limit = mss_now;
>                 if (tso_segs > 1 && !tcp_urg_mode(tp))
>                         limit = tcp_mss_split_point(sk, skb, mss_now,
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-10 17:37                   ` Yuchung Cheng
@ 2012-07-10 18:32                     ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-10 18:32 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, dave.taht, netdev, codel, therbert, mattmathis,
	nanditad, ncardwell, andrewmcgr

On Tue, 2012-07-10 at 10:37 -0700, Yuchung Cheng wrote:
> On Tue, Jul 10, 2012 at 8:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > +
> > +               if (!sock_owned_by_user(sk)) {
> > +                       if ((1 << sk->sk_state) &
> > +                           (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> > +                               tcp_write_xmit(sk,
> > +                                              tcp_current_mss(sk),
> > +                                              0, 0,
> > +                                              GFP_ATOMIC);
> Is this case possible: app does a large send and immediately closes
> the socket. then
> the queue is throttled and tcp_write_xmit is called back when state is
> in TCP_FIN_WAIT1.
> 
> I think tcp_write_xmit should continue regardless of the current state
> because the
> send maybe throttled/delayed but state change is synchronous.
> 

I need testing some allowed states, I think.

Maybe I missed some states, but I dont think we should call
tcp_write_xmit() if socket is now in TIMEWAIT state ?

(because of tasklet delay, we might handle TX completion _after_ socket
state change)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
  2012-07-10 17:06                   ` Eric Dumazet
  2012-07-10 17:37                   ` Yuchung Cheng
@ 2012-07-11 15:11                   ` Eric Dumazet
  2012-07-11 15:16                     ` Ben Greear
                                       ` (2 more replies)
  2012-07-12 13:33                   ` John Heffner
  3 siblings, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 15:11 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, codel, mattmathis, ncardwell

On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
> This introduce TSQ (TCP Small Queues)
> 
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
> 
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
> 
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
> 
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
> 
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
> 
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
> 
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
> 
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
> 
> 
> 
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
>   but some drivers call it in their start_xmit() handler.
>   These drivers should at least use BQL, or else a single TCP
>   session can still fill the whole NIC TX ring, since TSQ will
>   have no effect.

I am going to send an official patch (I'll put a v3 tag in it)

I believe I did a full implementation, including the xmit() done
by the user at release_sock() time, if the tasklet found socket owned by
the user.

Some bench results about the choice of 128KB being the default value:

64KB seems the 'good' value on 10Gb links to reach max throughput on my
lab machines (ixgbe adapters).

Using 128KB is a very conservative value to allow link rate on 20Gbps.

Still, it allows less than 1ms of buffering on a Gbit link, and less
than 8ms on 100Mbit link (instead of 130ms without Small Queues)


Tests using a single TCP flow.

Tests on 10Gbit links :


echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
392360      392360      16384  20.00   1389.53    10^6bits/s  0.52  S      4.30   S      0.737   1.014   usec/KB  

echo 24576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 33 tpci_snd_cwnd 86
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
396976      396976      16384  20.00   1483.03    10^6bits/s  0.45  S      4.51   S      0.603   0.997   usec/KB  

echo 32768 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 19 tpci_snd_cwnd 100
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
461600      461600      16384  20.00   2039.67    10^6bits/s  0.64  S      5.17   S      0.620   0.830   usec/KB  

echo 49152 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 28 tpci_snd_cwnd 207
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
955512      955512      16384  20.00   4448.86    10^6bits/s  1.19  S      11.16  S      0.526   0.822   usec/KB  

echo 65536 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 399 tpci_snd_cwnd 488
tcpi_reordering 127 tcpi_total_retrans 75
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
2460328     2460328     16384  20.00   5975.12    10^6bits/s  1.81  S      14.65  S      0.595   0.803   usec/KB  

echo 81920 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 24 tpci_snd_cwnd 236
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1144768     1144768     16384  20.00   5190.08    10^6bits/s  1.56  S      12.63  S      0.591   0.798   usec/KB  

echo 98304 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 20 tpci_snd_cwnd 644
tcpi_reordering 59 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
2991168     2991168     16384  20.00   5976.00    10^6bits/s  1.60  S      14.61  S      0.526   0.801   usec/KB  

echo 114688 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 23 tpci_snd_cwnd 683
tcpi_reordering 59 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
3161960     3161960     16384  20.00   5975.14    10^6bits/s  1.42  S      14.78  S      0.469   0.810   usec/KB  

echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 23 tpci_snd_cwnd 591
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
2728056     2728056     16384  20.00   5976.16    10^6bits/s  1.71  S      14.62  S      0.562   0.802   usec/KB  

echo 147456 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 27 tpci_snd_cwnd 697
tcpi_reordering 64 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
3240432     3240432     16384  20.00   5975.64    10^6bits/s  1.51  S      14.78  S      0.498   0.811   usec/KB  

echo 163840 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 18 tpci_snd_cwnd 710
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
3277360     3277360     16384  20.00   5975.56    10^6bits/s  1.59  S      14.79  S      0.525   0.811   usec/KB  

echo 180224 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 32 tpci_snd_cwnd 701
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
3235816     3235816     16384  20.00   5976.80    10^6bits/s  1.56  S      14.61  S      0.514   0.801   usec/KB  

echo 196608 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 502 tpci_snd_cwnd 690
tcpi_reordering 127 tcpi_total_retrans 37
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
3185040     3185040     16384  20.00   5975.46    10^6bits/s  1.50  S      14.67  S      0.493   0.804   usec/KB  

echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 721
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
3448152     3448152     16384  20.00   5975.49    10^6bits/s  1.57  S      14.78  S      0.516   0.811   usec/KB  

echo 524288 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 927
tcpi_reordering 53 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
4194304     4194304     16384  20.01   5976.61    10^6bits/s  1.63  S      14.56  S      0.538   0.798   usec/KB  

echo 1048576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2500 tcpi_rttvar 750 tcpi_snd_ssthresh 17 tpci_snd_cwnd 1272
tcpi_reordering 90 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
4194304     4194304     16384  20.01   5975.11    10^6bits/s  1.64  S      14.69  S      0.541   0.805   usec/KB  



Tests on Gbit link :


echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 30 tpci_snd_cwnd 274
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1264784     1264784     16384  20.01   689.70     10^6bits/s  0.22  S      15.05  S      0.634   7.149   usec/KB  

echo 24576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 43 tpci_snd_cwnd 245
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1130920     1130920     16384  20.01   860.21     10^6bits/s  0.25  S      16.05  S      0.576   6.112   usec/KB  

echo 32768 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 36 tpci_snd_cwnd 229
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1057064     1057064     16384  20.01   867.76     10^6bits/s  0.28  S      15.46  S      0.634   5.839   usec/KB  

echo 49152 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 32 tpci_snd_cwnd 293
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1352488     1352488     16384  20.01   873.61     10^6bits/s  0.21  S      16.25  S      0.483   6.095   usec/KB  

echo 65536 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 48 tpci_snd_cwnd 274
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1264784     1264784     16384  20.01   875.90     10^6bits/s  0.19  S      15.56  S      0.421   5.822   usec/KB  

echo 81920 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 18 tpci_snd_cwnd 246
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1135536     1135536     16384  20.01   879.10     10^6bits/s  0.26  S      15.92  S      0.590   5.935   usec/KB  

echo 98304 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 20 tpci_snd_cwnd 361
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1666376     1666376     16384  20.02   880.30     10^6bits/s  0.25  S      16.07  S      0.560   5.980   usec/KB  

echo 114688 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 41 tpci_snd_cwnd 281
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1297096     1297096     16384  20.01   881.30     10^6bits/s  0.26  S      15.96  S      0.569   5.933   usec/KB  

echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 30 tpci_snd_cwnd 292
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1347872     1347872     16384  20.01   880.43     10^6bits/s  0.23  S      16.71  S      0.511   6.219   usec/KB  

echo 147456 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 31 tpci_snd_cwnd 286
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1320176     1320176     16384  20.01   880.57     10^6bits/s  0.24  S      16.62  S      0.534   6.187   usec/KB  

echo 163840 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 19 tpci_snd_cwnd 406
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1874096     1874096     16384  20.02   880.23     10^6bits/s  0.25  S      17.08  S      0.550   6.358   usec/KB  

echo 180224 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 27 tpci_snd_cwnd 304
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1403264     1403264     16384  20.01   880.34     10^6bits/s  0.22  S      16.03  S      0.501   5.965   usec/KB  

echo 196608 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 42 tpci_snd_cwnd 365
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1684840     1684840     16384  20.02   879.73     10^6bits/s  0.26  S      16.82  S      0.578   6.267   usec/KB  

echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2875 tcpi_rttvar 750 tcpi_snd_ssthresh 27 tpci_snd_cwnd 471
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
2174136     2174136     16384  20.01   879.89     10^6bits/s  0.25  S      18.52  S      0.556   6.898   usec/KB  

echo 524288 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 205000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 5000 tcpi_rttvar 750 tcpi_snd_ssthresh 42 tpci_snd_cwnd 627
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
2894232     2894232     16384  20.03   879.84     10^6bits/s  0.25  S      17.12  S      0.564   6.374   usec/KB  

echo 1048576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 209000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 9875 tcpi_rttvar 750 tcpi_snd_ssthresh 33 tpci_snd_cwnd 950
tcpi_reordering 3 tcpi_total_retrans 0
Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
4194304     4194304     16384  20.03   880.70     10^6bits/s  0.25  S      18.44  S      0.560   6.861   usec/KB  

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:11                   ` Eric Dumazet
@ 2012-07-11 15:16                     ` Ben Greear
  2012-07-11 15:25                       ` Eric Dumazet
  2012-07-11 18:23                     ` Rick Jones
  2012-07-11 18:44                     ` Rick Jones
  2 siblings, 1 reply; 44+ messages in thread
From: Ben Greear @ 2012-07-11 15:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr, Rick Jones

On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
>> This introduce TSQ (TCP Small Queues)
>>
>> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
>> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
>> problem.
>>
>> sk->sk_wmem_alloc not allowed to grow above a given limit,
>> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
>> given time.
>>
>> TSO packets are sized/capped to half the limit, so that we have two
>> TSO packets in flight, allowing better bandwidth use.
>>
>> As a side effect, setting the limit to 40000 automatically reduces the
>> standard gso max limit (65536) to 40000/2 : It can help to reduce
>> latencies of high prio packets, having smaller TSO packets.
>>
>> This means we divert sock_wfree() to a tcp_wfree() handler, to
>> queue/send following frames when skb_orphan() [2] is called for the
>> already queued skbs.
>>
>> Results on my dev machine (tg3 nic) are really impressive, using
>> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
>> nominal bandwidth.
>>
>> I no longer have 3MBytes backlogged in qdisc by a single netperf
>> session, and both side socket autotuning no longer use 4 Mbytes.
>>
>> As skb destructor cannot restart xmit itself ( as qdisc lock might be
>> taken at this point ), we delegate the work to a tasklet. We use one
>> tasklest per cpu for performance reasons.
>>
>>
>>
>> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
>> [2] skb_orphan() is usually called at TX completion time,
>>    but some drivers call it in their start_xmit() handler.
>>    These drivers should at least use BQL, or else a single TCP
>>    session can still fill the whole NIC TX ring, since TSQ will
>>    have no effect.
>
> I am going to send an official patch (I'll put a v3 tag in it)
>
> I believe I did a full implementation, including the xmit() done
> by the user at release_sock() time, if the tasklet found socket owned by
> the user.
>
> Some bench results about the choice of 128KB being the default value:
>
> 64KB seems the 'good' value on 10Gb links to reach max throughput on my
> lab machines (ixgbe adapters).
>
> Using 128KB is a very conservative value to allow link rate on 20Gbps.
>
> Still, it allows less than 1ms of buffering on a Gbit link, and less
> than 8ms on 100Mbit link (instead of 130ms without Small Queues)

I haven't read your patch in detail, but I was wondering if this feature
would cause trouble for applications that are servicing many sockets at once
and so might take several ms between handling each individual socket.

Or, applications that for other reasons cannot service sockets quite
as fast.  Without this feature, they could poke more data into the
xmit queues to be handled by the kernel while the app goes about it's
other user-space work?

Maybe this feature could be enabled/tuned on a per-socket basis?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:16                     ` Ben Greear
@ 2012-07-11 15:25                       ` Eric Dumazet
  2012-07-11 15:43                         ` Ben Greear
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 15:25 UTC (permalink / raw)
  To: Ben Greear; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:

> I haven't read your patch in detail, but I was wondering if this feature
> would cause trouble for applications that are servicing many sockets at once
> and so might take several ms between handling each individual socket.
> 

Well, this patch has no impact for such applications. In fact their
send()/write() will return to userland faster than before (for very
large send())

> Or, applications that for other reasons cannot service sockets quite
> as fast.  Without this feature, they could poke more data into the
> xmit queues to be handled by the kernel while the app goes about it's
> other user-space work?
> 

There is no impact for the applications. They queue their data in socket
write queue, and tcp stack do the work to actually transmit data
and handle ACKS.

Before this patch, this work was triggered by :

- Timers
- Incoming ACKS

We now add a third trigger : TX completion


> Maybe this feature could be enabled/tuned on a per-socket basis?

Well, why not, but I want first to see why it would be needed.

I mean, if a single application _needs_ to send MBytes of tcp data in
Qdisc at once, everything else on the machine is stuck (as today)

So just increase global param.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:25                       ` Eric Dumazet
@ 2012-07-11 15:43                         ` Ben Greear
  2012-07-11 15:54                           ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Greear @ 2012-07-11 15:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr, Rick Jones

On 07/11/2012 08:25 AM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
>
>> I haven't read your patch in detail, but I was wondering if this feature
>> would cause trouble for applications that are servicing many sockets at once
>> and so might take several ms between handling each individual socket.
>>
>
> Well, this patch has no impact for such applications. In fact their
> send()/write() will return to userland faster than before (for very
> large send())

Maybe I'm just confused.  Is your patch just mucking with
the queues below the tcp xmit queues?  From the patch description
I was thinking you were somehow directly limiting the TCP xmit
queues...

If you are just draining the tcp xmit queues on a new/faster
trigger, then I see no problem with that, and no need for
a per-socket control.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:43                         ` Ben Greear
@ 2012-07-11 15:54                           ` Eric Dumazet
  2012-07-11 16:03                             ` Ben Greear
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 15:54 UTC (permalink / raw)
  To: Ben Greear; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On Wed, 2012-07-11 at 08:43 -0700, Ben Greear wrote:
> On 07/11/2012 08:25 AM, Eric Dumazet wrote:
> > On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
> >
> >> I haven't read your patch in detail, but I was wondering if this feature
> >> would cause trouble for applications that are servicing many sockets at once
> >> and so might take several ms between handling each individual socket.
> >>
> >
> > Well, this patch has no impact for such applications. In fact their
> > send()/write() will return to userland faster than before (for very
> > large send())
> 
> Maybe I'm just confused.  Is your patch just mucking with
> the queues below the tcp xmit queues?  From the patch description
> I was thinking you were somehow directly limiting the TCP xmit
> queues...
> 

I dont limit tcp xmit queues. I might avoid excessive autotuning.



> If you are just draining the tcp xmit queues on a new/faster
> trigger, then I see no problem with that, and no need for
> a per-socket control.

Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
in socket write queue.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:54                           ` Eric Dumazet
@ 2012-07-11 16:03                             ` Ben Greear
  0 siblings, 0 replies; 44+ messages in thread
From: Ben Greear @ 2012-07-11 16:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr, Rick Jones

On 07/11/2012 08:54 AM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 08:43 -0700, Ben Greear wrote:
>> On 07/11/2012 08:25 AM, Eric Dumazet wrote:
>>> On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
>>>
>>>> I haven't read your patch in detail, but I was wondering if this feature
>>>> would cause trouble for applications that are servicing many sockets at once
>>>> and so might take several ms between handling each individual socket.
>>>>
>>>
>>> Well, this patch has no impact for such applications. In fact their
>>> send()/write() will return to userland faster than before (for very
>>> large send())
>>
>> Maybe I'm just confused.  Is your patch just mucking with
>> the queues below the tcp xmit queues?  From the patch description
>> I was thinking you were somehow directly limiting the TCP xmit
>> queues...
>>
>
> I dont limit tcp xmit queues. I might avoid excessive autotuning.
>
>
>
>> If you are just draining the tcp xmit queues on a new/faster
>> trigger, then I see no problem with that, and no need for
>> a per-socket control.
>
> Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
> in socket write queue.

Thanks for the explanation.

Out of curiosity, have you tried running multiple TCP streams
with different processes driving each stream, where each is trying
to drive, say, 700Mbps bi-directional traffic over a 1Gbps link?

Perhaps with 50ms of latency generated by a network emulator.

This used to cause some extremely high latency
due to excessive TCP xmit queues (from what I could tell),
but maybe this new patch will cure that.

I'll re-run my tests with your patch eventually..but too bogged
down to do so soon.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:11                   ` Eric Dumazet
  2012-07-11 15:16                     ` Ben Greear
@ 2012-07-11 18:23                     ` Rick Jones
  2012-07-11 23:38                       ` Eric Dumazet
  2012-07-11 18:44                     ` Rick Jones
  2 siblings, 1 reply; 44+ messages in thread
From: Rick Jones @ 2012-07-11 18:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On 07/11/2012 08:11 AM, Eric Dumazet wrote:
>
>
> Tests using a single TCP flow.
>
> Tests on 10Gbit links :
>
>
> echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> tcpi_reordering 53 tcpi_total_retrans 0

I take it you hacked your local copy of netperf to emit those?  Or did I 
leave some cruft behind in something I committed to the repository?

What was the ultimate limiter on throughput?  I notice it didn't achieve 
link-rate on either 10 GbE nor 1 GbE.

 > Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
 > in socket write queue.

So the SO_SNDBUF can still grow rather larger than necessary?  It is 
just that TCP will be nice to the other flows by not dumping all of it 
into the qdisc at once.  Latency seen by the application itself is then 
unchanged since there will still be (potentially) as much stuff queued 
in the SO_SNDBUF as before right?

rick

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 18:23                     ` Rick Jones
@ 2012-07-11 23:38                       ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 23:38 UTC (permalink / raw)
  To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On Wed, 2012-07-11 at 11:23 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
> 
> I take it you hacked your local copy of netperf to emit those?  Or did I 
> leave some cruft behind in something I committed to the repository?
> 
Yep, its a netperf-2.5.0 with a one line change to output these TCP_INFO
bits

> What was the ultimate limiter on throughput?  I notice it didn't achieve 
> link-rate on either 10 GbE nor 1 GbE.
> 

My lab has one fast machine (source in this 10Gb test), and one slow
machine (Intel Q6600 quad core), both with ixgbe cards.

On Gigabit test, the receiver is a laptop.


>  > Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
>  > in socket write queue.
> 
> So the SO_SNDBUF can still grow rather larger than necessary?  It is 
> just that TCP will be nice to the other flows by not dumping all of it 
> into the qdisc at once.  Latency seen by the application itself is then 
> unchanged since there will still be (potentially) as much stuff queued 
> in the SO_SNDBUF as before right?

Of course SO_SNDBUF can grow if autotuning is enabled.

I think there is a bit of misunderstanding about this patch and what it
does.

It only makes sure the amount of packets (from socket write queue) are
cloned in qdisc/device queue in a limited way, not "as much as allowed"

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 15:11                   ` Eric Dumazet
  2012-07-11 15:16                     ` Ben Greear
  2012-07-11 18:23                     ` Rick Jones
@ 2012-07-11 18:44                     ` Rick Jones
  2012-07-11 23:49                       ` Eric Dumazet
  2 siblings, 1 reply; 44+ messages in thread
From: Rick Jones @ 2012-07-11 18:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On 07/11/2012 08:11 AM, Eric Dumazet wrote:
>
> Some bench results about the choice of 128KB being the default value:

What were the starting/baseline figures?

>
> Tests using a single TCP flow.
>
> Tests on 10Gbit links :
>
>
> echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> tcpi_reordering 53 tcpi_total_retrans 0
> Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 392360      392360      16384  20.00   1389.53    10^6bits/s  0.52  S      4.30   S      0.737   1.014   usec/KB

By the way, that double reporting of the local socket send size is fixed in:

------------------------------------------------------------------------
r516 | raj | 2012-01-05 15:48:52 -0800 (Thu, 05 Jan 2012) | 1 line

report the rsr_size_end in an omni stream test rather than a copy of the 
lss_size_end

of netperf and later.  Also, any idea why the local socket send size got 
so much larger with 1GbE than 10 GbE at that setting of 
tcp_limit_output_bytes?

> Tests on Gbit link :
>
>
> echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
> tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 30 tpci_snd_cwnd 274
> tcpi_reordering 3 tcpi_total_retrans 0
> Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1264784     1264784     16384  20.01   689.70     10^6bits/s  0.22  S      15.05  S      0.634   7.149   usec/KB

rick jones

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 18:44                     ` Rick Jones
@ 2012-07-11 23:49                       ` Eric Dumazet
  2012-07-12  7:34                         ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 23:49 UTC (permalink / raw)
  To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On Wed, 2012-07-11 at 11:44 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> > Some bench results about the choice of 128KB being the default value:
> 
> What were the starting/baseline figures?
> 
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
> > Local       Local       Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> > Send Socket Send Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> > Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> > Final       Final                                             %     Method %      Method
> > 392360      392360      16384  20.00   1389.53    10^6bits/s  0.52  S      4.30   S      0.737   1.014   usec/KB
> 
> By the way, that double reporting of the local socket send size is fixed in:
> 
> ------------------------------------------------------------------------
> r516 | raj | 2012-01-05 15:48:52 -0800 (Thu, 05 Jan 2012) | 1 line
> 
> report the rsr_size_end in an omni stream test rather than a copy of the 
> lss_size_end
> 
> of netperf and later.  Also, any idea why the local socket send size got 
> so much larger with 1GbE than 10 GbE at that setting of 
> tcp_limit_output_bytes?

The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
ubuntu kernel. They probably have very different TCP behavior.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-11 23:49                       ` Eric Dumazet
@ 2012-07-12  7:34                         ` Eric Dumazet
  2012-07-12  7:37                           ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-12  7:34 UTC (permalink / raw)
  To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller

On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:

> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
> ubuntu kernel. They probably have very different TCP behavior.


I tested TSQ on bnx2x and 10Gb links.

I get full rate even using 65536 bytes for
the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable

OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.8.37 () port 0 AF_INET : histogram
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1606536     2097152     16384  20.00   9411.12    10^6bits/s  2.40  S      4.27   S      0.502   0.892   usec/KB  

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-12  7:34                         ` Eric Dumazet
@ 2012-07-12  7:37                           ` David Miller
  2012-07-12  7:51                             ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2012-07-12  7:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, mattmathis, codel, ncardwell

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 Jul 2012 09:34:19 +0200

> On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
> 
>> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
>> ubuntu kernel. They probably have very different TCP behavior.
> 
> 
> I tested TSQ on bnx2x and 10Gb links.
> 
> I get full rate even using 65536 bytes for
> the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable

Great work Eric.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-12  7:37                           ` David Miller
@ 2012-07-12  7:51                             ` Eric Dumazet
  2012-07-12 14:55                               ` Tom Herbert
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-12  7:51 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, mattmathis, codel, ncardwell

On Thu, 2012-07-12 at 00:37 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 12 Jul 2012 09:34:19 +0200
> 
> > On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
> > 
> >> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
> >> ubuntu kernel. They probably have very different TCP behavior.
> > 
> > 
> > I tested TSQ on bnx2x and 10Gb links.
> > 
> > I get full rate even using 65536 bytes for
> > the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> 
> Great work Eric.

Thanks !

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-12  7:51                             ` Eric Dumazet
@ 2012-07-12 14:55                               ` Tom Herbert
  0 siblings, 0 replies; 44+ messages in thread
From: Tom Herbert @ 2012-07-12 14:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, rick.jones2, ycheng, dave.taht, netdev, codel,
	mattmathis, nanditad, ncardwell, andrewmcgr

On Thu, Jul 12, 2012 at 12:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-12 at 00:37 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Thu, 12 Jul 2012 09:34:19 +0200
>>
>> > On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
>> >
>> >> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
>> >> ubuntu kernel. They probably have very different TCP behavior.
>> >
>> >
>> > I tested TSQ on bnx2x and 10Gb links.
>> >
>> > I get full rate even using 65536 bytes for
>> > the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
>>
>> Great work Eric.
>
> Thanks !
>
This is indeed great work!  A couple of comments...

Do you know if there are are any qdiscs that function less efficiently
when we are restricting the number of packets?  For instance, will HTB
work as expected in various configurations?

One extension to this work be to make the limit dynamic and mostly
eliminate the tunable.  I'm thinking we might be able to correlate the
limit to the BQL limit of the egress queue for the flow it there is
one.

Assuming all work conserving qdiscs the minimal amount of outstanding
host data for a queue could be associated with the BQL limit of the
egress NIC queue.  We want to minimize the outstanding data so that:

sum(data_of_tcp_flows_share_same_queue) > bql_limit_for _queue

So this could imply a per flow limit of:

tcp_limit = max(bql_limit - bql_inflight, one_packet)

For a single active connection on a queue, the tcp_limit is equal to
the BQL limit.  Once the BQL limit is hit in the NIC, we only need one
packet outstanding per flow to maintain flow control.  For fairness,
we might need "one_packet" to actually be max GSO data.  Also, this
disregards any latency of scheduling and running the tasklet, that
might need to be taken into account also.

Tom

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
                                     ` (2 preceding siblings ...)
  2012-07-11 15:11                   ` Eric Dumazet
@ 2012-07-12 13:33                   ` John Heffner
  2012-07-12 13:46                     ` Eric Dumazet
  3 siblings, 1 reply; 44+ messages in thread
From: John Heffner @ 2012-07-12 13:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr

One general question: why a per-connection limit?  I haven't been
following the bufferbloat conversation closely so I may have missed
some of the conversation.  But it seems that multiple connections will
still cause longer queue times.

Thanks,
  -John


On Tue, Jul 10, 2012 at 11:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
>   but some drivers call it in their start_xmit() handler.
>   These drivers should at least use BQL, or else a single TCP
>   session can still fill the whole NIC TX ring, since TSQ will
>   have no effect.
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/tcp.h        |    9 ++
>  include/net/tcp.h          |    3
>  net/ipv4/sysctl_net_ipv4.c |    7 +
>  net/ipv4/tcp.c             |   14 ++-
>  net/ipv4/tcp_minisocks.c   |    1
>  net/ipv4/tcp_output.c      |  132 ++++++++++++++++++++++++++++++++++-
>  6 files changed, 160 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7d3bced..55b8cf9 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -339,6 +339,9 @@ struct tcp_sock {
>         u32     rcv_tstamp;     /* timestamp of last received ACK (for keepalives) */
>         u32     lsndtime;       /* timestamp of last sent data packet (for restart window) */
>
> +       struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> +       unsigned long   tsq_flags;
> +
>         /* Data for direct copy to user */
>         struct {
>                 struct sk_buff_head     prequeue;
> @@ -494,6 +497,12 @@ struct tcp_sock {
>         struct tcp_cookie_values  *cookie_values;
>  };
>
> +enum tsq_flags {
> +       TSQ_THROTTLED,
> +       TSQ_QUEUED,
> +       TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
> +};
> +
>  static inline struct tcp_sock *tcp_sk(const struct sock *sk)
>  {
>         return (struct tcp_sock *)sk;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 53fb7d8..3a6ed09 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_early_retrans;
> +extern int sysctl_tcp_limit_output_bytes;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> @@ -321,6 +322,8 @@ extern struct proto tcp_prot;
>
>  extern void tcp_init_mem(struct net *net);
>
> +extern void tcp_tasklet_init(void);
> +
>  extern void tcp_v4_err(struct sk_buff *skb, u32);
>
>  extern void tcp_shutdown (struct sock *sk, int how);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 12aa0c5..70730f7 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -598,6 +598,13 @@ static struct ctl_table ipv4_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec
>         },
> +       {
> +               .procname       = "tcp_limit_output_bytes",
> +               .data           = &sysctl_tcp_limit_output_bytes,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec
> +       },
>  #ifdef CONFIG_NET_DMA
>         {
>                 .procname       = "tcp_dma_copybreak",
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3ba605f..8838bd2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -376,6 +376,7 @@ void tcp_init_sock(struct sock *sk)
>         skb_queue_head_init(&tp->out_of_order_queue);
>         tcp_init_xmit_timers(sk);
>         tcp_prequeue_init(tp);
> +       INIT_LIST_HEAD(&tp->tsq_node);
>
>         icsk->icsk_rto = TCP_TIMEOUT_INIT;
>         tp->mdev = TCP_TIMEOUT_INIT;
> @@ -786,15 +787,17 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>                                        int large_allowed)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
> -       u32 xmit_size_goal, old_size_goal;
> +       u32 xmit_size_goal, old_size_goal, gso_max_size;
>
>         xmit_size_goal = mss_now;
>
>         if (large_allowed && sk_can_gso(sk)) {
> -               xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -                                 inet_csk(sk)->icsk_af_ops->net_header_len -
> -                                 inet_csk(sk)->icsk_ext_hdr_len -
> -                                 tp->tcp_header_len);
> +               gso_max_size = min_t(u32, sk->sk_gso_max_size,
> +                                    sysctl_tcp_limit_output_bytes >> 1);
> +               xmit_size_goal = (gso_max_size - 1) -
> +                                inet_csk(sk)->icsk_af_ops->net_header_len -
> +                                inet_csk(sk)->icsk_ext_hdr_len -
> +                                tp->tcp_header_len;
>
>                 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>
> @@ -3573,4 +3576,5 @@ void __init tcp_init(void)
>         tcp_secret_primary = &tcp_secret_one;
>         tcp_secret_retiring = &tcp_secret_two;
>         tcp_secret_secondary = &tcp_secret_two;
> +       tcp_tasklet_init();
>  }
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 72b7c63..83b358f 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>                         treq->snt_isn + 1 + tcp_s_data_size(oldtp);
>
>                 tcp_prequeue_init(newtp);
> +               INIT_LIST_HEAD(&newtp->tsq_node);
>
>                 tcp_init_wl(newtp, treq->rcv_isn);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c465d3e..991ae45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,6 +50,9 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>
> +/* Default TSQ limit of two TSO segments */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
>   * which are too large can cause TCP streams to be bursty.
> @@ -65,6 +68,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
>  int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
>  EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
>
> +static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> +                          int push_one, gfp_t gfp);
>
>  /* Account for new data that has been sent to the network. */
>  static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> @@ -783,6 +788,118 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>         return size;
>  }
>
> +
> +/* TCP SMALL QUEUES (TSQ)
> + *
> + * TSQ goal is to keep small amount of skbs per tcp flow in tx queues (qdisc+dev)
> + * to reduce RTT and bufferbloat.
> + * We do this using a special skb destructor (tcp_wfree).
> + *
> + * Its important tcp_wfree() can be replaced by sock_wfree() in the event skb
> + * needs to be reallocated in a driver.
> + * The invariant being skb->truesize substracted from sk->sk_wmem_alloc
> + *
> + * Since transmit from skb destructor is forbidden, we use a tasklet
> + * to process all sockets that eventually need to send more skbs.
> + * We use one tasklet per cpu, with its own queue of sockets.
> + */
> +struct tsq_tasklet {
> +       struct tasklet_struct   tasklet;
> +       struct list_head        head; /* queue of tcp sockets */
> +};
> +static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
> +
> +/*
> + * One tasklest per cpu tries to send more skbs.
> + * We run in tasklet context but need to disable irqs when
> + * transfering tsq->head because tcp_wfree() might
> + * interrupt us (non NAPI drivers)
> + */
> +static void tcp_tasklet_func(unsigned long data)
> +{
> +       struct tsq_tasklet *tsq = (struct tsq_tasklet *)data;
> +       LIST_HEAD(list);
> +       unsigned long flags;
> +       struct list_head *q, *n;
> +       struct tcp_sock *tp;
> +       struct sock *sk;
> +
> +       local_irq_save(flags);
> +       list_splice_init(&tsq->head, &list);
> +       local_irq_restore(flags);
> +
> +       list_for_each_safe(q, n, &list) {
> +               tp = list_entry(q, struct tcp_sock, tsq_node);
> +               list_del(&tp->tsq_node);
> +
> +               sk = (struct sock *)tp;
> +               bh_lock_sock(sk);
> +
> +               if (!sock_owned_by_user(sk)) {
> +                       if ((1 << sk->sk_state) &
> +                           (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> +                               tcp_write_xmit(sk,
> +                                              tcp_current_mss(sk),
> +                                              0, 0,
> +                                              GFP_ATOMIC);
> +               } else {
> +                       /* TODO:
> +                        * setup a timer, or check TSQ_OWNED in release_sock()
> +                        */
> +                       set_bit(TSQ_OWNED, &tp->tsq_flags);
> +               }
> +               bh_unlock_sock(sk);
> +
> +               clear_bit(TSQ_QUEUED, &tp->tsq_flags);
> +               sk_free(sk);
> +       }
> +}
> +
> +void __init tcp_tasklet_init(void)
> +{
> +       int i;
> +
> +       for_each_possible_cpu(i) {
> +               struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
> +
> +               INIT_LIST_HEAD(&tsq->head);
> +               tasklet_init(&tsq->tasklet,
> +                            tcp_tasklet_func,
> +                            (unsigned long)tsq);
> +       }
> +}
> +
> +/*
> + * Write buffer destructor automatically called from kfree_skb.
> + * We cant xmit new skbs from this context, as we might already
> + * hold qdisc lock.
> + */
> +void tcp_wfree(struct sk_buff *skb)
> +{
> +       struct sock *sk = skb->sk;
> +       struct tcp_sock *tp = tcp_sk(sk);
> +
> +       if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
> +           !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
> +               unsigned long flags;
> +               struct tsq_tasklet *tsq;
> +
> +               /* Keep a ref on socket.
> +                * This last ref will be released in tcp_tasklet_func()
> +                */
> +               atomic_sub(skb->truesize - 1, &sk->sk_wmem_alloc);
> +
> +               /* queue this socket to tasklet queue */
> +               local_irq_save(flags);
> +               tsq = &__get_cpu_var(tsq_tasklet);
> +               list_add(&tp->tsq_node, &tsq->head);
> +               tasklet_schedule(&tsq->tasklet);
> +               local_irq_restore(flags);
> +       } else {
> +               sock_wfree(skb);
> +       }
> +}
> +
>  /* This routine actually transmits TCP packets queued in by
>   * tcp_do_sendmsg().  This is used by both the initial
>   * transmission and possible later retransmissions.
> @@ -844,7 +961,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>
>         skb_push(skb, tcp_header_size);
>         skb_reset_transport_header(skb);
> -       skb_set_owner_w(skb, sk);
> +
> +       skb_orphan(skb);
> +       skb->sk = sk;
> +       skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
> +                         tcp_wfree : sock_wfree;
> +       atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>
>         /* Build TCP header and checksum it. */
>         th = tcp_hdr(skb);
> @@ -1780,6 +1902,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>         while ((skb = tcp_send_head(sk))) {
>                 unsigned int limit;
>
> +
>                 tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
>                 BUG_ON(!tso_segs);
>
> @@ -1800,6 +1923,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 }
>
> +               /* TSQ : sk_wmem_alloc accounts skb truesize,
> +                * including skb overhead. But thats OK.
> +                */
> +               if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
> +                       set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> +                       break;
> +               }
>                 limit = mss_now;
>                 if (tso_segs > 1 && !tcp_urg_mode(tp))
>                         limit = tcp_mss_split_point(sk, skb, mss_now,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-12 13:33                   ` John Heffner
@ 2012-07-12 13:46                     ` Eric Dumazet
  2012-07-12 16:44                       ` John Heffner
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-12 13:46 UTC (permalink / raw)
  To: John Heffner
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr

On Thu, 2012-07-12 at 09:33 -0400, John Heffner wrote:
> One general question: why a per-connection limit?  I haven't been
> following the bufferbloat conversation closely so I may have missed
> some of the conversation.  But it seems that multiple connections will
> still cause longer queue times.

We already have a per-device limit, in qdisc.

If you want to monitor several tcp sessions, I urge you use a controller
for that. Like codel or fq_codel.

Experiments show that limiting to two TSO packets in qdisc per tcp flow
is enough to stop insane qdisc queueing, without impact on throughput
for people wanting fast tcp sessions.

Thats not solving the more general problem of having 1000 competing
flows.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-12 13:46                     ` Eric Dumazet
@ 2012-07-12 16:44                       ` John Heffner
  2012-07-12 16:54                         ` Jim Gettys
  0 siblings, 1 reply; 44+ messages in thread
From: John Heffner @ 2012-07-12 16:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
	mattmathis, nanditad, ncardwell, andrewmcgr

On Thu, Jul 12, 2012 at 9:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-12 at 09:33 -0400, John Heffner wrote:
>> One general question: why a per-connection limit?  I haven't been
>> following the bufferbloat conversation closely so I may have missed
>> some of the conversation.  But it seems that multiple connections will
>> still cause longer queue times.
>
> We already have a per-device limit, in qdisc.
>
> If you want to monitor several tcp sessions, I urge you use a controller
> for that. Like codel or fq_codel.
>
> Experiments show that limiting to two TSO packets in qdisc per tcp flow
> is enough to stop insane qdisc queueing, without impact on throughput
> for people wanting fast tcp sessions.
>
> Thats not solving the more general problem of having 1000 competing
> flows.

Right, AQM (and probably some modifications to the congestion control)
is the more general solution.

I guess I'm just trying to justify in my mind that the case of a small
number of local connections is worth handling in this special way.  It
seems like a generally reasonable thing, but it's definitely not a
general solution to minimizing latency.  One thing worth noting: on a
system routing traffic, local connections may be at a disadvantage
relative to connections being forwarded, sharing the same interface
queue, if that queue is the bottleneck.

Architecturally, the inconsistency between a local queue and a queue
one hop away bothers me a bit, but it's something I can learn to live
with if it really does improve a common case significantly. ;-)

Thanks,
  -John

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2] tcp: TCP Small Queues
  2012-07-12 16:44                       ` John Heffner
@ 2012-07-12 16:54                         ` Jim Gettys
  0 siblings, 0 replies; 44+ messages in thread
From: Jim Gettys @ 2012-07-12 16:54 UTC (permalink / raw)
  To: John Heffner; +Cc: nanditad, netdev, codel, mattmathis, ncardwell, David Miller

On 07/12/2012 12:44 PM, John Heffner wrote:
> On Thu, Jul 12, 2012 at 9:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2012-07-12 at 09:33 -0400, John Heffner wrote:
>>> One general question: why a per-connection limit?  I haven't been
>>> following the bufferbloat conversation closely so I may have missed
>>> some of the conversation.  But it seems that multiple connections will
>>> still cause longer queue times.
>> We already have a per-device limit, in qdisc.
>>
>> If you want to monitor several tcp sessions, I urge you use a controller
>> for that. Like codel or fq_codel.
>>
>> Experiments show that limiting to two TSO packets in qdisc per tcp flow
>> is enough to stop insane qdisc queueing, without impact on throughput
>> for people wanting fast tcp sessions.
>>
>> Thats not solving the more general problem of having 1000 competing
>> flows.
> Right, AQM (and probably some modifications to the congestion control)
> is the more general solution.
>
> I guess I'm just trying to justify in my mind that the case of a small
> number of local connections is worth handling in this special way.  It
> seems like a generally reasonable thing, but it's definitely not a
> general solution to minimizing latency.  One thing worth noting: on a
> system routing traffic, local connections may be at a disadvantage
> relative to connections being forwarded, sharing the same interface
> queue, if that queue is the bottleneck.

Kathy simulated CoDel across a pretty wide range of RTT's seen at the
edge of the network, and things behave pretty well.  She did say she
needed to think more and simulate the data center cases; haven't had a
chance to chat with her about that.  Of course, you can do some
experiments pretty easily yourself, and we'd love to see whatever
results you get.
                    - Jim


>
> Architecturally, the inconsistency between a local queue and a queue
> one hop away bothers me a bit, but it's something I can learn to live
> with if it really does improve a common case significantly. ;-)
>
> Thanks,
>   -John
> _______________________________________________
> Codel mailing list
> Codel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 17:07 [PATCH net-next] fq_codel: report congestion notification at enqueue time Eric Dumazet
  2012-06-28 17:51 ` Dave Taht
@ 2012-06-28 23:52 ` Nandita Dukkipati
  2012-06-29  4:18   ` Eric Dumazet
  2012-06-29  4:53 ` Eric Dumazet
  2 siblings, 1 reply; 44+ messages in thread
From: Nandita Dukkipati @ 2012-06-28 23:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Dave Taht, codel, Tom Herbert, Matt Mathis,
	Yuchung Cheng, Neal Cardwell

As you know I really like this idea. My main concern is that the same
packet could cause TCP to reduce cwnd twice within an RTT - first on
enqueue and then if this packet is ECN marked on dequeue. I don't
think this is the desired behavior. Can we avoid it?

On Thu, Jun 28, 2012 at 10:07 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> At enqueue time, check sojourn time of packet at head of the queue,
> and return NET_XMIT_CN instead of NET_XMIT_SUCCESS if this sejourn
> time is above codel @target.
>
> This permits local TCP stack to call tcp_enter_cwr() and reduce its cwnd
> without drops (for example if ECN is not enabled for the flow)
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Dave Taht <dave.taht@bufferbloat.net>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Matt Mathis <mattmathis@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>  include/linux/pkt_sched.h |    1 +
>  include/net/codel.h       |    2 +-
>  net/sched/sch_fq_codel.c  |   19 +++++++++++++++----
>  3 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 32aef0a..4d409a5 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -714,6 +714,7 @@ struct tc_fq_codel_qd_stats {
>                                 */
>        __u32   new_flows_len;  /* count of flows in new list */
>        __u32   old_flows_len;  /* count of flows in old list */
> +       __u32   congestion_count;
>  };
>
>  struct tc_fq_codel_cl_stats {
> diff --git a/include/net/codel.h b/include/net/codel.h
> index 550debf..8c7d6a7 100644
> --- a/include/net/codel.h
> +++ b/include/net/codel.h
> @@ -148,7 +148,7 @@ struct codel_vars {
>  * struct codel_stats - contains codel shared variables and stats
>  * @maxpacket: largest packet we've seen so far
>  * @drop_count:        temp count of dropped packets in dequeue()
> - * ecn_mark:   number of packets we ECN marked instead of dropping
> + * @ecn_mark:  number of packets we ECN marked instead of dropping
>  */
>  struct codel_stats {
>        u32             maxpacket;
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index 9fc1c62..c0485a0 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -62,6 +62,7 @@ struct fq_codel_sched_data {
>        struct codel_stats cstats;
>        u32             drop_overlimit;
>        u32             new_flow_count;
> +       u32             congestion_count;
>
>        struct list_head new_flows;     /* list of new flows */
>        struct list_head old_flows;     /* list of old flows */
> @@ -196,16 +197,25 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>                flow->deficit = q->quantum;
>                flow->dropped = 0;
>        }
> -       if (++sch->q.qlen < sch->limit)
> +       if (++sch->q.qlen < sch->limit) {
> +               codel_time_t hdelay = codel_get_enqueue_time(skb) -
> +                                     codel_get_enqueue_time(flow->head);
> +
> +               /* If this flow is congested, tell the caller ! */
> +               if (codel_time_after(hdelay, q->cparams.target)) {
> +                       q->congestion_count++;
> +                       return NET_XMIT_CN;
> +               }
>                return NET_XMIT_SUCCESS;
> -
> +       }
>        q->drop_overlimit++;
>        /* Return Congestion Notification only if we dropped a packet
>         * from this flow.
>         */
> -       if (fq_codel_drop(sch) == idx)
> +       if (fq_codel_drop(sch) == idx) {
> +               q->congestion_count++;
>                return NET_XMIT_CN;
> -
> +       }
>        /* As we dropped a packet, better let upper stack know this */
>        qdisc_tree_decrease_qlen(sch, 1);
>        return NET_XMIT_SUCCESS;
> @@ -467,6 +477,7 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
>
>        st.qdisc_stats.maxpacket = q->cstats.maxpacket;
>        st.qdisc_stats.drop_overlimit = q->drop_overlimit;
> +       st.qdisc_stats.congestion_count = q->congestion_count;
>        st.qdisc_stats.ecn_mark = q->cstats.ecn_mark;
>        st.qdisc_stats.new_flow_count = q->new_flow_count;
>
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 23:52 ` [PATCH net-next] fq_codel: report congestion notification at enqueue time Nandita Dukkipati
@ 2012-06-29  4:18   ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-06-29  4:18 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: netdev, Yuchung Cheng, codel, Matt Mathis, Neal Cardwell, David Miller

On Thu, 2012-06-28 at 16:52 -0700, Nandita Dukkipati wrote:
> As you know I really like this idea. My main concern is that the same
> packet could cause TCP to reduce cwnd twice within an RTT - first on
> enqueue and then if this packet is ECN marked on dequeue. I don't
> think this is the desired behavior. Can we avoid it?

I'll work on this.

In my experiences, I found that no drops (or ECN marks) were done at
dequeue time once one NET_XMIT_CN was returned, but its certainly
possible if other flows compete with this one.

Strangely, SFQ has the same behavior and nobody complained yet ;)

Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-28 17:07 [PATCH net-next] fq_codel: report congestion notification at enqueue time Eric Dumazet
  2012-06-28 17:51 ` Dave Taht
  2012-06-28 23:52 ` [PATCH net-next] fq_codel: report congestion notification at enqueue time Nandita Dukkipati
@ 2012-06-29  4:53 ` Eric Dumazet
  2012-06-29  5:12   ` David Miller
  2 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-06-29  4:53 UTC (permalink / raw)
  To: David Miller
  Cc: Nandita Dukkipati, netdev, codel, Yuchung Cheng, Neal Cardwell,
	Matt Mathis

On Thu, 2012-06-28 at 19:07 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> At enqueue time, check sojourn time of packet at head of the queue,
> and return NET_XMIT_CN instead of NET_XMIT_SUCCESS if this sejourn
> time is above codel @target.
> 
> This permits local TCP stack to call tcp_enter_cwr() and reduce its cwnd
> without drops (for example if ECN is not enabled for the flow)
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Dave Taht <dave.taht@bufferbloat.net>
> Cc: Tom Herbert <therbert@google.com>
> Cc: Matt Mathis <mattmathis@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---

Please dont apply this patch, I'll submit an updated version later.

Thanks

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  4:53 ` Eric Dumazet
@ 2012-06-29  5:12   ` David Miller
  2012-06-29  5:24     ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2012-06-29  5:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 06:53:12 +0200

> Please dont apply this patch, I'll submit an updated version later.

Ok.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  5:12   ` David Miller
@ 2012-06-29  5:24     ` Eric Dumazet
  2012-06-29  5:29       ` David Miller
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-06-29  5:24 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis

On Thu, 2012-06-28 at 22:12 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 29 Jun 2012 06:53:12 +0200
> 
> > Please dont apply this patch, I'll submit an updated version later.
> 
> Ok.

By the way, I am not sure NET_XMIT_CN is correctly used in RED.

Or maybe my understanding of NET_XMIT_CN is wrong.

If a the packet is dropped in enqueue(), why use NET_XMIT_CN instead of 
NET_XMIT_DROP ?

This seems to mean : I dropped _this_ packet, but dont worry too much, I
might accept other packets, so please go on...

diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 633e32d..0fc5b6c 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -77,7 +77,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		sch->qstats.overlimits++;
 		if (!red_use_ecn(q) || !INET_ECN_set_ce(skb)) {
 			q->stats.prob_drop++;
-			goto congestion_drop;
+			goto drop;
 		}
 
 		q->stats.prob_mark++;
@@ -88,7 +88,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		if (red_use_harddrop(q) || !red_use_ecn(q) ||
 		    !INET_ECN_set_ce(skb)) {
 			q->stats.forced_drop++;
-			goto congestion_drop;
+			goto drop;
 		}
 
 		q->stats.forced_mark++;
@@ -104,9 +104,8 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	}
 	return ret;
 
-congestion_drop:
-	qdisc_drop(skb, sch);
-	return NET_XMIT_CN;
+drop:
+	return qdisc_drop(skb, sch);
 }
 
 static struct sk_buff *red_dequeue(struct Qdisc *sch)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  5:24     ` Eric Dumazet
@ 2012-06-29  5:29       ` David Miller
  2012-06-29  5:50         ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2012-06-29  5:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 07:24:08 +0200

> By the way, I am not sure NET_XMIT_CN is correctly used in RED.
> 
> Or maybe my understanding of NET_XMIT_CN is wrong.
> 
> If a the packet is dropped in enqueue(), why use NET_XMIT_CN instead of 
> NET_XMIT_DROP ?
> 
> This seems to mean : I dropped _this_ packet, but dont worry too much, I
> might accept other packets, so please go on...

I am pretty sure the behavior in RED is intentional.

It's a soft push back on TCP.

We're taking this path when we are unable to sucessfully ECN mark a
packet.  But our intention was to do so.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  5:29       ` David Miller
@ 2012-06-29  5:50         ` Eric Dumazet
  2012-06-29  7:53           ` David Miller
  2012-06-29  8:04           ` David Miller
  0 siblings, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-06-29  5:50 UTC (permalink / raw)
  To: David Miller; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis

On Thu, 2012-06-28 at 22:29 -0700, David Miller wrote:

> I am pretty sure the behavior in RED is intentional.
> 
> It's a soft push back on TCP.
> 

tcp_enter_cwr() is called the same for DROP and CN

> We're taking this path when we are unable to sucessfully ECN mark a
> packet.  But our intention was to do so.
> 

Hmm, problem is the sender thinks the packet was queued for
transmission.

        ret = macvlan_queue_xmit(skb, dev);
        if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
                struct macvlan_pcpu_stats *pcpu_stats;

                pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
                u64_stats_update_begin(&pcpu_stats->syncp);
                pcpu_stats->tx_packets++;
                pcpu_stats->tx_bytes += len;
                u64_stats_update_end(&pcpu_stats->syncp);
        } else {
                this_cpu_inc(vlan->pcpu_stats->tx_dropped);
        }

NET_XMIT_CN has a lazy semantic it seems.

I will just dont rely on it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  5:50         ` Eric Dumazet
@ 2012-06-29  7:53           ` David Miller
  2012-06-29  8:04           ` David Miller
  1 sibling, 0 replies; 44+ messages in thread
From: David Miller @ 2012-06-29  7:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 07:50:08 +0200

> Hmm, problem is the sender thinks the packet was queued for
> transmission.
> 
>         ret = macvlan_queue_xmit(skb, dev);
>         if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
>                 struct macvlan_pcpu_stats *pcpu_stats;
> 
>                 pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>                 u64_stats_update_begin(&pcpu_stats->syncp);
>                 pcpu_stats->tx_packets++;
>                 pcpu_stats->tx_bytes += len;
>                 u64_stats_update_end(&pcpu_stats->syncp);
>         } else {
>                 this_cpu_inc(vlan->pcpu_stats->tx_dropped);
>         }
> 
> NET_XMIT_CN has a lazy semantic it seems.
> 
> I will just dont rely on it.

I think we cannot just ignore this issue.  I will take a deeper look,
because we should have NET_XMIT_CN be very well defined and adjust any
mis-use.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net-next] fq_codel: report congestion notification at enqueue time
  2012-06-29  5:50         ` Eric Dumazet
  2012-06-29  7:53           ` David Miller
@ 2012-06-29  8:04           ` David Miller
  1 sibling, 0 replies; 44+ messages in thread
From: David Miller @ 2012-06-29  8:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nanditad, netdev, codel, ycheng, ncardwell, mattmathis

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 29 Jun 2012 07:50:08 +0200

> Hmm, problem is the sender thinks the packet was queued for
> transmission.
> 
>         ret = macvlan_queue_xmit(skb, dev);
>         if (likely(ret == NET_XMIT_SUCCESS || ret == NET_XMIT_CN)) {
>                 struct macvlan_pcpu_stats *pcpu_stats;
> 
>                 pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>                 u64_stats_update_begin(&pcpu_stats->syncp);
>                 pcpu_stats->tx_packets++;
>                 pcpu_stats->tx_bytes += len;
>                 u64_stats_update_end(&pcpu_stats->syncp);
>         } else {
>                 this_cpu_inc(vlan->pcpu_stats->tx_dropped);
>         }

Ok, that is the meaning this has taken on.  Same test exists in
vlan_dev.c and this test used to be present also in the ipip.h macros
some time ago.

Nobody really does anything special with this value, except to
translate it to a zero 0 when propagating back to sockets.

The only thing it guards is the selection of which statistic to
increment.

For all practical purposes it is treated as NET_XMIT_SUCCESS except in
one location, pktgen, where it causes the errors counter to increment.

Looking this over, I'd say we should just get rid of it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2012-07-12 16:54 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-28 17:07 [PATCH net-next] fq_codel: report congestion notification at enqueue time Eric Dumazet
2012-06-28 17:51 ` Dave Taht
2012-06-28 18:12   ` Eric Dumazet
2012-06-28 22:56     ` Yuchung Cheng
2012-06-28 23:47       ` Dave Taht
2012-06-29  4:50         ` Eric Dumazet
2012-06-29  5:24           ` Dave Taht
2012-07-04 10:11           ` [RFC PATCH] tcp: limit data skbs in qdisc layer Eric Dumazet
2012-07-09  7:08             ` David Miller
2012-07-09  8:03               ` Eric Dumazet
2012-07-09  8:48                 ` Eric Dumazet
2012-07-09 14:55               ` Eric Dumazet
2012-07-10 13:28                 ` Lin Ming
2012-07-10 15:13                 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
2012-07-10 17:06                   ` Eric Dumazet
2012-07-10 17:37                   ` Yuchung Cheng
2012-07-10 18:32                     ` Eric Dumazet
2012-07-11 15:11                   ` Eric Dumazet
2012-07-11 15:16                     ` Ben Greear
2012-07-11 15:25                       ` Eric Dumazet
2012-07-11 15:43                         ` Ben Greear
2012-07-11 15:54                           ` Eric Dumazet
2012-07-11 16:03                             ` Ben Greear
2012-07-11 18:23                     ` Rick Jones
2012-07-11 23:38                       ` Eric Dumazet
2012-07-11 18:44                     ` Rick Jones
2012-07-11 23:49                       ` Eric Dumazet
2012-07-12  7:34                         ` Eric Dumazet
2012-07-12  7:37                           ` David Miller
2012-07-12  7:51                             ` Eric Dumazet
2012-07-12 14:55                               ` Tom Herbert
2012-07-12 13:33                   ` John Heffner
2012-07-12 13:46                     ` Eric Dumazet
2012-07-12 16:44                       ` John Heffner
2012-07-12 16:54                         ` Jim Gettys
2012-06-28 23:52 ` [PATCH net-next] fq_codel: report congestion notification at enqueue time Nandita Dukkipati
2012-06-29  4:18   ` Eric Dumazet
2012-06-29  4:53 ` Eric Dumazet
2012-06-29  5:12   ` David Miller
2012-06-29  5:24     ` Eric Dumazet
2012-06-29  5:29       ` David Miller
2012-06-29  5:50         ` Eric Dumazet
2012-06-29  7:53           ` David Miller
2012-06-29  8:04           ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.