* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-10 15:13 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
@ 2012-07-10 17:06 ` Eric Dumazet
2012-07-10 17:37 ` Yuchung Cheng
` (2 subsequent siblings)
3 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-10 17:06 UTC (permalink / raw)
To: David Miller; +Cc: nanditad, netdev, ycheng, codel, mattmathis, ncardwell
On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
> but some drivers call it in their start_xmit() handler.
> These drivers should at least use BQL, or else a single TCP
> session can still fill the whole NIC TX ring, since TSQ will
> have no effect.
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
By the way, Rick Jones asked me :
"Is there also any chance in service demand?"
I copy here my answer since its a very good point:
I worked on the idea of a CoDel like feedback, to have a timed limit
instead of byte limit ("allow up to 1ms" delay in qdisc/dev queue.)
But it seemed a bit complex : I would need to add skb fields to properly
track the residence time (sojourn time) of queued packets.
Alternative would be to have a per tcp socket tracking array,
but it might be expensive to search a packet in it...
With multi queue devices or bad qdiscs, we can have reordering in skb
orphanings. So the lookup can be relatively expensive.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-10 15:13 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
2012-07-10 17:06 ` Eric Dumazet
@ 2012-07-10 17:37 ` Yuchung Cheng
2012-07-10 18:32 ` Eric Dumazet
2012-07-11 15:11 ` Eric Dumazet
2012-07-12 13:33 ` John Heffner
3 siblings, 1 reply; 44+ messages in thread
From: Yuchung Cheng @ 2012-07-10 17:37 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, dave.taht, netdev, codel, therbert, mattmathis,
nanditad, ncardwell, andrewmcgr
On Tue, Jul 10, 2012 at 8:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
> but some drivers call it in their start_xmit() handler.
> These drivers should at least use BQL, or else a single TCP
> session can still fill the whole NIC TX ring, since TSQ will
> have no effect.
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/linux/tcp.h | 9 ++
> include/net/tcp.h | 3
> net/ipv4/sysctl_net_ipv4.c | 7 +
> net/ipv4/tcp.c | 14 ++-
> net/ipv4/tcp_minisocks.c | 1
> net/ipv4/tcp_output.c | 132 ++++++++++++++++++++++++++++++++++-
> 6 files changed, 160 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7d3bced..55b8cf9 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -339,6 +339,9 @@ struct tcp_sock {
> u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */
> u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
>
> + struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> + unsigned long tsq_flags;
> +
> /* Data for direct copy to user */
> struct {
> struct sk_buff_head prequeue;
> @@ -494,6 +497,12 @@ struct tcp_sock {
> struct tcp_cookie_values *cookie_values;
> };
>
> +enum tsq_flags {
> + TSQ_THROTTLED,
> + TSQ_QUEUED,
> + TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
> +};
> +
> static inline struct tcp_sock *tcp_sk(const struct sock *sk)
> {
> return (struct tcp_sock *)sk;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 53fb7d8..3a6ed09 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
> extern int sysctl_tcp_thin_linear_timeouts;
> extern int sysctl_tcp_thin_dupack;
> extern int sysctl_tcp_early_retrans;
> +extern int sysctl_tcp_limit_output_bytes;
>
> extern atomic_long_t tcp_memory_allocated;
> extern struct percpu_counter tcp_sockets_allocated;
> @@ -321,6 +322,8 @@ extern struct proto tcp_prot;
>
> extern void tcp_init_mem(struct net *net);
>
> +extern void tcp_tasklet_init(void);
> +
> extern void tcp_v4_err(struct sk_buff *skb, u32);
>
> extern void tcp_shutdown (struct sock *sk, int how);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 12aa0c5..70730f7 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -598,6 +598,13 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec
> },
> + {
> + .procname = "tcp_limit_output_bytes",
> + .data = &sysctl_tcp_limit_output_bytes,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec
> + },
> #ifdef CONFIG_NET_DMA
> {
> .procname = "tcp_dma_copybreak",
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3ba605f..8838bd2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -376,6 +376,7 @@ void tcp_init_sock(struct sock *sk)
> skb_queue_head_init(&tp->out_of_order_queue);
> tcp_init_xmit_timers(sk);
> tcp_prequeue_init(tp);
> + INIT_LIST_HEAD(&tp->tsq_node);
>
> icsk->icsk_rto = TCP_TIMEOUT_INIT;
> tp->mdev = TCP_TIMEOUT_INIT;
> @@ -786,15 +787,17 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
> int large_allowed)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> - u32 xmit_size_goal, old_size_goal;
> + u32 xmit_size_goal, old_size_goal, gso_max_size;
>
> xmit_size_goal = mss_now;
>
> if (large_allowed && sk_can_gso(sk)) {
> - xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> - inet_csk(sk)->icsk_af_ops->net_header_len -
> - inet_csk(sk)->icsk_ext_hdr_len -
> - tp->tcp_header_len);
> + gso_max_size = min_t(u32, sk->sk_gso_max_size,
> + sysctl_tcp_limit_output_bytes >> 1);
> + xmit_size_goal = (gso_max_size - 1) -
> + inet_csk(sk)->icsk_af_ops->net_header_len -
> + inet_csk(sk)->icsk_ext_hdr_len -
> + tp->tcp_header_len;
>
> xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>
> @@ -3573,4 +3576,5 @@ void __init tcp_init(void)
> tcp_secret_primary = &tcp_secret_one;
> tcp_secret_retiring = &tcp_secret_two;
> tcp_secret_secondary = &tcp_secret_two;
> + tcp_tasklet_init();
> }
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 72b7c63..83b358f 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
> treq->snt_isn + 1 + tcp_s_data_size(oldtp);
>
> tcp_prequeue_init(newtp);
> + INIT_LIST_HEAD(&newtp->tsq_node);
>
> tcp_init_wl(newtp, treq->rcv_isn);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c465d3e..991ae45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,6 +50,9 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
> */
> int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>
> +/* Default TSQ limit of two TSO segments */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +
> /* This limits the percentage of the congestion window which we
> * will allow a single TSO frame to consume. Building TSO frames
> * which are too large can cause TCP streams to be bursty.
> @@ -65,6 +68,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
> int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
> EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
>
> +static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> + int push_one, gfp_t gfp);
>
> /* Account for new data that has been sent to the network. */
> static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> @@ -783,6 +788,118 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> return size;
> }
>
> +
> +/* TCP SMALL QUEUES (TSQ)
> + *
> + * TSQ goal is to keep small amount of skbs per tcp flow in tx queues (qdisc+dev)
> + * to reduce RTT and bufferbloat.
> + * We do this using a special skb destructor (tcp_wfree).
> + *
> + * Its important tcp_wfree() can be replaced by sock_wfree() in the event skb
> + * needs to be reallocated in a driver.
> + * The invariant being skb->truesize substracted from sk->sk_wmem_alloc
> + *
> + * Since transmit from skb destructor is forbidden, we use a tasklet
> + * to process all sockets that eventually need to send more skbs.
> + * We use one tasklet per cpu, with its own queue of sockets.
> + */
> +struct tsq_tasklet {
> + struct tasklet_struct tasklet;
> + struct list_head head; /* queue of tcp sockets */
> +};
> +static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
> +
> +/*
> + * One tasklest per cpu tries to send more skbs.
> + * We run in tasklet context but need to disable irqs when
> + * transfering tsq->head because tcp_wfree() might
> + * interrupt us (non NAPI drivers)
> + */
> +static void tcp_tasklet_func(unsigned long data)
> +{
> + struct tsq_tasklet *tsq = (struct tsq_tasklet *)data;
> + LIST_HEAD(list);
> + unsigned long flags;
> + struct list_head *q, *n;
> + struct tcp_sock *tp;
> + struct sock *sk;
> +
> + local_irq_save(flags);
> + list_splice_init(&tsq->head, &list);
> + local_irq_restore(flags);
> +
> + list_for_each_safe(q, n, &list) {
> + tp = list_entry(q, struct tcp_sock, tsq_node);
> + list_del(&tp->tsq_node);
> +
> + sk = (struct sock *)tp;
> + bh_lock_sock(sk);
> +
> + if (!sock_owned_by_user(sk)) {
> + if ((1 << sk->sk_state) &
> + (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> + tcp_write_xmit(sk,
> + tcp_current_mss(sk),
> + 0, 0,
> + GFP_ATOMIC);
Is this case possible: app does a large send and immediately closes
the socket. then
the queue is throttled and tcp_write_xmit is called back when state is
in TCP_FIN_WAIT1.
I think tcp_write_xmit should continue regardless of the current state
because the
send maybe throttled/delayed but state change is synchronous.
> + } else {
> + /* TODO:
> + * setup a timer, or check TSQ_OWNED in release_sock()
> + */
> + set_bit(TSQ_OWNED, &tp->tsq_flags);
> + }
> + bh_unlock_sock(sk);
> +
> + clear_bit(TSQ_QUEUED, &tp->tsq_flags);
> + sk_free(sk);
> + }
> +}
> +
> +void __init tcp_tasklet_init(void)
> +{
> + int i;
> +
> + for_each_possible_cpu(i) {
> + struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
> +
> + INIT_LIST_HEAD(&tsq->head);
> + tasklet_init(&tsq->tasklet,
> + tcp_tasklet_func,
> + (unsigned long)tsq);
> + }
> +}
> +
> +/*
> + * Write buffer destructor automatically called from kfree_skb.
> + * We cant xmit new skbs from this context, as we might already
> + * hold qdisc lock.
> + */
> +void tcp_wfree(struct sk_buff *skb)
> +{
> + struct sock *sk = skb->sk;
> + struct tcp_sock *tp = tcp_sk(sk);
> +
> + if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
> + !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
> + unsigned long flags;
> + struct tsq_tasklet *tsq;
> +
> + /* Keep a ref on socket.
> + * This last ref will be released in tcp_tasklet_func()
> + */
> + atomic_sub(skb->truesize - 1, &sk->sk_wmem_alloc);
> +
> + /* queue this socket to tasklet queue */
> + local_irq_save(flags);
> + tsq = &__get_cpu_var(tsq_tasklet);
> + list_add(&tp->tsq_node, &tsq->head);
> + tasklet_schedule(&tsq->tasklet);
> + local_irq_restore(flags);
> + } else {
> + sock_wfree(skb);
> + }
> +}
> +
> /* This routine actually transmits TCP packets queued in by
> * tcp_do_sendmsg(). This is used by both the initial
> * transmission and possible later retransmissions.
> @@ -844,7 +961,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>
> skb_push(skb, tcp_header_size);
> skb_reset_transport_header(skb);
> - skb_set_owner_w(skb, sk);
> +
> + skb_orphan(skb);
> + skb->sk = sk;
> + skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
> + tcp_wfree : sock_wfree;
> + atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>
> /* Build TCP header and checksum it. */
> th = tcp_hdr(skb);
> @@ -1780,6 +1902,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> while ((skb = tcp_send_head(sk))) {
> unsigned int limit;
>
> +
> tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
> BUG_ON(!tso_segs);
>
> @@ -1800,6 +1923,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> break;
> }
>
> + /* TSQ : sk_wmem_alloc accounts skb truesize,
> + * including skb overhead. But thats OK.
> + */
> + if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
> + set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> + break;
> + }
> limit = mss_now;
> if (tso_segs > 1 && !tcp_urg_mode(tp))
> limit = tcp_mss_split_point(sk, skb, mss_now,
>
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-10 17:37 ` Yuchung Cheng
@ 2012-07-10 18:32 ` Eric Dumazet
0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-10 18:32 UTC (permalink / raw)
To: Yuchung Cheng
Cc: David Miller, dave.taht, netdev, codel, therbert, mattmathis,
nanditad, ncardwell, andrewmcgr
On Tue, 2012-07-10 at 10:37 -0700, Yuchung Cheng wrote:
> On Tue, Jul 10, 2012 at 8:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > +
> > + if (!sock_owned_by_user(sk)) {
> > + if ((1 << sk->sk_state) &
> > + (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> > + tcp_write_xmit(sk,
> > + tcp_current_mss(sk),
> > + 0, 0,
> > + GFP_ATOMIC);
> Is this case possible: app does a large send and immediately closes
> the socket. then
> the queue is throttled and tcp_write_xmit is called back when state is
> in TCP_FIN_WAIT1.
>
> I think tcp_write_xmit should continue regardless of the current state
> because the
> send maybe throttled/delayed but state change is synchronous.
>
I need testing some allowed states, I think.
Maybe I missed some states, but I dont think we should call
tcp_write_xmit() if socket is now in TIMEWAIT state ?
(because of tasklet delay, we might handle TX completion _after_ socket
state change)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-10 15:13 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
2012-07-10 17:06 ` Eric Dumazet
2012-07-10 17:37 ` Yuchung Cheng
@ 2012-07-11 15:11 ` Eric Dumazet
2012-07-11 15:16 ` Ben Greear
` (2 more replies)
2012-07-12 13:33 ` John Heffner
3 siblings, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 15:11 UTC (permalink / raw)
To: David Miller; +Cc: nanditad, netdev, codel, mattmathis, ncardwell
On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
> but some drivers call it in their start_xmit() handler.
> These drivers should at least use BQL, or else a single TCP
> session can still fill the whole NIC TX ring, since TSQ will
> have no effect.
I am going to send an official patch (I'll put a v3 tag in it)
I believe I did a full implementation, including the xmit() done
by the user at release_sock() time, if the tasklet found socket owned by
the user.
Some bench results about the choice of 128KB being the default value:
64KB seems the 'good' value on 10Gb links to reach max throughput on my
lab machines (ixgbe adapters).
Using 128KB is a very conservative value to allow link rate on 20Gbps.
Still, it allows less than 1ms of buffering on a Gbit link, and less
than 8ms on 100Mbit link (instead of 130ms without Small Queues)
Tests using a single TCP flow.
Tests on 10Gbit links :
echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
392360 392360 16384 20.00 1389.53 10^6bits/s 0.52 S 4.30 S 0.737 1.014 usec/KB
echo 24576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 33 tpci_snd_cwnd 86
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
396976 396976 16384 20.00 1483.03 10^6bits/s 0.45 S 4.51 S 0.603 0.997 usec/KB
echo 32768 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 19 tpci_snd_cwnd 100
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
461600 461600 16384 20.00 2039.67 10^6bits/s 0.64 S 5.17 S 0.620 0.830 usec/KB
echo 49152 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 28 tpci_snd_cwnd 207
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
955512 955512 16384 20.00 4448.86 10^6bits/s 1.19 S 11.16 S 0.526 0.822 usec/KB
echo 65536 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 399 tpci_snd_cwnd 488
tcpi_reordering 127 tcpi_total_retrans 75
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
2460328 2460328 16384 20.00 5975.12 10^6bits/s 1.81 S 14.65 S 0.595 0.803 usec/KB
echo 81920 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 24 tpci_snd_cwnd 236
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1144768 1144768 16384 20.00 5190.08 10^6bits/s 1.56 S 12.63 S 0.591 0.798 usec/KB
echo 98304 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 20 tpci_snd_cwnd 644
tcpi_reordering 59 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
2991168 2991168 16384 20.00 5976.00 10^6bits/s 1.60 S 14.61 S 0.526 0.801 usec/KB
echo 114688 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 23 tpci_snd_cwnd 683
tcpi_reordering 59 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
3161960 3161960 16384 20.00 5975.14 10^6bits/s 1.42 S 14.78 S 0.469 0.810 usec/KB
echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 23 tpci_snd_cwnd 591
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
2728056 2728056 16384 20.00 5976.16 10^6bits/s 1.71 S 14.62 S 0.562 0.802 usec/KB
echo 147456 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 27 tpci_snd_cwnd 697
tcpi_reordering 64 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
3240432 3240432 16384 20.00 5975.64 10^6bits/s 1.51 S 14.78 S 0.498 0.811 usec/KB
echo 163840 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 18 tpci_snd_cwnd 710
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
3277360 3277360 16384 20.00 5975.56 10^6bits/s 1.59 S 14.79 S 0.525 0.811 usec/KB
echo 180224 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 32 tpci_snd_cwnd 701
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
3235816 3235816 16384 20.00 5976.80 10^6bits/s 1.56 S 14.61 S 0.514 0.801 usec/KB
echo 196608 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 502 tpci_snd_cwnd 690
tcpi_reordering 127 tcpi_total_retrans 37
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
3185040 3185040 16384 20.00 5975.46 10^6bits/s 1.50 S 14.67 S 0.493 0.804 usec/KB
echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 721
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
3448152 3448152 16384 20.00 5975.49 10^6bits/s 1.57 S 14.78 S 0.516 0.811 usec/KB
echo 524288 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 927
tcpi_reordering 53 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
4194304 4194304 16384 20.01 5976.61 10^6bits/s 1.63 S 14.56 S 0.538 0.798 usec/KB
echo 1048576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2500 tcpi_rttvar 750 tcpi_snd_ssthresh 17 tpci_snd_cwnd 1272
tcpi_reordering 90 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
4194304 4194304 16384 20.01 5975.11 10^6bits/s 1.64 S 14.69 S 0.541 0.805 usec/KB
Tests on Gbit link :
echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 30 tpci_snd_cwnd 274
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1264784 1264784 16384 20.01 689.70 10^6bits/s 0.22 S 15.05 S 0.634 7.149 usec/KB
echo 24576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 43 tpci_snd_cwnd 245
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1130920 1130920 16384 20.01 860.21 10^6bits/s 0.25 S 16.05 S 0.576 6.112 usec/KB
echo 32768 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 36 tpci_snd_cwnd 229
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1057064 1057064 16384 20.01 867.76 10^6bits/s 0.28 S 15.46 S 0.634 5.839 usec/KB
echo 49152 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 32 tpci_snd_cwnd 293
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1352488 1352488 16384 20.01 873.61 10^6bits/s 0.21 S 16.25 S 0.483 6.095 usec/KB
echo 65536 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 48 tpci_snd_cwnd 274
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1264784 1264784 16384 20.01 875.90 10^6bits/s 0.19 S 15.56 S 0.421 5.822 usec/KB
echo 81920 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 18 tpci_snd_cwnd 246
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1135536 1135536 16384 20.01 879.10 10^6bits/s 0.26 S 15.92 S 0.590 5.935 usec/KB
echo 98304 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 20 tpci_snd_cwnd 361
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1666376 1666376 16384 20.02 880.30 10^6bits/s 0.25 S 16.07 S 0.560 5.980 usec/KB
echo 114688 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 41 tpci_snd_cwnd 281
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1297096 1297096 16384 20.01 881.30 10^6bits/s 0.26 S 15.96 S 0.569 5.933 usec/KB
echo 131072 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 30 tpci_snd_cwnd 292
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1347872 1347872 16384 20.01 880.43 10^6bits/s 0.23 S 16.71 S 0.511 6.219 usec/KB
echo 147456 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 31 tpci_snd_cwnd 286
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1320176 1320176 16384 20.01 880.57 10^6bits/s 0.24 S 16.62 S 0.534 6.187 usec/KB
echo 163840 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 19 tpci_snd_cwnd 406
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1874096 1874096 16384 20.02 880.23 10^6bits/s 0.25 S 17.08 S 0.550 6.358 usec/KB
echo 180224 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 27 tpci_snd_cwnd 304
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1403264 1403264 16384 20.01 880.34 10^6bits/s 0.22 S 16.03 S 0.501 5.965 usec/KB
echo 196608 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2000 tcpi_rttvar 750 tcpi_snd_ssthresh 42 tpci_snd_cwnd 365
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1684840 1684840 16384 20.02 879.73 10^6bits/s 0.26 S 16.82 S 0.578 6.267 usec/KB
echo 262144 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 202000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 2875 tcpi_rttvar 750 tcpi_snd_ssthresh 27 tpci_snd_cwnd 471
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
2174136 2174136 16384 20.01 879.89 10^6bits/s 0.25 S 18.52 S 0.556 6.898 usec/KB
echo 524288 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 205000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 5000 tcpi_rttvar 750 tcpi_snd_ssthresh 42 tpci_snd_cwnd 627
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
2894232 2894232 16384 20.03 879.84 10^6bits/s 0.25 S 17.12 S 0.564 6.374 usec/KB
echo 1048576 >/proc/sys/net/ipv4/tcp_limit_output_bytes
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
tcpi_rto 209000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
tcpi_rtt 9875 tcpi_rttvar 750 tcpi_snd_ssthresh 33 tpci_snd_cwnd 950
tcpi_reordering 3 tcpi_total_retrans 0
Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
4194304 4194304 16384 20.03 880.70 10^6bits/s 0.25 S 18.44 S 0.560 6.861 usec/KB
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:11 ` Eric Dumazet
@ 2012-07-11 15:16 ` Ben Greear
2012-07-11 15:25 ` Eric Dumazet
2012-07-11 18:23 ` Rick Jones
2012-07-11 18:44 ` Rick Jones
2 siblings, 1 reply; 44+ messages in thread
From: Ben Greear @ 2012-07-11 15:16 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
mattmathis, nanditad, ncardwell, andrewmcgr, Rick Jones
On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 17:13 +0200, Eric Dumazet wrote:
>> This introduce TSQ (TCP Small Queues)
>>
>> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
>> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
>> problem.
>>
>> sk->sk_wmem_alloc not allowed to grow above a given limit,
>> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
>> given time.
>>
>> TSO packets are sized/capped to half the limit, so that we have two
>> TSO packets in flight, allowing better bandwidth use.
>>
>> As a side effect, setting the limit to 40000 automatically reduces the
>> standard gso max limit (65536) to 40000/2 : It can help to reduce
>> latencies of high prio packets, having smaller TSO packets.
>>
>> This means we divert sock_wfree() to a tcp_wfree() handler, to
>> queue/send following frames when skb_orphan() [2] is called for the
>> already queued skbs.
>>
>> Results on my dev machine (tg3 nic) are really impressive, using
>> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
>> nominal bandwidth.
>>
>> I no longer have 3MBytes backlogged in qdisc by a single netperf
>> session, and both side socket autotuning no longer use 4 Mbytes.
>>
>> As skb destructor cannot restart xmit itself ( as qdisc lock might be
>> taken at this point ), we delegate the work to a tasklet. We use one
>> tasklest per cpu for performance reasons.
>>
>>
>>
>> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
>> [2] skb_orphan() is usually called at TX completion time,
>> but some drivers call it in their start_xmit() handler.
>> These drivers should at least use BQL, or else a single TCP
>> session can still fill the whole NIC TX ring, since TSQ will
>> have no effect.
>
> I am going to send an official patch (I'll put a v3 tag in it)
>
> I believe I did a full implementation, including the xmit() done
> by the user at release_sock() time, if the tasklet found socket owned by
> the user.
>
> Some bench results about the choice of 128KB being the default value:
>
> 64KB seems the 'good' value on 10Gb links to reach max throughput on my
> lab machines (ixgbe adapters).
>
> Using 128KB is a very conservative value to allow link rate on 20Gbps.
>
> Still, it allows less than 1ms of buffering on a Gbit link, and less
> than 8ms on 100Mbit link (instead of 130ms without Small Queues)
I haven't read your patch in detail, but I was wondering if this feature
would cause trouble for applications that are servicing many sockets at once
and so might take several ms between handling each individual socket.
Or, applications that for other reasons cannot service sockets quite
as fast. Without this feature, they could poke more data into the
xmit queues to be handled by the kernel while the app goes about it's
other user-space work?
Maybe this feature could be enabled/tuned on a per-socket basis?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:16 ` Ben Greear
@ 2012-07-11 15:25 ` Eric Dumazet
2012-07-11 15:43 ` Ben Greear
0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 15:25 UTC (permalink / raw)
To: Ben Greear; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
> I haven't read your patch in detail, but I was wondering if this feature
> would cause trouble for applications that are servicing many sockets at once
> and so might take several ms between handling each individual socket.
>
Well, this patch has no impact for such applications. In fact their
send()/write() will return to userland faster than before (for very
large send())
> Or, applications that for other reasons cannot service sockets quite
> as fast. Without this feature, they could poke more data into the
> xmit queues to be handled by the kernel while the app goes about it's
> other user-space work?
>
There is no impact for the applications. They queue their data in socket
write queue, and tcp stack do the work to actually transmit data
and handle ACKS.
Before this patch, this work was triggered by :
- Timers
- Incoming ACKS
We now add a third trigger : TX completion
> Maybe this feature could be enabled/tuned on a per-socket basis?
Well, why not, but I want first to see why it would be needed.
I mean, if a single application _needs_ to send MBytes of tcp data in
Qdisc at once, everything else on the machine is stuck (as today)
So just increase global param.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:25 ` Eric Dumazet
@ 2012-07-11 15:43 ` Ben Greear
2012-07-11 15:54 ` Eric Dumazet
0 siblings, 1 reply; 44+ messages in thread
From: Ben Greear @ 2012-07-11 15:43 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
mattmathis, nanditad, ncardwell, andrewmcgr, Rick Jones
On 07/11/2012 08:25 AM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
>
>> I haven't read your patch in detail, but I was wondering if this feature
>> would cause trouble for applications that are servicing many sockets at once
>> and so might take several ms between handling each individual socket.
>>
>
> Well, this patch has no impact for such applications. In fact their
> send()/write() will return to userland faster than before (for very
> large send())
Maybe I'm just confused. Is your patch just mucking with
the queues below the tcp xmit queues? From the patch description
I was thinking you were somehow directly limiting the TCP xmit
queues...
If you are just draining the tcp xmit queues on a new/faster
trigger, then I see no problem with that, and no need for
a per-socket control.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:43 ` Ben Greear
@ 2012-07-11 15:54 ` Eric Dumazet
2012-07-11 16:03 ` Ben Greear
0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 15:54 UTC (permalink / raw)
To: Ben Greear; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On Wed, 2012-07-11 at 08:43 -0700, Ben Greear wrote:
> On 07/11/2012 08:25 AM, Eric Dumazet wrote:
> > On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
> >
> >> I haven't read your patch in detail, but I was wondering if this feature
> >> would cause trouble for applications that are servicing many sockets at once
> >> and so might take several ms between handling each individual socket.
> >>
> >
> > Well, this patch has no impact for such applications. In fact their
> > send()/write() will return to userland faster than before (for very
> > large send())
>
> Maybe I'm just confused. Is your patch just mucking with
> the queues below the tcp xmit queues? From the patch description
> I was thinking you were somehow directly limiting the TCP xmit
> queues...
>
I dont limit tcp xmit queues. I might avoid excessive autotuning.
> If you are just draining the tcp xmit queues on a new/faster
> trigger, then I see no problem with that, and no need for
> a per-socket control.
Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
in socket write queue.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:54 ` Eric Dumazet
@ 2012-07-11 16:03 ` Ben Greear
0 siblings, 0 replies; 44+ messages in thread
From: Ben Greear @ 2012-07-11 16:03 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
mattmathis, nanditad, ncardwell, andrewmcgr, Rick Jones
On 07/11/2012 08:54 AM, Eric Dumazet wrote:
> On Wed, 2012-07-11 at 08:43 -0700, Ben Greear wrote:
>> On 07/11/2012 08:25 AM, Eric Dumazet wrote:
>>> On Wed, 2012-07-11 at 08:16 -0700, Ben Greear wrote:
>>>
>>>> I haven't read your patch in detail, but I was wondering if this feature
>>>> would cause trouble for applications that are servicing many sockets at once
>>>> and so might take several ms between handling each individual socket.
>>>>
>>>
>>> Well, this patch has no impact for such applications. In fact their
>>> send()/write() will return to userland faster than before (for very
>>> large send())
>>
>> Maybe I'm just confused. Is your patch just mucking with
>> the queues below the tcp xmit queues? From the patch description
>> I was thinking you were somehow directly limiting the TCP xmit
>> queues...
>>
>
> I dont limit tcp xmit queues. I might avoid excessive autotuning.
>
>
>
>> If you are just draining the tcp xmit queues on a new/faster
>> trigger, then I see no problem with that, and no need for
>> a per-socket control.
>
> Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
> in socket write queue.
Thanks for the explanation.
Out of curiosity, have you tried running multiple TCP streams
with different processes driving each stream, where each is trying
to drive, say, 700Mbps bi-directional traffic over a 1Gbps link?
Perhaps with 50ms of latency generated by a network emulator.
This used to cause some extremely high latency
due to excessive TCP xmit queues (from what I could tell),
but maybe this new patch will cure that.
I'll re-run my tests with your patch eventually..but too bogged
down to do so soon.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:11 ` Eric Dumazet
2012-07-11 15:16 ` Ben Greear
@ 2012-07-11 18:23 ` Rick Jones
2012-07-11 23:38 ` Eric Dumazet
2012-07-11 18:44 ` Rick Jones
2 siblings, 1 reply; 44+ messages in thread
From: Rick Jones @ 2012-07-11 18:23 UTC (permalink / raw)
To: Eric Dumazet; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On 07/11/2012 08:11 AM, Eric Dumazet wrote:
>
>
> Tests using a single TCP flow.
>
> Tests on 10Gbit links :
>
>
> echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> tcpi_reordering 53 tcpi_total_retrans 0
I take it you hacked your local copy of netperf to emit those? Or did I
leave some cruft behind in something I committed to the repository?
What was the ultimate limiter on throughput? I notice it didn't achieve
link-rate on either 10 GbE nor 1 GbE.
> Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
> in socket write queue.
So the SO_SNDBUF can still grow rather larger than necessary? It is
just that TCP will be nice to the other flows by not dumping all of it
into the qdisc at once. Latency seen by the application itself is then
unchanged since there will still be (potentially) as much stuff queued
in the SO_SNDBUF as before right?
rick
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 18:23 ` Rick Jones
@ 2012-07-11 23:38 ` Eric Dumazet
0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 23:38 UTC (permalink / raw)
To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On Wed, 2012-07-11 at 11:23 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
>
> I take it you hacked your local copy of netperf to emit those? Or did I
> leave some cruft behind in something I committed to the repository?
>
Yep, its a netperf-2.5.0 with a one line change to output these TCP_INFO
bits
> What was the ultimate limiter on throughput? I notice it didn't achieve
> link-rate on either 10 GbE nor 1 GbE.
>
My lab has one fast machine (source in this 10Gb test), and one slow
machine (Intel Q6600 quad core), both with ixgbe cards.
On Gigabit test, the receiver is a laptop.
> > Thats the plan : limiting numer of bytes in Qdisc, not number of bytes
> > in socket write queue.
>
> So the SO_SNDBUF can still grow rather larger than necessary? It is
> just that TCP will be nice to the other flows by not dumping all of it
> into the qdisc at once. Latency seen by the application itself is then
> unchanged since there will still be (potentially) as much stuff queued
> in the SO_SNDBUF as before right?
Of course SO_SNDBUF can grow if autotuning is enabled.
I think there is a bit of misunderstanding about this patch and what it
does.
It only makes sure the amount of packets (from socket write queue) are
cloned in qdisc/device queue in a limited way, not "as much as allowed"
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 15:11 ` Eric Dumazet
2012-07-11 15:16 ` Ben Greear
2012-07-11 18:23 ` Rick Jones
@ 2012-07-11 18:44 ` Rick Jones
2012-07-11 23:49 ` Eric Dumazet
2 siblings, 1 reply; 44+ messages in thread
From: Rick Jones @ 2012-07-11 18:44 UTC (permalink / raw)
To: Eric Dumazet; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On 07/11/2012 08:11 AM, Eric Dumazet wrote:
>
> Some bench results about the choice of 128KB being the default value:
What were the starting/baseline figures?
>
> Tests using a single TCP flow.
>
> Tests on 10Gbit links :
>
>
> echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> tcpi_reordering 53 tcpi_total_retrans 0
> Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
> Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
> Size Size Size (sec) Util Util Util Util Demand Demand Units
> Final Final % Method % Method
> 392360 392360 16384 20.00 1389.53 10^6bits/s 0.52 S 4.30 S 0.737 1.014 usec/KB
By the way, that double reporting of the local socket send size is fixed in:
------------------------------------------------------------------------
r516 | raj | 2012-01-05 15:48:52 -0800 (Thu, 05 Jan 2012) | 1 line
report the rsr_size_end in an omni stream test rather than a copy of the
lss_size_end
of netperf and later. Also, any idea why the local socket send size got
so much larger with 1GbE than 10 GbE at that setting of
tcp_limit_output_bytes?
> Tests on Gbit link :
>
>
> echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.18 (172.30.42.18) port 0 AF_INET
> tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 30 tpci_snd_cwnd 274
> tcpi_reordering 3 tcpi_total_retrans 0
> Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
> Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
> Size Size Size (sec) Util Util Util Util Demand Demand Units
> Final Final % Method % Method
> 1264784 1264784 16384 20.01 689.70 10^6bits/s 0.22 S 15.05 S 0.634 7.149 usec/KB
rick jones
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 18:44 ` Rick Jones
@ 2012-07-11 23:49 ` Eric Dumazet
2012-07-12 7:34 ` Eric Dumazet
0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-11 23:49 UTC (permalink / raw)
To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On Wed, 2012-07-11 at 11:44 -0700, Rick Jones wrote:
> On 07/11/2012 08:11 AM, Eric Dumazet wrote:
> >
> > Some bench results about the choice of 128KB being the default value:
>
> What were the starting/baseline figures?
>
> >
> > Tests using a single TCP flow.
> >
> > Tests on 10Gbit links :
> >
> >
> > echo 16384 >/proc/sys/net/ipv4/tcp_limit_output_bytes
> > OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.99.2 (192.168.99.2) port 0 AF_INET
> > tcpi_rto 201000 tcpi_ato 0 tcpi_pmtu 1500 tcpi_rcv_ssthresh 14600
> > tcpi_rtt 1875 tcpi_rttvar 750 tcpi_snd_ssthresh 16 tpci_snd_cwnd 79
> > tcpi_reordering 53 tcpi_total_retrans 0
> > Local Local Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
> > Send Socket Send Socket Send Time Units CPU CPU CPU CPU Service Service Demand
> > Size Size Size (sec) Util Util Util Util Demand Demand Units
> > Final Final % Method % Method
> > 392360 392360 16384 20.00 1389.53 10^6bits/s 0.52 S 4.30 S 0.737 1.014 usec/KB
>
> By the way, that double reporting of the local socket send size is fixed in:
>
> ------------------------------------------------------------------------
> r516 | raj | 2012-01-05 15:48:52 -0800 (Thu, 05 Jan 2012) | 1 line
>
> report the rsr_size_end in an omni stream test rather than a copy of the
> lss_size_end
>
> of netperf and later. Also, any idea why the local socket send size got
> so much larger with 1GbE than 10 GbE at that setting of
> tcp_limit_output_bytes?
The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
ubuntu kernel. They probably have very different TCP behavior.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-11 23:49 ` Eric Dumazet
@ 2012-07-12 7:34 ` Eric Dumazet
2012-07-12 7:37 ` David Miller
0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-12 7:34 UTC (permalink / raw)
To: Rick Jones; +Cc: nanditad, netdev, mattmathis, codel, ncardwell, David Miller
On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
> ubuntu kernel. They probably have very different TCP behavior.
I tested TSQ on bnx2x and 10Gb links.
I get full rate even using 65536 bytes for
the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.8.37 () port 0 AF_INET : histogram
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1606536 2097152 16384 20.00 9411.12 10^6bits/s 2.40 S 4.27 S 0.502 0.892 usec/KB
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-12 7:34 ` Eric Dumazet
@ 2012-07-12 7:37 ` David Miller
2012-07-12 7:51 ` Eric Dumazet
0 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2012-07-12 7:37 UTC (permalink / raw)
To: eric.dumazet; +Cc: nanditad, netdev, mattmathis, codel, ncardwell
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 Jul 2012 09:34:19 +0200
> On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
>
>> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
>> ubuntu kernel. They probably have very different TCP behavior.
>
>
> I tested TSQ on bnx2x and 10Gb links.
>
> I get full rate even using 65536 bytes for
> the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
Great work Eric.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-12 7:37 ` David Miller
@ 2012-07-12 7:51 ` Eric Dumazet
2012-07-12 14:55 ` Tom Herbert
0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-12 7:51 UTC (permalink / raw)
To: David Miller; +Cc: nanditad, netdev, mattmathis, codel, ncardwell
On Thu, 2012-07-12 at 00:37 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 12 Jul 2012 09:34:19 +0200
>
> > On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
> >
> >> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
> >> ubuntu kernel. They probably have very different TCP behavior.
> >
> >
> > I tested TSQ on bnx2x and 10Gb links.
> >
> > I get full rate even using 65536 bytes for
> > the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
>
> Great work Eric.
Thanks !
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-12 7:51 ` Eric Dumazet
@ 2012-07-12 14:55 ` Tom Herbert
0 siblings, 0 replies; 44+ messages in thread
From: Tom Herbert @ 2012-07-12 14:55 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, rick.jones2, ycheng, dave.taht, netdev, codel,
mattmathis, nanditad, ncardwell, andrewmcgr
On Thu, Jul 12, 2012 at 12:51 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-12 at 00:37 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Thu, 12 Jul 2012 09:34:19 +0200
>>
>> > On Thu, 2012-07-12 at 01:49 +0200, Eric Dumazet wrote:
>> >
>> >> The 10Gb receiver is a net-next kernel, but the 1Gb receiver is a 2.6.38
>> >> ubuntu kernel. They probably have very different TCP behavior.
>> >
>> >
>> > I tested TSQ on bnx2x and 10Gb links.
>> >
>> > I get full rate even using 65536 bytes for
>> > the /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
>>
>> Great work Eric.
>
> Thanks !
>
This is indeed great work! A couple of comments...
Do you know if there are are any qdiscs that function less efficiently
when we are restricting the number of packets? For instance, will HTB
work as expected in various configurations?
One extension to this work be to make the limit dynamic and mostly
eliminate the tunable. I'm thinking we might be able to correlate the
limit to the BQL limit of the egress queue for the flow it there is
one.
Assuming all work conserving qdiscs the minimal amount of outstanding
host data for a queue could be associated with the BQL limit of the
egress NIC queue. We want to minimize the outstanding data so that:
sum(data_of_tcp_flows_share_same_queue) > bql_limit_for _queue
So this could imply a per flow limit of:
tcp_limit = max(bql_limit - bql_inflight, one_packet)
For a single active connection on a queue, the tcp_limit is equal to
the BQL limit. Once the BQL limit is hit in the NIC, we only need one
packet outstanding per flow to maintain flow control. For fairness,
we might need "one_packet" to actually be max GSO data. Also, this
disregards any latency of scheduling and running the tasklet, that
might need to be taken into account also.
Tom
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-10 15:13 ` [RFC PATCH v2] tcp: TCP Small Queues Eric Dumazet
` (2 preceding siblings ...)
2012-07-11 15:11 ` Eric Dumazet
@ 2012-07-12 13:33 ` John Heffner
2012-07-12 13:46 ` Eric Dumazet
3 siblings, 1 reply; 44+ messages in thread
From: John Heffner @ 2012-07-12 13:33 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
mattmathis, nanditad, ncardwell, andrewmcgr
One general question: why a per-connection limit? I haven't been
following the bufferbloat conversation closely so I may have missed
some of the conversation. But it seems that multiple connections will
still cause longer queue times.
Thanks,
-John
On Tue, Jul 10, 2012 at 11:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> This introduce TSQ (TCP Small Queues)
>
> TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
> device queues), to reduce RTT and cwnd bias, part of the bufferbloat
> problem.
>
> sk->sk_wmem_alloc not allowed to grow above a given limit,
> allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
> given time.
>
> TSO packets are sized/capped to half the limit, so that we have two
> TSO packets in flight, allowing better bandwidth use.
>
> As a side effect, setting the limit to 40000 automatically reduces the
> standard gso max limit (65536) to 40000/2 : It can help to reduce
> latencies of high prio packets, having smaller TSO packets.
>
> This means we divert sock_wfree() to a tcp_wfree() handler, to
> queue/send following frames when skb_orphan() [2] is called for the
> already queued skbs.
>
> Results on my dev machine (tg3 nic) are really impressive, using
> standard pfifo_fast, and with or without TSO/GSO. Without reduction of
> nominal bandwidth.
>
> I no longer have 3MBytes backlogged in qdisc by a single netperf
> session, and both side socket autotuning no longer use 4 Mbytes.
>
> As skb destructor cannot restart xmit itself ( as qdisc lock might be
> taken at this point ), we delegate the work to a tasklet. We use one
> tasklest per cpu for performance reasons.
>
>
>
> [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
> [2] skb_orphan() is usually called at TX completion time,
> but some drivers call it in their start_xmit() handler.
> These drivers should at least use BQL, or else a single TCP
> session can still fill the whole NIC TX ring, since TSQ will
> have no effect.
>
> Not-Yet-Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/linux/tcp.h | 9 ++
> include/net/tcp.h | 3
> net/ipv4/sysctl_net_ipv4.c | 7 +
> net/ipv4/tcp.c | 14 ++-
> net/ipv4/tcp_minisocks.c | 1
> net/ipv4/tcp_output.c | 132 ++++++++++++++++++++++++++++++++++-
> 6 files changed, 160 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 7d3bced..55b8cf9 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -339,6 +339,9 @@ struct tcp_sock {
> u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */
> u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
>
> + struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> + unsigned long tsq_flags;
> +
> /* Data for direct copy to user */
> struct {
> struct sk_buff_head prequeue;
> @@ -494,6 +497,12 @@ struct tcp_sock {
> struct tcp_cookie_values *cookie_values;
> };
>
> +enum tsq_flags {
> + TSQ_THROTTLED,
> + TSQ_QUEUED,
> + TSQ_OWNED, /* tcp_tasklet_func() found socket was locked */
> +};
> +
> static inline struct tcp_sock *tcp_sk(const struct sock *sk)
> {
> return (struct tcp_sock *)sk;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 53fb7d8..3a6ed09 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -253,6 +253,7 @@ extern int sysctl_tcp_cookie_size;
> extern int sysctl_tcp_thin_linear_timeouts;
> extern int sysctl_tcp_thin_dupack;
> extern int sysctl_tcp_early_retrans;
> +extern int sysctl_tcp_limit_output_bytes;
>
> extern atomic_long_t tcp_memory_allocated;
> extern struct percpu_counter tcp_sockets_allocated;
> @@ -321,6 +322,8 @@ extern struct proto tcp_prot;
>
> extern void tcp_init_mem(struct net *net);
>
> +extern void tcp_tasklet_init(void);
> +
> extern void tcp_v4_err(struct sk_buff *skb, u32);
>
> extern void tcp_shutdown (struct sock *sk, int how);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 12aa0c5..70730f7 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -598,6 +598,13 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec
> },
> + {
> + .procname = "tcp_limit_output_bytes",
> + .data = &sysctl_tcp_limit_output_bytes,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec
> + },
> #ifdef CONFIG_NET_DMA
> {
> .procname = "tcp_dma_copybreak",
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 3ba605f..8838bd2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -376,6 +376,7 @@ void tcp_init_sock(struct sock *sk)
> skb_queue_head_init(&tp->out_of_order_queue);
> tcp_init_xmit_timers(sk);
> tcp_prequeue_init(tp);
> + INIT_LIST_HEAD(&tp->tsq_node);
>
> icsk->icsk_rto = TCP_TIMEOUT_INIT;
> tp->mdev = TCP_TIMEOUT_INIT;
> @@ -786,15 +787,17 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
> int large_allowed)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> - u32 xmit_size_goal, old_size_goal;
> + u32 xmit_size_goal, old_size_goal, gso_max_size;
>
> xmit_size_goal = mss_now;
>
> if (large_allowed && sk_can_gso(sk)) {
> - xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> - inet_csk(sk)->icsk_af_ops->net_header_len -
> - inet_csk(sk)->icsk_ext_hdr_len -
> - tp->tcp_header_len);
> + gso_max_size = min_t(u32, sk->sk_gso_max_size,
> + sysctl_tcp_limit_output_bytes >> 1);
> + xmit_size_goal = (gso_max_size - 1) -
> + inet_csk(sk)->icsk_af_ops->net_header_len -
> + inet_csk(sk)->icsk_ext_hdr_len -
> + tp->tcp_header_len;
>
> xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>
> @@ -3573,4 +3576,5 @@ void __init tcp_init(void)
> tcp_secret_primary = &tcp_secret_one;
> tcp_secret_retiring = &tcp_secret_two;
> tcp_secret_secondary = &tcp_secret_two;
> + tcp_tasklet_init();
> }
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 72b7c63..83b358f 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
> treq->snt_isn + 1 + tcp_s_data_size(oldtp);
>
> tcp_prequeue_init(newtp);
> + INIT_LIST_HEAD(&newtp->tsq_node);
>
> tcp_init_wl(newtp, treq->rcv_isn);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index c465d3e..991ae45 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,6 +50,9 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
> */
> int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>
> +/* Default TSQ limit of two TSO segments */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +
> /* This limits the percentage of the congestion window which we
> * will allow a single TSO frame to consume. Building TSO frames
> * which are too large can cause TCP streams to be bursty.
> @@ -65,6 +68,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
> int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
> EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
>
> +static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> + int push_one, gfp_t gfp);
>
> /* Account for new data that has been sent to the network. */
> static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
> @@ -783,6 +788,118 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> return size;
> }
>
> +
> +/* TCP SMALL QUEUES (TSQ)
> + *
> + * TSQ goal is to keep small amount of skbs per tcp flow in tx queues (qdisc+dev)
> + * to reduce RTT and bufferbloat.
> + * We do this using a special skb destructor (tcp_wfree).
> + *
> + * Its important tcp_wfree() can be replaced by sock_wfree() in the event skb
> + * needs to be reallocated in a driver.
> + * The invariant being skb->truesize substracted from sk->sk_wmem_alloc
> + *
> + * Since transmit from skb destructor is forbidden, we use a tasklet
> + * to process all sockets that eventually need to send more skbs.
> + * We use one tasklet per cpu, with its own queue of sockets.
> + */
> +struct tsq_tasklet {
> + struct tasklet_struct tasklet;
> + struct list_head head; /* queue of tcp sockets */
> +};
> +static DEFINE_PER_CPU(struct tsq_tasklet, tsq_tasklet);
> +
> +/*
> + * One tasklest per cpu tries to send more skbs.
> + * We run in tasklet context but need to disable irqs when
> + * transfering tsq->head because tcp_wfree() might
> + * interrupt us (non NAPI drivers)
> + */
> +static void tcp_tasklet_func(unsigned long data)
> +{
> + struct tsq_tasklet *tsq = (struct tsq_tasklet *)data;
> + LIST_HEAD(list);
> + unsigned long flags;
> + struct list_head *q, *n;
> + struct tcp_sock *tp;
> + struct sock *sk;
> +
> + local_irq_save(flags);
> + list_splice_init(&tsq->head, &list);
> + local_irq_restore(flags);
> +
> + list_for_each_safe(q, n, &list) {
> + tp = list_entry(q, struct tcp_sock, tsq_node);
> + list_del(&tp->tsq_node);
> +
> + sk = (struct sock *)tp;
> + bh_lock_sock(sk);
> +
> + if (!sock_owned_by_user(sk)) {
> + if ((1 << sk->sk_state) &
> + (TCPF_CLOSE_WAIT | TCPF_ESTABLISHED))
> + tcp_write_xmit(sk,
> + tcp_current_mss(sk),
> + 0, 0,
> + GFP_ATOMIC);
> + } else {
> + /* TODO:
> + * setup a timer, or check TSQ_OWNED in release_sock()
> + */
> + set_bit(TSQ_OWNED, &tp->tsq_flags);
> + }
> + bh_unlock_sock(sk);
> +
> + clear_bit(TSQ_QUEUED, &tp->tsq_flags);
> + sk_free(sk);
> + }
> +}
> +
> +void __init tcp_tasklet_init(void)
> +{
> + int i;
> +
> + for_each_possible_cpu(i) {
> + struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
> +
> + INIT_LIST_HEAD(&tsq->head);
> + tasklet_init(&tsq->tasklet,
> + tcp_tasklet_func,
> + (unsigned long)tsq);
> + }
> +}
> +
> +/*
> + * Write buffer destructor automatically called from kfree_skb.
> + * We cant xmit new skbs from this context, as we might already
> + * hold qdisc lock.
> + */
> +void tcp_wfree(struct sk_buff *skb)
> +{
> + struct sock *sk = skb->sk;
> + struct tcp_sock *tp = tcp_sk(sk);
> +
> + if (test_and_clear_bit(TSQ_THROTTLED, &tp->tsq_flags) &&
> + !test_and_set_bit(TSQ_QUEUED, &tp->tsq_flags)) {
> + unsigned long flags;
> + struct tsq_tasklet *tsq;
> +
> + /* Keep a ref on socket.
> + * This last ref will be released in tcp_tasklet_func()
> + */
> + atomic_sub(skb->truesize - 1, &sk->sk_wmem_alloc);
> +
> + /* queue this socket to tasklet queue */
> + local_irq_save(flags);
> + tsq = &__get_cpu_var(tsq_tasklet);
> + list_add(&tp->tsq_node, &tsq->head);
> + tasklet_schedule(&tsq->tasklet);
> + local_irq_restore(flags);
> + } else {
> + sock_wfree(skb);
> + }
> +}
> +
> /* This routine actually transmits TCP packets queued in by
> * tcp_do_sendmsg(). This is used by both the initial
> * transmission and possible later retransmissions.
> @@ -844,7 +961,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>
> skb_push(skb, tcp_header_size);
> skb_reset_transport_header(skb);
> - skb_set_owner_w(skb, sk);
> +
> + skb_orphan(skb);
> + skb->sk = sk;
> + skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
> + tcp_wfree : sock_wfree;
> + atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>
> /* Build TCP header and checksum it. */
> th = tcp_hdr(skb);
> @@ -1780,6 +1902,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> while ((skb = tcp_send_head(sk))) {
> unsigned int limit;
>
> +
> tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
> BUG_ON(!tso_segs);
>
> @@ -1800,6 +1923,13 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> break;
> }
>
> + /* TSQ : sk_wmem_alloc accounts skb truesize,
> + * including skb overhead. But thats OK.
> + */
> + if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
> + set_bit(TSQ_THROTTLED, &tp->tsq_flags);
> + break;
> + }
> limit = mss_now;
> if (tso_segs > 1 && !tcp_urg_mode(tp))
> limit = tcp_mss_split_point(sk, skb, mss_now,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-12 13:33 ` John Heffner
@ 2012-07-12 13:46 ` Eric Dumazet
2012-07-12 16:44 ` John Heffner
0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2012-07-12 13:46 UTC (permalink / raw)
To: John Heffner
Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
mattmathis, nanditad, ncardwell, andrewmcgr
On Thu, 2012-07-12 at 09:33 -0400, John Heffner wrote:
> One general question: why a per-connection limit? I haven't been
> following the bufferbloat conversation closely so I may have missed
> some of the conversation. But it seems that multiple connections will
> still cause longer queue times.
We already have a per-device limit, in qdisc.
If you want to monitor several tcp sessions, I urge you use a controller
for that. Like codel or fq_codel.
Experiments show that limiting to two TSO packets in qdisc per tcp flow
is enough to stop insane qdisc queueing, without impact on throughput
for people wanting fast tcp sessions.
Thats not solving the more general problem of having 1000 competing
flows.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-12 13:46 ` Eric Dumazet
@ 2012-07-12 16:44 ` John Heffner
2012-07-12 16:54 ` Jim Gettys
0 siblings, 1 reply; 44+ messages in thread
From: John Heffner @ 2012-07-12 16:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, ycheng, dave.taht, netdev, codel, therbert,
mattmathis, nanditad, ncardwell, andrewmcgr
On Thu, Jul 12, 2012 at 9:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-07-12 at 09:33 -0400, John Heffner wrote:
>> One general question: why a per-connection limit? I haven't been
>> following the bufferbloat conversation closely so I may have missed
>> some of the conversation. But it seems that multiple connections will
>> still cause longer queue times.
>
> We already have a per-device limit, in qdisc.
>
> If you want to monitor several tcp sessions, I urge you use a controller
> for that. Like codel or fq_codel.
>
> Experiments show that limiting to two TSO packets in qdisc per tcp flow
> is enough to stop insane qdisc queueing, without impact on throughput
> for people wanting fast tcp sessions.
>
> Thats not solving the more general problem of having 1000 competing
> flows.
Right, AQM (and probably some modifications to the congestion control)
is the more general solution.
I guess I'm just trying to justify in my mind that the case of a small
number of local connections is worth handling in this special way. It
seems like a generally reasonable thing, but it's definitely not a
general solution to minimizing latency. One thing worth noting: on a
system routing traffic, local connections may be at a disadvantage
relative to connections being forwarded, sharing the same interface
queue, if that queue is the bottleneck.
Architecturally, the inconsistency between a local queue and a queue
one hop away bothers me a bit, but it's something I can learn to live
with if it really does improve a common case significantly. ;-)
Thanks,
-John
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH v2] tcp: TCP Small Queues
2012-07-12 16:44 ` John Heffner
@ 2012-07-12 16:54 ` Jim Gettys
0 siblings, 0 replies; 44+ messages in thread
From: Jim Gettys @ 2012-07-12 16:54 UTC (permalink / raw)
To: John Heffner; +Cc: nanditad, netdev, codel, mattmathis, ncardwell, David Miller
On 07/12/2012 12:44 PM, John Heffner wrote:
> On Thu, Jul 12, 2012 at 9:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Thu, 2012-07-12 at 09:33 -0400, John Heffner wrote:
>>> One general question: why a per-connection limit? I haven't been
>>> following the bufferbloat conversation closely so I may have missed
>>> some of the conversation. But it seems that multiple connections will
>>> still cause longer queue times.
>> We already have a per-device limit, in qdisc.
>>
>> If you want to monitor several tcp sessions, I urge you use a controller
>> for that. Like codel or fq_codel.
>>
>> Experiments show that limiting to two TSO packets in qdisc per tcp flow
>> is enough to stop insane qdisc queueing, without impact on throughput
>> for people wanting fast tcp sessions.
>>
>> Thats not solving the more general problem of having 1000 competing
>> flows.
> Right, AQM (and probably some modifications to the congestion control)
> is the more general solution.
>
> I guess I'm just trying to justify in my mind that the case of a small
> number of local connections is worth handling in this special way. It
> seems like a generally reasonable thing, but it's definitely not a
> general solution to minimizing latency. One thing worth noting: on a
> system routing traffic, local connections may be at a disadvantage
> relative to connections being forwarded, sharing the same interface
> queue, if that queue is the bottleneck.
Kathy simulated CoDel across a pretty wide range of RTT's seen at the
edge of the network, and things behave pretty well. She did say she
needed to think more and simulate the data center cases; haven't had a
chance to chat with her about that. Of course, you can do some
experiments pretty easily yourself, and we'd love to see whatever
results you get.
- Jim
>
> Architecturally, the inconsistency between a local queue and a queue
> one hop away bothers me a bit, but it's something I can learn to live
> with if it really does improve a common case significantly. ;-)
>
> Thanks,
> -John
> _______________________________________________
> Codel mailing list
> Codel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel
^ permalink raw reply [flat|nested] 44+ messages in thread