netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: do not pace pure ack packets
@ 2015-02-04  2:31 Eric Dumazet
  2015-02-05  4:36 ` David Miller
  2015-02-14 20:45 ` Or Gerlitz
  0 siblings, 2 replies; 4+ messages in thread
From: Eric Dumazet @ 2015-02-04  2:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Kenneth Klette Jonassen

From: Eric Dumazet <edumazet@google.com>

When we added pacing to TCP, we decided to let sch_fq take care
of actual pacing.

All TCP had to do was to compute sk->pacing_rate using simple formula:

sk->pacing_rate = 2 * cwnd * mss / rtt

It works well for senders (bulk flows), but not very well for receivers
or even RPC :

cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
can end up pacing ACK packets, slowing down the sender.

Really, only the sender should pace, according to its own logic.

Instead of adding a new bit in skb, or call yet another flow
dissection, we tweak skb->truesize to a small value (2), and
we instruct sch_fq to use new helper and not pace pure ack.

Note this also helps TCP small queue, as ack packets present
in qdisc/NIC do not prevent sending a data packet (RPC workload)

This helps to reduce tx completion overhead, ack packets can use regular
sock_wfree() instead of tcp_wfree() which is a bit more expensive.

This has no impact in the case packets are sent to loopback interface,
as we do not coalesce ack packets (were we would detect skb->truesize
lie)

In case netem (with a delay) is used, skb_orphan_partial() also sets
skb->truesize to 1.

This patch is a combination of two patches we used for about one year at
Google.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h     |   15 +++++++++++++++
 net/ipv4/tcp_output.c |   10 +++++++++-
 net/sched/sch_fq.c    |   10 ++++++++--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b8fdc6bab3f3..637ee490ec81 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1713,4 +1713,19 @@ static inline struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
 	return dopt;
 }
 
+/* locally generated TCP pure ACKs have skb->truesize == 2
+ * (check tcp_send_ack() in net/ipv4/tcp_output.c )
+ * This is much faster than dissecting the packet to find out.
+ * (Think of GRE encapsulations, IPv4, IPv6, ...)
+ */
+static inline bool skb_is_tcp_pure_ack(const struct sk_buff *skb)
+{
+	return skb->truesize == 2;
+}
+
+static inline void skb_set_tcp_pure_ack(struct sk_buff *skb)
+{
+	skb->truesize = 2;
+}
+
 #endif	/* _TCP_H */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 20ab06b228ac..eae35c1ad0aa 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -948,7 +948,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	skb_orphan(skb);
 	skb->sk = sk;
-	skb->destructor = tcp_wfree;
+	skb->destructor = skb_is_tcp_pure_ack(skb) ? sock_wfree : tcp_wfree;
 	skb_set_hash_from_sk(skb, sk);
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 
@@ -3265,6 +3265,14 @@ void tcp_send_ack(struct sock *sk)
 	skb_reserve(buff, MAX_TCP_HEADER);
 	tcp_init_nondata_skb(buff, tcp_acceptable_seq(sk), TCPHDR_ACK);
 
+	/* We do not want pure acks influencing TCP Small Queues or fq/pacing
+	 * too much.
+	 * SKB_TRUESIZE(max(1 .. 66, MAX_TCP_HEADER)) is unfortunately ~784
+	 * We also avoid tcp_wfree() overhead (cache line miss accessing
+	 * tp->tsq_flags) by using regular sock_wfree()
+	 */
+	skb_set_tcp_pure_ack(buff);
+
 	/* Send it off, this clears delayed acks for us. */
 	skb_mstamp_get(&buff->skb_mstamp);
 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 2a50f5c62070..69a3dbf55c60 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -52,6 +52,7 @@
 #include <net/pkt_sched.h>
 #include <net/sock.h>
 #include <net/tcp_states.h>
+#include <net/tcp.h>
 
 /*
  * Per flow structure, dynamically allocated
@@ -445,7 +446,9 @@ begin:
 		goto begin;
 	}
 
-	if (unlikely(f->head && now < f->time_next_packet)) {
+	skb = f->head;
+	if (unlikely(skb && now < f->time_next_packet &&
+		     !skb_is_tcp_pure_ack(skb))) {
 		head->first = f->next;
 		fq_flow_set_throttled(q, f);
 		goto begin;
@@ -464,12 +467,15 @@ begin:
 		goto begin;
 	}
 	prefetch(&skb->end);
-	f->time_next_packet = now;
 	f->credit -= qdisc_pkt_len(skb);
 
 	if (f->credit > 0 || !q->rate_enable)
 		goto out;
 
+	/* Do not pace locally generated ack packets */
+	if (skb_is_tcp_pure_ack(skb))
+		goto out;
+
 	rate = q->flow_max_rate;
 	if (skb->sk)
 		rate = min(skb->sk->sk_pacing_rate, rate);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] tcp: do not pace pure ack packets
  2015-02-04  2:31 [PATCH net-next] tcp: do not pace pure ack packets Eric Dumazet
@ 2015-02-05  4:36 ` David Miller
  2015-02-14 20:45 ` Or Gerlitz
  1 sibling, 0 replies; 4+ messages in thread
From: David Miller @ 2015-02-05  4:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, kennetkl

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Feb 2015 18:31:53 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> When we added pacing to TCP, we decided to let sch_fq take care
> of actual pacing.
> 
> All TCP had to do was to compute sk->pacing_rate using simple formula:
> 
> sk->pacing_rate = 2 * cwnd * mss / rtt
> 
> It works well for senders (bulk flows), but not very well for receivers
> or even RPC :
> 
> cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
> can end up pacing ACK packets, slowing down the sender.
> 
> Really, only the sender should pace, according to its own logic.
> 
> Instead of adding a new bit in skb, or call yet another flow
> dissection, we tweak skb->truesize to a small value (2), and
> we instruct sch_fq to use new helper and not pace pure ack.
> 
> Note this also helps TCP small queue, as ack packets present
> in qdisc/NIC do not prevent sending a data packet (RPC workload)
> 
> This helps to reduce tx completion overhead, ack packets can use regular
> sock_wfree() instead of tcp_wfree() which is a bit more expensive.
> 
> This has no impact in the case packets are sent to loopback interface,
> as we do not coalesce ack packets (were we would detect skb->truesize
> lie)
> 
> In case netem (with a delay) is used, skb_orphan_partial() also sets
> skb->truesize to 1.
> 
> This patch is a combination of two patches we used for about one year at
> Google.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] tcp: do not pace pure ack packets
  2015-02-04  2:31 [PATCH net-next] tcp: do not pace pure ack packets Eric Dumazet
  2015-02-05  4:36 ` David Miller
@ 2015-02-14 20:45 ` Or Gerlitz
  2015-02-15 14:25   ` Eric Dumazet
  1 sibling, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2015-02-14 20:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Kenneth Klette Jonassen, john fastabend

On Wed, Feb 4, 2015 at 4:31 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> When we added pacing to TCP, we decided to let sch_fq take care
> of actual pacing.
>
> All TCP had to do was to compute sk->pacing_rate using simple formula:
>
> sk->pacing_rate = 2 * cwnd * mss / rtt
>
> It works well for senders (bulk flows), but not very well for receivers
> or even RPC :
>
> cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
> can end up pacing ACK packets, slowing down the sender.
> Really, only the sender should pace, according to its own logic.

[...]

> This patch is a combination of two patches we used for about one year at Google.

Wow, so what is the black box impact on systems running without the
fix, pacing works but also un-needed slow downs
are introduced, right?

Sounds to me as something we really want to push into -stable, isn't that?

Do you think we need to let the patch rot for a while in 3.20-rc or we
can do it right away?

Or

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] tcp: do not pace pure ack packets
  2015-02-14 20:45 ` Or Gerlitz
@ 2015-02-15 14:25   ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2015-02-15 14:25 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev, Kenneth Klette Jonassen, john fastabend

On Sat, 2015-02-14 at 22:45 +0200, Or Gerlitz wrote:

> Wow, so what is the black box impact on systems running without the
> fix, pacing works but also un-needed slow downs
> are introduced, right?
> 
> Sounds to me as something we really want to push into -stable, isn't that?
> 
> Do you think we need to let the patch rot for a while in 3.20-rc or we
> can do it right away?

No impact at all. black box are forwarding packets, they are not pacing
them at all.

fq/pacing paces locally generated traffic only.

This works the following way :

TCP sets sk->sk_pacing_rate = 2 * cwnd * mss / srtt

fq then paces packets that are attached to a (local) socket, and where
sk->sk_pacing_rate is different than default value ( ~0U )

The only case a middle box could pace is when/if the optional 
'maxrate xxx' parameter is given to fq packet scheduler, and in this
case the intent is to have a max rate, including ACK packets.

I really do not think this patch is a stable candidate.

Thanks

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-15 14:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-04  2:31 [PATCH net-next] tcp: do not pace pure ack packets Eric Dumazet
2015-02-05  4:36 ` David Miller
2015-02-14 20:45 ` Or Gerlitz
2015-02-15 14:25   ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).