All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/7] tcp: second round for EDT conversion
@ 2018-10-15 16:37 Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh Eric Dumazet
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

First round of EDT patches left TCP stack in a non optimal state.

- High speed flows suffered from loss of performance, addressed
  by the first patch of this series.

- Second patch brings pacing to the current state of networking,
  since we now reach ~100 Gbit on a single TCP flow.

- Third patch implements a mitigation for scheduling delays,
  like the one we did in sch_fq in the past.

- Fourth patch removes one special case in sch_fq for ACK packets.

- Fifth patch removes a serious perfomance cost for TCP internal
  pacing. We should setup the high resolution timer only if
  really needed.

- Sixth patch fixes a typo in BBR.

- Last patch is one minor change in cdg congestion control.

Neal Cardwell also has a patch series fixing BBR after
EDT adoption.

Eric Dumazet (6):
  tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
  net: extend sk_pacing_rate to unsigned long
  tcp: mitigate scheduling jitter in EDT pacing model
  net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
  tcp: optimize tcp internal pacing
  tcp: cdg: use tcp high resolution clock cache

Neal Cardwell (1):
  tcp_bbr: fix typo in bbr_pacing_margin_percent

 include/linux/tcp.h   |  1 +
 include/net/sock.h    |  4 +--
 net/core/filter.c     |  4 +--
 net/core/sock.c       |  9 +++---
 net/ipv4/tcp.c        | 10 +++---
 net/ipv4/tcp_bbr.c    | 10 +++---
 net/ipv4/tcp_cdg.c    |  2 +-
 net/ipv4/tcp_output.c | 72 ++++++++++++++++++++++++++-----------------
 net/ipv4/tcp_timer.c  |  2 +-
 net/sched/sch_fq.c    | 22 +++++++------
 10 files changed, 78 insertions(+), 58 deletions(-)

-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long Eric Dumazet
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.

This causes major regressions at high speed flows.

Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.

tcp_wstamp_ns is only updated when a packet is sent.

Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.

Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/linux/tcp.h   | 1 +
 net/ipv4/tcp_output.c | 9 ++++++---
 net/ipv4/tcp_timer.c  | 2 +-
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 848f5b25e178288ce870637b68a692ab88dc7d4d..8ed77bb4ed8636e9294389a011529fd9a667dce4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -249,6 +249,7 @@ struct tcp_sock {
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 	u64	tcp_wstamp_ns;	/* departure time for next sent data packet */
+	u64	tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */
 
 /* RTT measurement */
 	u64	tcp_mstamp;	/* most recent packet received/sent */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 059b67af28b137fb9566eaef370b270fc424bffb..f14df66a0c858dcb22b8924b9691c375eb5fcbc5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -52,9 +52,8 @@ void tcp_mstamp_refresh(struct tcp_sock *tp)
 {
 	u64 val = tcp_clock_ns();
 
-	/* departure time for next data packet */
-	if (val > tp->tcp_wstamp_ns)
-		tp->tcp_wstamp_ns = val;
+	if (val > tp->tcp_clock_cache)
+		tp->tcp_clock_cache = val;
 
 	val = div_u64(val, NSEC_PER_USEC);
 	if (val > tp->tcp_mstamp)
@@ -1050,6 +1049,10 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 		if (unlikely(!skb))
 			return -ENOBUFS;
 	}
+
+	/* TODO: might take care of jitter here */
+	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
+
 	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
 
 	inet = inet_sk(sk);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 61023d50cd604d5e19464a32c33b65d29c75c81e..676020663ce80a79341ad1a05352742cc8dd5850 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -360,7 +360,7 @@ static void tcp_probe_timer(struct sock *sk)
 	 */
 	start_ts = tcp_skb_timestamp(skb);
 	if (!start_ts)
-		skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+		skb->skb_mstamp_ns = tp->tcp_clock_cache;
 	else if (icsk->icsk_user_timeout &&
 		 (s32)(tcp_time_stamp(tp) - start_ts) > icsk->icsk_user_timeout)
 		goto abort;
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model Eric Dumazet
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.

We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

This patch adds no cost for 32bit kernels.

The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.

Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.

State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                 timer:(on,003ms,0) ino:91863 sk:2 <->
 skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
 ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
 rcvmss:536 advmss:1448
 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
 segs_in:3916318 data_segs_out:177279175
 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
 send 28045.5Mbps lastrcv:73333
 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
 busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
 notsent:2085120 minrtt:0.013

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h    |  4 ++--
 net/core/filter.c     |  4 ++--
 net/core/sock.c       |  9 +++++----
 net/ipv4/tcp.c        | 10 +++++-----
 net/ipv4/tcp_bbr.c    |  6 +++---
 net/ipv4/tcp_output.c | 19 +++++++++++--------
 net/sched/sch_fq.c    | 20 ++++++++++++--------
 7 files changed, 40 insertions(+), 32 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 751549ac0a849144ab0382203ee5c877374523e2..cfaf261936c8787b3a65ce832fd9c871697d00f4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -422,8 +422,8 @@ struct sock {
 	struct timer_list	sk_timer;
 	__u32			sk_priority;
 	__u32			sk_mark;
-	u32			sk_pacing_rate; /* bytes per second */
-	u32			sk_max_pacing_rate;
+	unsigned long		sk_pacing_rate; /* bytes per second */
+	unsigned long		sk_max_pacing_rate;
 	struct page_frag	sk_frag;
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
diff --git a/net/core/filter.c b/net/core/filter.c
index 4bbc6567fcb818e91617bfa9a2fd7fbebbd129f8..80da21b097b8d05eb7b9fa92afa86762334ac0ae 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3927,8 +3927,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
 			sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
 			break;
-		case SO_MAX_PACING_RATE:
-			sk->sk_max_pacing_rate = val;
+		case SO_MAX_PACING_RATE: /* 32bit version */
+			sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val;
 			sk->sk_pacing_rate = min(sk->sk_pacing_rate,
 						 sk->sk_max_pacing_rate);
 			break;
diff --git a/net/core/sock.c b/net/core/sock.c
index 7e8796a6a0892efbb7dfce67d12b8062b2d5daa9..fdf9fc7d3f9875f2718575078a0f263674c80b4f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -998,7 +998,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 			cmpxchg(&sk->sk_pacing_status,
 				SK_PACING_NONE,
 				SK_PACING_NEEDED);
-		sk->sk_max_pacing_rate = val;
+		sk->sk_max_pacing_rate = (val == ~0U) ? ~0UL : val;
 		sk->sk_pacing_rate = min(sk->sk_pacing_rate,
 					 sk->sk_max_pacing_rate);
 		break;
@@ -1336,7 +1336,8 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 #endif
 
 	case SO_MAX_PACING_RATE:
-		v.val = sk->sk_max_pacing_rate;
+		/* 32bit version */
+		v.val = min_t(unsigned long, sk->sk_max_pacing_rate, ~0U);
 		break;
 
 	case SO_INCOMING_CPU:
@@ -2810,8 +2811,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_ll_usec		=	sysctl_net_busy_read;
 #endif
 
-	sk->sk_max_pacing_rate = ~0U;
-	sk->sk_pacing_rate = ~0U;
+	sk->sk_max_pacing_rate = ~0UL;
+	sk->sk_pacing_rate = ~0UL;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 43ef83b2330e6238a55c9843580a585d87708e0c..b8ba8fa34effac5138aea76b0d0fc2a9f1c05c4f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3111,10 +3111,10 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 {
 	const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */
 	const struct inet_connection_sock *icsk = inet_csk(sk);
+	unsigned long rate;
 	u32 now;
 	u64 rate64;
 	bool slow;
-	u32 rate;
 
 	memset(info, 0, sizeof(*info));
 	if (sk->sk_type != SOCK_STREAM)
@@ -3124,11 +3124,11 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 
 	/* Report meaningful fields for all TCP states, including listeners */
 	rate = READ_ONCE(sk->sk_pacing_rate);
-	rate64 = rate != ~0U ? rate : ~0ULL;
+	rate64 = (rate != ~0UL) ? rate : ~0ULL;
 	info->tcpi_pacing_rate = rate64;
 
 	rate = READ_ONCE(sk->sk_max_pacing_rate);
-	rate64 = rate != ~0U ? rate : ~0ULL;
+	rate64 = (rate != ~0UL) ? rate : ~0ULL;
 	info->tcpi_max_pacing_rate = rate64;
 
 	info->tcpi_reordering = tp->reordering;
@@ -3254,8 +3254,8 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *stats;
 	struct tcp_info info;
+	unsigned long rate;
 	u64 rate64;
-	u32 rate;
 
 	stats = alloc_skb(tcp_opt_stats_get_size(), GFP_ATOMIC);
 	if (!stats)
@@ -3274,7 +3274,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk)
 			  tp->total_retrans, TCP_NLA_PAD);
 
 	rate = READ_ONCE(sk->sk_pacing_rate);
-	rate64 = rate != ~0U ? rate : ~0ULL;
+	rate64 = (rate != ~0UL) ? rate : ~0ULL;
 	nla_put_u64_64bit(stats, TCP_NLA_PACING_RATE, rate64, TCP_NLA_PAD);
 
 	rate64 = tcp_compute_delivery_rate(tp);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index a5786e3e2c16ce53a332f29c9a55b9a641eec791..33f4358615e6d63b5c98a30484f12ffae66334a2 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -219,7 +219,7 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
 }
 
 /* Convert a BBR bw and gain factor to a pacing rate in bytes per second. */
-static u32 bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
+static unsigned long bbr_bw_to_pacing_rate(struct sock *sk, u32 bw, int gain)
 {
 	u64 rate = bw;
 
@@ -258,7 +258,7 @@ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct bbr *bbr = inet_csk_ca(sk);
-	u32 rate = bbr_bw_to_pacing_rate(sk, bw, gain);
+	unsigned long rate = bbr_bw_to_pacing_rate(sk, bw, gain);
 
 	if (unlikely(!bbr->has_seen_rtt && tp->srtt_us))
 		bbr_init_pacing_rate_from_rtt(sk);
@@ -280,7 +280,7 @@ static u32 bbr_tso_segs_goal(struct sock *sk)
 	/* Sort of tcp_tso_autosize() but ignoring
 	 * driver provided sk_gso_max_size.
 	 */
-	bytes = min_t(u32, sk->sk_pacing_rate >> sk->sk_pacing_shift,
+	bytes = min_t(unsigned long, sk->sk_pacing_rate >> sk->sk_pacing_shift,
 		      GSO_MAX_SIZE - 1 - MAX_TCP_HEADER);
 	segs = max_t(u32, bytes / tp->mss_cache, bbr_min_tso_segs(sk));
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f14df66a0c858dcb22b8924b9691c375eb5fcbc5..f4aa4109334a043d02b17b18bef346d805dab501 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -991,14 +991,14 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
 
 	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
 	if (sk->sk_pacing_status != SK_PACING_NONE) {
-		u32 rate = sk->sk_pacing_rate;
+		unsigned long rate = sk->sk_pacing_rate;
 
 		/* Original sch_fq does not pace first 10 MSS
 		 * Note that tp->data_segs_out overflows after 2^32 packets,
 		 * this is a minor annoyance.
 		 */
-		if (rate != ~0U && rate && tp->data_segs_out >= 10) {
-			tp->tcp_wstamp_ns += div_u64((u64)skb->len * NSEC_PER_SEC, rate);
+		if (rate != ~0UL && rate && tp->data_segs_out >= 10) {
+			tp->tcp_wstamp_ns += div64_ul((u64)skb->len * NSEC_PER_SEC, rate);
 
 			tcp_internal_pacing(sk);
 		}
@@ -1704,8 +1704,9 @@ static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
 {
 	u32 bytes, segs;
 
-	bytes = min(sk->sk_pacing_rate >> sk->sk_pacing_shift,
-		    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+	bytes = min_t(unsigned long,
+		      sk->sk_pacing_rate >> sk->sk_pacing_shift,
+		      sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
 
 	/* Goal is to send at least one packet per ms,
 	 * not one big TSO packet every 100 ms.
@@ -2198,10 +2199,12 @@ static bool tcp_pacing_check(const struct sock *sk)
 static bool tcp_small_queue_check(struct sock *sk, const struct sk_buff *skb,
 				  unsigned int factor)
 {
-	unsigned int limit;
+	unsigned long limit;
 
-	limit = max(2 * skb->truesize, sk->sk_pacing_rate >> sk->sk_pacing_shift);
-	limit = min_t(u32, limit,
+	limit = max_t(unsigned long,
+		      2 * skb->truesize,
+		      sk->sk_pacing_rate >> sk->sk_pacing_shift);
+	limit = min_t(unsigned long, limit,
 		      sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes);
 	limit <<= factor;
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 338222a6c664b1825aaada4355e2fc0a01db9c73..3923d14095335df61c270f69e50cb7cbfde4c796 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -92,8 +92,8 @@ struct fq_sched_data {
 	u32		quantum;
 	u32		initial_quantum;
 	u32		flow_refill_delay;
-	u32		flow_max_rate;	/* optional max rate per flow */
 	u32		flow_plimit;	/* max packets per flow */
+	unsigned long	flow_max_rate;	/* optional max rate per flow */
 	u32		orphan_mask;	/* mask for orphaned skb */
 	u32		low_rate_threshold;
 	struct rb_root	*fq_root;
@@ -416,7 +416,8 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 	struct fq_flow_head *head;
 	struct sk_buff *skb;
 	struct fq_flow *f;
-	u32 rate, plen;
+	unsigned long rate;
+	u32 plen;
 
 	skb = fq_dequeue_head(sch, &q->internal);
 	if (skb)
@@ -485,11 +486,11 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 		if (f->credit > 0)
 			goto out;
 	}
-	if (rate != ~0U) {
+	if (rate != ~0UL) {
 		u64 len = (u64)plen * NSEC_PER_SEC;
 
 		if (likely(rate))
-			do_div(len, rate);
+			len = div64_ul(len, rate);
 		/* Since socket rate can change later,
 		 * clamp the delay to 1 second.
 		 * Really, providers of too big packets should be fixed !
@@ -701,9 +702,11 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
 		pr_warn_ratelimited("sch_fq: defrate %u ignored.\n",
 				    nla_get_u32(tb[TCA_FQ_FLOW_DEFAULT_RATE]));
 
-	if (tb[TCA_FQ_FLOW_MAX_RATE])
-		q->flow_max_rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
+	if (tb[TCA_FQ_FLOW_MAX_RATE]) {
+		u32 rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
 
+		q->flow_max_rate = (rate == ~0U) ? ~0UL : rate;
+	}
 	if (tb[TCA_FQ_LOW_RATE_THRESHOLD])
 		q->low_rate_threshold =
 			nla_get_u32(tb[TCA_FQ_LOW_RATE_THRESHOLD]);
@@ -766,7 +769,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt,
 	q->quantum		= 2 * psched_mtu(qdisc_dev(sch));
 	q->initial_quantum	= 10 * psched_mtu(qdisc_dev(sch));
 	q->flow_refill_delay	= msecs_to_jiffies(40);
-	q->flow_max_rate	= ~0U;
+	q->flow_max_rate	= ~0UL;
 	q->time_next_delayed_flow = ~0ULL;
 	q->rate_enable		= 1;
 	q->new_flows.first	= NULL;
@@ -802,7 +805,8 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_QUANTUM, q->quantum) ||
 	    nla_put_u32(skb, TCA_FQ_INITIAL_QUANTUM, q->initial_quantum) ||
 	    nla_put_u32(skb, TCA_FQ_RATE_ENABLE, q->rate_enable) ||
-	    nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE, q->flow_max_rate) ||
+	    nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE,
+			min_t(unsigned long, q->flow_max_rate, ~0U)) ||
 	    nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
 			jiffies_to_usecs(q->flow_refill_delay)) ||
 	    nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() Eric Dumazet
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

In commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.

This patch does the same in TCP stack, now it is using EDT model.

Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.

This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.

Consider following case :

An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec

This is like having a token bucket with a depth of half
an skb.

Tested:

tc qdisc replace dev eth0 root pfifo_fast

Before
netperf -P0 -H remote -- -q 1000000000 # 8000Mbit
540000 262144 262144    10.00    7710.43

After :
netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit
540000 262144 262144    10.00    7999.75   # Much closer to 8000Mbit target

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f4aa4109334a043d02b17b18bef346d805dab501..5474c9854f252e50cdb1136435417873861d7618 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -985,7 +985,8 @@ static void tcp_internal_pacing(struct sock *sk)
 	sock_hold(sk);
 }
 
-static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
+static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
+				      u64 prior_wstamp)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -998,7 +999,12 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
 		 * this is a minor annoyance.
 		 */
 		if (rate != ~0UL && rate && tp->data_segs_out >= 10) {
-			tp->tcp_wstamp_ns += div64_ul((u64)skb->len * NSEC_PER_SEC, rate);
+			u64 len_ns = div64_ul((u64)skb->len * NSEC_PER_SEC, rate);
+			u64 credit = tp->tcp_wstamp_ns - prior_wstamp;
+
+			/* take into account OS jitter */
+			len_ns -= min_t(u64, len_ns / 2, credit);
+			tp->tcp_wstamp_ns += len_ns;
 
 			tcp_internal_pacing(sk);
 		}
@@ -1029,6 +1035,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	struct sk_buff *oskb = NULL;
 	struct tcp_md5sig_key *md5;
 	struct tcphdr *th;
+	u64 prior_wstamp;
 	int err;
 
 	BUG_ON(!skb || !tcp_skb_pcount(skb));
@@ -1050,7 +1057,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 			return -ENOBUFS;
 	}
 
-	/* TODO: might take care of jitter here */
+	prior_wstamp = tp->tcp_wstamp_ns;
 	tp->tcp_wstamp_ns = max(tp->tcp_wstamp_ns, tp->tcp_clock_cache);
 
 	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
@@ -1169,7 +1176,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 		err = net_xmit_eval(err);
 	}
 	if (!err && oskb) {
-		tcp_update_skb_after_send(sk, oskb);
+		tcp_update_skb_after_send(sk, oskb, prior_wstamp);
 		tcp_rate_skb_sent(sk, oskb);
 	}
 	return err;
@@ -2321,7 +2328,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp" is used as a start point for the retransmit timer */
-			tcp_update_skb_after_send(sk, skb);
+			tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
 			goto repair; /* Skip network transmission */
 		}
 
@@ -2896,7 +2903,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 		} tcp_skb_tsorted_restore(skb);
 
 		if (!err) {
-			tcp_update_skb_after_send(sk, skb);
+			tcp_update_skb_after_send(sk, skb, tp->tcp_wstamp_ns);
 			tcp_rate_skb_sent(sk, skb);
 		}
 	} else {
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
                   ` (2 preceding siblings ...)
  2018-10-15 16:37 ` [PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 5/7] tcp: optimize tcp internal pacing Eric Dumazet
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 3923d14095335df61c270f69e50cb7cbfde4c796..4b1af706896c07e5a0fe6d542dfcd530acdcf8f5 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -444,7 +444,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 	}
 
 	skb = f->head;
-	if (skb && !skb_is_tcp_pure_ack(skb)) {
+	if (skb) {
 		u64 time_next_packet = max_t(u64, ktime_to_ns(skb->tstamp),
 					     f->time_next_packet);
 
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 5/7] tcp: optimize tcp internal pacing
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
                   ` (3 preceding siblings ...)
  2018-10-15 16:37 ` [PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 6/7] tcp_bbr: fix typo in bbr_pacing_margin_percent Eric Dumazet
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.

But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.

This leads to extra overhead (high number of IRQ)

Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.

This leads to a ~10% performance increase in TCP_RR workloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5474c9854f252e50cdb1136435417873861d7618..d212e4cbc68902e873afb4a12b43b467ccd6069b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -975,16 +975,6 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void tcp_internal_pacing(struct sock *sk)
-{
-	if (!tcp_needs_internal_pacing(sk))
-		return;
-	hrtimer_start(&tcp_sk(sk)->pacing_timer,
-		      ns_to_ktime(tcp_sk(sk)->tcp_wstamp_ns),
-		      HRTIMER_MODE_ABS_PINNED_SOFT);
-	sock_hold(sk);
-}
-
 static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
 				      u64 prior_wstamp)
 {
@@ -1005,8 +995,6 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb,
 			/* take into account OS jitter */
 			len_ns -= min_t(u64, len_ns / 2, credit);
 			tp->tcp_wstamp_ns += len_ns;
-
-			tcp_internal_pacing(sk);
 		}
 	}
 	list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
@@ -2186,10 +2174,23 @@ static int tcp_mtu_probe(struct sock *sk)
 	return -1;
 }
 
-static bool tcp_pacing_check(const struct sock *sk)
+static bool tcp_pacing_check(struct sock *sk)
 {
-	return tcp_needs_internal_pacing(sk) &&
-	       hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tcp_needs_internal_pacing(sk))
+		return false;
+
+	if (tp->tcp_wstamp_ns <= tp->tcp_clock_cache)
+		return false;
+
+	if (!hrtimer_is_queued(&tp->pacing_timer)) {
+		hrtimer_start(&tp->pacing_timer,
+			      ns_to_ktime(tp->tcp_wstamp_ns),
+			      HRTIMER_MODE_ABS_PINNED_SOFT);
+		sock_hold(sk);
+	}
+	return true;
 }
 
 /* TCP Small Queues :
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 6/7] tcp_bbr: fix typo in bbr_pacing_margin_percent
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
                   ` (4 preceding siblings ...)
  2018-10-15 16:37 ` [PATCH net-next 5/7] tcp: optimize tcp internal pacing Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-15 16:37 ` [PATCH net-next 7/7] tcp: cdg: use tcp high resolution clock cache Eric Dumazet
  2018-10-16  5:58 ` [PATCH net-next 0/7] tcp: second round for EDT conversion David Miller
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

From: Neal Cardwell <ncardwell@google.com>

There was a typo in this parameter name.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_bbr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 33f4358615e6d63b5c98a30484f12ffae66334a2..b88081285fd172444a844b6aec5d038c0f882594 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -129,7 +129,7 @@ static const u32 bbr_probe_rtt_mode_ms = 200;
 static const int bbr_min_tso_rate = 1200000;
 
 /* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. */
-static const int bbr_pacing_marging_percent = 1;
+static const int bbr_pacing_margin_percent = 1;
 
 /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
  * that will allow a smoothly increasing pacing rate that will double each RTT
@@ -214,7 +214,7 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
 	rate *= mss;
 	rate *= gain;
 	rate >>= BBR_SCALE;
-	rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_marging_percent);
+	rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_margin_percent);
 	return rate >> BW_SCALE;
 }
 
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 7/7] tcp: cdg: use tcp high resolution clock cache
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
                   ` (5 preceding siblings ...)
  2018-10-15 16:37 ` [PATCH net-next 6/7] tcp_bbr: fix typo in bbr_pacing_margin_percent Eric Dumazet
@ 2018-10-15 16:37 ` Eric Dumazet
  2018-10-16  5:58 ` [PATCH net-next 0/7] tcp: second round for EDT conversion David Miller
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2018-10-15 16:37 UTC (permalink / raw)
  To: David S . Miller, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Gasper Zejn
  Cc: netdev, Eric Dumazet, Eric Dumazet

We store in tcp socket a cache of most recent high resolution
clock, there is no need to call local_clock() again, since
this cache is good enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_cdg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 06fbe102a425f28b43294925d8d13af4a13ec776..37eebd9103961be4731323cfb4d933b51954e802 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -146,7 +146,7 @@ static void tcp_cdg_hystart_update(struct sock *sk)
 		return;
 
 	if (hystart_detect & HYSTART_ACK_TRAIN) {
-		u32 now_us = div_u64(local_clock(), NSEC_PER_USEC);
+		u32 now_us = tp->tcp_mstamp;
 
 		if (ca->last_ack == 0 || !tcp_is_cwnd_limited(sk)) {
 			ca->last_ack = now_us;
-- 
2.19.0.605.g01d371f741-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/7] tcp: second round for EDT conversion
  2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
                   ` (6 preceding siblings ...)
  2018-10-15 16:37 ` [PATCH net-next 7/7] tcp: cdg: use tcp high resolution clock cache Eric Dumazet
@ 2018-10-16  5:58 ` David Miller
  7 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2018-10-16  5:58 UTC (permalink / raw)
  To: edumazet; +Cc: ncardwell, ycheng, soheil, zelo.zejn, netdev, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 15 Oct 2018 09:37:51 -0700

> First round of EDT patches left TCP stack in a non optimal state.
> 
> - High speed flows suffered from loss of performance, addressed
>   by the first patch of this series.
> 
> - Second patch brings pacing to the current state of networking,
>   since we now reach ~100 Gbit on a single TCP flow.
> 
> - Third patch implements a mitigation for scheduling delays,
>   like the one we did in sch_fq in the past.
> 
> - Fourth patch removes one special case in sch_fq for ACK packets.
> 
> - Fifth patch removes a serious perfomance cost for TCP internal
>   pacing. We should setup the high resolution timer only if
>   really needed.
> 
> - Sixth patch fixes a typo in BBR.
> 
> - Last patch is one minor change in cdg congestion control.
> 
> Neal Cardwell also has a patch series fixing BBR after
> EDT adoption.

Series applied, thanks Eric.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-10-16 13:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-15 16:37 [PATCH net-next 0/7] tcp: second round for EDT conversion Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 1/7] tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 2/7] net: extend sk_pacing_rate to unsigned long Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 3/7] tcp: mitigate scheduling jitter in EDT pacing model Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 4/7] net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 5/7] tcp: optimize tcp internal pacing Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 6/7] tcp_bbr: fix typo in bbr_pacing_margin_percent Eric Dumazet
2018-10-15 16:37 ` [PATCH net-next 7/7] tcp: cdg: use tcp high resolution clock cache Eric Dumazet
2018-10-16  5:58 ` [PATCH net-next 0/7] tcp: second round for EDT conversion David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.