netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/9] tcp: switch to Early Departure Time model
@ 2018-09-21 15:51 Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 1/9] tcp: switch tcp_clock_ns() to CLOCK_TAI base Eric Dumazet
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :

- SO_MAX_PACING_RATE could be used by any sockets.

- TCP would vary effective pacing rate based on CWND*MSS/SRTT

- FQ would ensure delays between packets based on current
  sk->sk_pacing_rate, but with some quantum based artifacts.
  (inflating RPC tail latencies)

- BBR then tweaked the pacing rate in its various phases
  (PROBE, DRAIN, ...)

This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.

Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)

Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.

https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time
https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf

Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role

This patch series converts TCP and FQ to the same model.

This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.

This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)

For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT

Before :

ESTAB  0  211408 10.246.7.151:41558   10.246.7.152:33723                
	 cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937
  segs_out:25488 segs_in:12454 data_segs_out:25486
  send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps
  busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026

After :

ESTAB  0  192584 10.246.7.151:61612   10.246.7.152:34375                
	 cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536 
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401 
  segs_out:117931 segs_in:57651 data_segs_out:117929 
  send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps
  busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054

A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.

Eric Dumazet (9):
  tcp: switch tcp_clock_ns() to CLOCK_TAI base
  tcp: introduce tcp_skb_timestamp_us() helper
  net_sched: sch_fq: switch to CLOCK_TAI
  tcp: add tcp_wstamp_ns socket field
  tcp: provide earliest departure time in skb->tstamp
  tcp: switch internal pacing timer to CLOCK_TAI
  tcp: switch tcp and sch_fq to new earliest departure time model
  tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
  net_sched: sch_fq: remove dead code dealing with retransmits

 include/linux/skbuff.h  |  2 +-
 include/linux/tcp.h     |  2 +
 include/net/tcp.h       | 26 ++++++-------
 net/ipv4/syncookies.c   |  2 +-
 net/ipv4/tcp.c          |  2 +-
 net/ipv4/tcp_bbr.c      |  7 ++--
 net/ipv4/tcp_input.c    | 11 +++---
 net/ipv4/tcp_ipv4.c     |  2 +-
 net/ipv4/tcp_output.c   | 68 +++++++++++++++++++++------------
 net/ipv4/tcp_rate.c     | 17 +++++----
 net/ipv4/tcp_recovery.c |  5 ++-
 net/ipv4/tcp_timer.c    |  4 +-
 net/sched/sch_fq.c      | 85 +++++++++--------------------------------
 13 files changed, 104 insertions(+), 129 deletions(-)

-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next 1/9] tcp: switch tcp_clock_ns() to CLOCK_TAI base
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 2/9] tcp: introduce tcp_skb_timestamp_us() helper Eric Dumazet
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

TCP pacing is either implemented in sch_fq or internally.
We have the goal of being able to offload pacing on the NICS.

TCP will soon provide per skb skb->tstamp as early departure time.

Like ETF in commit 25db26a91364 ("net/sched: Introduce the ETF Qdisc")
we chose CLOCK_T as the clock base, so that TCP and pacers can share
a common clock, to get better RTT samples (without pacing artificially
inflating these samples).

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 770917d0caa71896b6adac06a62b150bfdc72836..c6f0bc1dc6782a1976c06932e846b3f6d708ba9f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -732,7 +732,7 @@ void tcp_send_window_probe(struct sock *sk);
 
 static inline u64 tcp_clock_ns(void)
 {
-	return local_clock();
+	return ktime_get_tai_ns();
 }
 
 static inline u64 tcp_clock_us(void)
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 2/9] tcp: introduce tcp_skb_timestamp_us() helper
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 1/9] tcp: switch tcp_clock_ns() to CLOCK_TAI base Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 3/9] net_sched: sch_fq: switch to CLOCK_TAI Eric Dumazet
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

There are few places where TCP reads skb->skb_mstamp expecting
a value in usec unit.

skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value.

Add tcp_skb_timestamp_us() to provide proper conversion when needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h       |  8 +++++++-
 net/ipv4/tcp_input.c    | 11 ++++++-----
 net/ipv4/tcp_ipv4.c     |  2 +-
 net/ipv4/tcp_output.c   |  2 +-
 net/ipv4/tcp_rate.c     | 17 +++++++++--------
 net/ipv4/tcp_recovery.c |  5 +++--
 6 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c6f0bc1dc6782a1976c06932e846b3f6d708ba9f..0ca5ea10dc06f3552597c94de31dcd0c8e0ecc32 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -774,6 +774,12 @@ static inline u32 tcp_skb_timestamp(const struct sk_buff *skb)
 	return div_u64(skb->skb_mstamp, USEC_PER_SEC / TCP_TS_HZ);
 }
 
+/* provide the departure time in us unit */
+static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
+{
+	return skb->skb_mstamp;
+}
+
 
 #define tcp_flag_byte(th) (((u_int8_t *)th)[13])
 
@@ -1940,7 +1946,7 @@ static inline s64 tcp_rto_delta_us(const struct sock *sk)
 {
 	const struct sk_buff *skb = tcp_rtx_queue_head(sk);
 	u32 rto = inet_csk(sk)->icsk_rto;
-	u64 rto_time_stamp_us = skb->skb_mstamp + jiffies_to_usecs(rto);
+	u64 rto_time_stamp_us = tcp_skb_timestamp_us(skb) + jiffies_to_usecs(rto);
 
 	return rto_time_stamp_us - tcp_sk(sk)->tcp_mstamp;
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d9034073138ce49c423f7a22143bac415415bc09..d703a0b3b6a2f0efd8607354c1c74ac1a8e78d4f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1305,7 +1305,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	 */
 	tcp_sacktag_one(sk, state, TCP_SKB_CB(skb)->sacked,
 			start_seq, end_seq, dup_sack, pcount,
-			skb->skb_mstamp);
+			tcp_skb_timestamp_us(skb));
 	tcp_rate_skb_delivered(sk, skb, state->rate);
 
 	if (skb == tp->lost_skb_hint)
@@ -1580,7 +1580,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 						TCP_SKB_CB(skb)->end_seq,
 						dup_sack,
 						tcp_skb_pcount(skb),
-						skb->skb_mstamp);
+						tcp_skb_timestamp_us(skb));
 			tcp_rate_skb_delivered(sk, skb, state->rate);
 			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
 				list_del_init(&skb->tcp_tsorted_anchor);
@@ -3103,7 +3103,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				tp->retrans_out -= acked_pcount;
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
-			last_ackt = skb->skb_mstamp;
+			last_ackt = tcp_skb_timestamp_us(skb);
 			WARN_ON_ONCE(last_ackt == 0);
 			if (!first_ackt)
 				first_ackt = last_ackt;
@@ -3121,7 +3121,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 			tp->delivered += acked_pcount;
 			if (!tcp_skb_spurious_retrans(tp, skb))
 				tcp_rack_advance(tp, sacked, scb->end_seq,
-						 skb->skb_mstamp);
+						 tcp_skb_timestamp_us(skb));
 		}
 		if (sacked & TCPCB_LOST)
 			tp->lost_out -= acked_pcount;
@@ -3215,7 +3215,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 			tp->lost_cnt_hint -= min(tp->lost_cnt_hint, delta);
 		}
 	} else if (skb && rtt_update && sack_rtt_us >= 0 &&
-		   sack_rtt_us > tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp)) {
+		   sack_rtt_us > tcp_stamp_us_delta(tp->tcp_mstamp,
+						    tcp_skb_timestamp_us(skb))) {
 		/* Do not re-arm RTO if the sack RTT is measured from data sent
 		 * after when the head was last (re)transmitted. Otherwise the
 		 * timeout may continue to extend in loss recovery.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 09547ef9c4c644fba0f7887afad0a6393e3dd03a..1f2496e8620dd78cecefbb0dceb8570fc92661e5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -544,7 +544,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		BUG_ON(!skb);
 
 		tcp_mstamp_refresh(tp);
-		delta_us = (u32)(tp->tcp_mstamp - skb->skb_mstamp);
+		delta_us = (u32)(tp->tcp_mstamp - tcp_skb_timestamp_us(skb));
 		remaining = icsk->icsk_rto -
 			    usecs_to_jiffies(delta_us);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 597dbd749f05dc72e53962a5821861fc218774d6..b95aa72d88233dd6376a70ccd7cbb13744444889 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1966,7 +1966,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
 	head = tcp_rtx_queue_head(sk);
 	if (!head)
 		goto send_now;
-	age = tcp_stamp_us_delta(tp->tcp_mstamp, head->skb_mstamp);
+	age = tcp_stamp_us_delta(tp->tcp_mstamp, tcp_skb_timestamp_us(head));
 	/* If next ACK is likely to come too late (half srtt), do not defer */
 	if (age < (tp->srtt_us >> 4))
 		goto send_now;
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 4dff40dad4dc5ccc372f5108b0d6ba38497ab81f..baed2186c7c623737c739cbc1e35a3c772a8b15a 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
@@ -55,8 +55,10 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb)
 	  * bandwidth estimate.
 	  */
 	if (!tp->packets_out) {
-		tp->first_tx_mstamp  = skb->skb_mstamp;
-		tp->delivered_mstamp = skb->skb_mstamp;
+		u64 tstamp_us = tcp_skb_timestamp_us(skb);
+
+		tp->first_tx_mstamp  = tstamp_us;
+		tp->delivered_mstamp = tstamp_us;
 	}
 
 	TCP_SKB_CB(skb)->tx.first_tx_mstamp	= tp->first_tx_mstamp;
@@ -88,13 +90,12 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
 		rs->is_app_limited   = scb->tx.is_app_limited;
 		rs->is_retrans	     = scb->sacked & TCPCB_RETRANS;
 
-		/* Find the duration of the "send phase" of this window: */
-		rs->interval_us      = tcp_stamp_us_delta(
-						skb->skb_mstamp,
-						scb->tx.first_tx_mstamp);
-
 		/* Record send time of most recently ACKed packet: */
-		tp->first_tx_mstamp  = skb->skb_mstamp;
+		tp->first_tx_mstamp  = tcp_skb_timestamp_us(skb);
+		/* Find the duration of the "send phase" of this window: */
+		rs->interval_us = tcp_stamp_us_delta(tp->first_tx_mstamp,
+						     scb->tx.first_tx_mstamp);
+
 	}
 	/* Mark off the skb delivered once it's sacked to avoid being
 	 * used again when it's cumulatively acked. For acked packets
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index c81aadff769b2c3eee02e6de3a5545c27e8cbc38..fdb715bdd2d11dd33a1474d02892546bbac66f41 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -50,7 +50,7 @@ static u32 tcp_rack_reo_wnd(const struct sock *sk)
 s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, u32 reo_wnd)
 {
 	return tp->rack.rtt_us + reo_wnd -
-	       tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+	       tcp_stamp_us_delta(tp->tcp_mstamp, tcp_skb_timestamp_us(skb));
 }
 
 /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01):
@@ -91,7 +91,8 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 		    !(scb->sacked & TCPCB_SACKED_RETRANS))
 			continue;
 
-		if (!tcp_rack_sent_after(tp->rack.mstamp, skb->skb_mstamp,
+		if (!tcp_rack_sent_after(tp->rack.mstamp,
+					 tcp_skb_timestamp_us(skb),
 					 tp->rack.end_seq, scb->end_seq))
 			break;
 
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 3/9] net_sched: sch_fq: switch to CLOCK_TAI
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 1/9] tcp: switch tcp_clock_ns() to CLOCK_TAI base Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 2/9] tcp: introduce tcp_skb_timestamp_us() helper Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 4/9] tcp: add tcp_wstamp_ns socket field Eric Dumazet
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.

We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index b27ba36a269cc72cd716da19dcfa27018ec01490..d5185c44e9a5f521ca99243b6e9b53ec05b84d49 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -460,7 +460,7 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 {
 	struct fq_sched_data *q = qdisc_priv(sch);
-	u64 now = ktime_get_ns();
+	u64 now = ktime_get_tai_ns();
 	struct fq_flow_head *head;
 	struct sk_buff *skb;
 	struct fq_flow *f;
@@ -823,7 +823,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt,
 	q->fq_trees_log		= ilog2(1024);
 	q->orphan_mask		= 1024 - 1;
 	q->low_rate_threshold	= 550000 / 8;
-	qdisc_watchdog_init(&q->watchdog, sch);
+	qdisc_watchdog_init_clockid(&q->watchdog, sch, CLOCK_TAI);
 
 	if (opt)
 		err = fq_change(sch, opt, extack);
@@ -878,7 +878,7 @@ static int fq_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	st.flows_plimit		  = q->stat_flows_plimit;
 	st.pkts_too_long	  = q->stat_pkts_too_long;
 	st.allocation_errors	  = q->stat_allocation_errors;
-	st.time_next_delayed_flow = q->time_next_delayed_flow - ktime_get_ns();
+	st.time_next_delayed_flow = q->time_next_delayed_flow - ktime_get_tai_ns();
 	st.flows		  = q->flows;
 	st.inactive_flows	  = q->inactive_flows;
 	st.throttled_flows	  = q->throttled_flows;
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 4/9] tcp: add tcp_wstamp_ns socket field
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
                   ` (2 preceding siblings ...)
  2018-09-21 15:51 ` [PATCH net-next 3/9] net_sched: sch_fq: switch to CLOCK_TAI Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 5/9] tcp: provide earliest departure time in skb->tstamp Eric Dumazet
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

TCP will soon provide earliest departure time on TX skbs.
It needs to track this in a new variable.

tcp_mstamp_refresh() needs to update this variable, and
became too big to stay an inline.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/tcp.h   |  2 ++
 include/net/tcp.h     | 12 +-----------
 net/ipv4/tcp_output.c | 16 ++++++++++++++++
 3 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 263e37271afda18f3d61c99272d34da15dfdca29..848f5b25e178288ce870637b68a692ab88dc7d4d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -248,6 +248,8 @@ struct tcp_sock {
 		syn_smc:1;	/* SYN includes SMC */
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
+	u64	tcp_wstamp_ns;	/* departure time for next sent data packet */
+
 /* RTT measurement */
 	u64	tcp_mstamp;	/* most recent packet received/sent */
 	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0ca5ea10dc06f3552597c94de31dcd0c8e0ecc32..370198fdc65d3e863104665e20faefd0e5a09b92 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -752,17 +752,7 @@ static inline u32 tcp_time_stamp_raw(void)
 	return div_u64(tcp_clock_ns(), NSEC_PER_SEC / TCP_TS_HZ);
 }
 
-
-/* Refresh 1us clock of a TCP socket,
- * ensuring monotically increasing values.
- */
-static inline void tcp_mstamp_refresh(struct tcp_sock *tp)
-{
-	u64 val = tcp_clock_us();
-
-	if (val > tp->tcp_mstamp)
-		tp->tcp_mstamp = val;
-}
+void tcp_mstamp_refresh(struct tcp_sock *tp);
 
 static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0)
 {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b95aa72d88233dd6376a70ccd7cbb13744444889..5a8105e84f7c1a876bbd15e8050c2574c1fbe162 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -45,6 +45,22 @@
 
 #include <trace/events/tcp.h>
 
+/* Refresh clocks of a TCP socket,
+ * ensuring monotically increasing values.
+ */
+void tcp_mstamp_refresh(struct tcp_sock *tp)
+{
+	u64 val = tcp_clock_ns();
+
+	/* departure time for next data packet */
+	if (val > tp->tcp_wstamp_ns)
+		tp->tcp_wstamp_ns = val;
+
+	val = div_u64(val, NSEC_PER_USEC);
+	if (val > tp->tcp_mstamp)
+		tp->tcp_mstamp = val;
+}
+
 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			   int push_one, gfp_t gfp);
 
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 5/9] tcp: provide earliest departure time in skb->tstamp
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
                   ` (3 preceding siblings ...)
  2018-09-21 15:51 ` [PATCH net-next 4/9] tcp: add tcp_wstamp_ns socket field Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 6/9] tcp: switch internal pacing timer to CLOCK_TAI Eric Dumazet
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.

Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate

Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  2 +-
 include/net/tcp.h      |  6 +++---
 net/ipv4/syncookies.c  |  2 +-
 net/ipv4/tcp.c         |  2 +-
 net/ipv4/tcp_output.c  | 13 ++++++-------
 net/ipv4/tcp_timer.c   |  2 +-
 6 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e3a53ca4a9b51b84b7d75ce87485d4d9109a4cf2..86f337e9a81d5eff360335a19ab09f26ae48fca8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -689,7 +689,7 @@ struct sk_buff {
 
 	union {
 		ktime_t		tstamp;
-		u64		skb_mstamp;
+		u64		skb_mstamp_ns; /* earliest departure time */
 	};
 	/*
 	 * This is the control buffer. It is free to use for every
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 370198fdc65d3e863104665e20faefd0e5a09b92..ff15d8e0d525715b17671e64f6abdead9df0a8f3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -761,13 +761,13 @@ static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0)
 
 static inline u32 tcp_skb_timestamp(const struct sk_buff *skb)
 {
-	return div_u64(skb->skb_mstamp, USEC_PER_SEC / TCP_TS_HZ);
+	return div_u64(skb->skb_mstamp_ns, NSEC_PER_SEC / TCP_TS_HZ);
 }
 
 /* provide the departure time in us unit */
 static inline u64 tcp_skb_timestamp_us(const struct sk_buff *skb)
 {
-	return skb->skb_mstamp;
+	return div_u64(skb->skb_mstamp_ns, NSEC_PER_USEC);
 }
 
 
@@ -813,7 +813,7 @@ struct tcp_skb_cb {
 #define TCPCB_SACKED_RETRANS	0x02	/* SKB retransmitted		*/
 #define TCPCB_LOST		0x04	/* SKB is lost			*/
 #define TCPCB_TAGBITS		0x07	/* All tag bits			*/
-#define TCPCB_REPAIRED		0x10	/* SKB repaired (no skb_mstamp)	*/
+#define TCPCB_REPAIRED		0x10	/* SKB repaired (no skb_mstamp_ns)	*/
 #define TCPCB_EVER_RETRANS	0x80	/* Ever retransmitted frame	*/
 #define TCPCB_RETRANS		(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
 				TCPCB_REPAIRED)
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index c3387dfd725bf99bcddefb9fb4f1dc98f5dd7f23..606f868d9f3fde1c3140aa7eecde87d2ec32b5f2 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -88,7 +88,7 @@ u64 cookie_init_timestamp(struct request_sock *req)
 		ts <<= TSBITS;
 		ts |= options;
 	}
-	return (u64)ts * (USEC_PER_SEC / TCP_TS_HZ);
+	return (u64)ts * (NSEC_PER_SEC / TCP_TS_HZ);
 }
 
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 67670fac7c8de510df351fe3a835b554cc4759a9..69c236943f56bd0749e5efb18de97e69898f1bde 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1295,7 +1295,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			copy = size_goal;
 
 			/* All packets are restored as if they have
-			 * already been sent. skb_mstamp isn't set to
+			 * already been sent. skb_mstamp_ns isn't set to
 			 * avoid wrong rtt estimation.
 			 */
 			if (tp->repair)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5a8105e84f7c1a876bbd15e8050c2574c1fbe162..957f7a0e21c06cae9f0d3bed57017bbc0a36c880 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1014,7 +1014,7 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 
 static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
 {
-	skb->skb_mstamp = tp->tcp_mstamp;
+	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
 	list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 }
 
@@ -1061,7 +1061,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 		if (unlikely(!skb))
 			return -ENOBUFS;
 	}
-	skb->skb_mstamp = tp->tcp_mstamp;
+	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
 
 	inet = inet_sk(sk);
 	tcb = TCP_SKB_CB(skb);
@@ -1165,8 +1165,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
 	skb_shinfo(skb)->gso_size = tcp_skb_mss(skb);
 
-	/* Our usage of tstamp should remain private */
-	skb->tstamp = 0;
+	/* Leave earliest departure time in skb->tstamp (skb->skb_mstamp_ns) */
 
 	/* Cleanup our debris for IP stacks */
 	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
@@ -3221,10 +3220,10 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	memset(&opts, 0, sizeof(opts));
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(req->cookie_ts))
-		skb->skb_mstamp = cookie_init_timestamp(req);
+		skb->skb_mstamp_ns = cookie_init_timestamp(req);
 	else
 #endif
-		skb->skb_mstamp = tcp_clock_us();
+		skb->skb_mstamp_ns = tcp_clock_ns();
 
 #ifdef CONFIG_TCP_MD5SIG
 	rcu_read_lock();
@@ -3440,7 +3439,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	err = tcp_transmit_skb(sk, syn_data, 1, sk->sk_allocation);
 
-	syn->skb_mstamp = syn_data->skb_mstamp;
+	syn->skb_mstamp_ns = syn_data->skb_mstamp_ns;
 
 	/* Now full SYN+DATA was cloned and sent (or not),
 	 * remove the SYN from the original skb (syn_data)
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 7fdf222a0bdfe9775970082f6b5dcdcc82b2ae1a..61023d50cd604d5e19464a32c33b65d29c75c81e 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -360,7 +360,7 @@ static void tcp_probe_timer(struct sock *sk)
 	 */
 	start_ts = tcp_skb_timestamp(skb);
 	if (!start_ts)
-		skb->skb_mstamp = tp->tcp_mstamp;
+		skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
 	else if (icsk->icsk_user_timeout &&
 		 (s32)(tcp_time_stamp(tp) - start_ts) > icsk->icsk_user_timeout)
 		goto abort;
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 6/9] tcp: switch internal pacing timer to CLOCK_TAI
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
                   ` (4 preceding siblings ...)
  2018-09-21 15:51 ` [PATCH net-next 5/9] tcp: provide earliest departure time in skb->tstamp Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 7/9] tcp: switch tcp and sch_fq to new earliest departure time model Eric Dumazet
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

Next patch will use tcp_wstamp_ns to feed internal
TCP pacing timer, so switch to CLOCK_TAI to share same base.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 2 +-
 net/ipv4/tcp_timer.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 957f7a0e21c06cae9f0d3bed57017bbc0a36c880..a87068fa9b1aa582310df6371966fd2d6461edb8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1007,7 +1007,7 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 	len_ns = (u64)skb->len * NSEC_PER_SEC;
 	do_div(len_ns, rate);
 	hrtimer_start(&tcp_sk(sk)->pacing_timer,
-		      ktime_add_ns(ktime_get(), len_ns),
+		      ktime_add_ns(ktime_get_tai_ns(), len_ns),
 		      HRTIMER_MODE_ABS_PINNED_SOFT);
 	sock_hold(sk);
 }
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 61023d50cd604d5e19464a32c33b65d29c75c81e..4f661e178da8465203266ff4dfa3e8743e60ff82 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -758,7 +758,7 @@ void tcp_init_xmit_timers(struct sock *sk)
 {
 	inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
 				  &tcp_keepalive_timer);
-	hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
+	hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_TAI,
 		     HRTIMER_MODE_ABS_PINNED_SOFT);
 	tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
 
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 7/9] tcp: switch tcp and sch_fq to new earliest departure time model
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
                   ` (5 preceding siblings ...)
  2018-09-21 15:51 ` [PATCH net-next 6/9] tcp: switch internal pacing timer to CLOCK_TAI Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 8/9] tcp: switch tcp_internal_pacing() to tcp_wstamp_ns Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 9/9] net_sched: sch_fq: remove dead code dealing with retransmits Eric Dumazet
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.

Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
 Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02

After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_bbr.c    |  7 ++++---
 net/ipv4/tcp_output.c | 22 ++++++++++++++++++----
 net/sched/sch_fq.c    | 21 +++++++++++----------
 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 02ff2dde96094cf33b662a20994424a7adea509e..a5786e3e2c16ce53a332f29c9a55b9a641eec791 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -128,6 +128,9 @@ static const u32 bbr_probe_rtt_mode_ms = 200;
 /* Skip TSO below the following bandwidth (bits/sec): */
 static const int bbr_min_tso_rate = 1200000;
 
+/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. */
+static const int bbr_pacing_marging_percent = 1;
+
 /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
  * that will allow a smoothly increasing pacing rate that will double each RTT
  * and send the same number of packets per RTT that an un-paced, slow-starting
@@ -208,12 +211,10 @@ static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
 {
 	unsigned int mss = tcp_sk(sk)->mss_cache;
 
-	if (!tcp_needs_internal_pacing(sk))
-		mss = tcp_mss_to_mtu(sk, mss);
 	rate *= mss;
 	rate *= gain;
 	rate >>= BBR_SCALE;
-	rate *= USEC_PER_SEC;
+	rate *= USEC_PER_SEC / 100 * (100 - bbr_pacing_marging_percent);
 	return rate >> BW_SCALE;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a87068fa9b1aa582310df6371966fd2d6461edb8..2adb719e97b89021becfa1243d33c87df6cdf8a5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1012,9 +1012,23 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 	sock_hold(sk);
 }
 
-static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
+static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+
 	skb->skb_mstamp_ns = tp->tcp_wstamp_ns;
+	if (sk->sk_pacing_status != SK_PACING_NONE) {
+		u32 rate = sk->sk_pacing_rate;
+
+		/* Original sch_fq does not pace first 10 MSS
+		 * Note that tp->data_segs_out overflows after 2^32 packets,
+		 * this is a minor annoyance.
+		 */
+		if (rate != ~0U && rate && tp->data_segs_out >= 10) {
+			tp->tcp_wstamp_ns += div_u64((u64)skb->len * NSEC_PER_SEC, rate);
+			/* TODO: update internal pacing here */
+		}
+	}
 	list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
 }
 
@@ -1178,7 +1192,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 		err = net_xmit_eval(err);
 	}
 	if (!err && oskb) {
-		tcp_update_skb_after_send(tp, oskb);
+		tcp_update_skb_after_send(sk, oskb);
 		tcp_rate_skb_sent(sk, oskb);
 	}
 	return err;
@@ -2327,7 +2341,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp" is used as a start point for the retransmit timer */
-			tcp_update_skb_after_send(tp, skb);
+			tcp_update_skb_after_send(sk, skb);
 			goto repair; /* Skip network transmission */
 		}
 
@@ -2902,7 +2916,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 		} tcp_skb_tsorted_restore(skb);
 
 		if (!err) {
-			tcp_update_skb_after_send(tp, skb);
+			tcp_update_skb_after_send(sk, skb);
 			tcp_rate_skb_sent(sk, skb);
 		}
 	} else {
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index d5185c44e9a5f521ca99243b6e9b53ec05b84d49..77692ad6741de14025bd848741604e775742430b 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -491,11 +491,16 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 	}
 
 	skb = f->head;
-	if (unlikely(skb && now < f->time_next_packet &&
-		     !skb_is_tcp_pure_ack(skb))) {
-		head->first = f->next;
-		fq_flow_set_throttled(q, f);
-		goto begin;
+	if (skb && !skb_is_tcp_pure_ack(skb)) {
+		u64 time_next_packet = max_t(u64, ktime_to_ns(skb->tstamp),
+					     f->time_next_packet);
+
+		if (now < time_next_packet) {
+			head->first = f->next;
+			f->time_next_packet = time_next_packet;
+			fq_flow_set_throttled(q, f);
+			goto begin;
+		}
 	}
 
 	skb = fq_dequeue_head(sch, f);
@@ -513,11 +518,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 	prefetch(&skb->end);
 	f->credit -= qdisc_pkt_len(skb);
 
-	if (!q->rate_enable)
-		goto out;
-
-	/* Do not pace locally generated ack packets */
-	if (skb_is_tcp_pure_ack(skb))
+	if (ktime_to_ns(skb->tstamp) || !q->rate_enable)
 		goto out;
 
 	rate = q->flow_max_rate;
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 8/9] tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
                   ` (6 preceding siblings ...)
  2018-09-21 15:51 ` [PATCH net-next 7/9] tcp: switch tcp and sch_fq to new earliest departure time model Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  2018-09-21 15:51 ` [PATCH net-next 9/9] net_sched: sch_fq: remove dead code dealing with retransmits Eric Dumazet
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

Now TCP keeps track of tcp_wstamp_ns, recording the earliest
departure time of next packet, we can remove duplicate code
from tcp_internal_pacing()

This removes one ktime_get_tai_ns() call, and a divide.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2adb719e97b89021becfa1243d33c87df6cdf8a5..fe7855b090e4feed6a7d1ba6ee874cdb23a9bd0c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -993,21 +993,12 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
+static void tcp_internal_pacing(struct sock *sk)
 {
-	u64 len_ns;
-	u32 rate;
-
 	if (!tcp_needs_internal_pacing(sk))
 		return;
-	rate = sk->sk_pacing_rate;
-	if (!rate || rate == ~0U)
-		return;
-
-	len_ns = (u64)skb->len * NSEC_PER_SEC;
-	do_div(len_ns, rate);
 	hrtimer_start(&tcp_sk(sk)->pacing_timer,
-		      ktime_add_ns(ktime_get_tai_ns(), len_ns),
+		      ns_to_ktime(tcp_sk(sk)->tcp_wstamp_ns),
 		      HRTIMER_MODE_ABS_PINNED_SOFT);
 	sock_hold(sk);
 }
@@ -1026,7 +1017,8 @@ static void tcp_update_skb_after_send(struct sock *sk, struct sk_buff *skb)
 		 */
 		if (rate != ~0U && rate && tp->data_segs_out >= 10) {
 			tp->tcp_wstamp_ns += div_u64((u64)skb->len * NSEC_PER_SEC, rate);
-			/* TODO: update internal pacing here */
+
+			tcp_internal_pacing(sk);
 		}
 	}
 	list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
@@ -1167,7 +1159,6 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 		tcp_event_data_sent(tp, sk);
 		tp->data_segs_out += tcp_skb_pcount(skb);
 		tp->bytes_sent += skb->len - tcp_header_size;
-		tcp_internal_pacing(sk, skb);
 	}
 
 	if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next 9/9] net_sched: sch_fq: remove dead code dealing with retransmits
  2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
                   ` (7 preceding siblings ...)
  2018-09-21 15:51 ` [PATCH net-next 8/9] tcp: switch tcp_internal_pacing() to tcp_wstamp_ns Eric Dumazet
@ 2018-09-21 15:51 ` Eric Dumazet
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2018-09-21 15:51 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Eric Dumazet,
	Eric Dumazet

With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c | 58 ++++------------------------------------------
 1 file changed, 5 insertions(+), 53 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 77692ad6741de14025bd848741604e775742430b..628a2cdcfc6f2fa69d9402f06881949d2e1423d9 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -106,7 +106,6 @@ struct fq_sched_data {
 
 	u64		stat_gc_flows;
 	u64		stat_internal_packets;
-	u64		stat_tcp_retrans;
 	u64		stat_throttled;
 	u64		stat_flows_plimit;
 	u64		stat_pkts_too_long;
@@ -327,62 +326,17 @@ static struct sk_buff *fq_dequeue_head(struct Qdisc *sch, struct fq_flow *flow)
 	return skb;
 }
 
-/* We might add in the future detection of retransmits
- * For the time being, just return false
- */
-static bool skb_is_retransmit(struct sk_buff *skb)
-{
-	return false;
-}
-
-/* add skb to flow queue
- * flow queue is a linked list, kind of FIFO, except for TCP retransmits
- * We special case tcp retransmits to be transmitted before other packets.
- * We rely on fact that TCP retransmits are unlikely, so we do not waste
- * a separate queue or a pointer.
- * head->  [retrans pkt 1]
- *         [retrans pkt 2]
- *         [ normal pkt 1]
- *         [ normal pkt 2]
- *         [ normal pkt 3]
- * tail->  [ normal pkt 4]
- */
 static void flow_queue_add(struct fq_flow *flow, struct sk_buff *skb)
 {
-	struct sk_buff *prev, *head = flow->head;
+	struct sk_buff *head = flow->head;
 
 	skb->next = NULL;
-	if (!head) {
+	if (!head)
 		flow->head = skb;
-		flow->tail = skb;
-		return;
-	}
-	if (likely(!skb_is_retransmit(skb))) {
+	else
 		flow->tail->next = skb;
-		flow->tail = skb;
-		return;
-	}
 
-	/* This skb is a tcp retransmit,
-	 * find the last retrans packet in the queue
-	 */
-	prev = NULL;
-	while (skb_is_retransmit(head)) {
-		prev = head;
-		head = head->next;
-		if (!head)
-			break;
-	}
-	if (!prev) { /* no rtx packet in queue, become the new head */
-		skb->next = flow->head;
-		flow->head = skb;
-	} else {
-		if (prev == flow->tail)
-			flow->tail = skb;
-		else
-			skb->next = prev->next;
-		prev->next = skb;
-	}
+	flow->tail = skb;
 }
 
 static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
@@ -401,8 +355,6 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 
 	f->qlen++;
-	if (skb_is_retransmit(skb))
-		q->stat_tcp_retrans++;
 	qdisc_qstats_backlog_inc(sch, skb);
 	if (fq_flow_is_detached(f)) {
 		struct sock *sk = skb->sk;
@@ -874,7 +826,7 @@ static int fq_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 
 	st.gc_flows		  = q->stat_gc_flows;
 	st.highprio_packets	  = q->stat_internal_packets;
-	st.tcp_retrans		  = q->stat_tcp_retrans;
+	st.tcp_retrans		  = 0;
 	st.throttled		  = q->stat_throttled;
 	st.flows_plimit		  = q->stat_flows_plimit;
 	st.pkts_too_long	  = q->stat_pkts_too_long;
-- 
2.19.0.444.g18242da7ef-goog

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-09-21 21:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-21 15:51 [PATCH net-next 0/9] tcp: switch to Early Departure Time model Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 1/9] tcp: switch tcp_clock_ns() to CLOCK_TAI base Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 2/9] tcp: introduce tcp_skb_timestamp_us() helper Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 3/9] net_sched: sch_fq: switch to CLOCK_TAI Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 4/9] tcp: add tcp_wstamp_ns socket field Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 5/9] tcp: provide earliest departure time in skb->tstamp Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 6/9] tcp: switch internal pacing timer to CLOCK_TAI Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 7/9] tcp: switch tcp and sch_fq to new earliest departure time model Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 8/9] tcp: switch tcp_internal_pacing() to tcp_wstamp_ns Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 9/9] net_sched: sch_fq: remove dead code dealing with retransmits Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).