Netdev Archive on lore.kernel.org
 help / color / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: "David S . Miller" <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>, Van Jacobson <vanj@google.com>,
	Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Eric Dumazet <edumazet@google.com>,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: [PATCH net-next 0/9] tcp: switch to Early Departure Time model
Date: Fri, 21 Sep 2018 08:51:45 -0700
Message-ID: <20180921155154.49489-1-edumazet@google.com> (raw)

In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :

- SO_MAX_PACING_RATE could be used by any sockets.

- TCP would vary effective pacing rate based on CWND*MSS/SRTT

- FQ would ensure delays between packets based on current
  sk->sk_pacing_rate, but with some quantum based artifacts.
  (inflating RPC tail latencies)

- BBR then tweaked the pacing rate in its various phases
  (PROBE, DRAIN, ...)

This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.

Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)

Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.

https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time
https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf

Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role

This patch series converts TCP and FQ to the same model.

This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.

This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)

For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT

Before :

ESTAB  0  211408 10.246.7.151:41558   10.246.7.152:33723                
	 cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937
  segs_out:25488 segs_in:12454 data_segs_out:25486
  send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps
  busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026

After :

ESTAB  0  192584 10.246.7.151:61612   10.246.7.152:34375                
	 cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536 
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401 
  segs_out:117931 segs_in:57651 data_segs_out:117929 
  send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps
  busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054

A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.

Eric Dumazet (9):
  tcp: switch tcp_clock_ns() to CLOCK_TAI base
  tcp: introduce tcp_skb_timestamp_us() helper
  net_sched: sch_fq: switch to CLOCK_TAI
  tcp: add tcp_wstamp_ns socket field
  tcp: provide earliest departure time in skb->tstamp
  tcp: switch internal pacing timer to CLOCK_TAI
  tcp: switch tcp and sch_fq to new earliest departure time model
  tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
  net_sched: sch_fq: remove dead code dealing with retransmits

 include/linux/skbuff.h  |  2 +-
 include/linux/tcp.h     |  2 +
 include/net/tcp.h       | 26 ++++++-------
 net/ipv4/syncookies.c   |  2 +-
 net/ipv4/tcp.c          |  2 +-
 net/ipv4/tcp_bbr.c      |  7 ++--
 net/ipv4/tcp_input.c    | 11 +++---
 net/ipv4/tcp_ipv4.c     |  2 +-
 net/ipv4/tcp_output.c   | 68 +++++++++++++++++++++------------
 net/ipv4/tcp_rate.c     | 17 +++++----
 net/ipv4/tcp_recovery.c |  5 ++-
 net/ipv4/tcp_timer.c    |  4 +-
 net/sched/sch_fq.c      | 85 +++++++++--------------------------------
 13 files changed, 104 insertions(+), 129 deletions(-)

-- 
2.19.0.444.g18242da7ef-goog

             reply index

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-21 15:51 Eric Dumazet [this message]
2018-09-21 15:51 ` [PATCH net-next 1/9] tcp: switch tcp_clock_ns() to CLOCK_TAI base Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 2/9] tcp: introduce tcp_skb_timestamp_us() helper Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 3/9] net_sched: sch_fq: switch to CLOCK_TAI Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 4/9] tcp: add tcp_wstamp_ns socket field Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 5/9] tcp: provide earliest departure time in skb->tstamp Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 6/9] tcp: switch internal pacing timer to CLOCK_TAI Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 7/9] tcp: switch tcp and sch_fq to new earliest departure time model Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 8/9] tcp: switch tcp_internal_pacing() to tcp_wstamp_ns Eric Dumazet
2018-09-21 15:51 ` [PATCH net-next 9/9] net_sched: sch_fq: remove dead code dealing with retransmits Eric Dumazet

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180921155154.49489-1-edumazet@google.com \
    --to=edumazet@google.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=soheil@google.com \
    --cc=vanj@google.com \
    --cc=willemb@google.com \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git