All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
@ 2019-10-22 23:10 Cong Wang
  2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw)
  To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang

This patchset contains 3 patches: patch 1 is a cleanup,
patch 2 is a small change preparing for patch 3, patch 3 is the
one does the actual change. Please find details in each of them.

---
Cong Wang (3):
  tcp: get rid of ICSK_TIME_EARLY_RETRANS
  tcp: make tcp_send_loss_probe() boolean
  tcp: decouple TLP timer from RTO timer

 include/net/inet_connection_sock.h | 13 +++++----
 include/net/tcp.h                  |  3 ++-
 net/dccp/timer.c                   |  2 +-
 net/ipv4/inet_connection_sock.c    |  5 +++-
 net/ipv4/inet_diag.c               |  8 ++++--
 net/ipv4/tcp_input.c               |  8 ++++--
 net/ipv4/tcp_ipv4.c                |  6 +++--
 net/ipv4/tcp_output.c              |  8 ++++--
 net/ipv4/tcp_timer.c               | 43 +++++++++++++++++++++++++++---
 net/ipv6/tcp_ipv6.c                |  6 +++--
 10 files changed, 80 insertions(+), 22 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS
  2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
@ 2019-10-22 23:10 ` Cong Wang
  2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw)
  To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang

After commit bec41a11dd3d ("tcp: remove early retransmit")
ICSK_TIME_EARLY_RETRANS is no longer effective, so we can remove
its definition too.

Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/inet_connection_sock.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 895546058a20..e46958460739 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -142,9 +142,8 @@ struct inet_connection_sock {
 #define ICSK_TIME_RETRANS	1	/* Retransmit timer */
 #define ICSK_TIME_DACK		2	/* Delayed ack timer */
 #define ICSK_TIME_PROBE0	3	/* Zero window probe timer */
-#define ICSK_TIME_EARLY_RETRANS 4	/* Early retransmit timer */
-#define ICSK_TIME_LOSS_PROBE	5	/* Tail loss probe timer */
-#define ICSK_TIME_REO_TIMEOUT	6	/* Reordering timer */
+#define ICSK_TIME_LOSS_PROBE	4	/* Tail loss probe timer */
+#define ICSK_TIME_REO_TIMEOUT	5	/* Reordering timer */
 
 static inline struct inet_connection_sock *inet_csk(const struct sock *sk)
 {
@@ -227,8 +226,7 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 	}
 
 	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
-	    what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE ||
-	    what == ICSK_TIME_REO_TIMEOUT) {
+	    what == ICSK_TIME_LOSS_PROBE || what == ICSK_TIME_REO_TIMEOUT) {
 		icsk->icsk_pending = what;
 		icsk->icsk_timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean
  2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
  2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang
@ 2019-10-22 23:10 ` Cong Wang
  2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw)
  To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang

Let tcp_send_loss_probe() return whether a TLP has been
sent or not. This is needed by the folllowing patch.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/tcp.h     | 2 +-
 net/ipv4/tcp_output.c | 7 +++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ab4eb5eb5d07..0ee5400e751c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -581,7 +581,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
-void tcp_send_loss_probe(struct sock *sk);
+bool tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto);
 void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 			     const struct sk_buff *next_skb);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0488607c5cd3..9822820edca4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2539,12 +2539,13 @@ static bool skb_still_in_host_queue(const struct sock *sk,
 /* When probe timeout (PTO) fires, try send a new segment if possible, else
  * retransmit the last segment.
  */
-void tcp_send_loss_probe(struct sock *sk)
+bool tcp_send_loss_probe(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int pcount;
 	int mss = tcp_current_mss(sk);
+	bool sent = false;
 
 	skb = tcp_send_head(sk);
 	if (skb && tcp_snd_wnd_test(tp, skb, mss)) {
@@ -2560,7 +2561,7 @@ void tcp_send_loss_probe(struct sock *sk)
 			  "invalid inflight: %u state %u cwnd %u mss %d\n",
 			  tp->packets_out, sk->sk_state, tp->snd_cwnd, mss);
 		inet_csk(sk)->icsk_pending = 0;
-		return;
+		return false;
 	}
 
 	/* At most one outstanding TLP retransmission. */
@@ -2592,11 +2593,13 @@ void tcp_send_loss_probe(struct sock *sk)
 	tp->tlp_high_seq = tp->snd_nxt;
 
 probe_sent:
+	sent = true;
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPLOSSPROBES);
 	/* Reset s.t. tcp_rearm_rto will restart timer from now */
 	inet_csk(sk)->icsk_pending = 0;
 rearm_timer:
 	tcp_rearm_rto(sk);
+	return sent;
 }
 
 /* Push out any pending frames which were held back due to
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
  2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang
  2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang
@ 2019-10-22 23:10 ` Cong Wang
  2019-10-22 23:24   ` Eric Dumazet
  2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller
  2019-10-30 21:56 ` David Miller
  4 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw)
  To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang

Currently RTO, TLP and PROBE0 all share a same timer instance
in kernel and use icsk->icsk_pending to dispatch the work.
This causes spinlock contention when resetting the timer is
too frequent, as clearly shown in the perf report:

   61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
   ...
    - 58.83% tcp_v4_rcv
      - 58.80% tcp_v4_do_rcv
         - 58.80% tcp_rcv_established
            - 52.88% __tcp_push_pending_frames
               - 52.88% tcp_write_xmit
                  - 28.16% tcp_event_new_data_sent
                     - 28.15% sk_reset_timer
                        + mod_timer
                  - 24.68% tcp_schedule_loss_probe
                     - 24.68% sk_reset_timer
                        + 24.68% mod_timer

This patch decouples TLP timer from RTO timer by adding a new
timer instance but still uses icsk->icsk_pending to dispatch,
in order to minimize the risk of this patch.

After this patch, the CPU time spent in tcp_write_xmit() reduced
down to 10.92%.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/inet_connection_sock.h |  9 +++++--
 include/net/tcp.h                  |  1 +
 net/dccp/timer.c                   |  2 +-
 net/ipv4/inet_connection_sock.c    |  5 +++-
 net/ipv4/inet_diag.c               |  8 ++++--
 net/ipv4/tcp_input.c               |  8 ++++--
 net/ipv4/tcp_ipv4.c                |  6 +++--
 net/ipv4/tcp_output.c              |  1 +
 net/ipv4/tcp_timer.c               | 43 +++++++++++++++++++++++++++---
 net/ipv6/tcp_ipv6.c                |  6 +++--
 10 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index e46958460739..2a129fc6b522 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -121,6 +121,7 @@ struct inet_connection_sock {
 		__u16		  last_seg_size; /* Size of last incoming segment	   */
 		__u16		  rcv_mss;	 /* MSS used for delayed ACK decisions	   */
 	} icsk_ack;
+	struct timer_list	  icsk_tlp_timer;
 	struct {
 		int		  enabled;
 
@@ -170,7 +171,8 @@ enum inet_csk_ack_state_t {
 void inet_csk_init_xmit_timers(struct sock *sk,
 			       void (*retransmit_handler)(struct timer_list *),
 			       void (*delack_handler)(struct timer_list *),
-			       void (*keepalive_handler)(struct timer_list *));
+			       void (*keepalive_handler)(struct timer_list *),
+			       void (*tlp_handler)(struct timer_list *));
 void inet_csk_clear_xmit_timers(struct sock *sk);
 
 static inline void inet_csk_schedule_ack(struct sock *sk)
@@ -226,7 +228,7 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 	}
 
 	if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 ||
-	    what == ICSK_TIME_LOSS_PROBE || what == ICSK_TIME_REO_TIMEOUT) {
+	    what == ICSK_TIME_REO_TIMEOUT) {
 		icsk->icsk_pending = what;
 		icsk->icsk_timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout);
@@ -234,6 +236,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
 		icsk->icsk_ack.pending |= ICSK_ACK_TIMER;
 		icsk->icsk_ack.timeout = jiffies + when;
 		sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout);
+	} else if (what == ICSK_TIME_LOSS_PROBE) {
+		icsk->icsk_pending = what;
+		sk_reset_timer(sk, &icsk->icsk_tlp_timer, jiffies + when);
 	} else {
 		pr_debug("inet_csk BUG: unknown timer value\n");
 	}
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0ee5400e751c..3319d2b6b1c4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -331,6 +331,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
+void tcp_tail_loss_probe_handler(struct sock *sk);
 void tcp_delack_timer_handler(struct sock *sk);
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg);
 int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb);
diff --git a/net/dccp/timer.c b/net/dccp/timer.c
index c0b3672637c4..897a0469e8f1 100644
--- a/net/dccp/timer.c
+++ b/net/dccp/timer.c
@@ -246,7 +246,7 @@ void dccp_init_xmit_timers(struct sock *sk)
 	tasklet_init(&dp->dccps_xmitlet, dccp_write_xmitlet, (unsigned long)sk);
 	timer_setup(&dp->dccps_xmit_timer, dccp_write_xmit_timer, 0);
 	inet_csk_init_xmit_timers(sk, &dccp_write_timer, &dccp_delack_timer,
-				  &dccp_keepalive_timer);
+				  &dccp_keepalive_timer, NULL);
 }
 
 static ktime_t dccp_timestamp_seed;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index eb30fc1770de..4b279a86308e 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -503,12 +503,14 @@ EXPORT_SYMBOL(inet_csk_accept);
 void inet_csk_init_xmit_timers(struct sock *sk,
 			       void (*retransmit_handler)(struct timer_list *t),
 			       void (*delack_handler)(struct timer_list *t),
-			       void (*keepalive_handler)(struct timer_list *t))
+			       void (*keepalive_handler)(struct timer_list *t),
+			       void (*tlp_handler)(struct timer_list *t))
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
 	timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
+	timer_setup(&icsk->icsk_tlp_timer, tlp_handler, 0);
 	timer_setup(&sk->sk_timer, keepalive_handler, 0);
 	icsk->icsk_pending = icsk->icsk_ack.pending = 0;
 }
@@ -522,6 +524,7 @@ void inet_csk_clear_xmit_timers(struct sock *sk)
 
 	sk_stop_timer(sk, &icsk->icsk_retransmit_timer);
 	sk_stop_timer(sk, &icsk->icsk_delack_timer);
+	sk_stop_timer(sk, &icsk->icsk_tlp_timer);
 	sk_stop_timer(sk, &sk->sk_timer);
 }
 EXPORT_SYMBOL(inet_csk_clear_xmit_timers);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 7dc79b973e6e..e87fe87571a1 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -221,8 +221,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 	}
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
-	    icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT ||
-	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
+	    icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT) {
 		r->idiag_timer = 1;
 		r->idiag_retrans = icsk->icsk_retransmits;
 		r->idiag_expires =
@@ -232,6 +231,11 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
 		r->idiag_retrans = icsk->icsk_probes_out;
 		r->idiag_expires =
 			jiffies_to_msecs(icsk->icsk_timeout - jiffies);
+	} else if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
+		r->idiag_timer = 1;
+		r->idiag_retrans = icsk->icsk_retransmits;
+		r->idiag_expires =
+			jiffies_to_msecs(icsk->icsk_tlp_timer.expires - jiffies);
 	} else if (timer_pending(&sk->sk_timer)) {
 		r->idiag_timer = 2;
 		r->idiag_retrans = icsk->icsk_probes_out;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a2e52ad7cdab..71cbb486ef85 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3008,8 +3008,12 @@ void tcp_rearm_rto(struct sock *sk)
 			 */
 			rto = usecs_to_jiffies(max_t(int, delta_us, 1));
 		}
-		tcp_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
-				     TCP_RTO_MAX, tcp_rtx_queue_head(sk));
+		if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
+			tcp_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, rto,
+					     TCP_RTO_MAX, tcp_rtx_queue_head(sk));
+		else
+			tcp_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
+					     TCP_RTO_MAX, tcp_rtx_queue_head(sk));
 	}
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c616f0ad1fa0..f5e34fe7b2e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2434,13 +2434,15 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i)
 	int state;
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
-	    icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT ||
-	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
+	    icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT) {
 		timer_active	= 1;
 		timer_expires	= icsk->icsk_timeout;
 	} else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
 		timer_active	= 4;
 		timer_expires	= icsk->icsk_timeout;
+	} else if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
+		timer_active	= 1;
+		timer_expires	= icsk->icsk_tlp_timer.expires;
 	} else if (timer_pending(&sk->sk_timer)) {
 		timer_active	= 2;
 		timer_expires	= sk->sk_timer.expires;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9822820edca4..9038d7d61d0f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -882,6 +882,7 @@ void tcp_release_cb(struct sock *sk)
 
 	if (flags & TCPF_WRITE_TIMER_DEFERRED) {
 		tcp_write_timer_handler(sk);
+		tcp_tail_loss_probe_handler(sk);
 		__sock_put(sk);
 	}
 	if (flags & TCPF_DELACK_TIMER_DEFERRED) {
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index dd5a6317a801..f112aa979e8c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -591,9 +591,6 @@ void tcp_write_timer_handler(struct sock *sk)
 	case ICSK_TIME_REO_TIMEOUT:
 		tcp_rack_reo_timeout(sk);
 		break;
-	case ICSK_TIME_LOSS_PROBE:
-		tcp_send_loss_probe(sk);
-		break;
 	case ICSK_TIME_RETRANS:
 		icsk->icsk_pending = 0;
 		tcp_retransmit_timer(sk);
@@ -626,6 +623,42 @@ static void tcp_write_timer(struct timer_list *t)
 	sock_put(sk);
 }
 
+void tcp_tail_loss_probe_handler(struct sock *sk)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN) ||
+	    icsk->icsk_pending != ICSK_TIME_LOSS_PROBE)
+		goto out;
+
+	if (timer_pending(&icsk->icsk_tlp_timer))
+		goto out;
+
+	tcp_mstamp_refresh(tcp_sk(sk));
+	if (tcp_send_loss_probe(sk))
+		icsk->icsk_retransmits++;
+out:
+	sk_mem_reclaim(sk);
+}
+
+static void tcp_tail_loss_probe_timer(struct timer_list *t)
+{
+	struct inet_connection_sock *icsk =
+			from_timer(icsk, t, icsk_tlp_timer);
+	struct sock *sk = &icsk->icsk_inet.sk;
+
+	bh_lock_sock(sk);
+	if (!sock_owned_by_user(sk)) {
+		tcp_tail_loss_probe_handler(sk);
+	} else {
+		/* delegate our work to tcp_release_cb() */
+		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &sk->sk_tsq_flags))
+			sock_hold(sk);
+	}
+	bh_unlock_sock(sk);
+	sock_put(sk);
+}
+
 void tcp_syn_ack_timeout(const struct request_sock *req)
 {
 	struct net *net = read_pnet(&inet_rsk(req)->ireq_net);
@@ -758,7 +791,9 @@ static enum hrtimer_restart tcp_compressed_ack_kick(struct hrtimer *timer)
 void tcp_init_xmit_timers(struct sock *sk)
 {
 	inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
-				  &tcp_keepalive_timer);
+				  &tcp_keepalive_timer,
+				  &tcp_tail_loss_probe_timer);
+
 	hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_ABS_PINNED_SOFT);
 	tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4804b6dc5e65..7cc8dbe412af 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1874,13 +1874,15 @@ static void get_tcp6_sock(struct seq_file *seq, struct sock *sp, int i)
 	srcp  = ntohs(inet->inet_sport);
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS ||
-	    icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT ||
-	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
+	    icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT) {
 		timer_active	= 1;
 		timer_expires	= icsk->icsk_timeout;
 	} else if (icsk->icsk_pending == ICSK_TIME_PROBE0) {
 		timer_active	= 4;
 		timer_expires	= icsk->icsk_timeout;
+	} else if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) {
+		timer_active	= 1;
+		timer_expires	= icsk->icsk_tlp_timer.expires;
 	} else if (timer_pending(&sp->sk_timer)) {
 		timer_active	= 2;
 		timer_expires	= sp->sk_timer.expires;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang
@ 2019-10-22 23:24   ` Eric Dumazet
  2019-10-23  1:10     ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-22 23:24 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Currently RTO, TLP and PROBE0 all share a same timer instance
> in kernel and use icsk->icsk_pending to dispatch the work.
> This causes spinlock contention when resetting the timer is
> too frequent, as clearly shown in the perf report:
>
>    61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
>    ...
>     - 58.83% tcp_v4_rcv
>       - 58.80% tcp_v4_do_rcv
>          - 58.80% tcp_rcv_established
>             - 52.88% __tcp_push_pending_frames
>                - 52.88% tcp_write_xmit
>                   - 28.16% tcp_event_new_data_sent
>                      - 28.15% sk_reset_timer
>                         + mod_timer
>                   - 24.68% tcp_schedule_loss_probe
>                      - 24.68% sk_reset_timer
>                         + 24.68% mod_timer
>
> This patch decouples TLP timer from RTO timer by adding a new
> timer instance but still uses icsk->icsk_pending to dispatch,
> in order to minimize the risk of this patch.
>
> After this patch, the CPU time spent in tcp_write_xmit() reduced
> down to 10.92%.

What is the exact benchmark you are running ?

We never saw any contention like that, so lets make sure you are not
working around another issue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-22 23:24   ` Eric Dumazet
@ 2019-10-23  1:10     ` Cong Wang
  2019-10-23  2:15       ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-10-23  1:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Currently RTO, TLP and PROBE0 all share a same timer instance
> > in kernel and use icsk->icsk_pending to dispatch the work.
> > This causes spinlock contention when resetting the timer is
> > too frequent, as clearly shown in the perf report:
> >
> >    61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> >    ...
> >     - 58.83% tcp_v4_rcv
> >       - 58.80% tcp_v4_do_rcv
> >          - 58.80% tcp_rcv_established
> >             - 52.88% __tcp_push_pending_frames
> >                - 52.88% tcp_write_xmit
> >                   - 28.16% tcp_event_new_data_sent
> >                      - 28.15% sk_reset_timer
> >                         + mod_timer
> >                   - 24.68% tcp_schedule_loss_probe
> >                      - 24.68% sk_reset_timer
> >                         + 24.68% mod_timer
> >
> > This patch decouples TLP timer from RTO timer by adding a new
> > timer instance but still uses icsk->icsk_pending to dispatch,
> > in order to minimize the risk of this patch.
> >
> > After this patch, the CPU time spent in tcp_write_xmit() reduced
> > down to 10.92%.
>
> What is the exact benchmark you are running ?
>
> We never saw any contention like that, so lets make sure you are not
> working around another issue.

I simply ran 256 parallel netperf with 128 CPU's to trigger this
spinlock contention, 100% reproducible here.

A single netperf TCP_RR could _also_ confirm the improvement:

Before patch:

$ netperf -H XXX -t TCP_RR -l 20
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
AF_INET to XXX () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

655360 873800 1        1       20.00    17665.59
655360 873800


After patch:

$ netperf -H XXX -t TCP_RR -l 20
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
AF_INET to XXX () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

655360 873800 1        1       20.00    18829.31
655360 873800

(I have run it for multiple times, just pick a median one here.)

The difference can also be observed by turning off/on TLP without patch.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-23  1:10     ` Cong Wang
@ 2019-10-23  2:15       ` Eric Dumazet
  2019-10-23 17:40         ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-23  2:15 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > Currently RTO, TLP and PROBE0 all share a same timer instance
> > > in kernel and use icsk->icsk_pending to dispatch the work.
> > > This causes spinlock contention when resetting the timer is
> > > too frequent, as clearly shown in the perf report:
> > >
> > >    61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> > >    ...
> > >     - 58.83% tcp_v4_rcv
> > >       - 58.80% tcp_v4_do_rcv
> > >          - 58.80% tcp_rcv_established
> > >             - 52.88% __tcp_push_pending_frames
> > >                - 52.88% tcp_write_xmit
> > >                   - 28.16% tcp_event_new_data_sent
> > >                      - 28.15% sk_reset_timer
> > >                         + mod_timer
> > >                   - 24.68% tcp_schedule_loss_probe
> > >                      - 24.68% sk_reset_timer
> > >                         + 24.68% mod_timer
> > >
> > > This patch decouples TLP timer from RTO timer by adding a new
> > > timer instance but still uses icsk->icsk_pending to dispatch,
> > > in order to minimize the risk of this patch.
> > >
> > > After this patch, the CPU time spent in tcp_write_xmit() reduced
> > > down to 10.92%.
> >
> > What is the exact benchmark you are running ?
> >
> > We never saw any contention like that, so lets make sure you are not
> > working around another issue.
>
> I simply ran 256 parallel netperf with 128 CPU's to trigger this
> spinlock contention, 100% reproducible here.

How many TX/RX queues on the NIC ?
What is the qdisc setup ?

>
> A single netperf TCP_RR could _also_ confirm the improvement:
>
> Before patch:
>
> $ netperf -H XXX -t TCP_RR -l 20
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> AF_INET to XXX () port 0 AF_INET : first burst 0
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
>
> 655360 873800 1        1       20.00    17665.59
> 655360 873800
>
>
> After patch:
>
> $ netperf -H XXX -t TCP_RR -l 20
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> AF_INET to XXX () port 0 AF_INET : first burst 0
> Local /Remote
> Socket Size   Request  Resp.   Elapsed  Trans.
> Send   Recv   Size     Size    Time     Rate
> bytes  Bytes  bytes    bytes   secs.    per sec
>
> 655360 873800 1        1       20.00    18829.31
> 655360 873800
>
> (I have run it for multiple times, just pick a median one here.)
>
> The difference can also be observed by turning off/on TLP without patch.

OK thanks for using something I can repro easily :)

I ran the experiment ten times :

lpaa23:/export/hda3/google/edumazet# echo 3
>/proc/sys/net/ipv4/tcp_early_retrans
lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
  26797
  26850
  25266
  27605
  26586
  26341
  27255
  27532
  26657
  27253


Then disabled tlp, and got no obvious difference

lpaa23:/export/hda3/google/edumazet# echo 0
>/proc/sys/net/ipv4/tcp_early_retrans
lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
  25311
  24658
  27105
  27421
  27604
  24649
  26259
  27615
  27543
  26217

I tried with 256 concurrent flows, and same overall observation about
tlp not changing the numbers.
(In fact I am not even sure we arm RTO at all while doing a TCP_RR)
lpaa23:/export/hda3/google/edumazet# echo 3
>/proc/sys/net/ipv4/tcp_early_retrans
lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done
1578682
1572444
1573490
1536378
1514905
1580854
1575949
1578925
1511164
1568213
lpaa23:/export/hda3/google/edumazet# echo 0
>/proc/sys/net/ipv4/tcp_early_retrans
lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done
1576228
1578401
1577654
1579506
1570682
1582267
1550069
1530599
1583269
1578830


I wonder if you have some IRQ smp_affinity problem maybe, or some
scheduler strategy constantly migrating your user threads ?

TLP is quite subtle, having two timers instead of one is probably
going to trigger various bugs.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-23  2:15       ` Eric Dumazet
@ 2019-10-23 17:40         ` Cong Wang
  2019-10-23 18:14           ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-10-23 17:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Tue, Oct 22, 2019 at 7:15 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > Currently RTO, TLP and PROBE0 all share a same timer instance
> > > > in kernel and use icsk->icsk_pending to dispatch the work.
> > > > This causes spinlock contention when resetting the timer is
> > > > too frequent, as clearly shown in the perf report:
> > > >
> > > >    61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> > > >    ...
> > > >     - 58.83% tcp_v4_rcv
> > > >       - 58.80% tcp_v4_do_rcv
> > > >          - 58.80% tcp_rcv_established
> > > >             - 52.88% __tcp_push_pending_frames
> > > >                - 52.88% tcp_write_xmit
> > > >                   - 28.16% tcp_event_new_data_sent
> > > >                      - 28.15% sk_reset_timer
> > > >                         + mod_timer
> > > >                   - 24.68% tcp_schedule_loss_probe
> > > >                      - 24.68% sk_reset_timer
> > > >                         + 24.68% mod_timer
> > > >
> > > > This patch decouples TLP timer from RTO timer by adding a new
> > > > timer instance but still uses icsk->icsk_pending to dispatch,
> > > > in order to minimize the risk of this patch.
> > > >
> > > > After this patch, the CPU time spent in tcp_write_xmit() reduced
> > > > down to 10.92%.
> > >
> > > What is the exact benchmark you are running ?
> > >
> > > We never saw any contention like that, so lets make sure you are not
> > > working around another issue.
> >
> > I simply ran 256 parallel netperf with 128 CPU's to trigger this
> > spinlock contention, 100% reproducible here.
>
> How many TX/RX queues on the NIC ?

60 queues (default), 25Gbps NIC, mlx5.

> What is the qdisc setup ?

fq_codel, which is default here. Its parameters are default too.

>
> >
> > A single netperf TCP_RR could _also_ confirm the improvement:
> >
> > Before patch:
> >
> > $ netperf -H XXX -t TCP_RR -l 20
> > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> > AF_INET to XXX () port 0 AF_INET : first burst 0
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> >
> > 655360 873800 1        1       20.00    17665.59
> > 655360 873800
> >
> >
> > After patch:
> >
> > $ netperf -H XXX -t TCP_RR -l 20
> > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> > AF_INET to XXX () port 0 AF_INET : first burst 0
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> >
> > 655360 873800 1        1       20.00    18829.31
> > 655360 873800
> >
> > (I have run it for multiple times, just pick a median one here.)
> >
> > The difference can also be observed by turning off/on TLP without patch.
>
> OK thanks for using something I can repro easily :)
>
> I ran the experiment ten times :

How many CPU's do you have?


>
> lpaa23:/export/hda3/google/edumazet# echo 3
> >/proc/sys/net/ipv4/tcp_early_retrans
> lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
>   26797
>   26850
>   25266
>   27605
>   26586
>   26341
>   27255
>   27532
>   26657
>   27253
>
>
> Then disabled tlp, and got no obvious difference
>
> lpaa23:/export/hda3/google/edumazet# echo 0
> >/proc/sys/net/ipv4/tcp_early_retrans
> lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
>   25311
>   24658
>   27105
>   27421
>   27604
>   24649
>   26259
>   27615
>   27543
>   26217
>
> I tried with 256 concurrent flows, and same overall observation about
> tlp not changing the numbers.
> (In fact I am not even sure we arm RTO at all while doing a TCP_RR)

In case you misunderstand, the CPU profiling I used is captured
during 256 parallel TCP_STREAM.


> lpaa23:/export/hda3/google/edumazet# echo 3
> >/proc/sys/net/ipv4/tcp_early_retrans
> lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> ./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done
> 1578682
> 1572444
> 1573490
> 1536378
> 1514905
> 1580854
> 1575949
> 1578925
> 1511164
> 1568213
> lpaa23:/export/hda3/google/edumazet# echo 0
> >/proc/sys/net/ipv4/tcp_early_retrans
> lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> ./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done
> 1576228
> 1578401
> 1577654
> 1579506
> 1570682
> 1582267
> 1550069
> 1530599
> 1583269
> 1578830
>
>
> I wonder if you have some IRQ smp_affinity problem maybe, or some
> scheduler strategy constantly migrating your user threads ?

Scheduler is default too. IRQ smp affinity is statically distributed to
each of the first 60 CPU's.

>
> TLP is quite subtle, having two timers instead of one is probably
> going to trigger various bugs.

This is exactly why I keep icsk_pending. ;)

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-23 17:40         ` Cong Wang
@ 2019-10-23 18:14           ` Eric Dumazet
  2019-10-23 18:30             ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-23 18:14 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Wed, Oct 23, 2019 at 10:40 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 7:15 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > Currently RTO, TLP and PROBE0 all share a same timer instance
> > > > > in kernel and use icsk->icsk_pending to dispatch the work.
> > > > > This causes spinlock contention when resetting the timer is
> > > > > too frequent, as clearly shown in the perf report:
> > > > >
> > > > >    61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
> > > > >    ...
> > > > >     - 58.83% tcp_v4_rcv
> > > > >       - 58.80% tcp_v4_do_rcv
> > > > >          - 58.80% tcp_rcv_established
> > > > >             - 52.88% __tcp_push_pending_frames
> > > > >                - 52.88% tcp_write_xmit
> > > > >                   - 28.16% tcp_event_new_data_sent
> > > > >                      - 28.15% sk_reset_timer
> > > > >                         + mod_timer
> > > > >                   - 24.68% tcp_schedule_loss_probe
> > > > >                      - 24.68% sk_reset_timer
> > > > >                         + 24.68% mod_timer
> > > > >
> > > > > This patch decouples TLP timer from RTO timer by adding a new
> > > > > timer instance but still uses icsk->icsk_pending to dispatch,
> > > > > in order to minimize the risk of this patch.
> > > > >
> > > > > After this patch, the CPU time spent in tcp_write_xmit() reduced
> > > > > down to 10.92%.
> > > >
> > > > What is the exact benchmark you are running ?
> > > >
> > > > We never saw any contention like that, so lets make sure you are not
> > > > working around another issue.
> > >
> > > I simply ran 256 parallel netperf with 128 CPU's to trigger this
> > > spinlock contention, 100% reproducible here.
> >
> > How many TX/RX queues on the NIC ?
>
> 60 queues (default), 25Gbps NIC, mlx5.
>
> > What is the qdisc setup ?
>
> fq_codel, which is default here. Its parameters are default too.
>
> >
> > >
> > > A single netperf TCP_RR could _also_ confirm the improvement:
> > >
> > > Before patch:
> > >
> > > $ netperf -H XXX -t TCP_RR -l 20
> > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> > > AF_INET to XXX () port 0 AF_INET : first burst 0
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size     Size    Time     Rate
> > > bytes  Bytes  bytes    bytes   secs.    per sec
> > >
> > > 655360 873800 1        1       20.00    17665.59
> > > 655360 873800
> > >
> > >
> > > After patch:
> > >
> > > $ netperf -H XXX -t TCP_RR -l 20
> > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0
> > > AF_INET to XXX () port 0 AF_INET : first burst 0
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size     Size    Time     Rate
> > > bytes  Bytes  bytes    bytes   secs.    per sec
> > >
> > > 655360 873800 1        1       20.00    18829.31
> > > 655360 873800
> > >
> > > (I have run it for multiple times, just pick a median one here.)
> > >
> > > The difference can also be observed by turning off/on TLP without patch.
> >
> > OK thanks for using something I can repro easily :)
> >
> > I ran the experiment ten times :
>
> How many CPU's do you have?
>
>
> >
> > lpaa23:/export/hda3/google/edumazet# echo 3
> > >/proc/sys/net/ipv4/tcp_early_retrans
> > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
> >   26797
> >   26850
> >   25266
> >   27605
> >   26586
> >   26341
> >   27255
> >   27532
> >   26657
> >   27253
> >
> >
> > Then disabled tlp, and got no obvious difference
> >
> > lpaa23:/export/hda3/google/edumazet# echo 0
> > >/proc/sys/net/ipv4/tcp_early_retrans
> > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do
> > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done
> >   25311
> >   24658
> >   27105
> >   27421
> >   27604
> >   24649
> >   26259
> >   27615
> >   27543
> >   26217
> >
> > I tried with 256 concurrent flows, and same overall observation about
> > tlp not changing the numbers.
> > (In fact I am not even sure we arm RTO at all while doing a TCP_RR)
>
> In case you misunderstand, the CPU profiling I used is captured
> during 256 parallel TCP_STREAM.

When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/

"A single netperf TCP_RR could _also_ confirm the improvement:"

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-23 18:14           ` Eric Dumazet
@ 2019-10-23 18:30             ` Cong Wang
  2019-10-23 18:55               ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-10-23 18:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Wed, Oct 23, 2019 at 11:14 AM Eric Dumazet <edumazet@google.com> wrote:
> > In case you misunderstand, the CPU profiling I used is captured
> > during 256 parallel TCP_STREAM.
>
> When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/
>
> "A single netperf TCP_RR could _also_ confirm the improvement:"

I guess you didn't understand what "also" mean? The improvement
can be measured with both TCP_STREAM and TCP_RR, only the
CPU profiling is done with TCP_STREAM.

BTW, I just tested an unpatched kernel on a machine with 64 CPU's,
turning on/off TLP makes no difference there, so this is likely related
to the number of CPU's or hardware configurations. This explains
why you can't reproduce it on your side, so far I can only reproduce
it on one particular hardware platform too, but it is real.

If you need any other information, please let me know.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer
  2019-10-23 18:30             ` Cong Wang
@ 2019-10-23 18:55               ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2019-10-23 18:55 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell

On Wed, Oct 23, 2019 at 11:30 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Oct 23, 2019 at 11:14 AM Eric Dumazet <edumazet@google.com> wrote:
> > > In case you misunderstand, the CPU profiling I used is captured
> > > during 256 parallel TCP_STREAM.
> >
> > When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/
> >
> > "A single netperf TCP_RR could _also_ confirm the improvement:"
>
> I guess you didn't understand what "also" mean? The improvement
> can be measured with both TCP_STREAM and TCP_RR, only the
> CPU profiling is done with TCP_STREAM.
>

Except that I could not measure any gain with TCP_RR, which is expected,
since TCP_RR will not use RTO and TLP at the same time.

If you found that we were setting both RTO and TLP when sending 1-byte message,
we need to fix the stack, instead of working around it.

> BTW, I just tested an unpatched kernel on a machine with 64 CPU's,
> turning on/off TLP makes no difference there, so this is likely related
> to the number of CPU's or hardware configurations. This explains
> why you can't reproduce it on your side, so far I can only reproduce
> it on one particular hardware platform too, but it is real.
>

I have hosts with 112 cpus, I can try on them, but it will take some time.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
  2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
                   ` (2 preceding siblings ...)
  2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang
@ 2019-10-28 18:29 ` David Miller
  2019-10-28 18:34   ` Eric Dumazet
  2019-10-30 21:56 ` David Miller
  4 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2019-10-28 18:29 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, ycheng, ncardwell, edumazet

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 22 Oct 2019 16:10:48 -0700

> This patchset contains 3 patches: patch 1 is a cleanup,
> patch 2 is a small change preparing for patch 3, patch 3 is the
> one does the actual change. Please find details in each of them.

Eric, have you had a chance to test this on a system with
suitable CPU arity?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
  2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller
@ 2019-10-28 18:34   ` Eric Dumazet
  2019-10-28 20:13     ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-28 18:34 UTC (permalink / raw)
  To: David Miller; +Cc: Cong Wang, netdev, Yuchung Cheng, Neal Cardwell

On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote:
>
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Tue, 22 Oct 2019 16:10:48 -0700
>
> > This patchset contains 3 patches: patch 1 is a cleanup,
> > patch 2 is a small change preparing for patch 3, patch 3 is the
> > one does the actual change. Please find details in each of them.
>
> Eric, have you had a chance to test this on a system with
> suitable CPU arity?

Yes, and I confirm I could not repro the issues at all.

I got a 100Gbit NIC, trying to increase the pressure a bit, and
driving this NIC at line rate was only using 2% of my 96 cpus host,
no spinlock contention of any sort.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
  2019-10-28 18:34   ` Eric Dumazet
@ 2019-10-28 20:13     ` Cong Wang
  2019-10-28 20:31       ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2019-10-28 20:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell

On Mon, Oct 28, 2019 at 11:34 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote:
> >
> > From: Cong Wang <xiyou.wangcong@gmail.com>
> > Date: Tue, 22 Oct 2019 16:10:48 -0700
> >
> > > This patchset contains 3 patches: patch 1 is a cleanup,
> > > patch 2 is a small change preparing for patch 3, patch 3 is the
> > > one does the actual change. Please find details in each of them.
> >
> > Eric, have you had a chance to test this on a system with
> > suitable CPU arity?
>
> Yes, and I confirm I could not repro the issues at all.
>
> I got a 100Gbit NIC, trying to increase the pressure a bit, and
> driving this NIC at line rate was only using 2% of my 96 cpus host,
> no spinlock contention of any sort.

Please let me know if there is anything else I can provide to help
you to make the decision.

All I can say so far is this only happens on our hosts with 128
AMD CPU's. I don't see anything here related to AMD, so I think
only the number of CPU's (vs. number of TX queues?) matters.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
  2019-10-28 20:13     ` Cong Wang
@ 2019-10-28 20:31       ` Eric Dumazet
  2019-10-29  0:49         ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2019-10-28 20:31 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell

On Mon, Oct 28, 2019 at 1:13 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Mon, Oct 28, 2019 at 11:34 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote:
> > >
> > > From: Cong Wang <xiyou.wangcong@gmail.com>
> > > Date: Tue, 22 Oct 2019 16:10:48 -0700
> > >
> > > > This patchset contains 3 patches: patch 1 is a cleanup,
> > > > patch 2 is a small change preparing for patch 3, patch 3 is the
> > > > one does the actual change. Please find details in each of them.
> > >
> > > Eric, have you had a chance to test this on a system with
> > > suitable CPU arity?
> >
> > Yes, and I confirm I could not repro the issues at all.
> >
> > I got a 100Gbit NIC, trying to increase the pressure a bit, and
> > driving this NIC at line rate was only using 2% of my 96 cpus host,
> > no spinlock contention of any sort.
>
> Please let me know if there is anything else I can provide to help
> you to make the decision.
>
> All I can say so far is this only happens on our hosts with 128
> AMD CPU's. I don't see anything here related to AMD, so I think
> only the number of CPU's (vs. number of TX queues?) matters.
>

I also have AMD hosts with 256 cpus, I can try them later (not today,
I am too busy)

But I feel you are trying to work around a more fundamental issue if
this problem only shows up on AMD hosts.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
  2019-10-28 20:31       ` Eric Dumazet
@ 2019-10-29  0:49         ` Cong Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Cong Wang @ 2019-10-29  0:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell

On Mon, Oct 28, 2019 at 1:31 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 28, 2019 at 1:13 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Mon, Oct 28, 2019 at 11:34 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote:
> > > >
> > > > From: Cong Wang <xiyou.wangcong@gmail.com>
> > > > Date: Tue, 22 Oct 2019 16:10:48 -0700
> > > >
> > > > > This patchset contains 3 patches: patch 1 is a cleanup,
> > > > > patch 2 is a small change preparing for patch 3, patch 3 is the
> > > > > one does the actual change. Please find details in each of them.
> > > >
> > > > Eric, have you had a chance to test this on a system with
> > > > suitable CPU arity?
> > >
> > > Yes, and I confirm I could not repro the issues at all.
> > >
> > > I got a 100Gbit NIC, trying to increase the pressure a bit, and
> > > driving this NIC at line rate was only using 2% of my 96 cpus host,
> > > no spinlock contention of any sort.
> >
> > Please let me know if there is anything else I can provide to help
> > you to make the decision.
> >
> > All I can say so far is this only happens on our hosts with 128
> > AMD CPU's. I don't see anything here related to AMD, so I think
> > only the number of CPU's (vs. number of TX queues?) matters.
> >
>
> I also have AMD hosts with 256 cpus, I can try them later (not today,
> I am too busy)
>
> But I feel you are trying to work around a more fundamental issue if
> this problem only shows up on AMD hosts.

I wish I have Intel hosts with the same number of CPU's, but I don't,
all Intel ones have less, probably 80 at max. This is why I think it
is related to the number of CPU's.

Also, IOMMU is turned off explicitly, I don't see anything else could
be AMD specific along the TCP path.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer
  2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
                   ` (3 preceding siblings ...)
  2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller
@ 2019-10-30 21:56 ` David Miller
  4 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2019-10-30 21:56 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, ycheng, ncardwell, edumazet

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue, 22 Oct 2019 16:10:48 -0700

> This patchset contains 3 patches: patch 1 is a cleanup,
> patch 2 is a small change preparing for patch 3, patch 3 is the
> one does the actual change. Please find details in each of them.

I'm marking this deferred until someone can drill down why this is
only seen in such a specific configuration, and not to ANY EXTENT
whatsoever with just a slightly lower number of CPUs on other
machines.

It's really hard to justify this set of changes without a full
understanding and detailed analysis.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-10-30 21:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang
2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang
2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang
2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang
2019-10-22 23:24   ` Eric Dumazet
2019-10-23  1:10     ` Cong Wang
2019-10-23  2:15       ` Eric Dumazet
2019-10-23 17:40         ` Cong Wang
2019-10-23 18:14           ` Eric Dumazet
2019-10-23 18:30             ` Cong Wang
2019-10-23 18:55               ` Eric Dumazet
2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller
2019-10-28 18:34   ` Eric Dumazet
2019-10-28 20:13     ` Cong Wang
2019-10-28 20:31       ` Eric Dumazet
2019-10-29  0:49         ` Cong Wang
2019-10-30 21:56 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.