* [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer @ 2019-10-22 23:10 Cong Wang 2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang ` (4 more replies) 0 siblings, 5 replies; 17+ messages in thread From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw) To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang This patchset contains 3 patches: patch 1 is a cleanup, patch 2 is a small change preparing for patch 3, patch 3 is the one does the actual change. Please find details in each of them. --- Cong Wang (3): tcp: get rid of ICSK_TIME_EARLY_RETRANS tcp: make tcp_send_loss_probe() boolean tcp: decouple TLP timer from RTO timer include/net/inet_connection_sock.h | 13 +++++---- include/net/tcp.h | 3 ++- net/dccp/timer.c | 2 +- net/ipv4/inet_connection_sock.c | 5 +++- net/ipv4/inet_diag.c | 8 ++++-- net/ipv4/tcp_input.c | 8 ++++-- net/ipv4/tcp_ipv4.c | 6 +++-- net/ipv4/tcp_output.c | 8 ++++-- net/ipv4/tcp_timer.c | 43 +++++++++++++++++++++++++++--- net/ipv6/tcp_ipv6.c | 6 +++-- 10 files changed, 80 insertions(+), 22 deletions(-) -- 2.21.0 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS 2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang @ 2019-10-22 23:10 ` Cong Wang 2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang ` (3 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw) To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang After commit bec41a11dd3d ("tcp: remove early retransmit") ICSK_TIME_EARLY_RETRANS is no longer effective, so we can remove its definition too. Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- include/net/inet_connection_sock.h | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 895546058a20..e46958460739 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -142,9 +142,8 @@ struct inet_connection_sock { #define ICSK_TIME_RETRANS 1 /* Retransmit timer */ #define ICSK_TIME_DACK 2 /* Delayed ack timer */ #define ICSK_TIME_PROBE0 3 /* Zero window probe timer */ -#define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */ -#define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */ -#define ICSK_TIME_REO_TIMEOUT 6 /* Reordering timer */ +#define ICSK_TIME_LOSS_PROBE 4 /* Tail loss probe timer */ +#define ICSK_TIME_REO_TIMEOUT 5 /* Reordering timer */ static inline struct inet_connection_sock *inet_csk(const struct sock *sk) { @@ -227,8 +226,7 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, } if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 || - what == ICSK_TIME_EARLY_RETRANS || what == ICSK_TIME_LOSS_PROBE || - what == ICSK_TIME_REO_TIMEOUT) { + what == ICSK_TIME_LOSS_PROBE || what == ICSK_TIME_REO_TIMEOUT) { icsk->icsk_pending = what; icsk->icsk_timeout = jiffies + when; sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout); -- 2.21.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean 2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang 2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang @ 2019-10-22 23:10 ` Cong Wang 2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang ` (2 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw) To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang Let tcp_send_loss_probe() return whether a TLP has been sent or not. This is needed by the folllowing patch. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- include/net/tcp.h | 2 +- net/ipv4/tcp_output.c | 7 +++++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index ab4eb5eb5d07..0ee5400e751c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -581,7 +581,7 @@ void tcp_push_one(struct sock *, unsigned int mss_now); void __tcp_send_ack(struct sock *sk, u32 rcv_nxt); void tcp_send_ack(struct sock *sk); void tcp_send_delayed_ack(struct sock *sk); -void tcp_send_loss_probe(struct sock *sk); +bool tcp_send_loss_probe(struct sock *sk); bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto); void tcp_skb_collapse_tstamp(struct sk_buff *skb, const struct sk_buff *next_skb); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0488607c5cd3..9822820edca4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2539,12 +2539,13 @@ static bool skb_still_in_host_queue(const struct sock *sk, /* When probe timeout (PTO) fires, try send a new segment if possible, else * retransmit the last segment. */ -void tcp_send_loss_probe(struct sock *sk) +bool tcp_send_loss_probe(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; int pcount; int mss = tcp_current_mss(sk); + bool sent = false; skb = tcp_send_head(sk); if (skb && tcp_snd_wnd_test(tp, skb, mss)) { @@ -2560,7 +2561,7 @@ void tcp_send_loss_probe(struct sock *sk) "invalid inflight: %u state %u cwnd %u mss %d\n", tp->packets_out, sk->sk_state, tp->snd_cwnd, mss); inet_csk(sk)->icsk_pending = 0; - return; + return false; } /* At most one outstanding TLP retransmission. */ @@ -2592,11 +2593,13 @@ void tcp_send_loss_probe(struct sock *sk) tp->tlp_high_seq = tp->snd_nxt; probe_sent: + sent = true; NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPLOSSPROBES); /* Reset s.t. tcp_rearm_rto will restart timer from now */ inet_csk(sk)->icsk_pending = 0; rearm_timer: tcp_rearm_rto(sk); + return sent; } /* Push out any pending frames which were held back due to -- 2.21.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang 2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang 2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang @ 2019-10-22 23:10 ` Cong Wang 2019-10-22 23:24 ` Eric Dumazet 2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller 2019-10-30 21:56 ` David Miller 4 siblings, 1 reply; 17+ messages in thread From: Cong Wang @ 2019-10-22 23:10 UTC (permalink / raw) To: netdev; +Cc: ycheng, ncardwell, edumazet, Cong Wang Currently RTO, TLP and PROBE0 all share a same timer instance in kernel and use icsk->icsk_pending to dispatch the work. This causes spinlock contention when resetting the timer is too frequent, as clearly shown in the perf report: 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath ... - 58.83% tcp_v4_rcv - 58.80% tcp_v4_do_rcv - 58.80% tcp_rcv_established - 52.88% __tcp_push_pending_frames - 52.88% tcp_write_xmit - 28.16% tcp_event_new_data_sent - 28.15% sk_reset_timer + mod_timer - 24.68% tcp_schedule_loss_probe - 24.68% sk_reset_timer + 24.68% mod_timer This patch decouples TLP timer from RTO timer by adding a new timer instance but still uses icsk->icsk_pending to dispatch, in order to minimize the risk of this patch. After this patch, the CPU time spent in tcp_write_xmit() reduced down to 10.92%. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- include/net/inet_connection_sock.h | 9 +++++-- include/net/tcp.h | 1 + net/dccp/timer.c | 2 +- net/ipv4/inet_connection_sock.c | 5 +++- net/ipv4/inet_diag.c | 8 ++++-- net/ipv4/tcp_input.c | 8 ++++-- net/ipv4/tcp_ipv4.c | 6 +++-- net/ipv4/tcp_output.c | 1 + net/ipv4/tcp_timer.c | 43 +++++++++++++++++++++++++++--- net/ipv6/tcp_ipv6.c | 6 +++-- 10 files changed, 73 insertions(+), 16 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index e46958460739..2a129fc6b522 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -121,6 +121,7 @@ struct inet_connection_sock { __u16 last_seg_size; /* Size of last incoming segment */ __u16 rcv_mss; /* MSS used for delayed ACK decisions */ } icsk_ack; + struct timer_list icsk_tlp_timer; struct { int enabled; @@ -170,7 +171,8 @@ enum inet_csk_ack_state_t { void inet_csk_init_xmit_timers(struct sock *sk, void (*retransmit_handler)(struct timer_list *), void (*delack_handler)(struct timer_list *), - void (*keepalive_handler)(struct timer_list *)); + void (*keepalive_handler)(struct timer_list *), + void (*tlp_handler)(struct timer_list *)); void inet_csk_clear_xmit_timers(struct sock *sk); static inline void inet_csk_schedule_ack(struct sock *sk) @@ -226,7 +228,7 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, } if (what == ICSK_TIME_RETRANS || what == ICSK_TIME_PROBE0 || - what == ICSK_TIME_LOSS_PROBE || what == ICSK_TIME_REO_TIMEOUT) { + what == ICSK_TIME_REO_TIMEOUT) { icsk->icsk_pending = what; icsk->icsk_timeout = jiffies + when; sk_reset_timer(sk, &icsk->icsk_retransmit_timer, icsk->icsk_timeout); @@ -234,6 +236,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what, icsk->icsk_ack.pending |= ICSK_ACK_TIMER; icsk->icsk_ack.timeout = jiffies + when; sk_reset_timer(sk, &icsk->icsk_delack_timer, icsk->icsk_ack.timeout); + } else if (what == ICSK_TIME_LOSS_PROBE) { + icsk->icsk_pending = what; + sk_reset_timer(sk, &icsk->icsk_tlp_timer, jiffies + when); } else { pr_debug("inet_csk BUG: unknown timer value\n"); } diff --git a/include/net/tcp.h b/include/net/tcp.h index 0ee5400e751c..3319d2b6b1c4 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -331,6 +331,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, void tcp_release_cb(struct sock *sk); void tcp_wfree(struct sk_buff *skb); void tcp_write_timer_handler(struct sock *sk); +void tcp_tail_loss_probe_handler(struct sock *sk); void tcp_delack_timer_handler(struct sock *sk); int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg); int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb); diff --git a/net/dccp/timer.c b/net/dccp/timer.c index c0b3672637c4..897a0469e8f1 100644 --- a/net/dccp/timer.c +++ b/net/dccp/timer.c @@ -246,7 +246,7 @@ void dccp_init_xmit_timers(struct sock *sk) tasklet_init(&dp->dccps_xmitlet, dccp_write_xmitlet, (unsigned long)sk); timer_setup(&dp->dccps_xmit_timer, dccp_write_xmit_timer, 0); inet_csk_init_xmit_timers(sk, &dccp_write_timer, &dccp_delack_timer, - &dccp_keepalive_timer); + &dccp_keepalive_timer, NULL); } static ktime_t dccp_timestamp_seed; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index eb30fc1770de..4b279a86308e 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -503,12 +503,14 @@ EXPORT_SYMBOL(inet_csk_accept); void inet_csk_init_xmit_timers(struct sock *sk, void (*retransmit_handler)(struct timer_list *t), void (*delack_handler)(struct timer_list *t), - void (*keepalive_handler)(struct timer_list *t)) + void (*keepalive_handler)(struct timer_list *t), + void (*tlp_handler)(struct timer_list *t)) { struct inet_connection_sock *icsk = inet_csk(sk); timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0); timer_setup(&icsk->icsk_delack_timer, delack_handler, 0); + timer_setup(&icsk->icsk_tlp_timer, tlp_handler, 0); timer_setup(&sk->sk_timer, keepalive_handler, 0); icsk->icsk_pending = icsk->icsk_ack.pending = 0; } @@ -522,6 +524,7 @@ void inet_csk_clear_xmit_timers(struct sock *sk) sk_stop_timer(sk, &icsk->icsk_retransmit_timer); sk_stop_timer(sk, &icsk->icsk_delack_timer); + sk_stop_timer(sk, &icsk->icsk_tlp_timer); sk_stop_timer(sk, &sk->sk_timer); } EXPORT_SYMBOL(inet_csk_clear_xmit_timers); diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 7dc79b973e6e..e87fe87571a1 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -221,8 +221,7 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, } if (icsk->icsk_pending == ICSK_TIME_RETRANS || - icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT || - icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { + icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT) { r->idiag_timer = 1; r->idiag_retrans = icsk->icsk_retransmits; r->idiag_expires = @@ -232,6 +231,11 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk, r->idiag_retrans = icsk->icsk_probes_out; r->idiag_expires = jiffies_to_msecs(icsk->icsk_timeout - jiffies); + } else if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { + r->idiag_timer = 1; + r->idiag_retrans = icsk->icsk_retransmits; + r->idiag_expires = + jiffies_to_msecs(icsk->icsk_tlp_timer.expires - jiffies); } else if (timer_pending(&sk->sk_timer)) { r->idiag_timer = 2; r->idiag_retrans = icsk->icsk_probes_out; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a2e52ad7cdab..71cbb486ef85 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3008,8 +3008,12 @@ void tcp_rearm_rto(struct sock *sk) */ rto = usecs_to_jiffies(max_t(int, delta_us, 1)); } - tcp_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, - TCP_RTO_MAX, tcp_rtx_queue_head(sk)); + if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) + tcp_reset_xmit_timer(sk, ICSK_TIME_LOSS_PROBE, rto, + TCP_RTO_MAX, tcp_rtx_queue_head(sk)); + else + tcp_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, + TCP_RTO_MAX, tcp_rtx_queue_head(sk)); } } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c616f0ad1fa0..f5e34fe7b2e6 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2434,13 +2434,15 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i) int state; if (icsk->icsk_pending == ICSK_TIME_RETRANS || - icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT || - icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { + icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT) { timer_active = 1; timer_expires = icsk->icsk_timeout; } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) { timer_active = 4; timer_expires = icsk->icsk_timeout; + } else if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { + timer_active = 1; + timer_expires = icsk->icsk_tlp_timer.expires; } else if (timer_pending(&sk->sk_timer)) { timer_active = 2; timer_expires = sk->sk_timer.expires; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 9822820edca4..9038d7d61d0f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -882,6 +882,7 @@ void tcp_release_cb(struct sock *sk) if (flags & TCPF_WRITE_TIMER_DEFERRED) { tcp_write_timer_handler(sk); + tcp_tail_loss_probe_handler(sk); __sock_put(sk); } if (flags & TCPF_DELACK_TIMER_DEFERRED) { diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index dd5a6317a801..f112aa979e8c 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -591,9 +591,6 @@ void tcp_write_timer_handler(struct sock *sk) case ICSK_TIME_REO_TIMEOUT: tcp_rack_reo_timeout(sk); break; - case ICSK_TIME_LOSS_PROBE: - tcp_send_loss_probe(sk); - break; case ICSK_TIME_RETRANS: icsk->icsk_pending = 0; tcp_retransmit_timer(sk); @@ -626,6 +623,42 @@ static void tcp_write_timer(struct timer_list *t) sock_put(sk); } +void tcp_tail_loss_probe_handler(struct sock *sk) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + + if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN) || + icsk->icsk_pending != ICSK_TIME_LOSS_PROBE) + goto out; + + if (timer_pending(&icsk->icsk_tlp_timer)) + goto out; + + tcp_mstamp_refresh(tcp_sk(sk)); + if (tcp_send_loss_probe(sk)) + icsk->icsk_retransmits++; +out: + sk_mem_reclaim(sk); +} + +static void tcp_tail_loss_probe_timer(struct timer_list *t) +{ + struct inet_connection_sock *icsk = + from_timer(icsk, t, icsk_tlp_timer); + struct sock *sk = &icsk->icsk_inet.sk; + + bh_lock_sock(sk); + if (!sock_owned_by_user(sk)) { + tcp_tail_loss_probe_handler(sk); + } else { + /* delegate our work to tcp_release_cb() */ + if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &sk->sk_tsq_flags)) + sock_hold(sk); + } + bh_unlock_sock(sk); + sock_put(sk); +} + void tcp_syn_ack_timeout(const struct request_sock *req) { struct net *net = read_pnet(&inet_rsk(req)->ireq_net); @@ -758,7 +791,9 @@ static enum hrtimer_restart tcp_compressed_ack_kick(struct hrtimer *timer) void tcp_init_xmit_timers(struct sock *sk) { inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer, - &tcp_keepalive_timer); + &tcp_keepalive_timer, + &tcp_tail_loss_probe_timer); + hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_SOFT); tcp_sk(sk)->pacing_timer.function = tcp_pace_kick; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 4804b6dc5e65..7cc8dbe412af 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1874,13 +1874,15 @@ static void get_tcp6_sock(struct seq_file *seq, struct sock *sp, int i) srcp = ntohs(inet->inet_sport); if (icsk->icsk_pending == ICSK_TIME_RETRANS || - icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT || - icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { + icsk->icsk_pending == ICSK_TIME_REO_TIMEOUT) { timer_active = 1; timer_expires = icsk->icsk_timeout; } else if (icsk->icsk_pending == ICSK_TIME_PROBE0) { timer_active = 4; timer_expires = icsk->icsk_timeout; + } else if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) { + timer_active = 1; + timer_expires = icsk->icsk_tlp_timer.expires; } else if (timer_pending(&sp->sk_timer)) { timer_active = 2; timer_expires = sp->sk_timer.expires; -- 2.21.0 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang @ 2019-10-22 23:24 ` Eric Dumazet 2019-10-23 1:10 ` Cong Wang 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2019-10-22 23:24 UTC (permalink / raw) To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > Currently RTO, TLP and PROBE0 all share a same timer instance > in kernel and use icsk->icsk_pending to dispatch the work. > This causes spinlock contention when resetting the timer is > too frequent, as clearly shown in the perf report: > > 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath > ... > - 58.83% tcp_v4_rcv > - 58.80% tcp_v4_do_rcv > - 58.80% tcp_rcv_established > - 52.88% __tcp_push_pending_frames > - 52.88% tcp_write_xmit > - 28.16% tcp_event_new_data_sent > - 28.15% sk_reset_timer > + mod_timer > - 24.68% tcp_schedule_loss_probe > - 24.68% sk_reset_timer > + 24.68% mod_timer > > This patch decouples TLP timer from RTO timer by adding a new > timer instance but still uses icsk->icsk_pending to dispatch, > in order to minimize the risk of this patch. > > After this patch, the CPU time spent in tcp_write_xmit() reduced > down to 10.92%. What is the exact benchmark you are running ? We never saw any contention like that, so lets make sure you are not working around another issue. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-22 23:24 ` Eric Dumazet @ 2019-10-23 1:10 ` Cong Wang 2019-10-23 2:15 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: Cong Wang @ 2019-10-23 1:10 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > Currently RTO, TLP and PROBE0 all share a same timer instance > > in kernel and use icsk->icsk_pending to dispatch the work. > > This causes spinlock contention when resetting the timer is > > too frequent, as clearly shown in the perf report: > > > > 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath > > ... > > - 58.83% tcp_v4_rcv > > - 58.80% tcp_v4_do_rcv > > - 58.80% tcp_rcv_established > > - 52.88% __tcp_push_pending_frames > > - 52.88% tcp_write_xmit > > - 28.16% tcp_event_new_data_sent > > - 28.15% sk_reset_timer > > + mod_timer > > - 24.68% tcp_schedule_loss_probe > > - 24.68% sk_reset_timer > > + 24.68% mod_timer > > > > This patch decouples TLP timer from RTO timer by adding a new > > timer instance but still uses icsk->icsk_pending to dispatch, > > in order to minimize the risk of this patch. > > > > After this patch, the CPU time spent in tcp_write_xmit() reduced > > down to 10.92%. > > What is the exact benchmark you are running ? > > We never saw any contention like that, so lets make sure you are not > working around another issue. I simply ran 256 parallel netperf with 128 CPU's to trigger this spinlock contention, 100% reproducible here. A single netperf TCP_RR could _also_ confirm the improvement: Before patch: $ netperf -H XXX -t TCP_RR -l 20 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to XXX () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 655360 873800 1 1 20.00 17665.59 655360 873800 After patch: $ netperf -H XXX -t TCP_RR -l 20 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to XXX () port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 655360 873800 1 1 20.00 18829.31 655360 873800 (I have run it for multiple times, just pick a median one here.) The difference can also be observed by turning off/on TLP without patch. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-23 1:10 ` Cong Wang @ 2019-10-23 2:15 ` Eric Dumazet 2019-10-23 17:40 ` Cong Wang 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2019-10-23 2:15 UTC (permalink / raw) To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > Currently RTO, TLP and PROBE0 all share a same timer instance > > > in kernel and use icsk->icsk_pending to dispatch the work. > > > This causes spinlock contention when resetting the timer is > > > too frequent, as clearly shown in the perf report: > > > > > > 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath > > > ... > > > - 58.83% tcp_v4_rcv > > > - 58.80% tcp_v4_do_rcv > > > - 58.80% tcp_rcv_established > > > - 52.88% __tcp_push_pending_frames > > > - 52.88% tcp_write_xmit > > > - 28.16% tcp_event_new_data_sent > > > - 28.15% sk_reset_timer > > > + mod_timer > > > - 24.68% tcp_schedule_loss_probe > > > - 24.68% sk_reset_timer > > > + 24.68% mod_timer > > > > > > This patch decouples TLP timer from RTO timer by adding a new > > > timer instance but still uses icsk->icsk_pending to dispatch, > > > in order to minimize the risk of this patch. > > > > > > After this patch, the CPU time spent in tcp_write_xmit() reduced > > > down to 10.92%. > > > > What is the exact benchmark you are running ? > > > > We never saw any contention like that, so lets make sure you are not > > working around another issue. > > I simply ran 256 parallel netperf with 128 CPU's to trigger this > spinlock contention, 100% reproducible here. How many TX/RX queues on the NIC ? What is the qdisc setup ? > > A single netperf TCP_RR could _also_ confirm the improvement: > > Before patch: > > $ netperf -H XXX -t TCP_RR -l 20 > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 > AF_INET to XXX () port 0 AF_INET : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. > Send Recv Size Size Time Rate > bytes Bytes bytes bytes secs. per sec > > 655360 873800 1 1 20.00 17665.59 > 655360 873800 > > > After patch: > > $ netperf -H XXX -t TCP_RR -l 20 > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 > AF_INET to XXX () port 0 AF_INET : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. > Send Recv Size Size Time Rate > bytes Bytes bytes bytes secs. per sec > > 655360 873800 1 1 20.00 18829.31 > 655360 873800 > > (I have run it for multiple times, just pick a median one here.) > > The difference can also be observed by turning off/on TLP without patch. OK thanks for using something I can repro easily :) I ran the experiment ten times : lpaa23:/export/hda3/google/edumazet# echo 3 >/proc/sys/net/ipv4/tcp_early_retrans lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done 26797 26850 25266 27605 26586 26341 27255 27532 26657 27253 Then disabled tlp, and got no obvious difference lpaa23:/export/hda3/google/edumazet# echo 0 >/proc/sys/net/ipv4/tcp_early_retrans lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done 25311 24658 27105 27421 27604 24649 26259 27615 27543 26217 I tried with 256 concurrent flows, and same overall observation about tlp not changing the numbers. (In fact I am not even sure we arm RTO at all while doing a TCP_RR) lpaa23:/export/hda3/google/edumazet# echo 3 >/proc/sys/net/ipv4/tcp_early_retrans lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do ./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done 1578682 1572444 1573490 1536378 1514905 1580854 1575949 1578925 1511164 1568213 lpaa23:/export/hda3/google/edumazet# echo 0 >/proc/sys/net/ipv4/tcp_early_retrans lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do ./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done 1576228 1578401 1577654 1579506 1570682 1582267 1550069 1530599 1583269 1578830 I wonder if you have some IRQ smp_affinity problem maybe, or some scheduler strategy constantly migrating your user threads ? TLP is quite subtle, having two timers instead of one is probably going to trigger various bugs. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-23 2:15 ` Eric Dumazet @ 2019-10-23 17:40 ` Cong Wang 2019-10-23 18:14 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: Cong Wang @ 2019-10-23 17:40 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Tue, Oct 22, 2019 at 7:15 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > Currently RTO, TLP and PROBE0 all share a same timer instance > > > > in kernel and use icsk->icsk_pending to dispatch the work. > > > > This causes spinlock contention when resetting the timer is > > > > too frequent, as clearly shown in the perf report: > > > > > > > > 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath > > > > ... > > > > - 58.83% tcp_v4_rcv > > > > - 58.80% tcp_v4_do_rcv > > > > - 58.80% tcp_rcv_established > > > > - 52.88% __tcp_push_pending_frames > > > > - 52.88% tcp_write_xmit > > > > - 28.16% tcp_event_new_data_sent > > > > - 28.15% sk_reset_timer > > > > + mod_timer > > > > - 24.68% tcp_schedule_loss_probe > > > > - 24.68% sk_reset_timer > > > > + 24.68% mod_timer > > > > > > > > This patch decouples TLP timer from RTO timer by adding a new > > > > timer instance but still uses icsk->icsk_pending to dispatch, > > > > in order to minimize the risk of this patch. > > > > > > > > After this patch, the CPU time spent in tcp_write_xmit() reduced > > > > down to 10.92%. > > > > > > What is the exact benchmark you are running ? > > > > > > We never saw any contention like that, so lets make sure you are not > > > working around another issue. > > > > I simply ran 256 parallel netperf with 128 CPU's to trigger this > > spinlock contention, 100% reproducible here. > > How many TX/RX queues on the NIC ? 60 queues (default), 25Gbps NIC, mlx5. > What is the qdisc setup ? fq_codel, which is default here. Its parameters are default too. > > > > > A single netperf TCP_RR could _also_ confirm the improvement: > > > > Before patch: > > > > $ netperf -H XXX -t TCP_RR -l 20 > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 > > AF_INET to XXX () port 0 AF_INET : first burst 0 > > Local /Remote > > Socket Size Request Resp. Elapsed Trans. > > Send Recv Size Size Time Rate > > bytes Bytes bytes bytes secs. per sec > > > > 655360 873800 1 1 20.00 17665.59 > > 655360 873800 > > > > > > After patch: > > > > $ netperf -H XXX -t TCP_RR -l 20 > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 > > AF_INET to XXX () port 0 AF_INET : first burst 0 > > Local /Remote > > Socket Size Request Resp. Elapsed Trans. > > Send Recv Size Size Time Rate > > bytes Bytes bytes bytes secs. per sec > > > > 655360 873800 1 1 20.00 18829.31 > > 655360 873800 > > > > (I have run it for multiple times, just pick a median one here.) > > > > The difference can also be observed by turning off/on TLP without patch. > > OK thanks for using something I can repro easily :) > > I ran the experiment ten times : How many CPU's do you have? > > lpaa23:/export/hda3/google/edumazet# echo 3 > >/proc/sys/net/ipv4/tcp_early_retrans > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done > 26797 > 26850 > 25266 > 27605 > 26586 > 26341 > 27255 > 27532 > 26657 > 27253 > > > Then disabled tlp, and got no obvious difference > > lpaa23:/export/hda3/google/edumazet# echo 0 > >/proc/sys/net/ipv4/tcp_early_retrans > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done > 25311 > 24658 > 27105 > 27421 > 27604 > 24649 > 26259 > 27615 > 27543 > 26217 > > I tried with 256 concurrent flows, and same overall observation about > tlp not changing the numbers. > (In fact I am not even sure we arm RTO at all while doing a TCP_RR) In case you misunderstand, the CPU profiling I used is captured during 256 parallel TCP_STREAM. > lpaa23:/export/hda3/google/edumazet# echo 3 > >/proc/sys/net/ipv4/tcp_early_retrans > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do > ./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done > 1578682 > 1572444 > 1573490 > 1536378 > 1514905 > 1580854 > 1575949 > 1578925 > 1511164 > 1568213 > lpaa23:/export/hda3/google/edumazet# echo 0 > >/proc/sys/net/ipv4/tcp_early_retrans > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do > ./super_netperf 256 -H lpaa24 -t TCP_RR -l 20; done > 1576228 > 1578401 > 1577654 > 1579506 > 1570682 > 1582267 > 1550069 > 1530599 > 1583269 > 1578830 > > > I wonder if you have some IRQ smp_affinity problem maybe, or some > scheduler strategy constantly migrating your user threads ? Scheduler is default too. IRQ smp affinity is statically distributed to each of the first 60 CPU's. > > TLP is quite subtle, having two timers instead of one is probably > going to trigger various bugs. This is exactly why I keep icsk_pending. ;) Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-23 17:40 ` Cong Wang @ 2019-10-23 18:14 ` Eric Dumazet 2019-10-23 18:30 ` Cong Wang 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2019-10-23 18:14 UTC (permalink / raw) To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Wed, Oct 23, 2019 at 10:40 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Tue, Oct 22, 2019 at 7:15 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Oct 22, 2019 at 6:10 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > On Tue, Oct 22, 2019 at 4:24 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Tue, Oct 22, 2019 at 4:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > > > Currently RTO, TLP and PROBE0 all share a same timer instance > > > > > in kernel and use icsk->icsk_pending to dispatch the work. > > > > > This causes spinlock contention when resetting the timer is > > > > > too frequent, as clearly shown in the perf report: > > > > > > > > > > 61.72% 61.71% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath > > > > > ... > > > > > - 58.83% tcp_v4_rcv > > > > > - 58.80% tcp_v4_do_rcv > > > > > - 58.80% tcp_rcv_established > > > > > - 52.88% __tcp_push_pending_frames > > > > > - 52.88% tcp_write_xmit > > > > > - 28.16% tcp_event_new_data_sent > > > > > - 28.15% sk_reset_timer > > > > > + mod_timer > > > > > - 24.68% tcp_schedule_loss_probe > > > > > - 24.68% sk_reset_timer > > > > > + 24.68% mod_timer > > > > > > > > > > This patch decouples TLP timer from RTO timer by adding a new > > > > > timer instance but still uses icsk->icsk_pending to dispatch, > > > > > in order to minimize the risk of this patch. > > > > > > > > > > After this patch, the CPU time spent in tcp_write_xmit() reduced > > > > > down to 10.92%. > > > > > > > > What is the exact benchmark you are running ? > > > > > > > > We never saw any contention like that, so lets make sure you are not > > > > working around another issue. > > > > > > I simply ran 256 parallel netperf with 128 CPU's to trigger this > > > spinlock contention, 100% reproducible here. > > > > How many TX/RX queues on the NIC ? > > 60 queues (default), 25Gbps NIC, mlx5. > > > What is the qdisc setup ? > > fq_codel, which is default here. Its parameters are default too. > > > > > > > > > A single netperf TCP_RR could _also_ confirm the improvement: > > > > > > Before patch: > > > > > > $ netperf -H XXX -t TCP_RR -l 20 > > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 > > > AF_INET to XXX () port 0 AF_INET : first burst 0 > > > Local /Remote > > > Socket Size Request Resp. Elapsed Trans. > > > Send Recv Size Size Time Rate > > > bytes Bytes bytes bytes secs. per sec > > > > > > 655360 873800 1 1 20.00 17665.59 > > > 655360 873800 > > > > > > > > > After patch: > > > > > > $ netperf -H XXX -t TCP_RR -l 20 > > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 > > > AF_INET to XXX () port 0 AF_INET : first burst 0 > > > Local /Remote > > > Socket Size Request Resp. Elapsed Trans. > > > Send Recv Size Size Time Rate > > > bytes Bytes bytes bytes secs. per sec > > > > > > 655360 873800 1 1 20.00 18829.31 > > > 655360 873800 > > > > > > (I have run it for multiple times, just pick a median one here.) > > > > > > The difference can also be observed by turning off/on TLP without patch. > > > > OK thanks for using something I can repro easily :) > > > > I ran the experiment ten times : > > How many CPU's do you have? > > > > > > lpaa23:/export/hda3/google/edumazet# echo 3 > > >/proc/sys/net/ipv4/tcp_early_retrans > > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do > > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done > > 26797 > > 26850 > > 25266 > > 27605 > > 26586 > > 26341 > > 27255 > > 27532 > > 26657 > > 27253 > > > > > > Then disabled tlp, and got no obvious difference > > > > lpaa23:/export/hda3/google/edumazet# echo 0 > > >/proc/sys/net/ipv4/tcp_early_retrans > > lpaa23:/export/hda3/google/edumazet# for f in {1..10}; do > > ./super_netperf 1 -H lpaa24 -t TCP_RR -l 20; done > > 25311 > > 24658 > > 27105 > > 27421 > > 27604 > > 24649 > > 26259 > > 27615 > > 27543 > > 26217 > > > > I tried with 256 concurrent flows, and same overall observation about > > tlp not changing the numbers. > > (In fact I am not even sure we arm RTO at all while doing a TCP_RR) > > In case you misunderstand, the CPU profiling I used is captured > during 256 parallel TCP_STREAM. When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/ "A single netperf TCP_RR could _also_ confirm the improvement:" ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-23 18:14 ` Eric Dumazet @ 2019-10-23 18:30 ` Cong Wang 2019-10-23 18:55 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: Cong Wang @ 2019-10-23 18:30 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Wed, Oct 23, 2019 at 11:14 AM Eric Dumazet <edumazet@google.com> wrote: > > In case you misunderstand, the CPU profiling I used is captured > > during 256 parallel TCP_STREAM. > > When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/ > > "A single netperf TCP_RR could _also_ confirm the improvement:" I guess you didn't understand what "also" mean? The improvement can be measured with both TCP_STREAM and TCP_RR, only the CPU profiling is done with TCP_STREAM. BTW, I just tested an unpatched kernel on a machine with 64 CPU's, turning on/off TLP makes no difference there, so this is likely related to the number of CPU's or hardware configurations. This explains why you can't reproduce it on your side, so far I can only reproduce it on one particular hardware platform too, but it is real. If you need any other information, please let me know. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer 2019-10-23 18:30 ` Cong Wang @ 2019-10-23 18:55 ` Eric Dumazet 0 siblings, 0 replies; 17+ messages in thread From: Eric Dumazet @ 2019-10-23 18:55 UTC (permalink / raw) To: Cong Wang; +Cc: netdev, Yuchung Cheng, Neal Cardwell On Wed, Oct 23, 2019 at 11:30 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Wed, Oct 23, 2019 at 11:14 AM Eric Dumazet <edumazet@google.com> wrote: > > > In case you misunderstand, the CPU profiling I used is captured > > > during 256 parallel TCP_STREAM. > > > > When I asked you the workload, you gave me TCP_RR output, not TCP_STREAM :/ > > > > "A single netperf TCP_RR could _also_ confirm the improvement:" > > I guess you didn't understand what "also" mean? The improvement > can be measured with both TCP_STREAM and TCP_RR, only the > CPU profiling is done with TCP_STREAM. > Except that I could not measure any gain with TCP_RR, which is expected, since TCP_RR will not use RTO and TLP at the same time. If you found that we were setting both RTO and TLP when sending 1-byte message, we need to fix the stack, instead of working around it. > BTW, I just tested an unpatched kernel on a machine with 64 CPU's, > turning on/off TLP makes no difference there, so this is likely related > to the number of CPU's or hardware configurations. This explains > why you can't reproduce it on your side, so far I can only reproduce > it on one particular hardware platform too, but it is real. > I have hosts with 112 cpus, I can try on them, but it will take some time. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer 2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang ` (2 preceding siblings ...) 2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang @ 2019-10-28 18:29 ` David Miller 2019-10-28 18:34 ` Eric Dumazet 2019-10-30 21:56 ` David Miller 4 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2019-10-28 18:29 UTC (permalink / raw) To: xiyou.wangcong; +Cc: netdev, ycheng, ncardwell, edumazet From: Cong Wang <xiyou.wangcong@gmail.com> Date: Tue, 22 Oct 2019 16:10:48 -0700 > This patchset contains 3 patches: patch 1 is a cleanup, > patch 2 is a small change preparing for patch 3, patch 3 is the > one does the actual change. Please find details in each of them. Eric, have you had a chance to test this on a system with suitable CPU arity? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer 2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller @ 2019-10-28 18:34 ` Eric Dumazet 2019-10-28 20:13 ` Cong Wang 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2019-10-28 18:34 UTC (permalink / raw) To: David Miller; +Cc: Cong Wang, netdev, Yuchung Cheng, Neal Cardwell On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote: > > From: Cong Wang <xiyou.wangcong@gmail.com> > Date: Tue, 22 Oct 2019 16:10:48 -0700 > > > This patchset contains 3 patches: patch 1 is a cleanup, > > patch 2 is a small change preparing for patch 3, patch 3 is the > > one does the actual change. Please find details in each of them. > > Eric, have you had a chance to test this on a system with > suitable CPU arity? Yes, and I confirm I could not repro the issues at all. I got a 100Gbit NIC, trying to increase the pressure a bit, and driving this NIC at line rate was only using 2% of my 96 cpus host, no spinlock contention of any sort. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer 2019-10-28 18:34 ` Eric Dumazet @ 2019-10-28 20:13 ` Cong Wang 2019-10-28 20:31 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: Cong Wang @ 2019-10-28 20:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell On Mon, Oct 28, 2019 at 11:34 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote: > > > > From: Cong Wang <xiyou.wangcong@gmail.com> > > Date: Tue, 22 Oct 2019 16:10:48 -0700 > > > > > This patchset contains 3 patches: patch 1 is a cleanup, > > > patch 2 is a small change preparing for patch 3, patch 3 is the > > > one does the actual change. Please find details in each of them. > > > > Eric, have you had a chance to test this on a system with > > suitable CPU arity? > > Yes, and I confirm I could not repro the issues at all. > > I got a 100Gbit NIC, trying to increase the pressure a bit, and > driving this NIC at line rate was only using 2% of my 96 cpus host, > no spinlock contention of any sort. Please let me know if there is anything else I can provide to help you to make the decision. All I can say so far is this only happens on our hosts with 128 AMD CPU's. I don't see anything here related to AMD, so I think only the number of CPU's (vs. number of TX queues?) matters. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer 2019-10-28 20:13 ` Cong Wang @ 2019-10-28 20:31 ` Eric Dumazet 2019-10-29 0:49 ` Cong Wang 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2019-10-28 20:31 UTC (permalink / raw) To: Cong Wang; +Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell On Mon, Oct 28, 2019 at 1:13 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Mon, Oct 28, 2019 at 11:34 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote: > > > > > > From: Cong Wang <xiyou.wangcong@gmail.com> > > > Date: Tue, 22 Oct 2019 16:10:48 -0700 > > > > > > > This patchset contains 3 patches: patch 1 is a cleanup, > > > > patch 2 is a small change preparing for patch 3, patch 3 is the > > > > one does the actual change. Please find details in each of them. > > > > > > Eric, have you had a chance to test this on a system with > > > suitable CPU arity? > > > > Yes, and I confirm I could not repro the issues at all. > > > > I got a 100Gbit NIC, trying to increase the pressure a bit, and > > driving this NIC at line rate was only using 2% of my 96 cpus host, > > no spinlock contention of any sort. > > Please let me know if there is anything else I can provide to help > you to make the decision. > > All I can say so far is this only happens on our hosts with 128 > AMD CPU's. I don't see anything here related to AMD, so I think > only the number of CPU's (vs. number of TX queues?) matters. > I also have AMD hosts with 256 cpus, I can try them later (not today, I am too busy) But I feel you are trying to work around a more fundamental issue if this problem only shows up on AMD hosts. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer 2019-10-28 20:31 ` Eric Dumazet @ 2019-10-29 0:49 ` Cong Wang 0 siblings, 0 replies; 17+ messages in thread From: Cong Wang @ 2019-10-29 0:49 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell On Mon, Oct 28, 2019 at 1:31 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Oct 28, 2019 at 1:13 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > On Mon, Oct 28, 2019 at 11:34 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Mon, Oct 28, 2019 at 11:29 AM David Miller <davem@davemloft.net> wrote: > > > > > > > > From: Cong Wang <xiyou.wangcong@gmail.com> > > > > Date: Tue, 22 Oct 2019 16:10:48 -0700 > > > > > > > > > This patchset contains 3 patches: patch 1 is a cleanup, > > > > > patch 2 is a small change preparing for patch 3, patch 3 is the > > > > > one does the actual change. Please find details in each of them. > > > > > > > > Eric, have you had a chance to test this on a system with > > > > suitable CPU arity? > > > > > > Yes, and I confirm I could not repro the issues at all. > > > > > > I got a 100Gbit NIC, trying to increase the pressure a bit, and > > > driving this NIC at line rate was only using 2% of my 96 cpus host, > > > no spinlock contention of any sort. > > > > Please let me know if there is anything else I can provide to help > > you to make the decision. > > > > All I can say so far is this only happens on our hosts with 128 > > AMD CPU's. I don't see anything here related to AMD, so I think > > only the number of CPU's (vs. number of TX queues?) matters. > > > > I also have AMD hosts with 256 cpus, I can try them later (not today, > I am too busy) > > But I feel you are trying to work around a more fundamental issue if > this problem only shows up on AMD hosts. I wish I have Intel hosts with the same number of CPU's, but I don't, all Intel ones have less, probably 80 at max. This is why I think it is related to the number of CPU's. Also, IOMMU is turned off explicitly, I don't see anything else could be AMD specific along the TCP path. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer 2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang ` (3 preceding siblings ...) 2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller @ 2019-10-30 21:56 ` David Miller 4 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2019-10-30 21:56 UTC (permalink / raw) To: xiyou.wangcong; +Cc: netdev, ycheng, ncardwell, edumazet From: Cong Wang <xiyou.wangcong@gmail.com> Date: Tue, 22 Oct 2019 16:10:48 -0700 > This patchset contains 3 patches: patch 1 is a cleanup, > patch 2 is a small change preparing for patch 3, patch 3 is the > one does the actual change. Please find details in each of them. I'm marking this deferred until someone can drill down why this is only seen in such a specific configuration, and not to ANY EXTENT whatsoever with just a slightly lower number of CPUs on other machines. It's really hard to justify this set of changes without a full understanding and detailed analysis. Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-10-30 21:56 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-10-22 23:10 [Patch net-next 0/3] tcp: decouple TLP timer from RTO timer Cong Wang 2019-10-22 23:10 ` [Patch net-next 1/3] tcp: get rid of ICSK_TIME_EARLY_RETRANS Cong Wang 2019-10-22 23:10 ` [Patch net-next 2/3] tcp: make tcp_send_loss_probe() boolean Cong Wang 2019-10-22 23:10 ` [Patch net-next 3/3] tcp: decouple TLP timer from RTO timer Cong Wang 2019-10-22 23:24 ` Eric Dumazet 2019-10-23 1:10 ` Cong Wang 2019-10-23 2:15 ` Eric Dumazet 2019-10-23 17:40 ` Cong Wang 2019-10-23 18:14 ` Eric Dumazet 2019-10-23 18:30 ` Cong Wang 2019-10-23 18:55 ` Eric Dumazet 2019-10-28 18:29 ` [Patch net-next 0/3] " David Miller 2019-10-28 18:34 ` Eric Dumazet 2019-10-28 20:13 ` Cong Wang 2019-10-28 20:31 ` Eric Dumazet 2019-10-29 0:49 ` Cong Wang 2019-10-30 21:56 ` David Miller
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.