netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
@ 2011-03-17  8:06 H.K. Jerry Chu
  2011-03-17  8:42 ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: H.K. Jerry Chu @ 2011-03-17  8:06 UTC (permalink / raw)
  To: netdev; +Cc: Jerry Chu

From: Jerry Chu <hkchu@google.com>

This patch add a number of very useful counters/stats (defined in
tcp_stats.h) to help diagnosing TCP related problems.

create_time     - when the connection was created (in jiffies)
total_inbytes   - total inbytes as consumed by the receiving apps.
total_outbytes  - total outbytes sent down from the transmitting apps.

total_outdatasegs - total data carrying segments sent so far, including
		retransmitted ones.

total_xmit      - total accumulated time (usecs) when the connection
		has something to send.

total_retrans_time - total time (usecs, accumulated) the connection
		spends trying to recover lost packets. For each
		loss event the time is measured from the lost packet
		was first sent till the retransmitted packet was
		eventually ack'ed.

total_cwnd_limit - total time (usecs, excluding time spent on loss
    		recovery) the xmit is stopped due to cwnd limited

total_swnd_limit - total time (usecs) theconnection is swnd limited

The following two counters are for listeners only:

accepted_reqs   - total # of accepted connection requests.
listen_drops    - total # of dropped SYN reqs (SYN cookies excluded) due
    		to listener's queue overflow.

total_retrans_time/total_retrans ratio gives a rough picture of how
quickly in average the connection can recover from a pkt loss. E.g.,
when the network is more congested, or the traffic contains mainly
smaller RPC where tail drop often requires RTO to recover,
the total_retrans_time/total_retrans ratio tends to be higher.

Currently the new counters/stats are exported through /proc/net/tcp.
Some simple, abbreviated field names have been added to the output of
/proc/net/tcp in order to allow backward/forward compatibility in the
future. Obviously the new counters/stats can also be easily exported
through other APIs.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
---
 include/linux/ktime.h    |    3 ++
 include/linux/tcp.h      |    1 +
 include/net/tcp_stats.h  |   65 ++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c           |   30 ++++++++++++++++++---
 net/ipv4/tcp_input.c     |   13 +++++++++
 net/ipv4/tcp_ipv4.c      |   41 ++++++++++++++++++++++++++---
 net/ipv4/tcp_minisocks.c |    9 ++++++
 net/ipv4/tcp_output.c    |   47 +++++++++++++++++++++++++++++++--
 net/ipv6/tcp_ipv6.c      |    8 +++++
 9 files changed, 206 insertions(+), 11 deletions(-)
 create mode 100644 include/net/tcp_stats.h

diff --git a/include/linux/ktime.h b/include/linux/ktime.h
index e1ceaa9..e60e758 100644
--- a/include/linux/ktime.h
+++ b/include/linux/ktime.h
@@ -333,6 +333,9 @@ extern void ktime_get_ts(struct timespec *ts);
 /* Get the real (wall-) time in timespec format: */
 #define ktime_get_real_ts(ts)	getnstimeofday(ts)
 
+#define ktime_since(a)		ktime_to_us(ktime_sub(ktime_get(), (a)))
+#define ktime_zero(a)		ktime_equal((a), ktime_set(0, 0))
+
 static inline ktime_t ns_to_ktime(u64 ns)
 {
 	static const ktime_t ktime_zero = { .tv64 = 0 };
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index e64f4c6..ea5cb5d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -460,6 +460,7 @@ struct tcp_sock {
 	 * contains related tcp_cookie_transactions fields.
 	 */
 	struct tcp_cookie_values  *cookie_values;
+	struct tcp_stats	*conn_stats;
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/tcp_stats.h b/include/net/tcp_stats.h
new file mode 100644
index 0000000..f17cdac
--- /dev/null
+++ b/include/net/tcp_stats.h
@@ -0,0 +1,65 @@
+#ifndef	_TCP_STATS_H_
+#define	_TCP_STATS_H_
+
+#include <linux/types.h>
+#include <linux/ktime.h>
+
+/*
+ * The following is a list of per connection counters/stats:
+ *
+ * create_time - when the connection was created (in jiffies)
+ * total_inbytes - total inbytes as consumed by the receiving apps.
+ * total_outbytes - total outbytes sent down from the transmitting apps.
+ * total_outdatasegs - total data carrying segments sent so far,
+ *		including retransmitted ones.
+ * total_xmit - total accumulated time (usecs) when the connection has
+		something to send.
+ * total_retrans_time - total time (usecs, accumulated) the connection
+ *		spends trying to recover lost packets. For each	loss
+ *		event the time is measured from the lost packet was
+ *		first sent till the retransmitted packet was eventually
+ *		ack'ed.
+ * total_cwnd_limit - total time (usecs, excluding time spent on loss
+ *		recovery) the xmit is stopped due to cwnd limited
+ * total_swnd_limit - total time (usecs) theconnection is swnd limited
+ *
+ * The following two counters are for listeners only:
+ * accepted_reqs - total # of accepted connection requests.
+ * listen_drops - total # of dropped SYN reqs (SYN cookies excluded) due
+ *		to listener's queue overflow.
+ */
+
+#define	create_time		conn_stats->ts_create_time
+#define	total_inbytes		conn_stats->ts_total_inbytes
+#define	total_outbytes		conn_stats->ts_total_outbytes
+#define	total_outdatasegs	conn_stats->ts_total_outdatasegs
+#define	start_xmit		conn_stats->ts_start_xmit
+#define	total_xmit		conn_stats->ts_total_xmit
+#define	start_cwnd_limit	conn_stats->ts_start_cwnd_limit
+#define	total_cwnd_limit	conn_stats->ts_total_cwnd_limit
+#define	start_swnd_limit	conn_stats->ts_start_swnd_limit
+#define	total_swnd_limit	conn_stats->ts_total_swnd_limit
+#define	start_retrans		conn_stats->ts_start_retrans
+#define	total_retrans_time	conn_stats->ts_total_retrans_time
+#define	accepted_reqs		conn_stats->ts_total_inbytes
+#define	listen_drops		conn_stats->ts_total_outbytes
+/* # of dropped SYN reqs (SYN cookies excluded) due to q/q0 overflow */
+
+struct tcp_stats {
+	u64	ts_create_time;		/* in jiffies */
+	u64	ts_total_inbytes;
+	u64	ts_total_outbytes;
+	u64	ts_total_outdatasegs;
+	ktime_t	ts_start_xmit;
+	u64	ts_total_xmit;		/* in usecs */
+	ktime_t	ts_start_swnd_limit;
+	u64	ts_total_swnd_limit;	/* in usecs */
+	ktime_t	ts_start_cwnd_limit;
+	u64	ts_total_cwnd_limit;	/* in usecs */
+	ktime_t	ts_start_retrans;
+	u64	ts_total_retrans_time;	/* in usecs */
+};
+
+extern struct kmem_cache *tcp_statsp;
+
+#endif /* _TCP_STATS_H_ */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..9d0d582 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -273,6 +273,7 @@
 #include <net/ip.h>
 #include <net/netdma.h>
 #include <net/sock.h>
+#include <net/tcp_stats.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -317,6 +318,10 @@ struct tcp_splice_state {
 int tcp_memory_pressure __read_mostly;
 EXPORT_SYMBOL(tcp_memory_pressure);
 
+/* Memory cache for TCP stats structure */
+struct kmem_cache *tcp_statsp __read_mostly;
+EXPORT_SYMBOL(tcp_statsp);
+
 void tcp_enter_memory_pressure(struct sock *sk)
 {
 	if (!tcp_memory_pressure) {
@@ -543,6 +548,12 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
 	tcb->flags   = TCPHDR_ACK;
 	tcb->sacked  = 0;
 	skb_header_release(skb);
+
+	if (sk->sk_write_queue.qlen == 0) {
+		WARN_ON(!ktime_zero(tp->start_xmit));
+		tp->start_xmit = ktime_get();
+	}
+
 	tcp_add_write_queue_tail(sk, skb);
 	sk->sk_wmem_queued += skb->truesize;
 	sk_mem_charge(sk, skb->truesize);
@@ -860,8 +871,10 @@ wait_for_memory:
 	}
 
 out:
-	if (copied)
+	if (copied) {
 		tcp_push(sk, flags, mss_now, tp->nonagle);
+		tp->total_outbytes += copied;
+	}
 	return copied;
 
 do_error:
@@ -1108,8 +1121,10 @@ wait_for_memory:
 	}
 
 out:
-	if (copied)
+	if (copied) {
 		tcp_push(sk, flags, mss_now, tp->nonagle);
+		tp->total_outbytes += copied;
+	}
 	release_sock(sk);
 	return copied;
 
@@ -1387,8 +1402,10 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied > 0)
+	if (copied > 0) {
+		tp->total_inbytes += copied;
 		tcp_cleanup_rbuf(sk, copied);
+	}
 	return copied;
 }
 EXPORT_SYMBOL(tcp_read_sock);
@@ -1768,7 +1785,8 @@ skip_copy:
 
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
-
+	if (copied > 0)
+		tp->total_inbytes += copied;
 	release_sock(sk);
 	return copied;
 
@@ -2101,6 +2119,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	inet_csk_delack_init(sk);
 	tcp_init_send_head(sk);
 	memset(&tp->rx_opt, 0, sizeof(tp->rx_opt));
+	memset(tp->conn_stats, 0, sizeof(struct tcp_stats));
+	tp->create_time = get_jiffies_64();
 	__sk_dst_reset(sk);
 
 	WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
@@ -3314,4 +3334,6 @@ void __init tcp_init(void)
 	tcp_secret_primary = &tcp_secret_one;
 	tcp_secret_retiring = &tcp_secret_two;
 	tcp_secret_secondary = &tcp_secret_two;
+	tcp_statsp = kmem_cache_create("tcp_stats_cache",
+		sizeof(struct tcp_stats), 0,  SLAB_HWCACHE_ALIGN, NULL);
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index da782e7..fa3d8e4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,6 +72,7 @@
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <net/netdma.h>
+#include <net/tcp_stats.h>
 
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
@@ -3688,9 +3689,21 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	if (!prior_packets)
 		goto no_queue;
 
+	if (!ktime_zero(tp->start_retrans) &&
+	    !before(tp->snd_una, tp->high_seq)) {
+		tp->total_retrans_time += ktime_since(tp->start_retrans);
+		tp->start_retrans = net_invalid_timestamp();
+	}
+
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	if (sk->sk_write_queue.qlen == 0 &&
+	    !ktime_zero(tp->start_xmit)) {
+		tp->total_xmit += ktime_since(tp->start_xmit);
+		tp->start_xmit = net_invalid_timestamp();
+	}
+
 	if (tp->frto_counter)
 		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..bb14df2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -72,6 +72,7 @@
 #include <net/timewait_sock.h>
 #include <net/xfrm.h>
 #include <net/netdma.h>
+#include <net/tcp_stats.h>
 
 #include <linux/inet.h>
 #include <linux/ipv6.h>
@@ -1391,6 +1392,8 @@ drop_and_release:
 drop_and_free:
 	reqsk_free(req);
 drop:
+	if (!want_cookie)
+		tp->listen_drops++;
 	return 0;
 }
 EXPORT_SYMBOL(tcp_v4_conn_request);
@@ -1582,6 +1585,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 				rsk = nsk;
 				goto reset;
 			}
+			tcp_sk(sk)->accepted_reqs++;
 			return 0;
 		}
 	} else
@@ -1830,6 +1834,13 @@ static int tcp_v4_init_sock(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	tp->conn_stats = kmem_cache_alloc(tcp_statsp, GFP_KERNEL);
+	if (tp->conn_stats == NULL)
+		return -1;
+
+	memset(tp->conn_stats, 0, sizeof(struct tcp_stats));
+	tp->create_time = get_jiffies_64();
+
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
@@ -1937,7 +1948,10 @@ void tcp_v4_destroy_sock(struct sock *sk)
 			 tcp_cookie_values_release);
 		tp->cookie_values = NULL;
 	}
-
+	if (tp->conn_stats) {
+		kmem_cache_free(tcp_statsp, tp->conn_stats);
+		tp->conn_stats = NULL;
+	}
 	percpu_counter_dec(&tcp_sockets_allocated);
 }
 EXPORT_SYMBOL(tcp_v4_destroy_sock);
@@ -2390,6 +2404,8 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
 	__u16 destp = ntohs(inet->inet_dport);
 	__u16 srcp = ntohs(inet->inet_sport);
 	int rx_queue;
+	int len1;
+	unsigned long elapsed;
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS) {
 		timer_active	= 1;
@@ -2413,8 +2429,10 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
 		 */
 		rx_queue = max_t(int, tp->rcv_nxt - tp->copied_seq, 0);
 
+	elapsed = (unsigned long)(get_jiffies_64() - tp->create_time);
+
 	seq_printf(f, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
-			"%08X %5d %8d %lu %d %p %lu %lu %u %u %d%n",
+			"%08X %5d %8d %lu %d %p %lu %lu %u %u %d %u %n",
 		i, src, srcp, dest, destp, sk->sk_state,
 		tp->write_seq - tp->snd_una,
 		rx_queue,
@@ -2430,7 +2448,20 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
 		(icsk->icsk_ack.quick << 1) | icsk->icsk_ack.pingpong,
 		tp->snd_cwnd,
 		tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh,
+		jiffies_to_msecs(elapsed)/1000,
 		len);
+
+	if (sk->sk_state == TCP_LISTEN)
+		seq_printf(f, "%llu %llu%n", tp->accepted_reqs,
+			tp->listen_drops, &len1);
+	else
+		seq_printf(f, "%llu %llu %llu %llu %llu/%u %llu %llu%n",
+			tp->total_inbytes, tp->total_outbytes,
+			tp->total_outdatasegs, tp->total_xmit/1000,
+			tp->total_retrans_time/1000, tp->total_retrans,
+			tp->total_swnd_limit/1000, tp->total_cwnd_limit/1000,
+			&len1);
+	*len += len1;
 }
 
 static void get_timewait4_sock(struct inet_timewait_sock *tw,
@@ -2455,7 +2486,7 @@ static void get_timewait4_sock(struct inet_timewait_sock *tw,
 		atomic_read(&tw->tw_refcnt), tw, len);
 }
 
-#define TMPSZ 150
+#define TMPSZ 160
 
 static int tcp4_seq_show(struct seq_file *seq, void *v)
 {
@@ -2466,7 +2497,9 @@ static int tcp4_seq_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "%-*s\n", TMPSZ - 1,
 			   "  sl  local_address rem_address   st tx_queue "
 			   "rx_queue tr tm->when retrnsmt   uid  timeout "
-			   "inode");
+			   "inode - "
+			   "elps ib ob odsg xmt_bz rtrns_tm/rtrns "
+			   "swn_lmt cwn_lmt");
 		goto out;
 	}
 	st = seq->private;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 80b1f80..abc6e23 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -26,6 +26,7 @@
 #include <net/tcp.h>
 #include <net/inet_common.h>
 #include <net/xfrm.h>
+#include <net/tcp_stats.h>
 
 int sysctl_tcp_syncookies __read_mostly = 1;
 EXPORT_SYMBOL(sysctl_tcp_syncookies);
@@ -433,6 +434,14 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		struct tcp_sock *oldtp = tcp_sk(sk);
 		struct tcp_cookie_values *oldcvp = oldtp->cookie_values;
 
+		newtp->conn_stats = kmem_cache_alloc(tcp_statsp, GFP_ATOMIC);
+		if (newtp->conn_stats == NULL) {
+			sock_put(newsk);
+			return NULL;
+		}
+		memset(newtp->conn_stats, 0, sizeof(struct tcp_stats));
+		newtp->create_time = get_jiffies_64();
+
 		/* TCP Cookie Transactions require space for the cookie pair,
 		 * as it differs for each connection.  There is no need to
 		 * copy any s_data_payload stored at the original socket.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dfa5beb..ec846d3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -35,6 +35,7 @@
  */
 
 #include <net/tcp.h>
+#include <net/tcp_stats.h>
 
 #include <linux/compiler.h>
 #include <linux/gfp.h>
@@ -826,6 +827,21 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
+	if (!ktime_zero(tp->start_cwnd_limit) ||
+	    !ktime_zero(tp->start_swnd_limit)) {
+		/* update total_xmit periodically */
+		WARN_ON(ktime_zero(tp->start_xmit));
+		tp->total_xmit += ktime_since(tp->start_xmit);
+		tp->start_xmit = ktime_get();
+	}
+	if (!ktime_zero(tp->start_cwnd_limit)) {
+		tp->total_cwnd_limit += ktime_since(tp->start_cwnd_limit);
+		tp->start_cwnd_limit = net_invalid_timestamp();
+	} else if (!ktime_zero(tp->start_swnd_limit)) {
+		tp->total_swnd_limit += ktime_since(tp->start_swnd_limit);
+		tp->start_swnd_limit = net_invalid_timestamp();
+	}
+
 	if (unlikely(tcb->flags & TCPHDR_SYN))
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
 	else
@@ -892,8 +908,10 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	if (likely(tcb->flags & TCPHDR_ACK))
 		tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
 
-	if (skb->len != tcp_header_size)
+	if (skb->len != tcp_header_size) {
 		tcp_event_data_sent(tp, skb, sk);
+		tp->total_outdatasegs += tcp_skb_pcount(skb);
+	}
 
 	if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
 		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
@@ -903,6 +921,9 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	if (likely(err <= 0))
 		return err;
 
+	if (ktime_zero(tp->start_retrans))
+		tp->start_retrans = ktime_get();
+
 	tcp_enter_cwr(sk, 1);
 
 	return net_xmit_eval(err);
@@ -1759,11 +1780,19 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		BUG_ON(!tso_segs);
 
 		cwnd_quota = tcp_cwnd_test(tp, skb);
-		if (!cwnd_quota)
+		if (!cwnd_quota) {
+			if (ktime_zero(tp->start_cwnd_limit) &&
+			    inet_csk(sk)->icsk_ca_state == TCP_CA_Open)
+				tp->start_cwnd_limit = ktime_get();
 			break;
+		}
 
-		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
+		if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
+			if (ktime_zero(tp->start_swnd_limit) &&
+			    inet_csk(sk)->icsk_ca_state == TCP_CA_Open)
+				tp->start_swnd_limit = ktime_get();
 			break;
+		}
 
 		if (tso_segs == 1) {
 			if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
@@ -2134,6 +2163,18 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 		}
 	}
 
+	if (ktime_zero(tp->start_retrans)) {
+		unsigned int since_last = jiffies_to_usecs(tcp_time_stamp
+		    - TCP_SKB_CB(skb)->when);
+
+		tp->start_cwnd_limit = net_invalid_timestamp();
+		tp->start_swnd_limit = net_invalid_timestamp();
+		/* Stop counting cwnd/swnd_limit when we get into retrans */
+		tp->start_retrans = ktime_get();
+		tp->start_retrans = ktime_sub_us(tp->start_retrans,
+		    since_last);
+	}
+
 	/* Make a copy, if the first transmission SKB clone we made
 	 * is still in somebody's hands, else make a clone.
 	 */
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2b0c186..30db33a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -61,6 +61,7 @@
 #include <net/timewait_sock.h>
 #include <net/netdma.h>
 #include <net/inet_common.h>
+#include <net/tcp_stats.h>
 
 #include <asm/uaccess.h>
 
@@ -1950,6 +1951,13 @@ static int tcp_v6_init_sock(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	tp->conn_stats = kmem_cache_alloc(tcp_statsp, GFP_KERNEL);
+	if (tp->conn_stats == NULL)
+		return -1;
+
+	memset(tp->conn_stats, 0, sizeof(struct tcp_stats));
+	tp->create_time = get_jiffies_64();
+
 	skb_queue_head_init(&tp->out_of_order_queue);
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-17  8:06 [PATCH] Add useful per-connection TCP stats for diagnosis purpose H.K. Jerry Chu
@ 2011-03-17  8:42 ` Eric Dumazet
  2011-03-17 20:16   ` Jerry Chu
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-03-17  8:42 UTC (permalink / raw)
  To: H.K. Jerry Chu; +Cc: netdev

Le jeudi 17 mars 2011 à 01:06 -0700, H.K. Jerry Chu a écrit :
> From: Jerry Chu <hkchu@google.com>
> 
> This patch add a number of very useful counters/stats (defined in
> tcp_stats.h) to help diagnosing TCP related problems.
> 
> create_time     - when the connection was created (in jiffies)
> total_inbytes   - total inbytes as consumed by the receiving apps.
> total_outbytes  - total outbytes sent down from the transmitting apps.
> 
> total_outdatasegs - total data carrying segments sent so far, including
> 		retransmitted ones.
> 
> total_xmit      - total accumulated time (usecs) when the connection
> 		has something to send.
> 
> total_retrans_time - total time (usecs, accumulated) the connection
> 		spends trying to recover lost packets. For each
> 		loss event the time is measured from the lost packet
> 		was first sent till the retransmitted packet was
> 		eventually ack'ed.
> 
> total_cwnd_limit - total time (usecs, excluding time spent on loss
>     		recovery) the xmit is stopped due to cwnd limited
> 
> total_swnd_limit - total time (usecs) theconnection is swnd limited
> 
> The following two counters are for listeners only:
> 
> accepted_reqs   - total # of accepted connection requests.
> listen_drops    - total # of dropped SYN reqs (SYN cookies excluded) due
>     		to listener's queue overflow.
> 
> total_retrans_time/total_retrans ratio gives a rough picture of how
> quickly in average the connection can recover from a pkt loss. E.g.,
> when the network is more congested, or the traffic contains mainly
> smaller RPC where tail drop often requires RTO to recover,
> the total_retrans_time/total_retrans ratio tends to be higher.
> 
> Currently the new counters/stats are exported through /proc/net/tcp.

Please dont. Use iproute2 instead.

> Some simple, abbreviated field names have been added to the output of
> /proc/net/tcp in order to allow backward/forward compatibility in the
> future. Obviously the new counters/stats can also be easily exported
> through other APIs.
> 

/proc/net/tcp is legacy. You should touch it eventually, but after
"other APIS" are done. It was the old way (quick but a bit ugly)



> Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
> ---
>  include/linux/ktime.h    |    3 ++
>  include/linux/tcp.h      |    1 +
>  include/net/tcp_stats.h  |   65 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/ipv4/tcp.c           |   30 ++++++++++++++++++---
>  net/ipv4/tcp_input.c     |   13 +++++++++
>  net/ipv4/tcp_ipv4.c      |   41 ++++++++++++++++++++++++++---
>  net/ipv4/tcp_minisocks.c |    9 ++++++
>  net/ipv4/tcp_output.c    |   47 +++++++++++++++++++++++++++++++--
>  net/ipv6/tcp_ipv6.c      |    8 +++++
>  9 files changed, 206 insertions(+), 11 deletions(-)
>  create mode 100644 include/net/tcp_stats.h
> 
> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
> index e1ceaa9..e60e758 100644
> --- a/include/linux/ktime.h
> +++ b/include/linux/ktime.h
> @@ -333,6 +333,9 @@ extern void ktime_get_ts(struct timespec *ts);
>  /* Get the real (wall-) time in timespec format: */
>  #define ktime_get_real_ts(ts)	getnstimeofday(ts)

Hmm, this kind of changes are out of netdev scope and should be avoided

>  
> +#define ktime_since(a)		ktime_to_us(ktime_sub(ktime_get(), (a)))

us are implied in ktime_since() ? thats strange.

> +#define ktime_zero(a)		ktime_equal((a), ktime_set(0, 0))

ktime_zero() sounds like : "give me zero time" or "clear the ktime
field". 

> +
>  static inline ktime_t ns_to_ktime(u64 ns)
>  {
>  	static const ktime_t ktime_zero = { .tv64 = 0 };
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index e64f4c6..ea5cb5d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -460,6 +460,7 @@ struct tcp_sock {
>  	 * contains related tcp_cookie_transactions fields.
>  	 */
>  	struct tcp_cookie_values  *cookie_values;
> +	struct tcp_stats	*conn_stats;
>  };

Really, using separate cache lines to store some stats is expensive.
You should add counters in existing structure, to avoid additional cache
line dirties. Carefully placing stats in already dirtied cache lines.

You also should use native ktime_t infrastructure, to make the maths
really fast in fast path.

Only when stats are to be returned to user, you'll have to convert the
native timestamps to user exportable ones.

Quite frankly, using u64 fields allow nanosec resolution.

BTW, we probably could 'export' sk->sk_drops for TCP, like we do for
UDP.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-17  8:42 ` Eric Dumazet
@ 2011-03-17 20:16   ` Jerry Chu
  2011-03-17 21:20     ` Stephen Hemminger
  0 siblings, 1 reply; 15+ messages in thread
From: Jerry Chu @ 2011-03-17 20:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Eric, thanks for the prompt feedback.

On Thu, Mar 17, 2011 at 1:42 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 17 mars 2011 à 01:06 -0700, H.K. Jerry Chu a écrit :
>> From: Jerry Chu <hkchu@google.com>
>>
>> This patch add a number of very useful counters/stats (defined in
>> tcp_stats.h) to help diagnosing TCP related problems.
>>
>> create_time     - when the connection was created (in jiffies)
>> total_inbytes   - total inbytes as consumed by the receiving apps.
>> total_outbytes  - total outbytes sent down from the transmitting apps.
>>
>> total_outdatasegs - total data carrying segments sent so far, including
>>               retransmitted ones.
>>
>> total_xmit      - total accumulated time (usecs) when the connection
>>               has something to send.
>>
>> total_retrans_time - total time (usecs, accumulated) the connection
>>               spends trying to recover lost packets. For each
>>               loss event the time is measured from the lost packet
>>               was first sent till the retransmitted packet was
>>               eventually ack'ed.
>>
>> total_cwnd_limit - total time (usecs, excluding time spent on loss
>>               recovery) the xmit is stopped due to cwnd limited
>>
>> total_swnd_limit - total time (usecs) theconnection is swnd limited
>>
>> The following two counters are for listeners only:
>>
>> accepted_reqs   - total # of accepted connection requests.
>> listen_drops    - total # of dropped SYN reqs (SYN cookies excluded) due
>>               to listener's queue overflow.
>>
>> total_retrans_time/total_retrans ratio gives a rough picture of how
>> quickly in average the connection can recover from a pkt loss. E.g.,
>> when the network is more congested, or the traffic contains mainly
>> smaller RPC where tail drop often requires RTO to recover,
>> the total_retrans_time/total_retrans ratio tends to be higher.
>>
>> Currently the new counters/stats are exported through /proc/net/tcp.
>
> Please dont. Use iproute2 instead.
>
>> Some simple, abbreviated field names have been added to the output of
>> /proc/net/tcp in order to allow backward/forward compatibility in the
>> future. Obviously the new counters/stats can also be easily exported
>> through other APIs.
>>
>
> /proc/net/tcp is legacy. You should touch it eventually, but after
> "other APIS" are done. It was the old way (quick but a bit ugly)

Understood. /proc/net/tcp is a much more expedient way of exporting these
counters because it doesn't requires any additional, special tool to read it,
unless other APIs (e.g., netlink). Note that backward compatibility to
/proc/net/tcp has been ensured by adding field names in the heading.

>
>
>
>> Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
>> ---
>>  include/linux/ktime.h    |    3 ++
>>  include/linux/tcp.h      |    1 +
>>  include/net/tcp_stats.h  |   65 ++++++++++++++++++++++++++++++++++++++++++++++
>>  net/ipv4/tcp.c           |   30 ++++++++++++++++++---
>>  net/ipv4/tcp_input.c     |   13 +++++++++
>>  net/ipv4/tcp_ipv4.c      |   41 ++++++++++++++++++++++++++---
>>  net/ipv4/tcp_minisocks.c |    9 ++++++
>>  net/ipv4/tcp_output.c    |   47 +++++++++++++++++++++++++++++++--
>>  net/ipv6/tcp_ipv6.c      |    8 +++++
>>  9 files changed, 206 insertions(+), 11 deletions(-)
>>  create mode 100644 include/net/tcp_stats.h
>>
>> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
>> index e1ceaa9..e60e758 100644
>> --- a/include/linux/ktime.h
>> +++ b/include/linux/ktime.h
>> @@ -333,6 +333,9 @@ extern void ktime_get_ts(struct timespec *ts);
>>  /* Get the real (wall-) time in timespec format: */
>>  #define ktime_get_real_ts(ts)        getnstimeofday(ts)
>
> Hmm, this kind of changes are out of netdev scope and should be avoided

Ok. (It was moved out of tcp_stats.h only at the last minute.)

>
>>
>> +#define ktime_since(a)               ktime_to_us(ktime_sub(ktime_get(), (a)))
>
> us are implied in ktime_since() ? thats strange.

Ok.

>
>> +#define ktime_zero(a)                ktime_equal((a), ktime_set(0, 0))
>
> ktime_zero() sounds like : "give me zero time" or "clear the ktime
> field".

Yes I actually have been flip-flopping on the name...

>
>> +
>>  static inline ktime_t ns_to_ktime(u64 ns)
>>  {
>>       static const ktime_t ktime_zero = { .tv64 = 0 };
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index e64f4c6..ea5cb5d 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -460,6 +460,7 @@ struct tcp_sock {
>>        * contains related tcp_cookie_transactions fields.
>>        */
>>       struct tcp_cookie_values  *cookie_values;
>> +     struct tcp_stats        *conn_stats;
>>  };
>
> Really, using separate cache lines to store some stats is expensive.
> You should add counters in existing structure, to avoid additional cache
> line dirties. Carefully placing stats in already dirtied cache lines.

This was how it was done initially but then we wanted to allow future
extension to include possibly a lot more counters, something like Web100
(RFC4898). For the latter the memory/performance hit will likely require
a config option, and a separate structure will make this easier. Does it
make sense?

>
> You also should use native ktime_t infrastructure, to make the maths
> really fast in fast path.
>
> Only when stats are to be returned to user, you'll have to convert the
> native timestamps to user exportable ones.

Good point! Will do. (I mistakenly thought ktime_t is a larger structure.)

>
> Quite frankly, using u64 fields allow nanosec resolution.

I wish to use less bits because the final report only needs ms or even
sec resolution but the intermediate computation needs to capture usec
resolution.

>
> BTW, we probably could 'export' sk->sk_drops for TCP, like we do for
> UDP.

There are many other potentially useful counters/stats (spurious_retrans,
min_rtt, total_rto,...) but there is a tradeoff against memory/performance hit
so for the first round I'm focusing on what i feel is the most useful set.

Thanks,

Jerry

>
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-17 20:16   ` Jerry Chu
@ 2011-03-17 21:20     ` Stephen Hemminger
  2011-03-18  4:33       ` Jerry Chu
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2011-03-17 21:20 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Eric Dumazet, netdev

On Thu, 17 Mar 2011 13:16:15 -0700
Jerry Chu <hkchu@google.com> wrote:

> Eric, thanks for the prompt feedback.
> 
> On Thu, Mar 17, 2011 at 1:42 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le jeudi 17 mars 2011 à 01:06 -0700, H.K. Jerry Chu a écrit :
> >> From: Jerry Chu <hkchu@google.com>
> >>
> >> This patch add a number of very useful counters/stats (defined in
> >> tcp_stats.h) to help diagnosing TCP related problems.
> >>
> >> create_time     - when the connection was created (in jiffies)
> >> total_inbytes   - total inbytes as consumed by the receiving apps.
> >> total_outbytes  - total outbytes sent down from the transmitting apps.
> >>
> >> total_outdatasegs - total data carrying segments sent so far, including
> >>               retransmitted ones.
> >>
> >> total_xmit      - total accumulated time (usecs) when the connection
> >>               has something to send.
> >>
> >> total_retrans_time - total time (usecs, accumulated) the connection
> >>               spends trying to recover lost packets. For each
> >>               loss event the time is measured from the lost packet
> >>               was first sent till the retransmitted packet was
> >>               eventually ack'ed.
> >>
> >> total_cwnd_limit - total time (usecs, excluding time spent on loss
> >>               recovery) the xmit is stopped due to cwnd limited
> >>
> >> total_swnd_limit - total time (usecs) theconnection is swnd limited
> >>
> >> The following two counters are for listeners only:
> >>
> >> accepted_reqs   - total # of accepted connection requests.
> >> listen_drops    - total # of dropped SYN reqs (SYN cookies excluded) due
> >>               to listener's queue overflow.
> >>
> >> total_retrans_time/total_retrans ratio gives a rough picture of how
> >> quickly in average the connection can recover from a pkt loss. E.g.,
> >> when the network is more congested, or the traffic contains mainly
> >> smaller RPC where tail drop often requires RTO to recover,
> >> the total_retrans_time/total_retrans ratio tends to be higher.
> >>
> >> Currently the new counters/stats are exported through /proc/net/tcp.
> >
> > Please dont. Use iproute2 instead.
> >
> >> Some simple, abbreviated field names have been added to the output of
> >> /proc/net/tcp in order to allow backward/forward compatibility in the
> >> future. Obviously the new counters/stats can also be easily exported
> >> through other APIs.
> >>
> >
> > /proc/net/tcp is legacy. You should touch it eventually, but after
> > "other APIS" are done. It was the old way (quick but a bit ugly)
> 
> Understood. /proc/net/tcp is a much more expedient way of exporting these
> counters because it doesn't requires any additional, special tool to read it,
> unless other APIs (e.g., netlink). Note that backward compatibility to
> /proc/net/tcp has been ensured by adding field names in the heading.
> 
> >
> >
> >
> >> Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
> >> ---
> >>  include/linux/ktime.h    |    3 ++
> >>  include/linux/tcp.h      |    1 +
> >>  include/net/tcp_stats.h  |   65 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  net/ipv4/tcp.c           |   30 ++++++++++++++++++---
> >>  net/ipv4/tcp_input.c     |   13 +++++++++
> >>  net/ipv4/tcp_ipv4.c      |   41 ++++++++++++++++++++++++++---
> >>  net/ipv4/tcp_minisocks.c |    9 ++++++
> >>  net/ipv4/tcp_output.c    |   47 +++++++++++++++++++++++++++++++--
> >>  net/ipv6/tcp_ipv6.c      |    8 +++++
> >>  9 files changed, 206 insertions(+), 11 deletions(-)
> >>  create mode 100644 include/net/tcp_stats.h
> >>
> >> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
> >> index e1ceaa9..e60e758 100644
> >> --- a/include/linux/ktime.h
> >> +++ b/include/linux/ktime.h
> >> @@ -333,6 +333,9 @@ extern void ktime_get_ts(struct timespec *ts);
> >>  /* Get the real (wall-) time in timespec format: */
> >>  #define ktime_get_real_ts(ts)        getnstimeofday(ts)
> >
> > Hmm, this kind of changes are out of netdev scope and should be avoided
> 
> Ok. (It was moved out of tcp_stats.h only at the last minute.)
> 
> >
> >>
> >> +#define ktime_since(a)               ktime_to_us(ktime_sub(ktime_get(), (a)))
> >
> > us are implied in ktime_since() ? thats strange.
> 
> Ok.
> 
> >
> >> +#define ktime_zero(a)                ktime_equal((a), ktime_set(0, 0))
> >
> > ktime_zero() sounds like : "give me zero time" or "clear the ktime
> > field".
> 
> Yes I actually have been flip-flopping on the name...
> 
> >
> >> +
> >>  static inline ktime_t ns_to_ktime(u64 ns)
> >>  {
> >>       static const ktime_t ktime_zero = { .tv64 = 0 };
> >> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> >> index e64f4c6..ea5cb5d 100644
> >> --- a/include/linux/tcp.h
> >> +++ b/include/linux/tcp.h
> >> @@ -460,6 +460,7 @@ struct tcp_sock {
> >>        * contains related tcp_cookie_transactions fields.
> >>        */
> >>       struct tcp_cookie_values  *cookie_values;
> >> +     struct tcp_stats        *conn_stats;
> >>  };
> >
> > Really, using separate cache lines to store some stats is expensive.
> > You should add counters in existing structure, to avoid additional cache
> > line dirties. Carefully placing stats in already dirtied cache lines.
> 
> This was how it was done initially but then we wanted to allow future
> extension to include possibly a lot more counters, something like Web100
> (RFC4898). For the latter the memory/performance hit will likely require
> a config option, and a separate structure will make this easier. Does it
> make sense?
> 
> >
> > You also should use native ktime_t infrastructure, to make the maths
> > really fast in fast path.
> >
> > Only when stats are to be returned to user, you'll have to convert the
> > native timestamps to user exportable ones.
> 
> Good point! Will do. (I mistakenly thought ktime_t is a larger structure.)
> 
> >
> > Quite frankly, using u64 fields allow nanosec resolution.
> 
> I wish to use less bits because the final report only needs ms or even
> sec resolution but the intermediate computation needs to capture usec
> resolution.
> 
> >
> > BTW, we probably could 'export' sk->sk_drops for TCP, like we do for
> > UDP.
> 
> There are many other potentially useful counters/stats (spurious_retrans,
> min_rtt, total_rto,...) but there is a tradeoff against memory/performance hit
> so for the first round I'm focusing on what i feel is the most useful set.
> 
> Thanks,
> 
> Jerry

These stats are best added via netlink, the tool ss already prints lots
of similar stats. Look at INET_DIAG_INFO and the output of:
 ss -i -t






-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-17 21:20     ` Stephen Hemminger
@ 2011-03-18  4:33       ` Jerry Chu
  2011-03-18  4:51         ` David Miller
  2011-03-18  6:05         ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Jerry Chu @ 2011-03-18  4:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, netdev

On Thu, Mar 17, 2011 at 2:20 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Thu, 17 Mar 2011 13:16:15 -0700
> Jerry Chu <hkchu@google.com> wrote:
>
>> Eric, thanks for the prompt feedback.
>>
>> On Thu, Mar 17, 2011 at 1:42 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > Le jeudi 17 mars 2011 à 01:06 -0700, H.K. Jerry Chu a écrit :
>> >> From: Jerry Chu <hkchu@google.com>
>> >>
>> >> This patch add a number of very useful counters/stats (defined in
>> >> tcp_stats.h) to help diagnosing TCP related problems.
>> >>
>> >> create_time     - when the connection was created (in jiffies)
>> >> total_inbytes   - total inbytes as consumed by the receiving apps.
>> >> total_outbytes  - total outbytes sent down from the transmitting apps.
>> >>
>> >> total_outdatasegs - total data carrying segments sent so far, including
>> >>               retransmitted ones.
>> >>
>> >> total_xmit      - total accumulated time (usecs) when the connection
>> >>               has something to send.
>> >>
>> >> total_retrans_time - total time (usecs, accumulated) the connection
>> >>               spends trying to recover lost packets. For each
>> >>               loss event the time is measured from the lost packet
>> >>               was first sent till the retransmitted packet was
>> >>               eventually ack'ed.
>> >>
>> >> total_cwnd_limit - total time (usecs, excluding time spent on loss
>> >>               recovery) the xmit is stopped due to cwnd limited
>> >>
>> >> total_swnd_limit - total time (usecs) theconnection is swnd limited
>> >>
>> >> The following two counters are for listeners only:
>> >>
>> >> accepted_reqs   - total # of accepted connection requests.
>> >> listen_drops    - total # of dropped SYN reqs (SYN cookies excluded) due
>> >>               to listener's queue overflow.
>> >>
>> >> total_retrans_time/total_retrans ratio gives a rough picture of how
>> >> quickly in average the connection can recover from a pkt loss. E.g.,
>> >> when the network is more congested, or the traffic contains mainly
>> >> smaller RPC where tail drop often requires RTO to recover,
>> >> the total_retrans_time/total_retrans ratio tends to be higher.
>> >>
>> >> Currently the new counters/stats are exported through /proc/net/tcp.
>> >
>> > Please dont. Use iproute2 instead.
>> >
>> >> Some simple, abbreviated field names have been added to the output of
>> >> /proc/net/tcp in order to allow backward/forward compatibility in the
>> >> future. Obviously the new counters/stats can also be easily exported
>> >> through other APIs.
>> >>
>> >
>> > /proc/net/tcp is legacy. You should touch it eventually, but after
>> > "other APIS" are done. It was the old way (quick but a bit ugly)
>>
>> Understood. /proc/net/tcp is a much more expedient way of exporting these
>> counters because it doesn't requires any additional, special tool to read it,
>> unless other APIs (e.g., netlink). Note that backward compatibility to
>> /proc/net/tcp has been ensured by adding field names in the heading.
>>
>> >
>> >
>> >
>> >> Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
>> >> ---
>> >>  include/linux/ktime.h    |    3 ++
>> >>  include/linux/tcp.h      |    1 +
>> >>  include/net/tcp_stats.h  |   65 ++++++++++++++++++++++++++++++++++++++++++++++
>> >>  net/ipv4/tcp.c           |   30 ++++++++++++++++++---
>> >>  net/ipv4/tcp_input.c     |   13 +++++++++
>> >>  net/ipv4/tcp_ipv4.c      |   41 ++++++++++++++++++++++++++---
>> >>  net/ipv4/tcp_minisocks.c |    9 ++++++
>> >>  net/ipv4/tcp_output.c    |   47 +++++++++++++++++++++++++++++++--
>> >>  net/ipv6/tcp_ipv6.c      |    8 +++++
>> >>  9 files changed, 206 insertions(+), 11 deletions(-)
>> >>  create mode 100644 include/net/tcp_stats.h
>> >>
>> >> diff --git a/include/linux/ktime.h b/include/linux/ktime.h
>> >> index e1ceaa9..e60e758 100644
>> >> --- a/include/linux/ktime.h
>> >> +++ b/include/linux/ktime.h
>> >> @@ -333,6 +333,9 @@ extern void ktime_get_ts(struct timespec *ts);
>> >>  /* Get the real (wall-) time in timespec format: */
>> >>  #define ktime_get_real_ts(ts)        getnstimeofday(ts)
>> >
>> > Hmm, this kind of changes are out of netdev scope and should be avoided
>>
>> Ok. (It was moved out of tcp_stats.h only at the last minute.)
>>
>> >
>> >>
>> >> +#define ktime_since(a)               ktime_to_us(ktime_sub(ktime_get(), (a)))
>> >
>> > us are implied in ktime_since() ? thats strange.
>>
>> Ok.
>>
>> >
>> >> +#define ktime_zero(a)                ktime_equal((a), ktime_set(0, 0))
>> >
>> > ktime_zero() sounds like : "give me zero time" or "clear the ktime
>> > field".
>>
>> Yes I actually have been flip-flopping on the name...
>>
>> >
>> >> +
>> >>  static inline ktime_t ns_to_ktime(u64 ns)
>> >>  {
>> >>       static const ktime_t ktime_zero = { .tv64 = 0 };
>> >> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> >> index e64f4c6..ea5cb5d 100644
>> >> --- a/include/linux/tcp.h
>> >> +++ b/include/linux/tcp.h
>> >> @@ -460,6 +460,7 @@ struct tcp_sock {
>> >>        * contains related tcp_cookie_transactions fields.
>> >>        */
>> >>       struct tcp_cookie_values  *cookie_values;
>> >> +     struct tcp_stats        *conn_stats;
>> >>  };
>> >
>> > Really, using separate cache lines to store some stats is expensive.
>> > You should add counters in existing structure, to avoid additional cache
>> > line dirties. Carefully placing stats in already dirtied cache lines.
>>
>> This was how it was done initially but then we wanted to allow future
>> extension to include possibly a lot more counters, something like Web100
>> (RFC4898). For the latter the memory/performance hit will likely require
>> a config option, and a separate structure will make this easier. Does it
>> make sense?
>>
>> >
>> > You also should use native ktime_t infrastructure, to make the maths
>> > really fast in fast path.
>> >
>> > Only when stats are to be returned to user, you'll have to convert the
>> > native timestamps to user exportable ones.
>>
>> Good point! Will do. (I mistakenly thought ktime_t is a larger structure.)
>>
>> >
>> > Quite frankly, using u64 fields allow nanosec resolution.
>>
>> I wish to use less bits because the final report only needs ms or even
>> sec resolution but the intermediate computation needs to capture usec
>> resolution.
>>
>> >
>> > BTW, we probably could 'export' sk->sk_drops for TCP, like we do for
>> > UDP.
>>
>> There are many other potentially useful counters/stats (spurious_retrans,
>> min_rtt, total_rto,...) but there is a tradeoff against memory/performance hit
>> so for the first round I'm focusing on what i feel is the most useful set.
>>
>> Thanks,
>>
>> Jerry
>
> These stats are best added via netlink, the tool ss already prints lots
> of similar stats. Look at INET_DIAG_INFO and the output of:
>  ss -i -t

Yes I'm familiar with ss and how inet_csk_diag_fill() calls into
tcp_diag_get_info(),..., etc. Like I mentioned previously the netlink
interface
requires a new version of reader app (e.g., ss) to ship every time a new
counter is added and exported by the kernel, whereas /proc/net/tcp does
not have such a problem.

Anyway I can add a INET_DIAG_INFO_EXT of some sort to netlink in
addition to /proc/net/tcp. That's easy to do.

Thanks,

Jerry

>
>
>
>
>
>
> --
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-18  4:33       ` Jerry Chu
@ 2011-03-18  4:51         ` David Miller
  2011-03-18  6:05         ` Eric Dumazet
  1 sibling, 0 replies; 15+ messages in thread
From: David Miller @ 2011-03-18  4:51 UTC (permalink / raw)
  To: hkchu; +Cc: shemminger, eric.dumazet, netdev

From: Jerry Chu <hkchu@google.com>
Date: Thu, 17 Mar 2011 21:33:50 -0700

> Yes I'm familiar with ss and how inet_csk_diag_fill() calls into
> tcp_diag_get_info(),..., etc. Like I mentioned previously the netlink
> interface
> requires a new version of reader app (e.g., ss) to ship every time a new
> counter is added and exported by the kernel, whereas /proc/net/tcp does
> not have such a problem.

You can add new netlink attributes to portably extend the existing
info, and also to add a mechanism by which to make 'ss' utility
upgrades unnecessary.

Please, just use modern mechanisms like netlink and leave legacy
bits like /proc/net/tcp alone, thank you.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-18  4:33       ` Jerry Chu
  2011-03-18  4:51         ` David Miller
@ 2011-03-18  6:05         ` Eric Dumazet
  2011-03-19  4:33           ` Jerry Chu
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-03-18  6:05 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Stephen Hemminger, netdev

Le jeudi 17 mars 2011 à 21:33 -0700, Jerry Chu a écrit :

> Yes I'm familiar with ss and how inet_csk_diag_fill() calls into
> tcp_diag_get_info(),..., etc. Like I mentioned previously the netlink
> interface
> requires a new version of reader app (e.g., ss) to ship every time a new
> counter is added and exported by the kernel, whereas /proc/net/tcp does
> not have such a problem.
> 
> Anyway I can add a INET_DIAG_INFO_EXT of some sort to netlink in
> addition to /proc/net/tcp. That's easy to do.
> 

I suggest adding a getsockopt() or something to gather stats for one
socket. On a loaded machine, /proc/net/tcp is damn slow.

We have TCP_INFO, we could add TCP_INFO_EXT, but taking care of choosing
an interface allowing for extension of the structure.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-18  6:05         ` Eric Dumazet
@ 2011-03-19  4:33           ` Jerry Chu
  2011-03-19  5:02             ` Eric Dumazet
  2011-03-21 21:18             ` Stephen Hemminger
  0 siblings, 2 replies; 15+ messages in thread
From: Jerry Chu @ 2011-03-19  4:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev

On Thu, Mar 17, 2011 at 11:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Le jeudi 17 mars 2011 à 21:33 -0700, Jerry Chu a écrit :
>
> > Yes I'm familiar with ss and how inet_csk_diag_fill() calls into
> > tcp_diag_get_info(),..., etc. Like I mentioned previously the netlink
> > interface
> > requires a new version of reader app (e.g., ss) to ship every time a new
> > counter is added and exported by the kernel, whereas /proc/net/tcp does
> > not have such a problem.
> >
> > Anyway I can add a INET_DIAG_INFO_EXT of some sort to netlink in
> > addition to /proc/net/tcp. That's easy to do.
> >
>
> I suggest adding a getsockopt() or something to gather stats for one
> socket. On a loaded machine, /proc/net/tcp is damn slow.

I'm well aware of the past, hideous O(n**2) problem of reading
/proc/net/tcp but I
thought the problem has been fixed by Tom Herbert a while back, no?
(See http://marc.info/?l=linux-netdev&m=127588123429437&w=2)

>
> We have TCP_INFO, we could add TCP_INFO_EXT, but taking care of choosing
> an interface allowing for extension of the structure.

Ok.

Jerry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-19  4:33           ` Jerry Chu
@ 2011-03-19  5:02             ` Eric Dumazet
  2011-03-19  5:38               ` Jerry Chu
  2011-03-21 21:18             ` Stephen Hemminger
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-03-19  5:02 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Stephen Hemminger, netdev

Le vendredi 18 mars 2011 à 21:33 -0700, Jerry Chu a écrit :

> I'm well aware of the past, hideous O(n**2) problem of reading
> /proc/net/tcp but I
> thought the problem has been fixed by Tom Herbert a while back, no?
> (See http://marc.info/?l=linux-netdev&m=127588123429437&w=2)
> 

O(N) is still too slow, to gather stats for your socket, if you want to
gather stats once per second for example.

AFAIK, I am not sure we want to allow any user to access all these data
for all tcp sockets on the machine. This might have security impacts.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-19  5:02             ` Eric Dumazet
@ 2011-03-19  5:38               ` Jerry Chu
  2011-03-19  6:03                 ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jerry Chu @ 2011-03-19  5:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev

On Fri, Mar 18, 2011 at 10:02 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 18 mars 2011 à 21:33 -0700, Jerry Chu a écrit :
>
>> I'm well aware of the past, hideous O(n**2) problem of reading
>> /proc/net/tcp but I
>> thought the problem has been fixed by Tom Herbert a while back, no?
>> (See http://marc.info/?l=linux-netdev&m=127588123429437&w=2)
>>
>
> O(N) is still too slow, to gather stats for your socket, if you want to
> gather stats once per second for example.

I don't see anyway around O(n). The netlink/inet_diag_dump() will take at
best O(n) too.

>
> AFAIK, I am not sure we want to allow any user to access all these data
> for all tcp sockets on the machine. This might have security impacts.

This seems to be an orthogonal issue to netlink or /proc/net/tcp.

Jerry

>
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-19  5:38               ` Jerry Chu
@ 2011-03-19  6:03                 ` Eric Dumazet
  2011-03-20  7:30                   ` Jerry Chu
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2011-03-19  6:03 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Stephen Hemminger, netdev

Le vendredi 18 mars 2011 à 22:38 -0700, Jerry Chu a écrit :
> On Fri, Mar 18, 2011 at 10:02 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le vendredi 18 mars 2011 à 21:33 -0700, Jerry Chu a écrit :
> >
> >> I'm well aware of the past, hideous O(n**2) problem of reading
> >> /proc/net/tcp but I
> >> thought the problem has been fixed by Tom Herbert a while back, no?
> >> (See http://marc.info/?l=linux-netdev&m=127588123429437&w=2)
> >>
> >
> > O(N) is still too slow, to gather stats for your socket, if you want to
> > gather stats once per second for example.
> 
> I don't see anyway around O(n). The netlink/inet_diag_dump() will take at
> best O(n) too.

1) netlink is extensible, without breaking old apps.

2) netlink is able to dump part of the information, while
whith /proc/net/tcp you have to cat all the file and grep the needed
info. O(1) is better than O(10)

Example :

I only want info of socket identified by dst ip 192.168.0.144 and dst
port 58196

# ss -emoi dst 192.168.20.144:58196
State       Recv-Q Send-Q     Local Address:Port         Peer Address:Port   
ESTAB       0      0         192.168.20.108:65111      192.168.20.144:58196    ino:3302 sk:ffff88011c278c80
	 mem:(r0,w0,f0,t0) sack cubic wscale:7,7 rto:201 rtt:1.75/0.75 cwnd:10 send 66.7Mbps rcv_space:14600

I want all sockets to remote 192.168.20.110
# ss -emoi dst 192.168.20.110
State       Recv-Q Send-Q     Local Address:Port         Peer Address:Port   
ESTAB       0      0      ::ffff:192.168.20.108:ssh     ::ffff:192.168.20.110:47633    timer:(keepalive,120min,0) ino:70342 sk:ffff88011c381500
	 mem:(r0,w0,f0,t0) ts sack ecn cubic wscale:8,7 rto:207 rtt:7.625/11.5 ato:40 cwnd:10 send 15.2Mbps rcv_rtt:1 rcv_space:14480



3) getsockopt(fd) is O(1)

> 
> >
> > AFAIK, I am not sure we want to allow any user to access all these data
> > for all tcp sockets on the machine. This might have security impacts.
> 
> This seems to be an orthogonal issue to netlink or /proc/net/tcp.

getsockopt(fd) only finds information about your own socket.

An application might want to log tcp information only right before
closing one socket. This sounds more practical than having a separate
spy thread.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-19  6:03                 ` Eric Dumazet
@ 2011-03-20  7:30                   ` Jerry Chu
  2011-03-20  9:18                     ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jerry Chu @ 2011-03-20  7:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netdev

On Fri, Mar 18, 2011 at 11:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Le vendredi 18 mars 2011 à 22:38 -0700, Jerry Chu a écrit :
> > On Fri, Mar 18, 2011 at 10:02 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Le vendredi 18 mars 2011 à 21:33 -0700, Jerry Chu a écrit :
> > >
> > >> I'm well aware of the past, hideous O(n**2) problem of reading
> > >> /proc/net/tcp but I
> > >> thought the problem has been fixed by Tom Herbert a while back, no?
> > >> (See http://marc.info/?l=linux-netdev&m=127588123429437&w=2)
> > >>
> > >
> > > O(N) is still too slow, to gather stats for your socket, if you want to
> > > gather stats once per second for example.
> >
> > I don't see anyway around O(n). The netlink/inet_diag_dump() will take at
> > best O(n) too.
>
> 1) netlink is extensible, without breaking old apps.
>
> 2) netlink is able to dump part of the information, while
> whith /proc/net/tcp you have to cat all the file and grep the needed
> info. O(1) is better than O(10)
>
> Example :
>
> I only want info of socket identified by dst ip 192.168.0.144 and dst
> port 58196
>
> # ss -emoi dst 192.168.20.144:58196
> State       Recv-Q Send-Q     Local Address:Port         Peer Address:Port
> ESTAB       0      0         192.168.20.108:65111      192.168.20.144:58196    ino:3302 sk:ffff88011c278c80
>         mem:(r0,w0,f0,t0) sack cubic wscale:7,7 rto:201 rtt:1.75/0.75 cwnd:10 send 66.7Mbps rcv_space:14600
>
> I want all sockets to remote 192.168.20.110
> # ss -emoi dst 192.168.20.110
> State       Recv-Q Send-Q     Local Address:Port         Peer Address:Port
> ESTAB       0      0      ::ffff:192.168.20.108:ssh     ::ffff:192.168.20.110:47633    timer:(keepalive,120min,0) ino:70342 sk:ffff88011c381500
>         mem:(r0,w0,f0,t0) ts sack ecn cubic wscale:8,7 rto:207 rtt:7.625/11.5 ato:40 cwnd:10 send 15.2Mbps rcv_rtt:1 rcv_space:14480

I presume the filtering of specific IP addrs/ports are done in the
kernel to short-circuit the
O(n) loop. Otherwise it seems moot (i.e., it still costs O(n)).

A quick look at inet_csk_diag_dump() does not reveal anything like
that. But I haven't
looked hard enough.

>
>
> 3) getsockopt(fd) is O(1)
>
> >
> > >
> > > AFAIK, I am not sure we want to allow any user to access all these data
> > > for all tcp sockets on the machine. This might have security impacts.
> >
> > This seems to be an orthogonal issue to netlink or /proc/net/tcp.
>
> getsockopt(fd) only finds information about your own socket.

Make no mistake if my usage case was driven by specific apps why would I
waste my time chasing /proc/net/tcp? In fact we've had a private
TCP_INFO_EXT socket option for years used by some specific apps. (We
have not tried to pushed it upstreams because some of the counters we collect
are too Google specific.)

But my immediate use of the counters/stats proposed here is for our internal
monitor programs to take samples occasionally from ALL CONNECTIONS, or for
diagnosis purpose from the TCP layer when we don't have specific info about the
apps' IP or port# (so we just want to dump info from all connections).

>
> An application might want to log tcp information only right before
> closing one socket. This sounds more practical than having a separate
> spy thread.

I'm not trying to use /proc/net/tcp (or netlink) because I don't know about
other, more efficient way to retrieve counters from a subset, or even just one
specific connection as explained above.

Jerry

>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-20  7:30                   ` Jerry Chu
@ 2011-03-20  9:18                     ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2011-03-20  9:18 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Stephen Hemminger, netdev

Le dimanche 20 mars 2011 à 00:30 -0700, Jerry Chu a écrit :

> I presume the filtering of specific IP addrs/ports are done in the
> kernel to short-circuit the
> O(n) loop. Otherwise it seems moot (i.e., it still costs O(n)).
> 

Filtering is done in kernel, and we can extend filtering capa if
necessary.

/proc/net/tcp is a flat file, you cannot do this


> A quick look at inet_csk_diag_dump() does not reveal anything like
> that. But I haven't
> looked hard enough.

You should ;)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-19  4:33           ` Jerry Chu
  2011-03-19  5:02             ` Eric Dumazet
@ 2011-03-21 21:18             ` Stephen Hemminger
  2011-03-21 22:19               ` Ben Greear
  1 sibling, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2011-03-21 21:18 UTC (permalink / raw)
  To: Jerry Chu; +Cc: Eric Dumazet, netdev

In case it isn't clear. /proc/net/tcp is a legacy interface and any
changes to it are certain to be rejected. netlink stats are an extensible
interface and likely to be accepted.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add useful per-connection TCP stats for diagnosis purpose.
  2011-03-21 21:18             ` Stephen Hemminger
@ 2011-03-21 22:19               ` Ben Greear
  0 siblings, 0 replies; 15+ messages in thread
From: Ben Greear @ 2011-03-21 22:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jerry Chu, Eric Dumazet, netdev

On 03/21/2011 02:18 PM, Stephen Hemminger wrote:
> In case it isn't clear. /proc/net/tcp is a legacy interface and any
> changes to it are certain to be rejected. netlink stats are an extensible
> interface and likely to be accepted.

Might be nice to have this available via

getsockopt(s, 6, /*SOL_TCP,*/ TCP_INFO, (void*)&info, &argsz)
as well.

Thanks,
Ben

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-03-21 22:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-17  8:06 [PATCH] Add useful per-connection TCP stats for diagnosis purpose H.K. Jerry Chu
2011-03-17  8:42 ` Eric Dumazet
2011-03-17 20:16   ` Jerry Chu
2011-03-17 21:20     ` Stephen Hemminger
2011-03-18  4:33       ` Jerry Chu
2011-03-18  4:51         ` David Miller
2011-03-18  6:05         ` Eric Dumazet
2011-03-19  4:33           ` Jerry Chu
2011-03-19  5:02             ` Eric Dumazet
2011-03-19  5:38               ` Jerry Chu
2011-03-19  6:03                 ` Eric Dumazet
2011-03-20  7:30                   ` Jerry Chu
2011-03-20  9:18                     ` Eric Dumazet
2011-03-21 21:18             ` Stephen Hemminger
2011-03-21 22:19               ` Ben Greear

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).