netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Tracepoint for tcp retransmission
@ 2012-01-20 18:07 Satoru Moriya
  2012-01-20 18:08 ` [PATCH v2 1/2] tcp: refactor tcp_retransmit_skb() for a single return point Satoru Moriya
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-01-20 18:07 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, tgraf, Stephen Hemminger, Hagen Paul Pfeifer,
	eric.dumazet, Seiji Aguchi

Change log v1 -> v2
- rewrite a patch description based on replies to v1 patchset
- add local port number to tracedata

Sometimes network packets are dropped for some reason. In enterprise
systems which require strict RAS functionality, we must know the
reason why it happened and explain it to our customers even if using
TCP. When we investigate the incidents, at first we try to find out
whether the packet drop is in the server(kernel, application) or else
(router, hub etc). Once we find it happened in the kernel, we try to
get more details.

Currently, there are some tools/interfaces, e.g. tcpdump,
dropwatch/skb:kfree_skb(tracepoint), netstat, /proc, systemtap etc, 
which help us analyze situations.
But unfortunately, they are too much for one, not enough for
the other. tcpdump captures all the packet but it's overkill because
we don't need all the packets' data but just dropped one. We can
get statistics via netstat and/or /proc but we need more information
to analyze the situation. skb:kfree_skb tracepoint is very useful
for detecting packet drop and analyzing it. In addition to it, if
we have tracepoints in TCP layer in particular retransmit path,
it is very helpful for us to dig into situations because with TCP
the kernel tries to resend packets before dropping them.

With this tracepoint, we can know whether the packet drop occurred
in the server (moreover in the kernel) or not. For example, if we
finds that retransmission failed (tcp_retransmit_skb() returned
negative value), it means the kernel may have some troubles at
that time and we can drill down on issues in the kernel based on
trace data. OTOH, if retransmission succeeded, packet is dropped
outside the kernel/server.


Satoru Moriya (2):
  tcp: refactor tcp_retransmit_skb() for a single return point
  tcp: add tracepoint for tcp retransmission

 include/trace/events/tcp.h |   38 ++++++++++++++++++++++++++++++++++++++
 net/core/net-traces.c      |    1 +
 net/ipv4/tcp_output.c      |   34 ++++++++++++++++++++++++----------
 3 files changed, 63 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/tcp.h

-- 
1.7.6.4

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/2] tcp: refactor tcp_retransmit_skb() for a single return point
  2012-01-20 18:07 [PATCH v2 0/2] Tracepoint for tcp retransmission Satoru Moriya
@ 2012-01-20 18:08 ` Satoru Moriya
  2012-01-20 18:09 ` [PATCH v2 2/2] tcp: add tracepoint for tcp retransmission Satoru Moriya
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-01-20 18:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, tgraf, Stephen Hemminger, Hagen Paul Pfeifer,
	eric.dumazet, Seiji Aguchi

This is just a cleanup patch for making easy to hook return value
with a tracepoint.

Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
---
 net/ipv4/tcp_output.c |   31 +++++++++++++++++++++----------
 1 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8c8de27..7bd23a8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2092,18 +2092,24 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	 * copying overhead: fragmentation, tunneling, mangling etc.
 	 */
 	if (atomic_read(&sk->sk_wmem_alloc) >
-	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf))
-		return -EAGAIN;
+	    min(sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) {
+		err = -EAGAIN;
+		goto out;
+	}
 
 	if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
 		if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
 			BUG();
-		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
-			return -ENOMEM;
+		if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq)) {
+			err = -ENOMEM;
+			goto out;
+		}
 	}
 
-	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
-		return -EHOSTUNREACH; /* Routing failure or similar. */
+	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk)) {
+		err = -EHOSTUNREACH; /* Routing failure or similar. */
+		goto out;
+	}
 
 	cur_mss = tcp_current_mss(sk);
 
@@ -2113,12 +2119,16 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 	 * our retransmit serves as a zero window probe.
 	 */
 	if (!before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(tp)) &&
-	    TCP_SKB_CB(skb)->seq != tp->snd_una)
-		return -EAGAIN;
+	    TCP_SKB_CB(skb)->seq != tp->snd_una) {
+		err = -EAGAIN;
+		goto out;
+	}
 
 	if (skb->len > cur_mss) {
-		if (tcp_fragment(sk, skb, cur_mss, cur_mss))
-			return -ENOMEM; /* We'll try again later. */
+		if (tcp_fragment(sk, skb, cur_mss, cur_mss)) {
+			err = -ENOMEM; /* We'll try again later. */
+			goto out;
+		}
 	} else {
 		int oldpcount = tcp_skb_pcount(skb);
 
@@ -2188,6 +2198,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 		 */
 		TCP_SKB_CB(skb)->ack_seq = tp->snd_nxt;
 	}
+out:
 	return err;
 }
 
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/2] tcp: add tracepoint for tcp retransmission
  2012-01-20 18:07 [PATCH v2 0/2] Tracepoint for tcp retransmission Satoru Moriya
  2012-01-20 18:08 ` [PATCH v2 1/2] tcp: refactor tcp_retransmit_skb() for a single return point Satoru Moriya
@ 2012-01-20 18:09 ` Satoru Moriya
  2012-01-20 18:50 ` [PATCH v2 0/2] Tracepoint " David Miller
  2012-01-25 13:27 ` Hagen Paul Pfeifer
  3 siblings, 0 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-01-20 18:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, tgraf, Stephen Hemminger, Hagen Paul Pfeifer,
	eric.dumazet, Seiji Aguchi

This patch adds a tracepoint to tcp_retransmit_skb() to get the
pointer to sk and skb, local port number  and return value. This helps
one understand whether problems are in a server or not.

Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
---
 include/trace/events/tcp.h |   38 ++++++++++++++++++++++++++++++++++++++
 net/core/net-traces.c      |    1 +
 net/ipv4/tcp_output.c      |    3 +++
 3 files changed, 42 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/tcp.h

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
new file mode 100644
index 0000000..311e783
--- /dev/null
+++ b/include/trace/events/tcp.h
@@ -0,0 +1,38 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tcp
+
+#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TCP_H
+
+#include <linux/skbuff.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(tcp_retransmit_skb,
+
+	TP_PROTO(struct sock *sk, struct sk_buff *skb, int err),
+
+	TP_ARGS(sk, skb, err),
+
+	TP_STRUCT__entry(
+		__field(void *, skaddr)
+		__field(void *, skbaddr)
+		__field(__u16, lport)
+		__field(int, err)
+	),
+
+	TP_fast_assign(
+		__entry->skaddr = sk;
+		__entry->skbaddr = skb;
+		__entry->lport = inet_sk(sk)->inet_num;
+		__entry->err = err;
+	),
+
+	TP_printk("sk=%p skb=%p port=%hu err=%d",
+		__entry->skaddr, __entry->skbaddr,
+		__entry->lport, __entry->err)
+);
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index ba3c012..63f966b 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@
 #include <trace/events/napi.h>
 #include <trace/events/sock.h>
 #include <trace/events/udp.h>
+#include <trace/events/tcp.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7bd23a8..483cc17 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -40,6 +40,8 @@
 #include <linux/gfp.h>
 #include <linux/module.h>
 
+#include <trace/events/tcp.h>
+
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse __read_mostly = 1;
 
@@ -2199,6 +2201,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 		TCP_SKB_CB(skb)->ack_seq = tp->snd_nxt;
 	}
 out:
+	trace_tcp_retransmit_skb(sk, skb, err);
 	return err;
 }
 
-- 
1.7.6.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-20 18:07 [PATCH v2 0/2] Tracepoint for tcp retransmission Satoru Moriya
  2012-01-20 18:08 ` [PATCH v2 1/2] tcp: refactor tcp_retransmit_skb() for a single return point Satoru Moriya
  2012-01-20 18:09 ` [PATCH v2 2/2] tcp: add tracepoint for tcp retransmission Satoru Moriya
@ 2012-01-20 18:50 ` David Miller
  2012-02-03 21:47   ` Satoru Moriya
  2012-01-25 13:27 ` Hagen Paul Pfeifer
  3 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2012-01-20 18:50 UTC (permalink / raw)
  To: satoru.moriya
  Cc: netdev, nhorman, tgraf, stephen.hemminger, hagen, eric.dumazet,
	seiji.aguchi

From: Satoru Moriya <satoru.moriya@hds.com>
Date: Fri, 20 Jan 2012 13:07:02 -0500

> Sometimes network packets are dropped for some reason. In enterprise
> systems which require strict RAS functionality, we must know the
> reason why it happened and explain it to our customers even if using
> TCP.

You were given an alternative way to trace these kinds of events,
and you have yet to give us a solid reason why that cannot work
for you.

I am not applying these patches, sorry.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-20 18:07 [PATCH v2 0/2] Tracepoint for tcp retransmission Satoru Moriya
                   ` (2 preceding siblings ...)
  2012-01-20 18:50 ` [PATCH v2 0/2] Tracepoint " David Miller
@ 2012-01-25 13:27 ` Hagen Paul Pfeifer
  2012-01-25 14:44   ` Eric Dumazet
  2012-02-03 20:43   ` Satoru Moriya
  3 siblings, 2 replies; 30+ messages in thread
From: Hagen Paul Pfeifer @ 2012-01-25 13:27 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: netdev, davem, nhorman, tgraf, Stephen Hemminger, eric.dumazet,
	Seiji Aguchi


On Fri, 20 Jan 2012 13:07:02 -0500, Satoru Moriya wrote:



> With this tracepoint, we can know whether the packet drop occurred

> in the server (moreover in the kernel) or not. For example, if we

> finds that retransmission failed (tcp_retransmit_skb() returned

> negative value), it means the kernel may have some troubles at

> that time and we can drill down on issues in the kernel based on

> trace data. OTOH, if retransmission succeeded, packet is dropped

> outside the kernel/server.





This is an equivalent replacement and by means not so intrusive:



probe kernel.function("tcp_retransmit_skb").return {

    printf("%s -> %d", probefunc(), retval())

}



and print inet_sk($sk)->inet_num for

kernel.function("tcp_retransmit_skb").



It is crazy to add everywhere new tracepoints. Systemtap is far from being

perfect and as smooth as dtrace. But this is an example where systemtap is

suitable and should be used.



Satoru, you wrote systemtap is not suitable - why?



Hagen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-25 13:27 ` Hagen Paul Pfeifer
@ 2012-01-25 14:44   ` Eric Dumazet
  2012-01-26 18:51     ` David Miller
  2012-02-03 20:31     ` Satoru Moriya
  2012-02-03 20:43   ` Satoru Moriya
  1 sibling, 2 replies; 30+ messages in thread
From: Eric Dumazet @ 2012-01-25 14:44 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Satoru Moriya, netdev, davem, nhorman, tgraf, Stephen Hemminger,
	Seiji Aguchi

Le mercredi 25 janvier 2012 à 14:27 +0100, Hagen Paul Pfeifer a écrit :


> It is crazy to add everywhere new tracepoints. Systemtap is far from being
> perfect and as smooth as dtrace. But this is an example where systemtap is
> suitable and should be used.

Agreed, but last time I tried systemtap I failed miserably.
It was on a debian distro.

Anyway, it seems we lack a LINUX_MIB_TCPRETRANSFAIL counter.

"netstat -s" is an incredible universal tool.

[PATCH net-next] tcp: add LINUX_MIB_TCPRETRANSFAIL counter

It might be useful to get a counter of failed tcp_retransmit_skb()
calls.

Reported-by: Satoru Moriya <satoru.moriya@hds.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/snmp.h  |    1 +
 net/ipv4/proc.c       |    1 +
 net/ipv4/tcp_output.c |    4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index c1241c42..8ee8af4 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -232,6 +232,7 @@ enum
 	LINUX_MIB_TCPTIMEWAITOVERFLOW,		/* TCPTimeWaitOverflow */
 	LINUX_MIB_TCPREQQFULLDOCOOKIES,		/* TCPReqQFullDoCookies */
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
+	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 6afc807..02d6107 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -256,6 +256,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPTimeWaitOverflow", LINUX_MIB_TCPTIMEWAITOVERFLOW),
 	SNMP_MIB_ITEM("TCPReqQFullDoCookies", LINUX_MIB_TCPREQQFULLDOCOOKIES),
 	SNMP_MIB_ITEM("TCPReqQFullDrop", LINUX_MIB_TCPREQQFULLDROP),
+	SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8c8de27..561550a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2308,8 +2308,10 @@ begin_fwd:
 		if (sacked & (TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS))
 			continue;
 
-		if (tcp_retransmit_skb(sk, skb))
+		if (tcp_retransmit_skb(sk, skb)) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
 			return;
+		}
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 
 		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-25 14:44   ` Eric Dumazet
@ 2012-01-26 18:51     ` David Miller
  2012-02-03 20:31     ` Satoru Moriya
  1 sibling, 0 replies; 30+ messages in thread
From: David Miller @ 2012-01-26 18:51 UTC (permalink / raw)
  To: eric.dumazet
  Cc: hagen, satoru.moriya, netdev, nhorman, tgraf, stephen.hemminger,
	seiji.aguchi

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 25 Jan 2012 15:44:20 +0100

> [PATCH net-next] tcp: add LINUX_MIB_TCPRETRANSFAIL counter
> 
> It might be useful to get a counter of failed tcp_retransmit_skb()
> calls.
> 
> Reported-by: Satoru Moriya <satoru.moriya@hds.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-25 14:44   ` Eric Dumazet
  2012-01-26 18:51     ` David Miller
@ 2012-02-03 20:31     ` Satoru Moriya
  1 sibling, 0 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-02-03 20:31 UTC (permalink / raw)
  To: Eric Dumazet, Hagen Paul Pfeifer
  Cc: netdev, davem, nhorman, tgraf, Stephen Hemminger, Seiji Aguchi

On 01/25/2012 09:44 AM, Eric Dumazet wrote:
> Agreed, but last time I tried systemtap I failed miserably.
> It was on a debian distro.
> 
> Anyway, it seems we lack a LINUX_MIB_TCPRETRANSFAIL counter.
> 
> "netstat -s" is an incredible universal tool.
> 
> [PATCH net-next] tcp: add LINUX_MIB_TCPRETRANSFAIL counter
> 
> It might be useful to get a counter of failed tcp_retransmit_skb() 
> calls.

Thank you, Eric.

This is a good first step for us to grab the failure of
retransmission. But unfortunately, it's not enough because
when we analyze incidents and report to our customers,
we need more detailed information such as when it happened,
why it failed etc.

Therefore I proposed this tracepoint.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-25 13:27 ` Hagen Paul Pfeifer
  2012-01-25 14:44   ` Eric Dumazet
@ 2012-02-03 20:43   ` Satoru Moriya
  2012-02-03 20:55     ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 30+ messages in thread
From: Satoru Moriya @ 2012-02-03 20:43 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: netdev, davem, nhorman, tgraf, Stephen Hemminger, eric.dumazet,
	Seiji Aguchi

On 01/25/2012 08:27 AM, Hagen Paul Pfeifer wrote:
> It is crazy to add everywhere new tracepoints. Systemtap is far from 
> being perfect and as smooth as dtrace. But this is an example where 
> systemtap is suitable and should be used.
> 
> Satoru, you wrote systemtap is not suitable - why?

Actually, we've already used systemtap in our flight recorder.
But we believe that tcp retransmission is one of the fundamental
function in tcp stack and so kernel itself should provide the
instruments from which we can get enough information without
tools which is not included in kernel.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-03 20:43   ` Satoru Moriya
@ 2012-02-03 20:55     ` Hagen Paul Pfeifer
  0 siblings, 0 replies; 30+ messages in thread
From: Hagen Paul Pfeifer @ 2012-02-03 20:55 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: netdev, davem, nhorman, tgraf, Stephen Hemminger, eric.dumazet,
	Seiji Aguchi


On Fri, 3 Feb 2012 15:43:30 -0500, Satoru Moriya wrote:



> Actually, we've already used systemtap in our flight recorder.

> But we believe that tcp retransmission is one of the fundamental

> function in tcp stack and so kernel itself should provide the

> instruments from which we can get enough information without

> tools which is not included in kernel.



Mhh, that's no real reason to add tracepoints all over. There where some

lengthy debates on lkml about inflationary use of tracepoints. Especially

this case where systemtap can provide the same information (btw: for

tracepoints you may also need some userspace tools). Maybe you should start

to improve systemtap. ;) But hey, it is David who decides if he is fine

with your patch.



Hagen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-01-20 18:50 ` [PATCH v2 0/2] Tracepoint " David Miller
@ 2012-02-03 21:47   ` Satoru Moriya
  2012-02-04  4:40     ` Yuchung Cheng
  2012-02-04 14:28     ` Neil Horman
  0 siblings, 2 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-02-03 21:47 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, nhorman, tgraf, stephen.hemminger, hagen, eric.dumazet,
	Seiji Aguchi

On 01/20/2012 01:50 PM, David Miller wrote:
> You were given an alternative way to trace these kinds of events, and 
> you have yet to give us a solid reason why that cannot work for you.

OK. I'll try to explain it.

First of all, we'd like to use this tracepoint with our
flight recorder.

tcpdump:
 tcpdump captures all the packets and so its overhead is not
 acceptable. Also we can't keep the data on memory but must
 write the data to file for each time. It introduce other
 overhead which we can't accept.

commit 63e03724b51, dropwatch, skb:kfree_skb:
 With this tracepoint, we can detect packet drop.
 But it may be too late because with tcp kernel retransmits
 packets repeatedly if it can't get ack and after that it
 may drops packets in a no-win situation.
 Also sometimes customer finds delays which is caused by
 temporal packet drop and retransmission. With this tracepoint
 we can explain it based on the real data.

netstat:
 This is a good tool for the first step to analyze what
 happened. But it shows only statistics and it's not enough
 for us to analyze incidents and explain it to our customers.
 We need each packet drop data(when it happen, whether it
 succeeded or not etc.)

systemtap:
 Actually, we've already used systemtap in our flight recorder.
 But we believe that tcp retransmission is one of the fundamental
 function in tcp stack and so kernel itself should provide the
 instruments from which we can get enough information.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-03 21:47   ` Satoru Moriya
@ 2012-02-04  4:40     ` Yuchung Cheng
  2012-02-06 18:32       ` Satoru Moriya
  2012-02-04 14:28     ` Neil Horman
  1 sibling, 1 reply; 30+ messages in thread
From: Yuchung Cheng @ 2012-02-04  4:40 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: David Miller, netdev, nhorman, tgraf, stephen.hemminger, hagen,
	eric.dumazet, Seiji Aguchi

Hi Satoru,

I totally understand that you need deeper instrumentation for *your*
business. At Google, we instrument a lot more in TCP as well. But we
have yet to upstream these changes because it's not universally useful
for default Linux kernel. I failed to see how different failures of a
TCP retransmission must be recorded and exported. Could you elaborate
your last point?

On Fri, Feb 3, 2012 at 4:47 PM, Satoru Moriya <satoru.moriya@hds.com> wrote:
>
> On 01/20/2012 01:50 PM, David Miller wrote:
> > You were given an alternative way to trace these kinds of events, and
> > you have yet to give us a solid reason why that cannot work for you.
>
> OK. I'll try to explain it.
>
> First of all, we'd like to use this tracepoint with our
> flight recorder.
>
> tcpdump:
>  tcpdump captures all the packets and so its overhead is not
>  acceptable. Also we can't keep the data on memory but must
>  write the data to file for each time. It introduce other
>  overhead which we can't accept.
>
> commit 63e03724b51, dropwatch, skb:kfree_skb:
>  With this tracepoint, we can detect packet drop.
>  But it may be too late because with tcp kernel retransmits
>  packets repeatedly if it can't get ack and after that it
>  may drops packets in a no-win situation.
>  Also sometimes customer finds delays which is caused by
>  temporal packet drop and retransmission. With this tracepoint
>  we can explain it based on the real data.
>
> netstat:
>  This is a good tool for the first step to analyze what
>  happened. But it shows only statistics and it's not enough
>  for us to analyze incidents and explain it to our customers.
>  We need each packet drop data(when it happen, whether it
>  succeeded or not etc.)
>
> systemtap:
>  Actually, we've already used systemtap in our flight recorder.
>  But we believe that tcp retransmission is one of the fundamental
>  function in tcp stack and so kernel itself should provide the
>  instruments from which we can get enough information.
>
> Regards,
> Satoru
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-03 21:47   ` Satoru Moriya
  2012-02-04  4:40     ` Yuchung Cheng
@ 2012-02-04 14:28     ` Neil Horman
  2012-02-04 15:58       ` Hagen Paul Pfeifer
  1 sibling, 1 reply; 30+ messages in thread
From: Neil Horman @ 2012-02-04 14:28 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: David Miller, netdev, tgraf, stephen.hemminger, hagen,
	eric.dumazet, Seiji Aguchi

On Fri, Feb 03, 2012 at 04:47:04PM -0500, Satoru Moriya wrote:
> On 01/20/2012 01:50 PM, David Miller wrote:
> > You were given an alternative way to trace these kinds of events, and 
> > you have yet to give us a solid reason why that cannot work for you.
> 
> OK. I'll try to explain it.
> 
> First of all, we'd like to use this tracepoint with our
> flight recorder.
> 
> tcpdump:
>  tcpdump captures all the packets and so its overhead is not
>  acceptable. Also we can't keep the data on memory but must
>  write the data to file for each time. It introduce other
>  overhead which we can't accept.
> 
So, I hadn't really considered this approach (missed the suggestion previously).
Its not really accurate to disregard this solution entirely.  While the overhead
of tcpdump (or libpcap more specifically) might be too much, it speaks to a
possible solution that doesn't require adding additional tracepoints: a
netfilter hook.  What about writing a kernel module to hook at various points
(I'd guess IP_PREROUTE would be best), to detect duplicate sequence numbers on a
particular connection.  You could export the information via a proc file, or do
it asynchronously with a netlink socket or some such.  Thats the sort of module
that pleasantly isolated (allow those not interested to not have to include it
in their builds or see it in the source), efficiently provides the information
you need, and can be expanded to other types of traffic should you need it in
the future.

Thoughts?
Neil

> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-04 14:28     ` Neil Horman
@ 2012-02-04 15:58       ` Hagen Paul Pfeifer
  2012-02-04 20:09         ` Neil Horman
  0 siblings, 1 reply; 30+ messages in thread
From: Hagen Paul Pfeifer @ 2012-02-04 15:58 UTC (permalink / raw)
  To: Neil Horman
  Cc: Satoru Moriya, David Miller, netdev, tgraf, stephen.hemminger,
	eric.dumazet, Seiji Aguchi, fche

* Neil Horman | 2012-02-04 09:28:23 [-0500]:

>So, I hadn't really considered this approach (missed the suggestion previously).
>Its not really accurate to disregard this solution entirely.  While the overhead
>of tcpdump (or libpcap more specifically) might be too much, it speaks to a
>possible solution that doesn't require adding additional tracepoints: a
>netfilter hook.  What about writing a kernel module to hook at various points
>(I'd guess IP_PREROUTE would be best), to detect duplicate sequence numbers on a
>particular connection.  You could export the information via a proc file, or do
>it asynchronously with a netlink socket or some such.  Thats the sort of module
>that pleasantly isolated (allow those not interested to not have to include it
>in their builds or see it in the source), efficiently provides the information
>you need, and can be expanded to other types of traffic should you need it in
>the future.

That sounds broken by design. Copy TCP sequence number mechanism one more time
as a netfilter module? You probably catch the retransmission drop with this,
but users *want* more: in one week Satoru realize that the retransmission
tracepoint doesn't cover all required information for their customer. Other
user may need information about TCP rate-halfing-SACK-rto-foo drops. You will
and up realizing that in-packet data (like sequence number) is not sufficient
for correct analysis. Ask Yuchung how many tracepoints they use. Maybe Google
guys should publish their git network-stack-trace topic branch, like the rt
branch: everybody who want more tracing should pull from this branch. (ok, too
revolutionary ;-)

What is required is a context aware tracing mechanism with no overhead and no
code modification ... Mmhh ... systemtap! ;-)

The problem is the usability of systemtap. One problem is that the scripts are
maintained out of kernel development (maybe) and thus source code
modifications are not aligned (kernel and systemtap scripts). IMO that's one
of the drawbacks of systemap. But maybe this is no problem in praxis:
end-customers normally use stable kernels for a longer time.

Hagen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-04 15:58       ` Hagen Paul Pfeifer
@ 2012-02-04 20:09         ` Neil Horman
  2012-02-05 12:53           ` Frank Ch. Eigler
  0 siblings, 1 reply; 30+ messages in thread
From: Neil Horman @ 2012-02-04 20:09 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Satoru Moriya, David Miller, netdev, tgraf, stephen.hemminger,
	eric.dumazet, Seiji Aguchi, fche

On Sat, Feb 04, 2012 at 04:58:07PM +0100, Hagen Paul Pfeifer wrote:
> * Neil Horman | 2012-02-04 09:28:23 [-0500]:
> 
> >So, I hadn't really considered this approach (missed the suggestion previously).
> >Its not really accurate to disregard this solution entirely.  While the overhead
> >of tcpdump (or libpcap more specifically) might be too much, it speaks to a
> >possible solution that doesn't require adding additional tracepoints: a
> >netfilter hook.  What about writing a kernel module to hook at various points
> >(I'd guess IP_PREROUTE would be best), to detect duplicate sequence numbers on a
> >particular connection.  You could export the information via a proc file, or do
> >it asynchronously with a netlink socket or some such.  Thats the sort of module
> >that pleasantly isolated (allow those not interested to not have to include it
> >in their builds or see it in the source), efficiently provides the information
> >you need, and can be expanded to other types of traffic should you need it in
> >the future.
> 
> That sounds broken by design. Copy TCP sequence number mechanism one more time
> as a netfilter module? You probably catch the retransmission drop with this,
Only enough of it to do some reasonable connection tracking, so that you can
detect retransmissions (or whatever other info you want).

Note I'm not strictly suggesting that this is an upstreamable approach, but it
is a solution that lets them get the info they want without having to modify the
kernel code, or use systemtap (which Satoru has previously said they wish to
stop doing).  The cost is maintaining a kernel module (either by getting it
upstream or maintaining it themselves).

> but users *want* more: in one week Satoru realize that the retransmission
> tracepoint doesn't cover all required information for their customer. Other
> user may need information about TCP rate-halfing-SACK-rto-foo drops. You will
> and up realizing that in-packet data (like sequence number) is not sufficient
> for correct analysis. Ask Yuchung how many tracepoints they use. Maybe Google
> guys should publish their git network-stack-trace topic branch, like the rt
> branch: everybody who want more tracing should pull from this branch. (ok, too
> revolutionary ;-)
> 
> What is required is a context aware tracing mechanism with no overhead and no
> code modification ... Mmhh ... systemtap! ;-)
> 
Systemtap is fine for development.  Lots of application/product vendors really
don't like it.  On top of the difficult to use aspect, it requires that
debugging stabs info be kept with the kernel, which is a real pain, especially
if you're running on a stable kernel.  Not saying you can't do it, but I
understand why its undesireable.

> The problem is the usability of systemtap. One problem is that the scripts are
> maintained out of kernel development (maybe) and thus source code
> modifications are not aligned (kernel and systemtap scripts). IMO that's one
> of the drawbacks of systemap. But maybe this is no problem in praxis:
> end-customers normally use stable kernels for a longer time.
> 
Again, systemtap is fine, and if Satoru and company want to go that route, fine,
this conversation is done, but he's already said they've gone that route and
want to discontinue that approach.  I'm just offering them another avenue,
maintain your own kernel module that filters on the kind of frames you're after
and report them in a useful format to you.
Neil

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-04 20:09         ` Neil Horman
@ 2012-02-05 12:53           ` Frank Ch. Eigler
  2012-02-05 19:17             ` Neil Horman
  0 siblings, 1 reply; 30+ messages in thread
From: Frank Ch. Eigler @ 2012-02-05 12:53 UTC (permalink / raw)
  To: Neil Horman
  Cc: Hagen Paul Pfeifer, Satoru Moriya, David Miller, netdev, tgraf,
	stephen.hemminger, eric.dumazet, Seiji Aguchi, fche

Hi -

On Sat, Feb 04, 2012 at 03:09:37PM -0500, Neil Horman wrote:
> [...]

> Systemtap is fine for development.  Lots of application/product
> vendors really don't like it.  On top of the difficult to use
> aspect

(Just curious, what difficulty-to-use aspect have you more recently
come across?)

> it requires that debugging stabs info be kept with the kernel, which
> is a real pain, especially if you're running on a stable kernel.
> [...]

You probably mean when running on an unstable kernel, no?  Installing
debuginfo for a stable kernel is a one-time event.  It also enables
"perf probe" and crash(8) and other tools.  Plus, with systemtap,
there are two separate network-remoting mechanisms (--use-server
compilation and --remote execution) that make local debuginfo
unnecessary.

Anyway, a reasonable way to go may be to prototype in stap whatever
hard-coded kernel module y'all envision finally doing this work.

- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-05 12:53           ` Frank Ch. Eigler
@ 2012-02-05 19:17             ` Neil Horman
  2012-02-05 20:04               ` Hagen Paul Pfeifer
  0 siblings, 1 reply; 30+ messages in thread
From: Neil Horman @ 2012-02-05 19:17 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Hagen Paul Pfeifer, Satoru Moriya, David Miller, netdev, tgraf,
	stephen.hemminger, eric.dumazet, Seiji Aguchi, fche

On Sun, Feb 05, 2012 at 07:53:25AM -0500, Frank Ch. Eigler wrote:
> Hi -
> 
> On Sat, Feb 04, 2012 at 03:09:37PM -0500, Neil Horman wrote:
> > [...]
> 
> > Systemtap is fine for development.  Lots of application/product
> > vendors really don't like it.  On top of the difficult to use
> > aspect
> 
> (Just curious, what difficulty-to-use aspect have you more recently
> come across?)
> 
Generally the fact that probe points may shift underneath a script between
kernels.  While its often ok, theres no guaranteed fixed set of probe points.
While it often works, theres always the possibility that functional changes,
even if they don't reults in behavioral changes to userspace, require re-working
of stap scripts.  Its not the sort of thing that works will for people trying to
build applications.

> > it requires that debugging stabs info be kept with the kernel, which
> > is a real pain, especially if you're running on a stable kernel.
> > [...]
> 
> You probably mean when running on an unstable kernel, no?  Installing
> debuginfo for a stable kernel is a one-time event.  It also enables
No its a one time-per kernel event.  Unless you plan on never updating your
kernel, you still have to install the corresponding debuginfo with each kernel
that you update to.

> "perf probe" and crash(8) and other tools.  Plus, with systemtap,
> there are two separate network-remoting mechanisms (--use-server
> compilation and --remote execution) that make local debuginfo
> unnecessary.
> 
Yes, but as I noted above, these are development and problem determination
tools, things that either a developer will use or a sysadmin will install on
demand when trying to diagnose a problem in the system.  People balk when buying
an application and find out that to run it requires the addition of a few Gigs
of debuginfo information, even if it enables other tools too.

> Anyway, a reasonable way to go may be to prototype in stap whatever
> hard-coded kernel module y'all envision finally doing this work.
> 
Absolutely, and from what I understand this is what Satoru has already done.
Their flight recorder does tcp retransmit stat counting using systemtap
currently, and he's now looking to codify the feature as they currently have it
using a slightly more stable API.  The tracepoint has been balked at, so I was
suggesting using netfilter as an alternate solution

Neil

> - FChE
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-05 19:17             ` Neil Horman
@ 2012-02-05 20:04               ` Hagen Paul Pfeifer
  2012-02-05 21:48                 ` David Miller
  2012-02-06  1:32                 ` Neil Horman
  0 siblings, 2 replies; 30+ messages in thread
From: Hagen Paul Pfeifer @ 2012-02-05 20:04 UTC (permalink / raw)
  To: Neil Horman
  Cc: Frank Ch. Eigler, Satoru Moriya, David Miller, netdev, tgraf,
	stephen.hemminger, eric.dumazet, Seiji Aguchi, fche,
	mathieu.desnoyers, rostedt

* Neil Horman | 2012-02-05 14:17:08 [-0500]:

>Generally the fact that probe points may shift underneath a script between
>kernels.  While its often ok, theres no guaranteed fixed set of probe points.
>While it often works, theres always the possibility that functional changes,
>even if they don't reults in behavioral changes to userspace, require re-working
>of stap scripts.  Its not the sort of thing that works will for people trying to
>build applications.

Hey Neil, I though about this more deeply. In my first iteration I had the
same answer. But IMHO if you think about this more deeply, I think this is not
a scalable approach (implementing in userspace as systemtap scripts compared
to implement them in kernelspace (tracepoints). Why? Because at a deep TCP
level the API (and thus the context) is not static and the meaning of return
values (or arguments) change over time. No matter _where_ it is implemented or
coded.

Lets imaging the following (I construct one example, but this may universal):
imagine we have two retransmission functions, one for RTO triggered
retransmission and one for dup-ack fast-retransmission. Both functions where
instrumented as tracepoints for years. After some years someone say "hey both
functions are similar, lets fix this and make one function". What should
happened with the tracepoints? Should they merged into one? What should be the
name of the tracepoint?

In the end you will end up fixing userspace analysis tools as well! IMHO there
is - in my opinion - no clean way to add tracepoints and think you are free of
maintaining the whole chain, from kernelspace to userspace.

Only "stable" in kernel API's may be a candidate for tracepoints (like kmalloc
and friends). Highly dynamic and often changed code not.

The biggest advantage from tracepoints is that the function is "annotated"
with something like "this is a tracepoint, please be nice and don't break the
behavioral meaning of this function". On the other side this is also a
disadvantage because it elevates a kernel function to a "userspace API": Tools
will start using this tracepoint, you may end up in a situation where you
can't change/delete tracepoints.

Hagen

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-05 20:04               ` Hagen Paul Pfeifer
@ 2012-02-05 21:48                 ` David Miller
  2012-02-06  1:32                 ` Neil Horman
  1 sibling, 0 replies; 30+ messages in thread
From: David Miller @ 2012-02-05 21:48 UTC (permalink / raw)
  To: hagen
  Cc: nhorman, fche, satoru.moriya, netdev, tgraf, stephen.hemminger,
	eric.dumazet, seiji.aguchi, fche, mathieu.desnoyers, rostedt


Let me put an end to this dicussion right now, I'm not applying
any patches which just add tracepoints into the TCP stack.

That's not the solution, and you can hoot and hollar all you want
about it, I simply do not care.

The solution is a way to manage socket events, and be able to choose
which sockets emit those events, in a useful way, via the perf event
infrastructure.

This means we need a perf event socket layer that allows filtering on
socket flow ID, which enables per-socket event and statistics
gathering, and so forth.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-05 20:04               ` Hagen Paul Pfeifer
  2012-02-05 21:48                 ` David Miller
@ 2012-02-06  1:32                 ` Neil Horman
  2012-02-06 15:20                   ` Frank Ch. Eigler
  1 sibling, 1 reply; 30+ messages in thread
From: Neil Horman @ 2012-02-06  1:32 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Frank Ch. Eigler, Satoru Moriya, David Miller, netdev, tgraf,
	stephen.hemminger, eric.dumazet, Seiji Aguchi, fche,
	mathieu.desnoyers, rostedt

On Sun, Feb 05, 2012 at 09:04:29PM +0100, Hagen Paul Pfeifer wrote:
> * Neil Horman | 2012-02-05 14:17:08 [-0500]:
> 
> >Generally the fact that probe points may shift underneath a script between
> >kernels.  While its often ok, theres no guaranteed fixed set of probe points.
> >While it often works, theres always the possibility that functional changes,
> >even if they don't reults in behavioral changes to userspace, require re-working
> >of stap scripts.  Its not the sort of thing that works will for people trying to
> >build applications.
> 
> Hey Neil, I though about this more deeply. In my first iteration I had the
> same answer. But IMHO if you think about this more deeply, I think this is not
> a scalable approach (implementing in userspace as systemtap scripts compared
> to implement them in kernelspace (tracepoints). Why? Because at a deep TCP
> level the API (and thus the context) is not static and the meaning of return
> values (or arguments) change over time. No matter _where_ it is implemented or
> coded.
> 
Yes, this is why I'm suggesting alternatives to the tracepoint.

> Lets imaging the following (I construct one example, but this may universal):
> imagine we have two retransmission functions, one for RTO triggered
> retransmission and one for dup-ack fast-retransmission. Both functions where
> instrumented as tracepoints for years. After some years someone say "hey both
> functions are similar, lets fix this and make one function". What should
> happened with the tracepoints? Should they merged into one? What should be the
> name of the tracepoint?
> 
Yes, this is a perenial problem with tracepoints (and, in a simmilar fashion,
stap scripts).

> In the end you will end up fixing userspace analysis tools as well! IMHO there
> is - in my opinion - no clean way to add tracepoints and think you are free of
> maintaining the whole chain, from kernelspace to userspace.
> 
Yes.

> Only "stable" in kernel API's may be a candidate for tracepoints (like kmalloc
> and friends). Highly dynamic and often changed code not.
> 
Yes.

> The biggest advantage from tracepoints is that the function is "annotated"
> with something like "this is a tracepoint, please be nice and don't break the
> behavioral meaning of this function". On the other side this is also a
> disadvantage because it elevates a kernel function to a "userspace API": Tools
> will start using this tracepoint, you may end up in a situation where you
> can't change/delete tracepoints.
> 
Yes, tracepoints only make sense for API's that are clearly stable and well
defined.  This doesn't fit that bill.
 
The long and the short of it is, a tracepoint isn't going to be the solution for
the problem Satoru needs to solve.  Dave just made that abundantly clear.
Another solution needs to be found.  A perf socket layer seems to be a good long
term solution.  If something is needed in the short term for Satoru, I still
think a privately maintained module that registers a netfilter hook to watch
outgoing packets for retransmits offers a good solution.

Regards
Neil

> Hagen
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06  1:32                 ` Neil Horman
@ 2012-02-06 15:20                   ` Frank Ch. Eigler
  2012-02-06 15:24                     ` Eric Dumazet
  2012-02-06 15:53                     ` Neil Horman
  0 siblings, 2 replies; 30+ messages in thread
From: Frank Ch. Eigler @ 2012-02-06 15:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Hagen Paul Pfeifer, Frank Ch. Eigler, Satoru Moriya,
	David Miller, netdev, tgraf, stephen.hemminger, eric.dumazet,
	Seiji Aguchi, fche, mathieu.desnoyers, rostedt

Hi, Neil -

On Sun, Feb 05, 2012 at 08:32:47PM -0500, Neil Horman wrote:

> [...]
> > Hey Neil, I though about this more deeply. In my first iteration I
> > had the same answer. But IMHO if you think about this more deeply,
> > I think this is not a scalable approach (implementing in userspace
> > as systemtap scripts compared to implement them in kernelspace
> > (tracepoints). Why? Because at a deep TCP level the API (and thus
> > the context) is not static and the meaning of return values (or
> > arguments) change over time. No matter _where_ it is implemented
> > or coded.

> Yes, this is why I'm suggesting alternatives to the tracepoint.

I believe Hagen's point was that alternatives to tracepoints would
have substantially the same difficulty.  At some point in the stack,
you have to standardize/abstract.  What you hook up *at that point* is
not too important: tracepoints, or function calls, or expanded perf
machinery all seem to have the same burden.


> [...] If something is needed in the short term for Satoru, I still
> think a privately maintained module that registers a netfilter hook
> to watch outgoing packets for retransmits offers a good solution.

Does this mean that this netfilter hook mechanism is sufficient to
adapt to the current/future diversity of behaviors?  Why not make
*that* into a tracepoint then, so perf/stap scripts could get at it in
userspace?


- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 15:20                   ` Frank Ch. Eigler
@ 2012-02-06 15:24                     ` Eric Dumazet
  2012-02-06 15:38                       ` Neil Horman
  2012-02-06 15:53                     ` Neil Horman
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2012-02-06 15:24 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Neil Horman, Hagen Paul Pfeifer, Frank Ch. Eigler, Satoru Moriya,
	David Miller, netdev, tgraf, stephen.hemminger, Seiji Aguchi,
	fche, mathieu.desnoyers, rostedt

Le lundi 06 février 2012 à 10:20 -0500, Frank Ch. Eigler a écrit :

> > [...] If something is needed in the short term for Satoru, I still
> > think a privately maintained module that registers a netfilter hook
> > to watch outgoing packets for retransmits offers a good solution.
> 
> Does this mean that this netfilter hook mechanism is sufficient to
> adapt to the current/future diversity of behaviors?  Why not make
> *that* into a tracepoint then, so perf/stap scripts could get at it in
> userspace?

Anyway, this netfilter module wont catch the failed retransmits (because
of memory allocation errors or other checks done in our stack)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 15:24                     ` Eric Dumazet
@ 2012-02-06 15:38                       ` Neil Horman
  0 siblings, 0 replies; 30+ messages in thread
From: Neil Horman @ 2012-02-06 15:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Frank Ch. Eigler, Hagen Paul Pfeifer, Frank Ch. Eigler,
	Satoru Moriya, David Miller, netdev, tgraf, stephen.hemminger,
	Seiji Aguchi, fche, mathieu.desnoyers, rostedt

On Mon, Feb 06, 2012 at 04:24:19PM +0100, Eric Dumazet wrote:
> Le lundi 06 février 2012 à 10:20 -0500, Frank Ch. Eigler a écrit :
> 
> > > [...] If something is needed in the short term for Satoru, I still
> > > think a privately maintained module that registers a netfilter hook
> > > to watch outgoing packets for retransmits offers a good solution.
> > 
> > Does this mean that this netfilter hook mechanism is sufficient to
> > adapt to the current/future diversity of behaviors?  Why not make
> > *that* into a tracepoint then, so perf/stap scripts could get at it in
> > userspace?
> 
> Anyway, this netfilter module wont catch the failed retransmits (because
> of memory allocation errors or other checks done in our stack)
> 
But the existing kfree_skb tracepoint will.  As will the TCPRETANSFAIL counter.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 15:20                   ` Frank Ch. Eigler
  2012-02-06 15:24                     ` Eric Dumazet
@ 2012-02-06 15:53                     ` Neil Horman
  2012-02-06 16:18                       ` Steven Rostedt
                                         ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Neil Horman @ 2012-02-06 15:53 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Hagen Paul Pfeifer, Frank Ch. Eigler, Satoru Moriya,
	David Miller, netdev, tgraf, stephen.hemminger, eric.dumazet,
	Seiji Aguchi, fche, mathieu.desnoyers, rostedt

On Mon, Feb 06, 2012 at 10:20:19AM -0500, Frank Ch. Eigler wrote:
> Hi, Neil -
> 
> On Sun, Feb 05, 2012 at 08:32:47PM -0500, Neil Horman wrote:
> 
> > [...]
> > > Hey Neil, I though about this more deeply. In my first iteration I
> > > had the same answer. But IMHO if you think about this more deeply,
> > > I think this is not a scalable approach (implementing in userspace
> > > as systemtap scripts compared to implement them in kernelspace
> > > (tracepoints). Why? Because at a deep TCP level the API (and thus
> > > the context) is not static and the meaning of return values (or
> > > arguments) change over time. No matter _where_ it is implemented
> > > or coded.
> 
> > Yes, this is why I'm suggesting alternatives to the tracepoint.
> 
> I believe Hagen's point was that alternatives to tracepoints would
> have substantially the same difficulty.  At some point in the stack,
> you have to standardize/abstract.  What you hook up *at that point* is
> not too important: tracepoints, or function calls, or expanded perf
> machinery all seem to have the same burden.
> 
Yes, that was his point.  Mine was that, given Satoru's needs, a netfilter hook
provides 90% of what they need (as its been described here).  If you're
unfamiliar with them, netfilter hooks are those standard points in the network
input/output/forwarding paths at which we can watch and maniupulate network
traffic.  They're what iptables/ip6tables/ebtables/etc use to do everything they
do.  I'm suggesting that they use those existing hooks to monitor outgoing
traffic for whatever information they want (in this case retransmitted unacked
tcp sequence numbers).  Or whatever else their interested in.

As for the other 10% (recording failed retransmits, which don't generate
traffic), we have existing tracepoints that can be used.

> 
> > [...] If something is needed in the short term for Satoru, I still
> > think a privately maintained module that registers a netfilter hook
> > to watch outgoing packets for retransmits offers a good solution.
> 
> Does this mean that this netfilter hook mechanism is sufficient to
> adapt to the current/future diversity of behaviors?  Why not make
> *that* into a tracepoint then, so perf/stap scripts could get at it in
> userspace?
> 
I suppose you could.  In fact there is a queue target that lets you shuttle
frames from netfilter hook points up to userspace already, so you could do that.
I'm suggesting a kernel module because then you wouldn't have the performance
overhead of moving all those frames to user space.
Neil

> 
> - FChE
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 15:53                     ` Neil Horman
@ 2012-02-06 16:18                       ` Steven Rostedt
  2012-02-06 17:02                         ` Eric Dumazet
  2012-02-06 16:21                       ` Frank Ch. Eigler
  2012-02-06 18:21                       ` Satoru Moriya
  2 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2012-02-06 16:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: Frank Ch. Eigler, Hagen Paul Pfeifer, Frank Ch. Eigler,
	Satoru Moriya, David Miller, netdev, tgraf, stephen.hemminger,
	eric.dumazet, Seiji Aguchi, fche, mathieu.desnoyers

On Mon, 2012-02-06 at 10:53 -0500, Neil Horman wrote:

> Yes, that was his point.  Mine was that, given Satoru's needs, a netfilter hook
> provides 90% of what they need (as its been described here).  If you're
> unfamiliar with them, netfilter hooks are those standard points in the network
> input/output/forwarding paths at which we can watch and maniupulate network
> traffic.  They're what iptables/ip6tables/ebtables/etc use to do everything they
> do.  I'm suggesting that they use those existing hooks to monitor outgoing
> traffic for whatever information they want (in this case retransmitted unacked
> tcp sequence numbers).  Or whatever else their interested in.
> 
> As for the other 10% (recording failed retransmits, which don't generate
> traffic), we have existing tracepoints that can be used.

I haven't looked at the details here, but I'm wondering if netfilter
could benefit from static_branch() calls (aka jump-labels). I'm sure
selinux could too. Basically, when netfilter is disabled, the fast path
would just include a nop, with no if() branch testing at all. When you
enable netfilters, it would then add a jmp to the code that actually
does the tests.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 15:53                     ` Neil Horman
  2012-02-06 16:18                       ` Steven Rostedt
@ 2012-02-06 16:21                       ` Frank Ch. Eigler
  2012-02-06 18:21                       ` Satoru Moriya
  2 siblings, 0 replies; 30+ messages in thread
From: Frank Ch. Eigler @ 2012-02-06 16:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: Hagen Paul Pfeifer, Frank Ch. Eigler, Satoru Moriya,
	David Miller, netdev, tgraf, stephen.hemminger, eric.dumazet,
	Seiji Aguchi, fche, mathieu.desnoyers, rostedt

Hi -

On Mon, Feb 06, 2012 at 10:53:05AM -0500, Neil Horman wrote:

> [...] If you're unfamiliar with them, netfilter hooks are those
> standard points in the network input/output/forwarding paths at
> which we can watch and maniupulate network traffic.  [...]

Aha, nf_register_hook and friends.  Thanks for the pointer.

> > Does this mean that this netfilter hook mechanism is sufficient to
> > adapt to the current/future diversity of behaviors?  Why not make
> > *that* into a tracepoint then, so perf/stap scripts could get at it in
> > userspace?

> I suppose you could.  [...]  I'm suggesting a kernel module because
> then you wouldn't have the performance overhead of moving all those
> frames to user space.

(For what it's worth, if there were a tracepoint in the netfilter 
hook machinery, stap would be able to process the packets likewise,
in-kernel, with no kernel debuginfo installed.)

- FChE

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 16:18                       ` Steven Rostedt
@ 2012-02-06 17:02                         ` Eric Dumazet
  2012-02-06 17:18                           ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Dumazet @ 2012-02-06 17:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Neil Horman, Frank Ch. Eigler, Hagen Paul Pfeifer,
	Frank Ch. Eigler, Satoru Moriya, David Miller, netdev, tgraf,
	stephen.hemminger, Seiji Aguchi, fche, mathieu.desnoyers

Le lundi 06 février 2012 à 11:18 -0500, Steven Rostedt a écrit :
> On Mon, 2012-02-06 at 10:53 -0500, Neil Horman wrote:
> 
> > Yes, that was his point.  Mine was that, given Satoru's needs, a netfilter hook
> > provides 90% of what they need (as its been described here).  If you're
> > unfamiliar with them, netfilter hooks are those standard points in the network
> > input/output/forwarding paths at which we can watch and maniupulate network
> > traffic.  They're what iptables/ip6tables/ebtables/etc use to do everything they
> > do.  I'm suggesting that they use those existing hooks to monitor outgoing
> > traffic for whatever information they want (in this case retransmitted unacked
> > tcp sequence numbers).  Or whatever else their interested in.
> > 
> > As for the other 10% (recording failed retransmits, which don't generate
> > traffic), we have existing tracepoints that can be used.
> 
> I haven't looked at the details here, but I'm wondering if netfilter
> could benefit from static_branch() calls (aka jump-labels). I'm sure
> selinux could too. Basically, when netfilter is disabled, the fast path
> would just include a nop, with no if() branch testing at all. When you
> enable netfilters, it would then add a jmp to the code that actually
> does the tests.

Hmm... I thought I already did that.

Or maybe I intended so and forgot.

Ah no, its there in commit a2d7ec58ac09
(netfilter: use jump_label for nf_hooks)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 17:02                         ` Eric Dumazet
@ 2012-02-06 17:18                           ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2012-02-06 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Frank Ch. Eigler, Hagen Paul Pfeifer,
	Frank Ch. Eigler, Satoru Moriya, David Miller, netdev, tgraf,
	stephen.hemminger, Seiji Aguchi, fche, mathieu.desnoyers

On Mon, 2012-02-06 at 18:02 +0100, Eric Dumazet wrote:
>  
> > I haven't looked at the details here, but I'm wondering if netfilter
> > could benefit from static_branch() calls (aka jump-labels). I'm sure
> > selinux could too. Basically, when netfilter is disabled, the fast path
> > would just include a nop, with no if() branch testing at all. When you
> > enable netfilters, it would then add a jmp to the code that actually
> > does the tests.
> 
> Hmm... I thought I already did that.
> 
> Or maybe I intended so and forgot.
> 
> Ah no, its there in commit a2d7ec58ac09
> (netfilter: use jump_label for nf_hooks)

hehe, this shows the fact that "I haven't looked at the details here".

;-)

Great to hear!

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-06 15:53                     ` Neil Horman
  2012-02-06 16:18                       ` Steven Rostedt
  2012-02-06 16:21                       ` Frank Ch. Eigler
@ 2012-02-06 18:21                       ` Satoru Moriya
  2 siblings, 0 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-02-06 18:21 UTC (permalink / raw)
  To: Neil Horman, Frank Ch. Eigler
  Cc: Hagen Paul Pfeifer, Frank Ch. Eigler, David Miller, netdev,
	tgraf, stephen.hemminger, eric.dumazet, Seiji Aguchi, fche,
	mathieu.desnoyers, rostedt

On 02/06/2012 10:53 AM, Neil Horman wrote:
> On Mon, Feb 06, 2012 at 10:20:19AM -0500, Frank Ch. Eigler wrote:
>> On Sun, Feb 05, 2012 at 08:32:47PM -0500, Neil Horman wrote:
>>
> Yes, that was his point.  Mine was that, given Satoru's needs, a 
> netfilter hook provides 90% of what they need (as its been described 
> here).  If you're unfamiliar with them, netfilter hooks are those 
> standard points in the network input/output/forwarding paths at which 
> we can watch and maniupulate network traffic.  They're what 
> iptables/ip6tables/ebtables/etc use to do everything they do.  I'm 
> suggesting that they use those existing hooks to monitor outgoing 
> traffic for whatever information they want (in this case
> retransmitted unacked tcp sequence numbers).  Or whatever else their
> interested in.
> 
> As for the other 10% (recording failed retransmits, which don't 
> generate traffic), we have existing tracepoints that can be used.

Thank you so much, Neil and other guys.

For now it seems to be difficult to move into netfilter approach
because "other 10%" is so important for us. But first, I'll consider
it more deeply.

Regards,
Satoru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2 0/2] Tracepoint for tcp retransmission
  2012-02-04  4:40     ` Yuchung Cheng
@ 2012-02-06 18:32       ` Satoru Moriya
  0 siblings, 0 replies; 30+ messages in thread
From: Satoru Moriya @ 2012-02-06 18:32 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, netdev, nhorman, tgraf, stephen.hemminger, hagen,
	eric.dumazet, Seiji Aguchi

On 02/03/2012 11:40 PM, Yuchung Cheng wrote:
> I totally understand that you need deeper instrumentation for *your* 
> business. At Google, we instrument a lot more in TCP as well. But we 
> have yet to upstream these changes because it's not universally useful 
> for default Linux kernel. I failed to see how different failures of a 
> TCP retransmission must be recorded and exported. Could you elaborate 
> your last point?

Hi Yuchung,

Ultimately, we'd like to record all the tcp retransmission failures
but for the first step, we'd like to know whether kernel succeeded
retransmission or not (kernel sent a packet or not).

Does it make sense?

Regards,
Satoru

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-02-06 18:32 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-20 18:07 [PATCH v2 0/2] Tracepoint for tcp retransmission Satoru Moriya
2012-01-20 18:08 ` [PATCH v2 1/2] tcp: refactor tcp_retransmit_skb() for a single return point Satoru Moriya
2012-01-20 18:09 ` [PATCH v2 2/2] tcp: add tracepoint for tcp retransmission Satoru Moriya
2012-01-20 18:50 ` [PATCH v2 0/2] Tracepoint " David Miller
2012-02-03 21:47   ` Satoru Moriya
2012-02-04  4:40     ` Yuchung Cheng
2012-02-06 18:32       ` Satoru Moriya
2012-02-04 14:28     ` Neil Horman
2012-02-04 15:58       ` Hagen Paul Pfeifer
2012-02-04 20:09         ` Neil Horman
2012-02-05 12:53           ` Frank Ch. Eigler
2012-02-05 19:17             ` Neil Horman
2012-02-05 20:04               ` Hagen Paul Pfeifer
2012-02-05 21:48                 ` David Miller
2012-02-06  1:32                 ` Neil Horman
2012-02-06 15:20                   ` Frank Ch. Eigler
2012-02-06 15:24                     ` Eric Dumazet
2012-02-06 15:38                       ` Neil Horman
2012-02-06 15:53                     ` Neil Horman
2012-02-06 16:18                       ` Steven Rostedt
2012-02-06 17:02                         ` Eric Dumazet
2012-02-06 17:18                           ` Steven Rostedt
2012-02-06 16:21                       ` Frank Ch. Eigler
2012-02-06 18:21                       ` Satoru Moriya
2012-01-25 13:27 ` Hagen Paul Pfeifer
2012-01-25 14:44   ` Eric Dumazet
2012-01-26 18:51     ` David Miller
2012-02-03 20:31     ` Satoru Moriya
2012-02-03 20:43   ` Satoru Moriya
2012-02-03 20:55     ` Hagen Paul Pfeifer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).