All of lore.kernel.org
 help / color / mirror / Atom feed
* net-timestamp: MSG_TSTAMP flags and bytestream support
@ 2014-06-24 15:43 Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct Willem de Bruijn
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

This patchset extends socket timestamping in a number of related ways.
Most notably:

2 MSG_TSTAMP: request a single tx timestamp by passing a flag on send
6 MSG_TSTAMP_ENQ: request a tx timestamp before traffic shaping.
5 MSG_TSTAMP_ACK: request a tx timestamp after acknowledgements (TCP)
4 TCP support for all three flags

Each individual patch commit message gives more detail about the
specific feature.

The other patches support the main feature:
1 explicitly define the timestamp response API
3 optionally avoid looping large packets onto the socket error queue.
7 documentation and an example test.

This initial patchset is mostly to request feedback. Though the
patches are somewhat interdependent, I can resubmit them independently
or drop optional features (most notably, patches 1 and 3).

While rebasing, I found a few possible improvements, myself. These are
debatable, so I will send the set first and optionally integrate these
in v2:

- The struct in patch 1 was informally called scm_timestamping in the
  documentation. I prefer that name over the long one I came up with.
- We can actually avoid extending that structure, because ts_type and
  ts_key can be passed in the currently unused sock_extended_err
  fields ee_info and ee_data
- the new functionality of MSG_TSTAMP_* should also be extended to
  the socket option interface SO_TIMESTAMPING

Tested:
  I ran the msg_tstamp binary for various settings:
  - IPv4 and IPv6
  - UDP and TCP
  - 1 B and 20 KB payload
  - GSO, TSO, neither
  - with and without the no-payload feature (patch 3)

  Example output from one IPv4/TCP/1B/payload run:
  (on a bonded machine, resulting in 2 ENQ timestamps per send)

proto INET
  test SND
      USR: 1400265321 s 167847 us (seq=0, len=0)
      SND: 1400265321 s 167854 us (seq=408779523, len=7)  (+7 us)
  test ENQ
      USR: 1400265321 s 768728 us (seq=0, len=7)
      ENQ: 1400265321 s 768732 us (seq=3113669987, len=7)  (+4 us)
      ENQ: 1400265321 s 768734 us (seq=3113669987, len=7)  (+2 us)
  test ENQ + SND
      USR: 1400265322 s 369747 us (seq=0, len=7)
      ENQ: 1400265322 s 369750 us (seq=2305548511, len=7)  (+3 us)
      ENQ: 1400265322 s 369751 us (seq=2305548511, len=7)  (+1 us)
      SND: 1400265322 s 369752 us (seq=2305548511, len=7)  (+1 us)
  test ACK
      USR: 1400265322 s 970717 us (seq=0, len=7)
      ACK: 1400265322 s 970752 us (seq=2324323855, len=7)  (+35 us)
  test SND + ACK
      USR: 1400265323 s 571662 us (seq=0, len=7)
      SND: 1400265323 s 571681 us (seq=872301729, len=7)  (+19 us)
      ACK: 1400265323 s 571708 us (seq=872301729, len=7)  (+27 us)
  test ENQ + SND + ACK
      USR: 1400265324 s 172558 us (seq=0, len=7)
      ENQ: 1400265324 s 172561 us (seq=2135092223, len=7)  (+3 us)
      ENQ: 1400265324 s 172565 us (seq=2135092223, len=7)  (+4 us)
      SND: 1400265324 s 172581 us (seq=2135092223, len=7)  (+16 us)
      ACK: 1400265324 s 172624 us (seq=2135092223, len=7)  (+43 us)


Signed-off-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-25  4:56   ` Richard Cochran
  2014-06-24 15:43 ` [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps Willem de Bruijn
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

Applications that request kernel transmit timestamps with
SO_TIMESTAMPING read timestamps using recvmsg() ancillary data.
The existing timestamp types (hardware raw, hardware sys, software)
are not explicitly documented. Instead, an array of anonymous
timespecs is passed. With upcoming additional types, this becomes
fragile.

This patch replaces the array with a struct with explicit fields for
each timestamp type. It also extends it with two new fields for
upcoming features:
  @ts_type: the location in the kernel where the tstamp was taken
  @ts_key:  the location in the flow that a tstamp corresponds to

The code is backward compatible with legacy applications that treat
the ancillary data as an anonymous array 'struct timespec data[3]'.
It will break applications that test the size of the cmsg data.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/uapi/linux/errqueue.h | 32 ++++++++++++++++++++++++++++++++
 net/compat.c                  | 38 +++++++++++++++++++++++++++-----------
 net/socket.c                  | 24 +++++++++++++++---------
 3 files changed, 74 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index aacd4fb..d80b866 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -22,5 +22,37 @@ struct sock_extended_err {
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
+/**
+ *	struct sock_errqueue_timestamping - timestamps exposed through cmsg
+ *
+ *	The timestamping interfaces SO_TIMESTAMPING, MSG_TSTAMP_*
+ *	communicate network timestamps to userspace by passing this struct
+ *	through a cmsg in recvmsg().
+ *
+ *	@ts_sw:     the sw timestamp: the contents depends on ts_type.
+ *	@ts_hw_sys: a hardware generated timestamp converted to system time.
+ *	@ts_hw_raw: a hardware generated timestamp converted in its raw format.
+ *	@ts_type:   the type of timestamp ts_sw. One of SCM_TSTAMP_*
+ *	@ts_key:    socket flow index that the timestamps correspond to
+ *		    (stream transport protocols only, e.g., TCP seqno)
+ *
+ *	The first three fields are dictated by historical use. The hardware
+ *	timestamps are empty unless hardware timestamping is enabled, but
+ *	they have to be present in each message.
+ */
+struct sock_errqueue_timestamping {
+	struct timespec ts_sw;
+	struct timespec ts_hw_sys;
+	struct timespec ts_hw_raw;
+	__u32 ts_key;
+	__u16 ts_type;
+	__u16 ts_padding;
+};
+
+enum {
+	SCM_TSTAMP_SND = 1,
+	SCM_TSTAMP_ACK = 2,
+	SCM_TSTAMP_ENQ = 3
+};
 
 #endif /* _UAPI_LINUX_ERRQUEUE_H */
diff --git a/net/compat.c b/net/compat.c
index 9a76eaf..fff6d76 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -23,6 +23,7 @@
 #include <linux/compat.h>
 #include <linux/security.h>
 #include <linux/export.h>
+#include <linux/errqueue.h>
 
 #include <net/scm.h>
 #include <net/sock.h>
@@ -225,7 +226,15 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
 	struct compat_cmsghdr __user *cm = (struct compat_cmsghdr __user *) kmsg->msg_control;
 	struct compat_cmsghdr cmhdr;
 	struct compat_timeval ctv;
-	struct compat_timespec cts[3];
+	struct compat_timespec cts;
+	struct compat_errqueue_timestamping {
+		struct compat_timespec ts_sw;
+		struct compat_timespec ts_hw_sys;
+		struct compat_timespec ts_hw_raw;
+		__u32 ts_key;
+		__u16 ts_type;
+		__u16 ts_padding;
+	} ctss;
 	int cmlen;
 
 	if (cm == NULL || kmsg->msg_controllen < sizeof(*cm)) {
@@ -240,18 +249,25 @@ int put_cmsg_compat(struct msghdr *kmsg, int level, int type, int len, void *dat
 			ctv.tv_usec = tv->tv_usec;
 			data = &ctv;
 			len = sizeof(ctv);
-		}
-		if (level == SOL_SOCKET &&
-		    (type == SCM_TIMESTAMPNS || type == SCM_TIMESTAMPING)) {
-			int count = type == SCM_TIMESTAMPNS ? 1 : 3;
-			int i;
+		} else if (level == SOL_SOCKET && type == SCM_TIMESTAMPNS) {
 			struct timespec *ts = (struct timespec *)data;
-			for (i = 0; i < count; i++) {
-				cts[i].tv_sec = ts[i].tv_sec;
-				cts[i].tv_nsec = ts[i].tv_nsec;
-			}
+			cts.tv_sec = ts->tv_sec;
+			cts.tv_nsec = ts->tv_nsec;
 			data = &cts;
-			len = sizeof(cts[0]) * count;
+			len = sizeof(cts);
+		} else if (level == SOL_SOCKET && type == SCM_TIMESTAMPING) {
+			struct sock_errqueue_timestamping *tss = data;
+			ctss.ts_sw.tv_sec = tss->ts_sw.tv_sec;
+			ctss.ts_sw.tv_nsec = tss->ts_sw.tv_nsec;
+			ctss.ts_hw_sys.tv_sec = tss->ts_hw_sys.tv_sec;
+			ctss.ts_hw_sys.tv_nsec = tss->ts_hw_sys.tv_nsec;
+			ctss.ts_hw_raw.tv_sec = tss->ts_hw_raw.tv_sec;
+			ctss.ts_hw_raw.tv_nsec = tss->ts_hw_raw.tv_nsec;
+			ctss.ts_type = tss->ts_type;
+			ctss.ts_key = tss->ts_key;
+			ctss.ts_padding = 0;
+			data = &ctss;
+			len = sizeof(ctss);
 		}
 	}
 
diff --git a/net/socket.c b/net/socket.c
index abf56b2..c001746 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -106,6 +106,7 @@
 #include <linux/sockios.h>
 #include <linux/atalk.h>
 #include <net/busy_poll.h>
+#include <linux/errqueue.h>
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 unsigned int sysctl_net_busy_read __read_mostly;
@@ -697,7 +698,7 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct sk_buff *skb)
 {
 	int need_software_tstamp = sock_flag(sk, SOCK_RCVTSTAMP);
-	struct timespec ts[3];
+	struct sock_errqueue_timestamping tss;
 	int empty = 1;
 	struct skb_shared_hwtstamps *shhwtstamps =
 		skb_hwtstamps(skb);
@@ -714,28 +715,33 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
 				 sizeof(tv), &tv);
 		} else {
-			skb_get_timestampns(skb, &ts[0]);
+			struct timespec ts;
+			skb_get_timestampns(skb, &ts);
 			put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPNS,
-				 sizeof(ts[0]), &ts[0]);
+				 sizeof(ts), &ts);
 		}
 	}
 
-
-	memset(ts, 0, sizeof(ts));
+	memset(&tss, 0, sizeof(tss));
 	if (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) &&
-	    ktime_to_timespec_cond(skb->tstamp, ts + 0))
+	    ktime_to_timespec_cond(skb->tstamp, &tss.ts_sw)) {
 		empty = 0;
+		tss.ts_type = SCM_TSTAMP_SND;
+	}
 	if (shhwtstamps) {
 		if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) &&
-		    ktime_to_timespec_cond(shhwtstamps->syststamp, ts + 1))
+		    ktime_to_timespec_cond(shhwtstamps->syststamp,
+					   &tss.ts_hw_sys))
 			empty = 0;
 		if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE) &&
-		    ktime_to_timespec_cond(shhwtstamps->hwtstamp, ts + 2))
+		    ktime_to_timespec_cond(shhwtstamps->hwtstamp,
+					   &tss.ts_hw_raw))
 			empty = 0;
 	}
+
 	if (!empty)
 		put_cmsg(msg, SOL_SOCKET,
-			 SCM_TIMESTAMPING, sizeof(ts), &ts);
+			 SCM_TIMESTAMPING, sizeof(tss), &tss);
 }
 EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
 
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-25  5:01   ` Richard Cochran
  2014-06-24 15:43 ` [PATCH net-next 3/7] net-timestamp: tx timestamp without payload Willem de Bruijn
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

The kernel support datagram tx timestamping through socket option
SO_TIMESTAMPING. This patch add send() flag MSG_TSTAMP to allow
selectively requesting a timestamp for a single packet.

MSG_TSTAMP does not depend on SO_TIMESTAMPING. Enabling both
concurrently is redundant, but safe.

This patch adds support for IPv4 and IPv6 UDP sendmsg().

The feature maintains semantics of the socket option: with IP
fragmentation, only the first fragment is timestamped. This
should not happen, but can be forced, e.g., by disabling PMTU
on UDP.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 10 ++++++++++
 include/linux/socket.h |  1 +
 include/net/sock.h     |  5 +++--
 net/ipv4/udp.c         |  1 +
 net/ipv6/ip6_output.c  |  4 +++-
 net/socket.c           |  3 ++-
 6 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ec89301..bec3ded 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2736,6 +2736,16 @@ static inline void skb_tx_timestamp(struct sk_buff *skb)
 	sw_tx_timestamp(skb);
 }
 
+static inline u8 skbflags_tx_tstamp(int flags)
+{
+	u8 tx_flags = 0;
+
+	if (unlikely(flags & MSG_TSTAMP))
+		tx_flags |= SKBTX_SW_TSTAMP;
+
+	return tx_flags;
+}
+
 /**
  * skb_complete_wifi_ack - deliver skb with wifi status
  *
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8e98297..ce4101e 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -253,6 +253,7 @@ struct ucred {
 #define MSG_MORE	0x8000	/* Sender will send more */
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
+#define MSG_TSTAMP	0x100000
 #define MSG_EOF         MSG_FIN
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/include/net/sock.h b/include/net/sock.h
index 07b7fcd..32cd1be 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2138,14 +2138,15 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 	 * - receive time stamping in software requested (SOCK_RCVTSTAMP
 	 *   or SOCK_TIMESTAMPING_RX_SOFTWARE)
 	 * - software time stamp available and wanted
-	 *   (SOCK_TIMESTAMPING_SOFTWARE)
+	 *   (SOCK_TIMESTAMPING_SOFTWARE || SKBTX_SW_TSTAMP)
 	 * - hardware time stamps available and wanted
 	 *   (SOCK_TIMESTAMPING_SYS_HARDWARE or
 	 *   SOCK_TIMESTAMPING_RAW_HARDWARE)
 	 */
 	if (sock_flag(sk, SOCK_RCVTSTAMP) ||
 	    sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE) ||
-	    (kt.tv64 && sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE)) ||
+	    (kt.tv64 && (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) ||
+	     skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)) ||
 	    (hwtstamps->hwtstamp.tv64 &&
 	     sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE)) ||
 	    (hwtstamps->syststamp.tv64 &&
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d92f94b..9127aab 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -967,6 +967,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	ipc.oif = sk->sk_bound_dev_if;
 
 	sock_tx_timestamp(sk, &ipc.tx_flags);
+	ipc.tx_flags |= skbflags_tx_tstamp(msg->msg_flags);
 
 	if (msg->msg_controllen) {
 		err = ip_cmsg_send(sock_net(sk), msg, &ipc,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index cb9df0e..7306c5d 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1270,8 +1270,10 @@ emsgsize:
 	}
 
 	/* For UDP, check if TX timestamp is enabled */
-	if (sk->sk_type == SOCK_DGRAM)
+	if (sk->sk_type == SOCK_DGRAM) {
 		sock_tx_timestamp(sk, &tx_flags);
+		tx_flags |= skbflags_tx_tstamp(flags);
+	}
 
 	/*
 	 * Let's try using as much space as possible.
diff --git a/net/socket.c b/net/socket.c
index c001746..18ab44a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -723,7 +723,8 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	}
 
 	memset(&tss, 0, sizeof(tss));
-	if (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) &&
+	if ((sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) ||
+	     skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP) &&
 	    ktime_to_timespec_cond(skb->tstamp, &tss.ts_sw)) {
 		empty = 0;
 		tss.ts_type = SCM_TSTAMP_SND;
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 3/7] net-timestamp: tx timestamp without payload
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-25  5:16   ` Richard Cochran
  2014-06-24 15:43 ` [PATCH net-next 4/7] net-timestamp: TCP timestamping Willem de Bruijn
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

Applications receive tx timestamps from the kernel by reading the
original packet from the socket error queue with sendmsg() and
processing an ancillary data item that holds the timestamps.

If the application is only interested in the timestamp, then looping
the whole packet back up to userspace wastes socket buffer space
(SO_RCVBUF). This is especially important when the same packet is
enqueued repeatedly with multiple timestamps.

This patch adds a socket option to loop the timestamp on top of an
empty packet instead of a clone of the original.

The option is only implemented for tx timestamps. Code that dequeues
from an sk_error_queue onto which skb_tstamp_tx enqueues has to be
able to handle zero-length packets.  Common implementations peek into
the packet headers, for instance to learn the address for msg_namelen.
When the queued skb has no payload, this data is unavailable and thus
not returned. skb_dequeue(sk->sk_error_queue) callers have been
audited to avoid accessing packet contents and fixed where needed
(IP, IPv6, RxRPC).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/sock.h              |  1 +
 include/uapi/linux/net_tstamp.h |  5 +++--
 net/core/skbuff.c               | 16 ++++++++++++----
 net/core/sock.c                 |  4 ++++
 net/ipv4/ip_sockglue.c          |  4 ++--
 net/ipv6/datagram.c             |  4 ++--
 net/rxrpc/ar-error.c            |  5 +++++
 7 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 32cd1be..df7bde0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -690,6 +690,7 @@ enum sock_flags {
 	SOCK_TIMESTAMPING_SOFTWARE,     /* %SOF_TIMESTAMPING_SOFTWARE */
 	SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SOF_TIMESTAMPING_RAW_HARDWARE */
 	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
+	SOCK_TIMESTAMPING_OPT_TX_NO_PAYLOAD, /* %SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD */
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
 	SOCK_ZEROCOPY, /* buffers from userspace */
diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
index f53879c..0b4a2b0 100644
--- a/include/uapi/linux/net_tstamp.h
+++ b/include/uapi/linux/net_tstamp.h
@@ -20,9 +20,10 @@ enum {
 	SOF_TIMESTAMPING_SOFTWARE = (1<<4),
 	SOF_TIMESTAMPING_SYS_HARDWARE = (1<<5),
 	SOF_TIMESTAMPING_RAW_HARDWARE = (1<<6),
+	SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD = (1<<7),
 	SOF_TIMESTAMPING_MASK =
-	(SOF_TIMESTAMPING_RAW_HARDWARE - 1) |
-	SOF_TIMESTAMPING_RAW_HARDWARE
+	(SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD - 1) |
+	SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD
 };
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9cd5344..bc653c4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3501,6 +3501,13 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 	if (!sk)
 		return;
 
+	if (sock_flag(sk, SOCK_TIMESTAMPING_OPT_TX_NO_PAYLOAD))
+		skb = alloc_skb(0, GFP_ATOMIC);
+	else
+		skb = skb_clone(orig_skb, GFP_ATOMIC);
+	if (!skb)
+		return;
+
 	if (hwtstamps) {
 		*skb_hwtstamps(orig_skb) =
 			*hwtstamps;
@@ -3510,12 +3517,13 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 		 * so keep the shared tx_flags and only
 		 * store software time stamp
 		 */
-		orig_skb->tstamp = ktime_get_real();
+		skb->tstamp = ktime_get_real();
 	}
 
-	skb = skb_clone(orig_skb, GFP_ATOMIC);
-	if (!skb)
-		return;
+	if (!skb->len) {
+		skb_shinfo(skb)->tx_flags = skb_shinfo(orig_skb)->tx_flags;
+		*skb_hwtstamps(skb) = *skb_hwtstamps(orig_skb);
+	}
 
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
diff --git a/net/core/sock.c b/net/core/sock.c
index 026e01f..0e8b518 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -866,6 +866,8 @@ set_rcvbuf:
 				  val & SOF_TIMESTAMPING_SYS_HARDWARE);
 		sock_valbool_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE,
 				  val & SOF_TIMESTAMPING_RAW_HARDWARE);
+		sock_valbool_flag(sk, SOCK_TIMESTAMPING_OPT_TX_NO_PAYLOAD,
+				  val & SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD);
 		break;
 
 	case SO_RCVLOWAT:
@@ -1106,6 +1108,8 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 			v.val |= SOF_TIMESTAMPING_SYS_HARDWARE;
 		if (sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE))
 			v.val |= SOF_TIMESTAMPING_RAW_HARDWARE;
+		if (sock_flag(sk, SOCK_TIMESTAMPING_OPT_TX_NO_PAYLOAD))
+			v.val |= SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD;
 		break;
 
 	case SO_RCVTIMEO:
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 64741b9..f17f34f 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -432,7 +432,7 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 	serr = SKB_EXT_ERR(skb);
 
-	if (sin) {
+	if (sin && skb->len) {
 		sin->sin_family = AF_INET;
 		sin->sin_addr.s_addr = *(__be32 *)(skb_network_header(skb) +
 						   serr->addr_offset);
@@ -444,7 +444,7 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 	memcpy(&errhdr.ee, &serr->ee, sizeof(struct sock_extended_err));
 	sin = &errhdr.offender;
 	sin->sin_family = AF_UNSPEC;
-	if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
+	if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP && skb->len) {
 		struct inet_sock *inet = inet_sk(sk);
 
 		sin->sin_family = AF_INET;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index c3bf2d2..391e6e0 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -358,7 +358,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 
 	serr = SKB_EXT_ERR(skb);
 
-	if (sin) {
+	if (sin && skb->len) {
 		const unsigned char *nh = skb_network_header(skb);
 		sin->sin6_family = AF_INET6;
 		sin->sin6_flowinfo = 0;
@@ -383,7 +383,7 @@ int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len)
 	memcpy(&errhdr.ee, &serr->ee, sizeof(struct sock_extended_err));
 	sin = &errhdr.offender;
 	sin->sin6_family = AF_UNSPEC;
-	if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
+	if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL && skb->len) {
 		sin->sin6_family = AF_INET6;
 		sin->sin6_flowinfo = 0;
 		sin->sin6_port = 0;
diff --git a/net/rxrpc/ar-error.c b/net/rxrpc/ar-error.c
index db57458..f9a65c0 100644
--- a/net/rxrpc/ar-error.c
+++ b/net/rxrpc/ar-error.c
@@ -42,6 +42,11 @@ void rxrpc_UDP_error_report(struct sock *sk)
 		_leave("UDP socket errqueue empty");
 		return;
 	}
+	if (!skb->len) {
+		_leave("UDP empty message");
+		kfree_skb(skb);
+		return;
+	}
 
 	rxrpc_new_skb(skb);
 
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 4/7] net-timestamp: TCP timestamping
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
                   ` (2 preceding siblings ...)
  2014-06-24 15:43 ` [PATCH net-next 3/7] net-timestamp: tx timestamp without payload Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 5/7] net-timestamp: ACK timestamp for bytestreams Willem de Bruijn
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

TCP timestamping extends datagram MSG_TSTAMP support to bytestreams.

Bytestreams do not have a 1:1 relationship between send() buffers and
network packets. The feature interprets a send call with MSG_TSTAMP on
a bytestream as a request for a timestamp for the last byte in the
buffer.

The choice corresponds to a request for a timestamp when all bytes in
the buffer have been sent. That assumption depends on in-order kernel
transmission. This is the common case. That said, it is possible to
construct a traffic shaping tree that would result in reordering.
The guarantee is strong, then, but not ironclad.

This implementation supports send and sendpages (splice). GSO replaces
one large packet with multiple smaller packets. This patch also copies
the option into the correct smaller packet.

This patch does not yet support timestamping on data in an initial TCP
Fast Open SYN, because that takes a very different data path.

The implementation supports a single timestamp per packet. To avoid
having multiple timestamp requests per sk_buff, the skb is locked
against extension once the flag is set.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c      | 12 ++++++++++++
 net/ipv4/tcp.c         | 21 +++++++++++++++++----
 net/ipv4/tcp_offload.c |  4 ++++
 net/socket.c           |  2 ++
 4 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bc653c4..07ba98d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
 #include <net/xfrm.h>
+#include <net/tcp.h>
 
 #include <asm/uaccess.h>
 #include <trace/events/skb.h>
@@ -3496,6 +3497,7 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 	struct sock *sk = orig_skb->sk;
 	struct sock_exterr_skb *serr;
 	struct sk_buff *skb;
+	__u32 key = 0;
 	int err;
 
 	if (!sk)
@@ -3525,10 +3527,20 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 		*skb_hwtstamps(skb) = *skb_hwtstamps(orig_skb);
 	}
 
+	if (orig_skb->sk && orig_skb->sk->sk_protocol == IPPROTO_TCP) {
+		if (orig_skb->fclone == SKB_FCLONE_CLONE)
+			key = TCP_SKB_CB(orig_skb - 1)->end_seq;
+		else /* after GSO segmentation, clone no longer works */
+			key = ntohl(tcp_hdr(skb)->seq) +
+			      ntohs(ip_hdr(skb)->tot_len) -
+			      ip_hdrlen(skb) - tcp_hdrlen(skb);
+	}
+
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
+	serr->ee.ee_data = key;
 
 	err = sock_queue_err_skb(sk, skb);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index eb1dde3..4ceecd9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -878,6 +878,11 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
+static bool tcp_skb_can_extend(struct sk_buff *skb)
+{
+	return !(skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP);
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 				size_t size, int flags)
 {
@@ -911,7 +916,8 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		int copy, i;
 		bool can_coalesce;
 
-		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0 ||
+		    !tcp_skb_can_extend(skb)) {
 new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
@@ -959,8 +965,10 @@ new_segment:
 
 		copied += copy;
 		offset += copy;
-		if (!(size -= copy))
+		if (!(size -= copy)) {
+			skb_shinfo(skb)->tx_flags |= skbflags_tx_tstamp(flags);
 			goto out;
+		}
 
 		if (skb->len < size_goal || (flags & MSG_OOB))
 			continue;
@@ -1160,7 +1168,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 				copy = max - skb->len;
 			}
 
-			if (copy <= 0) {
+			if (copy <= 0 || !tcp_skb_can_extend(skb)) {
 new_segment:
 				/* Allocate new segment. If the interface is SG,
 				 * allocate skb fitting to single page.
@@ -1252,8 +1260,10 @@ new_segment:
 
 			from += copy;
 			copied += copy;
-			if ((seglen -= copy) == 0 && iovlen == 0)
+			if ((seglen -= copy) == 0 && iovlen == 0) {
+				skb_shinfo(skb)->tx_flags |= skbflags_tx_tstamp(flags);
 				goto out;
+			}
 
 			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
 				continue;
@@ -1616,6 +1626,9 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	struct sk_buff *skb;
 	u32 urg_hole = 0;
 
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return ip_recv_error(sk, msg, len, addr_len);
+
 	if (sk_can_busy_loop(sk) && skb_queue_empty(&sk->sk_receive_queue) &&
 	    (sk->sk_state == TCP_ESTABLISHED))
 		sk_busy_loop(sk, nonblock);
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 4e86c59..730c2f6 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -134,6 +134,10 @@ struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 				(__force u32)delta));
 	if (skb->ip_summed != CHECKSUM_PARTIAL)
 		th->check = gso_make_checksum(skb, ~th->check);
+
+	if (unlikely(skb_shinfo(gso_skb)->tx_flags & SKBTX_SW_TSTAMP))
+		skb_shinfo(skb)->tx_flags |= SKBTX_SW_TSTAMP;
+
 out:
 	return segs;
 }
diff --git a/net/socket.c b/net/socket.c
index 18ab44a..d01e323 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -697,6 +697,7 @@ EXPORT_SYMBOL(kernel_sendmsg);
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct sk_buff *skb)
 {
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
 	int need_software_tstamp = sock_flag(sk, SOCK_RCVTSTAMP);
 	struct sock_errqueue_timestamping tss;
 	int empty = 1;
@@ -727,6 +728,7 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	     skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP) &&
 	    ktime_to_timespec_cond(skb->tstamp, &tss.ts_sw)) {
 		empty = 0;
+		tss.ts_key = serr->ee.ee_data;
 		tss.ts_type = SCM_TSTAMP_SND;
 	}
 	if (shhwtstamps) {
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 5/7] net-timestamp: ACK timestamp for bytestreams
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
                   ` (3 preceding siblings ...)
  2014-06-24 15:43 ` [PATCH net-next 4/7] net-timestamp: TCP timestamping Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 6/7] net-timestamp: ENQ timestamp on enqueue to traffic shaping layer Willem de Bruijn
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

This patch adds send() flag MSG_TSTAMP_ACK, a request for a timestamp
when the last byte in the send buffer is acknowledged. It implements
the feature for TCP.

The timestamp is generated when the TCP socket cumulative ACK is
moved beyond the tracked seqno for the first time. This corresponds
to the other peer having received all data up until this byte. The
feature ignores SACK and FACK, because those acknowledge the
specific byte, but not necessarily the entire contents of the buffer
passed in send()

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 24 ++++++++++++++++++++----
 include/linux/socket.h |  2 ++
 include/net/sock.h     |  4 ++--
 net/core/skbuff.c      | 11 +++++++----
 net/ipv4/tcp.c         |  2 +-
 net/ipv4/tcp_input.c   |  3 +++
 net/socket.c           | 11 +++++++++--
 7 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bec3ded..ee86654 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -258,8 +258,12 @@ enum {
 	 * all frags to avoid possible bad checksum
 	 */
 	SKBTX_SHARED_FRAG = 1 << 5,
+
+	SKBTX_ACK_TSTAMP = 1 << 6,
 };
 
+#define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | SKBTX_ACK_TSTAMP)
+
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
  * lower device, the skb last reference should be 0 when calling this.
@@ -2697,6 +2701,10 @@ static inline bool skb_defer_rx_timestamp(struct sk_buff *skb)
 void skb_complete_tx_timestamp(struct sk_buff *skb,
 			       struct skb_shared_hwtstamps *hwtstamps);
 
+void __skb_tstamp_tx(struct sk_buff *orig_skb,
+		     struct skb_shared_hwtstamps *hwtstamps,
+		     struct sock *sk, int tstype);
+
 /**
  * skb_tstamp_tx - queue clone of skb with send time stamps
  * @orig_skb:	the original outgoing packet
@@ -2708,8 +2716,12 @@ void skb_complete_tx_timestamp(struct sk_buff *skb,
  * generates a software time stamp (otherwise), then queues the clone
  * to the error queue of the socket.  Errors are silently ignored.
  */
-void skb_tstamp_tx(struct sk_buff *orig_skb,
-		   struct skb_shared_hwtstamps *hwtstamps);
+static inline void skb_tstamp_tx(struct sk_buff *orig_skb,
+				 struct skb_shared_hwtstamps *hwtstamps)
+{
+	return __skb_tstamp_tx(orig_skb, hwtstamps, orig_skb->sk,
+			       hwtstamps ? 0 : SKBTX_SW_TSTAMP);
+}
 
 static inline void sw_tx_timestamp(struct sk_buff *skb)
 {
@@ -2740,8 +2752,12 @@ static inline u8 skbflags_tx_tstamp(int flags)
 {
 	u8 tx_flags = 0;
 
-	if (unlikely(flags & MSG_TSTAMP))
-		tx_flags |= SKBTX_SW_TSTAMP;
+	if (unlikely(flags & MSG_TSTAMP_MASK)) {
+		if (flags & MSG_TSTAMP)
+			tx_flags |= SKBTX_SW_TSTAMP;
+		if (flags & MSG_TSTAMP_ACK)
+			tx_flags |= SKBTX_ACK_TSTAMP;
+	}
 
 	return tx_flags;
 }
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ce4101e..68d5f48 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -254,6 +254,8 @@ struct ucred {
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_TSTAMP	0x100000
+#define MSG_TSTAMP_ACK	0x200000
+#define MSG_TSTAMP_MASK	(MSG_TSTAMP | MSG_TSTAMP_ACK)
 #define MSG_EOF         MSG_FIN
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/include/net/sock.h b/include/net/sock.h
index df7bde0..0fcd12e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2139,7 +2139,7 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 	 * - receive time stamping in software requested (SOCK_RCVTSTAMP
 	 *   or SOCK_TIMESTAMPING_RX_SOFTWARE)
 	 * - software time stamp available and wanted
-	 *   (SOCK_TIMESTAMPING_SOFTWARE || SKBTX_SW_TSTAMP)
+	 *   (SOCK_TIMESTAMPING_SOFTWARE || SKBTX_ANY_SW_TSTAMP)
 	 * - hardware time stamps available and wanted
 	 *   (SOCK_TIMESTAMPING_SYS_HARDWARE or
 	 *   SOCK_TIMESTAMPING_RAW_HARDWARE)
@@ -2147,7 +2147,7 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 	if (sock_flag(sk, SOCK_RCVTSTAMP) ||
 	    sock_flag(sk, SOCK_TIMESTAMPING_RX_SOFTWARE) ||
 	    (kt.tv64 && (sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) ||
-	     skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP)) ||
+	     skb_shinfo(skb)->tx_flags & SKBTX_ANY_SW_TSTAMP)) ||
 	    (hwtstamps->hwtstamp.tv64 &&
 	     sock_flag(sk, SOCK_TIMESTAMPING_RAW_HARDWARE)) ||
 	    (hwtstamps->syststamp.tv64 &&
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 07ba98d..7b0823a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3491,10 +3491,10 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_queue_err_skb);
 
-void skb_tstamp_tx(struct sk_buff *orig_skb,
-		struct skb_shared_hwtstamps *hwtstamps)
+void __skb_tstamp_tx(struct sk_buff *orig_skb,
+		     struct skb_shared_hwtstamps *hwtstamps,
+		     struct sock *sk, int tstype)
 {
-	struct sock *sk = orig_skb->sk;
 	struct sock_exterr_skb *serr;
 	struct sk_buff *skb;
 	__u32 key = 0;
@@ -3534,6 +3534,8 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 			key = ntohl(tcp_hdr(skb)->seq) +
 			      ntohs(ip_hdr(skb)->tot_len) -
 			      ip_hdrlen(skb) - tcp_hdrlen(skb);
+	} else if (tstype == SKBTX_ACK_TSTAMP) {
+		key = TCP_SKB_CB(orig_skb)->end_seq;
 	}
 
 	serr = SKB_EXT_ERR(skb);
@@ -3541,13 +3543,14 @@ void skb_tstamp_tx(struct sk_buff *orig_skb,
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
 	serr->ee.ee_data = key;
+	serr->ee.ee_info = tstype;
 
 	err = sock_queue_err_skb(sk, skb);
 
 	if (err)
 		kfree_skb(skb);
 }
-EXPORT_SYMBOL_GPL(skb_tstamp_tx);
+EXPORT_SYMBOL_GPL(__skb_tstamp_tx);
 
 void skb_complete_wifi_ack(struct sk_buff *skb, bool acked)
 {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4ceecd9..b792642 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -880,7 +880,7 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 
 static bool tcp_skb_can_extend(struct sk_buff *skb)
 {
-	return !(skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP);
+	return !(skb_shinfo(skb)->tx_flags & SKBTX_ANY_SW_TSTAMP);
 }
 
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 40661fc..90fa2df 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3103,6 +3103,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 		if (!fully_acked)
 			break;
 
+		if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_ACK_TSTAMP))
+			__skb_tstamp_tx(skb, NULL, sk, SKBTX_ACK_TSTAMP);
+
 		tcp_unlink_write_queue(skb, sk);
 		sk_wmem_free_skb(sk, skb);
 		if (skb == tp->retransmit_skb_hint)
diff --git a/net/socket.c b/net/socket.c
index d01e323..b71001b 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -725,11 +725,18 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 
 	memset(&tss, 0, sizeof(tss));
 	if ((sock_flag(sk, SOCK_TIMESTAMPING_SOFTWARE) ||
-	     skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP) &&
+	     skb_shinfo(skb)->tx_flags & SKBTX_ANY_SW_TSTAMP) &&
 	    ktime_to_timespec_cond(skb->tstamp, &tss.ts_sw)) {
 		empty = 0;
 		tss.ts_key = serr->ee.ee_data;
-		tss.ts_type = SCM_TSTAMP_SND;
+		switch (serr->ee.ee_info) {
+		case SKBTX_SW_TSTAMP:
+			tss.ts_type = SCM_TSTAMP_SND;
+			break;
+		case SKBTX_ACK_TSTAMP:
+			tss.ts_type = SCM_TSTAMP_ACK;
+			break;
+		}
 	}
 	if (shhwtstamps) {
 		if (sock_flag(sk, SOCK_TIMESTAMPING_SYS_HARDWARE) &&
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 6/7] net-timestamp: ENQ timestamp on enqueue to traffic shaping layer
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
                   ` (4 preceding siblings ...)
  2014-06-24 15:43 ` [PATCH net-next 5/7] net-timestamp: ACK timestamp for bytestreams Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-24 15:43 ` [PATCH net-next 7/7] net-timestamp: expand documentation Willem de Bruijn
  2014-06-25  7:32 ` net-timestamp: MSG_TSTAMP flags and bytestream support Richard Cochran
  7 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

Kernel transmit latency is often incurred in the traffic shaping
layer. This patch adds a new timestamp on transmission just before
entering traffic shaping. When data travels through multiple devices
(bonding, tunneling, ...) each device will export an individual
timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 8 +++++++-
 include/linux/socket.h | 3 ++-
 net/core/dev.c         | 3 +++
 net/socket.c           | 3 +++
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ee86654..b6c2926 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -260,9 +260,13 @@ enum {
 	SKBTX_SHARED_FRAG = 1 << 5,
 
 	SKBTX_ACK_TSTAMP = 1 << 6,
+
+	SKBTX_ENQ_TSTAMP = 1 << 7,
 };
 
-#define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | SKBTX_ACK_TSTAMP)
+#define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | \
+			     SKBTX_ACK_TSTAMP | \
+			     SKBTX_ENQ_TSTAMP)
 
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
@@ -2757,6 +2761,8 @@ static inline u8 skbflags_tx_tstamp(int flags)
 			tx_flags |= SKBTX_SW_TSTAMP;
 		if (flags & MSG_TSTAMP_ACK)
 			tx_flags |= SKBTX_ACK_TSTAMP;
+		if (flags & MSG_TSTAMP_ENQ)
+			tx_flags |= SKBTX_ENQ_TSTAMP;
 	}
 
 	return tx_flags;
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 68d5f48..6d21582 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -255,7 +255,8 @@ struct ucred {
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_TSTAMP	0x100000
 #define MSG_TSTAMP_ACK	0x200000
-#define MSG_TSTAMP_MASK	(MSG_TSTAMP | MSG_TSTAMP_ACK)
+#define MSG_TSTAMP_ENQ	0x400000
+#define MSG_TSTAMP_MASK	(MSG_TSTAMP | MSG_TSTAMP_ACK | MSG_TSTAMP_ENQ)
 #define MSG_EOF         MSG_FIN
 
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
diff --git a/net/core/dev.c b/net/core/dev.c
index a04b12f..8df522b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2859,6 +2859,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 
 	skb_reset_mac_header(skb);
 
+	if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_ENQ_TSTAMP))
+		__skb_tstamp_tx(skb, NULL, skb->sk, SKBTX_ENQ_TSTAMP);
+
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
 	 */
diff --git a/net/socket.c b/net/socket.c
index b71001b..9b8deaf 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -736,6 +736,9 @@ void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 		case SKBTX_ACK_TSTAMP:
 			tss.ts_type = SCM_TSTAMP_ACK;
 			break;
+		case SKBTX_ENQ_TSTAMP:
+			tss.ts_type = SCM_TSTAMP_ENQ;
+			break;
 		}
 	}
 	if (shhwtstamps) {
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 7/7] net-timestamp: expand documentation
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
                   ` (5 preceding siblings ...)
  2014-06-24 15:43 ` [PATCH net-next 6/7] net-timestamp: ENQ timestamp on enqueue to traffic shaping layer Willem de Bruijn
@ 2014-06-24 15:43 ` Willem de Bruijn
  2014-06-25  7:32 ` net-timestamp: MSG_TSTAMP flags and bytestream support Richard Cochran
  7 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-24 15:43 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, richardcochran, davem, Willem de Bruijn

Expand Documentation/networking/timestamping.txt with interface
details of MSG_TSTAMP and bytestream timestamping. Also minor
cleanup of the other text.

Add Documentation/networking/msg_tstamp.c example application
to demonstrate the implementation.

Signed-off-by: Willem de Bruijn <willemb@google.com>

--

I included msg_tstamp.c for reference during review, mostly. I can
remove it for v2.
---
 Documentation/networking/timestamping.txt          | 176 +++++++--
 Documentation/networking/timestamping/msg_tstamp.c | 409 +++++++++++++++++++++
 2 files changed, 561 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/networking/timestamping/msg_tstamp.c

diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index bc35541..21c5410 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -1,4 +1,4 @@
-The existing interfaces for getting network packages time stamped are:
+The interfaces for getting network packages time stamped are:
 
 * SO_TIMESTAMP
   Generate time stamp for each incoming packet using the (not necessarily
@@ -13,21 +13,47 @@ The existing interfaces for getting network packages time stamped are:
   Only for multicasts: approximate send time stamp by receiving the looped
   packet and using its receive time stamp.
 
-The following interface complements the existing ones: receive time
-stamps can be generated and returned for arbitrary packets and much
-closer to the point where the packet is really sent. Time stamps can
-be generated in software (as before) or in hardware (if the hardware
-has such a feature).
+* SO_TIMESTAMPING
+  Request timestamps on reception, transmission or both. Request hardware,
+  software or both timestamps.
+
+* MSG_TSTAMP..
+  Like SO_TIMESTAMPING, but unlike that socket option, request a timestamp
+  for the payload of one specific send() call only. Currently supports
+  only timestamping on transmission.
+
+
+SO_TIMESTAMP:
+
+This socket option enables timestamping of datagrams on the network reception
+path. Because the destination socket, if any, is not known early in the
+network stack, the feature has to be enabled for all possibly matching packets
+(i.e., datagrams). The same is true for all subsequent reception timestamp
+options, too.
+
+For interface details, see `man 7 socket`.
+
+
+SO_TIMESTAMPNS:
+
+This option is identical to SO_TIMESTAMP except for the returned data type.
+Its struct timespec allows for higher resolution (ns) timestamps than the
+timeval of SO_TIMESTAMP (ms).
+
 
 SO_TIMESTAMPING:
 
 Instructs the socket layer which kind of information should be collected
-and/or reported.  The parameter is an integer with some of the following
-bits set. Setting other bits is an error and doesn't change the current
-state.
+and/or reported. Unlike SO_TIMESTAMP(NS), the socket option is not a boolean,
+but a bitmap. In an expression
+
+  err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val);
+
+The parameter val is an integer with some of the following bits set. Setting
+other bits returns EINVAL and does not change the current state.
 
 Four of the bits are requests to the stack to try to generate
-timestamps.  Any combination of them is valid.
+timestamps. Any combination of them is valid.
 
 SOF_TIMESTAMPING_TX_HARDWARE:  try to obtain send time stamps in hardware
 SOF_TIMESTAMPING_TX_SOFTWARE:  try to obtain send time stamps in software
@@ -50,27 +76,129 @@ can generate hardware receive timestamps ignore
 SOF_TIMESTAMPING_RX_HARDWARE.  It is still a good idea to set that flag
 in case future drivers pay attention.
 
-If timestamps are reported, they will appear in a control message with
-cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like
-this:
 
-struct scm_timestamping {
+MSG_TSTAMP:
+
+The socket options enable timestamps for all datagrams on a socket
+until the configuration is again updated. Timestamps are often of
+interest only selectively, for instance for sampled monitoring or
+to instrument outliers. In these cases, continuous monitoring imposes
+unnecessary cost.
+
+MSG_TSTAMP and the MSG_TSTAMP_* flags are passed immediately with
+a send() call and request a timestamp only for the data in that
+buffer. They do not change socket state, nor do they depend on any
+of the socket options. Both can be used independently. Enabling
+both concurrently is safe, but redundant.
+
+MSG_TSTAMP:
+  generates the same timestamp as
+  SOF_TIMESTAMPING_TX_SOFTWARE | SOF_TIMESTAMPING_SOFTWARE: a transmit
+  timestamp in the device driver prior to handing to the NIC. As such
+  support for this timestamp is device driver specific.
+
+MSG_TSTAMP_ENQ:
+  generates a timestamp in the traffic shaping layer, prior to queuing
+  a packet. Kernel transmit latency is, if long, often dominated by
+  queueing delay. The difference between MSG_TSTAMP_ENQ and MSG_TSTAMP
+  will expose this delay indepedently from protocol processing. On
+  machines with virtual devices where a transmitted packet travels
+  through multiple devices and, hence, multiple traffic shaping
+  layers, a timestamp is returned for each layer. This enables fine
+  grained measurement of queueing delay.
+
+MSG_TSTAMP_ACK:
+  generates a timestamp when all data in the send buffer has been
+  acknowledged. This only makes sense for reliable protocols. It is
+  currently only implemented for TCP. For that protocol, it may
+  over-report measurement, because it defines when all data up to
+  and including the buffer was acknowledged (a cumulative ACK). It
+  ignores SACK and FACK.
+
+Bytestream Timestamps
+
+Unlike the socket options, the MSG_TSTAMP_.. interface supports
+timestamping of data in a bytestream. Each request is interpreted
+as a request for when the entire content of the buffer has passed a
+defined timestamping point. That is, a MSG_TSTAMP request records
+when all bytes have reached the device driver, regardless of how
+many packets the data has been converted into.
+
+In general, bytestreams have no natural delimiters and therefore
+correlating a timestamp with data is non-trivial. A range of bytes
+may be split across packets, packets may be merged (possibly merging
+two halves of two previously split, otherwise independent, buffers).
+These segments may be reordered and can even coexist for reliable
+protocols that implement retransmissions.
+
+It is essential that all timestamps implement the same semantics,
+regardless of all possible transformations, as otherwise they are
+incomparable. Handling "rare" corner cases differently from the
+simple case (a 1:1 mapping from buffer to skb) is insufficient
+because performance debugging often needs to focus on such outliers.
+
+In practice, timestamps can be correlated with segments of a
+bytestream consistently, if both semantics of the timestamp and the
+timing of measurement are chosen correctly. This challenge is no
+different from deciding on a strategy for IP fragmentation. There, the
+definition is that only the first fragment is timestamped. For
+bytestreams, we chose that a timestamp is generated only when all
+bytes have passed a point. The MSG_TSTAMP_ACK as defined is easy to
+implement and reason about. An implementation that has to take into
+account SACK would be more complex due to possible transmission holes
+and out of order arrival.
+
+On the host, TCP can also break the simple 1:1 mapping from buffer to
+skb by
+- appending a buffer to an existing skb (e.g., Nagle, cork and autocork)
+- MSS-based segmentation
+- generic segmentation offload (GSO)
+
+The implementation avoids the first by effectively closing an skb
+for appends once a timestamp flag is set. The stack avoids
+segmentation due to MSS. GSO is supported by copying the relevant
+flag from the original large packet into the last of the segmented
+MTU or smaller sized packets.
+
+This ensures that the timestamp is generated only when all bytes have
+passed a timestamp point, if the network stack does not reorder the
+packets. The stack indeed tries to avoid reordering. The one exception
+is under administrator control: it is possible to construct a traffic
+shaping setup that delays segments differently. Such a setup would be
+unusual.
+
+
+Reading TIMESTAMPING and MSG_TSTAMP records
+
+Timestamps can be read using the ancillary data feature of recvmsg().
+See `man 3 cmsg` for details of this interface. Timestamps are
+returned in a control message with cmsg_level SOL_SOCKET, cmsg_type
+SO_TIMESTAMPING, and payload of type
+
+struct sock_errqueue_timestamping {
 	struct timespec systime;
 	struct timespec hwtimetrans;
 	struct timespec hwtimeraw;
+	__u32 ts_key;
+	__u32 ts_type;
+	__u64 ts_padding;
 };
 
-recvmsg() can be used to get this control message for regular incoming
-packets. For send time stamps the outgoing packet is looped back to
+For send timestamps the outgoing packet is looped back to
 the socket's error queue with the send time stamp(s) attached. It can
 be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
-original outgoing packet data including all headers preprended down to
-and including the link layer, the scm_timestamping control message and
+original outgoing packet data including all headers prefixed down to
+and including the link layer, the timestamping control message and
 a sock_extended_err control message with ee_errno==ENOMSG and
-ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
-bounced packet is ready for reading as far as select() is concerned.
-If the outgoing packet has to be fragmented, then only the first
-fragment is time stamped and returned to the sending socket.
+ee_origin==SO_EE_ORIGIN_TIMESTAMPING. Reading from the error queue is
+always a non-blocking operation. The process can block for data using
+poll or select. In that case, the socket is ready for reading on POLLIN
+(not POLLERR).
+
+Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
+explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
+then only the first fragment is timestamped and returned to the sending
+socket.
 
 All three values correspond to the same event in time, but were
 generated in different ways. Each of these values may be empty (= all
@@ -97,7 +225,7 @@ Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
 by the network device and will be empty without that support.
 
 
-SIOCSHWTSTAMP, SIOCGHWTSTAMP:
+Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
 
 Hardware time stamping must also be initialized for each device driver
 that is expected to do hardware time stamping. The parameter is defined in
@@ -169,7 +297,7 @@ enum {
 };
 
 
-DEVICE IMPLEMENTATION
+Hardware Timestamping Implementation: Device Drivers
 
 A driver which supports hardware time stamping must support the
 SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
diff --git a/Documentation/networking/timestamping/msg_tstamp.c b/Documentation/networking/timestamping/msg_tstamp.c
new file mode 100644
index 0000000..0c85133
--- /dev/null
+++ b/Documentation/networking/timestamping/msg_tstamp.c
@@ -0,0 +1,409 @@
+/*
+ * Conformance tests for MSG_TSTAMP, including
+ *
+ * - UDP MSG_TSTAMP
+ * - TCP MSG_TSTAMP, MSG_TSTAMP_ENQ and MSG_TSTAMP_ACK
+ * - IPv4 and IPv6
+ * - various packet sizes (to test GSO and TSO)
+ *
+ * Consult the command line arguments for help on running
+ * the various testcases.
+ *
+ * This test requires a dummy TCP server.
+ * A simple `nc6 [-u] -l -p $DESTPORT` will do
+ *
+ * Tested against Linux 3.16-rc1 (7171511eaec5)
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <arpa/inet.h>
+#include <asm/types.h>
+#include <error.h>
+#include <errno.h>
+#include <linux/errqueue.h>
+#include <linux/if_ether.h>
+#include <linux/net_tstamp.h>
+#include <netdb.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <netinet/tcp.h>
+#include <netpacket/packet.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/select.h>
+#include <sys/socket.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+/* should be defined in include/uapi/linux/socket.h */
+#define MSG_TSTAMP	0x100000
+#define MSG_TSTAMP_ACK	0x200000
+#define MSG_TSTAMP_ENQ	0x400000
+
+#define NUM_RUNS	4
+
+/* command line parameters */
+static int do_udp;
+static int do_ipv4 = 1;
+static int do_ipv6 = 1;
+static int payload_len = 1;
+static int tstamp_no_payload;
+static uint16_t dest_port = 9000;
+
+struct sockaddr_in daddr;
+struct sockaddr_in6 daddr6;
+
+/* random globals */
+static struct timeval tv;
+static struct timespec ts_prev;
+static int tstamp_payload_len;
+
+static void __print_timestamp(const char *name, struct timespec *cur,
+			      uint32_t key)
+{
+	if (!(cur->tv_sec | cur->tv_nsec))
+		return;
+
+	fprintf(stderr, "  %s: %lu s %lu us (seq=%u, len=%u)",
+			name, cur->tv_sec, cur->tv_nsec / 1000,
+			key, tstamp_payload_len);
+
+	if ((ts_prev.tv_sec | ts_prev.tv_nsec)) {
+		int64_t cur_ms, prev_ms;
+
+		cur_ms = (long) cur->tv_sec * 1000 * 1000;
+		cur_ms += cur->tv_nsec / 1000;
+
+		prev_ms = (long) ts_prev.tv_sec * 1000 * 1000;
+		prev_ms += ts_prev.tv_nsec / 1000;
+
+		fprintf(stderr, "  (%+ld us)", cur_ms - prev_ms);
+	}
+
+	ts_prev = *cur;
+	fprintf(stderr, "\n");
+}
+
+static void print_timestamp_usr(void)
+{
+	struct timespec ts;
+
+	ts.tv_sec = tv.tv_sec;
+	ts.tv_nsec = tv.tv_usec * 1000;
+	__print_timestamp("  USR", &ts, 0);
+
+}
+
+static void print_timestamp(struct sock_errqueue_timestamping *tss)
+{
+	const char *tsname;
+
+	switch (tss->ts_type) {
+	case SCM_TSTAMP_ENQ:
+		tsname = "  ENQ";
+		break;
+	case SCM_TSTAMP_SND:
+		tsname = "  SND";
+		break;
+	case SCM_TSTAMP_ACK:
+		tsname = "  ACK";
+		break;
+	default:
+		error(1, 0, "unknown timestamp type: %u",
+		tss->ts_type);
+	}
+	__print_timestamp(tsname, &tss->ts_sw, tss->ts_key);
+}
+
+static void __recv_errmsg_cmsg(struct msghdr *msg)
+{
+	struct cmsghdr *cm;
+
+	for (cm = CMSG_FIRSTHDR(msg); cm; cm = CMSG_NXTHDR(msg, cm)) {
+		if (cm->cmsg_level == SOL_SOCKET &&
+		    cm->cmsg_type == SCM_TIMESTAMPING) {
+			print_timestamp((void *) CMSG_DATA(cm));
+			continue;
+		}
+
+		if ((cm->cmsg_level == SOL_IP &&
+		     cm->cmsg_type == IP_RECVERR) ||
+		    (cm->cmsg_level == SOL_IPV6 &&
+		     cm->cmsg_type == IPV6_RECVERR)) {
+			struct sock_extended_err *serr;
+
+			serr = (void *) CMSG_DATA(cm);
+			if (serr->ee_errno != ENOMSG ||
+			    serr->ee_origin != SO_EE_ORIGIN_TIMESTAMPING) {
+				fprintf(stderr, "unknown ip error %d %d\n",
+						serr->ee_errno,
+						serr->ee_origin);
+			}
+			continue;
+		}
+
+		fprintf(stderr, "%d, %d\n", cm->cmsg_level, cm->cmsg_type);
+	}
+
+}
+
+static int recv_errmsg(int fd)
+{
+	static char ctrl[1024 /* overcommit */];
+	static struct msghdr msg;
+	struct iovec entry;
+	static char *data;
+	int ret = 0;
+
+	data = malloc(payload_len);
+	if (!data)
+		error(1, 0, "malloc");
+
+	memset(&msg, 0, sizeof(msg));
+	memset(&entry, 0, sizeof(entry));
+	memset(ctrl, 0, sizeof(ctrl));
+	memset(data, 0, sizeof(data));
+
+	entry.iov_base = data;
+	/* for TCP we specify payload length to read one packet at a time. */
+	entry.iov_len = payload_len;
+	msg.msg_iov = &entry;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = ctrl;
+	msg.msg_controllen = sizeof(ctrl);
+
+	ret = recvmsg(fd, &msg, MSG_ERRQUEUE | MSG_DONTWAIT);
+	if (ret == -1 && (errno == EINTR || errno == EWOULDBLOCK))
+		goto done;
+	if (ret == -1)
+		error(1, errno, "recvmsg");
+
+	tstamp_payload_len = ret;
+	if (tstamp_no_payload && tstamp_payload_len)
+		error(1, 0, "recv: payload when configured without");
+	else if (!tstamp_no_payload && !tstamp_payload_len)
+		error(1, 0, "recv: no payload when configured with");
+
+	__recv_errmsg_cmsg(&msg);
+
+done:
+	free(data);
+	return ret == -1;
+}
+
+static void do_test(int family, unsigned int flags)
+{
+	char *buf;
+	int fd, i, val;
+
+	buf = malloc(payload_len);
+	if (!buf)
+		error(1, 0, "malloc");
+
+	if (do_udp)
+		fd = socket(family, SOCK_DGRAM, IPPROTO_UDP);
+	else
+		fd = socket(family, SOCK_STREAM, IPPROTO_TCP);
+	if (fd < 0)
+		error(1, errno, "socket");
+
+	if (!do_udp) {
+		val = 1;
+		if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY,
+			       (char*) &val, sizeof(val)))
+			error(1, 0, "setsockopt no nagle");
+
+		if (family == PF_INET) {
+			if (connect(fd, (void *) &daddr, sizeof(daddr)))
+				error(1, errno, "connect ipv4");
+		} else {
+			if (connect(fd, (void *) &daddr6, sizeof(daddr6)))
+				error(1, errno, "connect ipv6");
+		}
+	}
+
+	if (tstamp_no_payload) {
+		val = SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD;
+		if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING,
+			       (char *) &val, sizeof(val)))
+			error(1, 0, "setsockopt no payload");
+	}
+
+	for (i = 0; i < NUM_RUNS; i++) {
+		memset(&ts_prev, 0, sizeof(ts_prev));
+		memset(buf, 'a' + i, payload_len);
+		buf[payload_len - 1] = '\n';
+
+		gettimeofday(&tv, NULL);
+		if (do_udp) {
+			if (family == PF_INET)
+				val = sendto(fd, buf, payload_len, flags, (void *) &daddr, sizeof(daddr));
+			else
+				val = sendto(fd, buf, payload_len, flags, (void *) &daddr6, sizeof(daddr6));
+		} else {
+			val = send(fd, buf, payload_len, flags);
+		}
+		if (val != payload_len)
+			error(1, errno, "send");
+
+		usleep(50 * 1000);
+
+		print_timestamp_usr();
+		while (!recv_errmsg(fd)) {}
+	}
+
+	if (close(fd))
+		error(1, errno, "close");
+
+	free(buf);
+	usleep(400 * 1000);
+}
+
+static void __attribute__((noreturn)) usage(const char *filepath)
+{
+	fprintf(stderr, "\nUsage: %s [options] hostname\n"
+			"\nwhere options are:\n"
+			"  -4:   only IPv4\n"
+			"  -6:   only IPv6\n"
+			"  -h:   show this message\n"
+			"  -l N: send N bytes at a time\n"
+			"  -n:   no payload on tstamp\n"
+			"  -p N: connect to port N\n"
+			"  -u:   use udp\n",
+			filepath);
+	exit(1);
+}
+
+static void parse_opt(int argc, char **argv)
+{
+	char c;
+
+	while ((c = getopt(argc, argv, "46hl:np:u")) != -1) {
+		switch (c) {
+		case '4':
+			do_ipv6 = 0;
+			break;
+		case '6':
+			do_ipv4 = 0;
+			break;
+		case 'u':
+			do_udp = 1;
+			break;
+		case 'l':
+			payload_len = strtoul(optarg, NULL, 10);
+			break;
+		case 'n':
+			tstamp_no_payload = 1;
+			break;
+		case 'p':
+			dest_port = strtoul(optarg, NULL, 10);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+		}
+	}
+
+	if (do_udp && payload_len > 1472)
+		error(1, 0, "udp packet might exceed expected MTU");
+	if (!do_ipv4 && !do_ipv6)
+		error(1, 0, "pass -4 or -6, not both");
+
+	if (optind != argc - 1)
+		error(1, 0, "missing required hostname argument");
+}
+
+static void resolve_hostname(const char *hostname)
+{
+	struct addrinfo *addrs, *cur;
+	int have_ipv4 = 0, have_ipv6 = 0;
+
+	if (getaddrinfo(hostname, NULL, NULL, &addrs))
+		error(1, errno, "getaddrinfo");
+
+	cur = addrs;
+	while (cur && !have_ipv4 && !have_ipv6) {
+		if (!have_ipv4 && cur->ai_family == AF_INET) {
+			memcpy(&daddr, cur->ai_addr, sizeof(daddr));
+			daddr.sin_port = htons(dest_port);
+			have_ipv4 = 1;
+		}
+		else if (!have_ipv6 && cur->ai_family == AF_INET6) {
+			memcpy(&daddr6, cur->ai_addr, sizeof(daddr6));
+			daddr6.sin6_port = htons(dest_port);
+			have_ipv6 = 1;
+		}
+		cur = cur->ai_next;
+	}
+	if (addrs)
+		freeaddrinfo(addrs);
+
+	do_ipv4 &= have_ipv4;
+	do_ipv6 &= have_ipv6;
+}
+
+static void do_main(int family)
+{
+	fprintf(stderr, "family:       %s\n",
+			family == PF_INET ? "INET" : "INET6");
+
+	fprintf(stderr, "test SND\n");
+	do_test(family, MSG_TSTAMP);
+
+	fprintf(stderr, "test ENQ\n");
+	do_test(family, MSG_TSTAMP_ENQ);
+
+	fprintf(stderr, "test ENQ + SND\n");
+	do_test(family, MSG_TSTAMP_ENQ | MSG_TSTAMP);
+
+	if (!do_udp) {
+		fprintf(stderr, "\ntest ACK\n");
+		do_test(family, MSG_TSTAMP_ACK);
+
+		fprintf(stderr, "\ntest SND + ACK\n");
+		do_test(family, MSG_TSTAMP | MSG_TSTAMP_ACK);
+
+		fprintf(stderr, "\ntest ENQ + SND + ACK\n");
+		do_test(family, MSG_TSTAMP_ENQ | MSG_TSTAMP | MSG_TSTAMP_ACK);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	parse_opt(argc, argv);
+	resolve_hostname(argv[argc - 1]);
+
+	fprintf(stderr, "protocol:     %s\n", do_udp ? "udp" : "tcp");
+	fprintf(stderr, "payload:      %u\n", payload_len);
+	fprintf(stderr, "server port:  %u\n", dest_port);
+	fprintf(stderr, "\n");
+
+	if (do_ipv4)
+		do_main(PF_INET);
+	if (do_ipv6)
+		do_main(PF_INET6);
+
+	return 0;
+}
-- 
2.0.0.526.g5318336

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct
  2014-06-24 15:43 ` [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct Willem de Bruijn
@ 2014-06-25  4:56   ` Richard Cochran
  2014-06-25 21:18     ` Willem de Bruijn
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Cochran @ 2014-06-25  4:56 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, eric.dumazet, davem

On Tue, Jun 24, 2014 at 11:43:46AM -0400, Willem de Bruijn wrote:

> The code is backward compatible with legacy applications that treat
> the ancillary data as an anonymous array 'struct timespec data[3]'.
> It will break applications that test the size of the cmsg data.

I think this introduces an unacceptable ABI change.
In linuxptp we have

		if (SOL_SOCKET == level && SO_TIMESTAMPING == type) {
			if (cm->cmsg_len < sizeof(*ts) * 3) {
				pr_warning("short SO_TIMESTAMPING message");
				return -1;
			}
			ts = (struct timespec *) CMSG_DATA(cm);
		}

but other applications might barf if the length isn't exactly right.

> +/**
> + *	struct sock_errqueue_timestamping - timestamps exposed through cmsg
> + *
> + *	The timestamping interfaces SO_TIMESTAMPING, MSG_TSTAMP_*
> + *	communicate network timestamps to userspace by passing this struct
> + *	through a cmsg in recvmsg().
> + *
> + *	@ts_sw:     the sw timestamp: the contents depends on ts_type.

This would overload the field. I don't like that.

> + *	@ts_hw_sys: a hardware generated timestamp converted to system time.
> + *	@ts_hw_raw: a hardware generated timestamp converted in its raw format.
> + *	@ts_type:   the type of timestamp ts_sw. One of SCM_TSTAMP_*
> + *	@ts_key:    socket flow index that the timestamps correspond to
> + *		    (stream transport protocols only, e.g., TCP seqno)
> + *
> + *	The first three fields are dictated by historical use. The hardware
> + *	timestamps are empty unless hardware timestamping is enabled, but
> + *	they have to be present in each message.
> + */
> +struct sock_errqueue_timestamping {
> +	struct timespec ts_sw;
> +	struct timespec ts_hw_sys;
> +	struct timespec ts_hw_raw;
> +	__u32 ts_key;
> +	__u16 ts_type;
> +	__u16 ts_padding;
> +};
> +
> +enum {
> +	SCM_TSTAMP_SND = 1,
> +	SCM_TSTAMP_ACK = 2,
> +	SCM_TSTAMP_ENQ = 3
> +};

So why not simply introduce a new kind of CMSG for these new time
stamps? It appears that the use case for these is totally different
than for SO_TIMESTAMPING. I can't imagine why you would want to mix
them together.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps
  2014-06-24 15:43 ` [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps Willem de Bruijn
@ 2014-06-25  5:01   ` Richard Cochran
  2014-06-25 21:20     ` Willem de Bruijn
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Cochran @ 2014-06-25  5:01 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, eric.dumazet, davem

On Tue, Jun 24, 2014 at 11:43:47AM -0400, Willem de Bruijn wrote:
> The kernel support datagram tx timestamping through socket option
> SO_TIMESTAMPING. This patch add send() flag MSG_TSTAMP to allow
> selectively requesting a timestamp for a single packet.

This is a nice idea. I wonder whether the drivers will react
correctly, need to think about it.

> MSG_TSTAMP does not depend on SO_TIMESTAMPING. Enabling both
> concurrently is redundant, but safe.
> 
> This patch adds support for IPv4 and IPv6 UDP sendmsg().

What about raw?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/7] net-timestamp: tx timestamp without payload
  2014-06-24 15:43 ` [PATCH net-next 3/7] net-timestamp: tx timestamp without payload Willem de Bruijn
@ 2014-06-25  5:16   ` Richard Cochran
  2014-06-25 21:22     ` Willem de Bruijn
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Cochran @ 2014-06-25  5:16 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, eric.dumazet, davem

On Tue, Jun 24, 2014 at 11:43:48AM -0400, Willem de Bruijn wrote:
> Applications receive tx timestamps from the kernel by reading the
> original packet from the socket error queue with sendmsg() and
> processing an ancillary data item that holds the timestamps.
> 
> If the application is only interested in the timestamp, then looping
> the whole packet back up to userspace wastes socket buffer space
> (SO_RCVBUF). This is especially important when the same packet is
> enqueued repeatedly with multiple timestamps.
> 
> This patch adds a socket option to loop the timestamp on top of an
> empty packet instead of a clone of the original.

This makes sense. In practice the looped buffer is totally useless,
due to the fact that many NICs can only handle one outstanding
transmit time stamp. Applications must make sure they only send one
packet at a time if they want every packet time stamped.

> diff --git a/include/net/sock.h b/include/net/sock.h
> index 32cd1be..df7bde0 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -690,6 +690,7 @@ enum sock_flags {
>  	SOCK_TIMESTAMPING_SOFTWARE,     /* %SOF_TIMESTAMPING_SOFTWARE */
>  	SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SOF_TIMESTAMPING_RAW_HARDWARE */
>  	SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
> +	SOCK_TIMESTAMPING_OPT_TX_NO_PAYLOAD, /* %SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD */

That is a bit of a mouthful. How about something like:

 SOCK_TIMESTAMPING_PLAIN_TS
 SOCK_TIMESTAMPING_BARE_TS
 SOCK_TIMESTAMPING_TSONLY


Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: net-timestamp: MSG_TSTAMP flags and bytestream support
  2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
                   ` (6 preceding siblings ...)
  2014-06-24 15:43 ` [PATCH net-next 7/7] net-timestamp: expand documentation Willem de Bruijn
@ 2014-06-25  7:32 ` Richard Cochran
  2014-06-25 21:11   ` Willem de Bruijn
  7 siblings, 1 reply; 17+ messages in thread
From: Richard Cochran @ 2014-06-25  7:32 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, eric.dumazet, davem

On Tue, Jun 24, 2014 at 11:43:45AM -0400, Willem de Bruijn wrote:
> This patchset extends socket timestamping in a number of related ways.
> Most notably:
> 
> 2 MSG_TSTAMP: request a single tx timestamp by passing a flag on send
> 6 MSG_TSTAMP_ENQ: request a tx timestamp before traffic shaping.
> 5 MSG_TSTAMP_ACK: request a tx timestamp after acknowledgements (TCP)
> 4 TCP support for all three flags

Can you tell us a bit about the use case?  It sounds like that this is
for performance monitoring.

> Each individual patch commit message gives more detail about the
> specific feature.
> 
> The other patches support the main feature:
> 1 explicitly define the timestamp response API
> 3 optionally avoid looping large packets onto the socket error queue.
> 7 documentation and an example test.

I think #2 and #3 could be improvements to the so_timestamping api.

The others probably need their own, separate api. I can't imagine
wanting to mix so_timestamping with these new tcp time stamps, but
maybe you want to explain the expected scenario?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: net-timestamp: MSG_TSTAMP flags and bytestream support
  2014-06-25  7:32 ` net-timestamp: MSG_TSTAMP flags and bytestream support Richard Cochran
@ 2014-06-25 21:11   ` Willem de Bruijn
  0 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-25 21:11 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, Eric Dumazet, David Miller

Thanks for taking a look, Richard. I will reply inline to the individual emails.

On Wed, Jun 25, 2014 at 3:32 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Jun 24, 2014 at 11:43:45AM -0400, Willem de Bruijn wrote:
>> This patchset extends socket timestamping in a number of related ways.
>> Most notably:
>>
>> 2 MSG_TSTAMP: request a single tx timestamp by passing a flag on send
>> 6 MSG_TSTAMP_ENQ: request a tx timestamp before traffic shaping.
>> 5 MSG_TSTAMP_ACK: request a tx timestamp after acknowledgements (TCP)
>> 4 TCP support for all three flags
>
> Can you tell us a bit about the use case?  It sounds like that this is
> for performance monitoring.

There are a couple of use cases: 24/7 sampled background monitoring,
detailed analysis of (tail) latency and application delivery
notifications, for example.

The MSG flags enable sampled measurement, both random sampling for the
first use case, and explicit communication of application data unit
delimiters to the kernel in the case of bytestreams. It is quite
common to embed discrete message protocols in a TCP stream to gain
reliability, flow control, ... In that case, only the send() call that
writes the final byte of a message will generate a timestamp that is
relevant to the application.

The new measurement points are indeed for performance monitoring,
specifically for debugging of tail latency root causes. Traffic
shaping, TCP pacing, byte queue limits and other queuing can affect
latency. Measurement points along the skb lifecycle help pinpoint the
specific root cause.

>> Each individual patch commit message gives more detail about the
>> specific feature.
>>
>> The other patches support the main feature:
>> 1 explicitly define the timestamp response API
>> 3 optionally avoid looping large packets onto the socket error queue.
>> 7 documentation and an example test.
>
> I think #2 and #3 could be improvements to the so_timestamping api.
>
> The others probably need their own, separate api. I can't imagine
> wanting to mix so_timestamping with these new tcp time stamps, but
> maybe you want to explain the expected scenario?

I think that a single coherent timestamping API would be better than a
fragmented experience with disjoint feature set. Both depend
internally on the same SKBTX_.. flags and the same skb_tstamp_tx
plumbing. I think that they should also implement the same set feature
set and return timestamps in the same format (where possible given
legacy API constraints).

This requires extensions to both to cover the other's feature set,
e.g., MSG_TSTAMP_HW and SOF_TIMESTAMPING_TX_ENQUEUE. But those are
minor changes. I'd be happy to work on those as follow on work.




>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct
  2014-06-25  4:56   ` Richard Cochran
@ 2014-06-25 21:18     ` Willem de Bruijn
  0 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-25 21:18 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, Eric Dumazet, David Miller

On Wed, Jun 25, 2014 at 12:56 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Jun 24, 2014 at 11:43:46AM -0400, Willem de Bruijn wrote:
>
>> The code is backward compatible with legacy applications that treat
>> the ancillary data as an anonymous array 'struct timespec data[3]'.
>> It will break applications that test the size of the cmsg data.
>
> I think this introduces an unacceptable ABI change.
> In linuxptp we have
>
>                 if (SOL_SOCKET == level && SO_TIMESTAMPING == type) {
>                         if (cm->cmsg_len < sizeof(*ts) * 3) {
>                                 pr_warning("short SO_TIMESTAMPING message");
>                                 return -1;
>                         }
>                         ts = (struct timespec *) CMSG_DATA(cm);
>                 }
>
> but other applications might barf if the length isn't exactly right.

Fair point.

>> +/**
>> + *   struct sock_errqueue_timestamping - timestamps exposed through cmsg
>> + *
>> + *   The timestamping interfaces SO_TIMESTAMPING, MSG_TSTAMP_*
>> + *   communicate network timestamps to userspace by passing this struct
>> + *   through a cmsg in recvmsg().
>> + *
>> + *   @ts_sw:     the sw timestamp: the contents depends on ts_type.
>
> This would overload the field. I don't like that.
>
>> + *   @ts_hw_sys: a hardware generated timestamp converted to system time.
>> + *   @ts_hw_raw: a hardware generated timestamp converted in its raw format.
>> + *   @ts_type:   the type of timestamp ts_sw. One of SCM_TSTAMP_*
>> + *   @ts_key:    socket flow index that the timestamps correspond to
>> + *               (stream transport protocols only, e.g., TCP seqno)
>> + *
>> + *   The first three fields are dictated by historical use. The hardware
>> + *   timestamps are empty unless hardware timestamping is enabled, but
>> + *   they have to be present in each message.
>> + */
>> +struct sock_errqueue_timestamping {
>> +     struct timespec ts_sw;
>> +     struct timespec ts_hw_sys;
>> +     struct timespec ts_hw_raw;
>> +     __u32 ts_key;
>> +     __u16 ts_type;
>> +     __u16 ts_padding;
>> +};
>> +
>> +enum {
>> +     SCM_TSTAMP_SND = 1,
>> +     SCM_TSTAMP_ACK = 2,
>> +     SCM_TSTAMP_ENQ = 3
>> +};
>
> So why not simply introduce a new kind of CMSG for these new time
> stamps? It appears that the use case for these is totally different
> than for SO_TIMESTAMPING. I can't imagine why you would want to mix
> them together.

See also my reply in the patchset cover message (0/7). I do not agree
that the use case for the two interfaces is totally different. Both aim to
return a software timestamp on device transmission, for instance. The
only difference is whether this should occur on every datagram from a
socket or on a per-datagram bases.

Because of the legacy issue you raised, I agree that a new CMSG may
be in order. The simpler solution is to store ts_key and ts_type in
ee_data and ee_info, because these are so far undefined for error
SO_EE_ORIGIN_TIMESTAMPING.

> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps
  2014-06-25  5:01   ` Richard Cochran
@ 2014-06-25 21:20     ` Willem de Bruijn
  0 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-25 21:20 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, Eric Dumazet, David Miller

On Wed, Jun 25, 2014 at 1:01 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Jun 24, 2014 at 11:43:47AM -0400, Willem de Bruijn wrote:
>> The kernel support datagram tx timestamping through socket option
>> SO_TIMESTAMPING. This patch add send() flag MSG_TSTAMP to allow
>> selectively requesting a timestamp for a single packet.
>
> This is a nice idea. I wonder whether the drivers will react
> correctly, need to think about it.
>
>> MSG_TSTAMP does not depend on SO_TIMESTAMPING. Enabling both
>> concurrently is redundant, but safe.
>>
>> This patch adds support for IPv4 and IPv6 UDP sendmsg().
>
> What about raw?

I saw that bug/feature request earlier this month, too. I can add that
as future work to this patch set. It seems like a straightforward
extension.

>
> Thanks,
> Richard
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/7] net-timestamp: tx timestamp without payload
  2014-06-25  5:16   ` Richard Cochran
@ 2014-06-25 21:22     ` Willem de Bruijn
  0 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-06-25 21:22 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, Eric Dumazet, David Miller

On Wed, Jun 25, 2014 at 1:16 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Jun 24, 2014 at 11:43:48AM -0400, Willem de Bruijn wrote:
>> Applications receive tx timestamps from the kernel by reading the
>> original packet from the socket error queue with sendmsg() and
>> processing an ancillary data item that holds the timestamps.
>>
>> If the application is only interested in the timestamp, then looping
>> the whole packet back up to userspace wastes socket buffer space
>> (SO_RCVBUF). This is especially important when the same packet is
>> enqueued repeatedly with multiple timestamps.
>>
>> This patch adds a socket option to loop the timestamp on top of an
>> empty packet instead of a clone of the original.
>
> This makes sense. In practice the looped buffer is totally useless,
> due to the fact that many NICs can only handle one outstanding
> transmit time stamp. Applications must make sure they only send one
> packet at a time if they want every packet time stamped.
>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 32cd1be..df7bde0 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -690,6 +690,7 @@ enum sock_flags {
>>       SOCK_TIMESTAMPING_SOFTWARE,     /* %SOF_TIMESTAMPING_SOFTWARE */
>>       SOCK_TIMESTAMPING_RAW_HARDWARE, /* %SOF_TIMESTAMPING_RAW_HARDWARE */
>>       SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
>> +     SOCK_TIMESTAMPING_OPT_TX_NO_PAYLOAD, /* %SOF_TIMESTAMPING_OPT_TX_NO_PAYLOAD */
>
> That is a bit of a mouthful. How about something like:
>
>  SOCK_TIMESTAMPING_PLAIN_TS
>  SOCK_TIMESTAMPING_BARE_TS
>  SOCK_TIMESTAMPING_TSONLY

Ack. I'll pick a shorter name. This also exceeded 80 chars.
>
>
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 17+ messages in thread

* net-timestamp: MSG_TSTAMP flags and bytestream support
@ 2014-07-03 19:39 Willem de Bruijn
  0 siblings, 0 replies; 17+ messages in thread
From: Willem de Bruijn @ 2014-07-03 19:39 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, richardcochran, stephen

This patchset extends socket timestamping for continuous sampled
monitoring of a wider set of protocols, including discrete
messages layered over bytestreams.

Patches 5 and 6 add additional measurement points in the data
lifecycle, to be able to detect the cause of latency (e.g.,
which tc queue).

Patch 4 extends timestamping to bytestreams, because many
application protocols, including those with discrete payloads,
use TCP as a transport layer for reliability, flow control, etc.

Patch 2 adds a send() flag-based control interface to request a
timestamp for a payload. With bytestream-based timestamping, this
allows an application to communicate the boundaries of discrete
application payloads, which often span multiple send() calls.

A second use of this feature is in sampled background monitoring.
For high volume production server workloads, high sampling rates
may be the only feasible approach to 24/7 monitoring.

The other patches support these core features:

Patch 1 explicitly defines a previously implicitly defined struct
Patch 3 optionally conserves SO_SNDBUF by not looping payloads
        back onto the error queue
Patch 7 documents the new mechanisms
Patch 8 fixes SO_TIMESTAMPING for SOCK_RAW and ping sockets
        and adds support for the new MSG_TSTAMP interface.

The individual patch commit messages gives more detail about the
specific feature.

Changelog:
  v1->v2
  - expand timestamping (existing and new) to SOCK_RAW and ping sockets
  - rename sock_errqueue_timestamping to scm_timestamping
  - change timestamp data format: do not add fields to scm_timestamping.
      Doing so could break legacy applications. Instead, communicate
      through an existing, but unused, field in the error message.
  - rename SOF_.._OPT_TX_NO_PAYLOAD to shorter SOF_.._OPT_TSONLY
  - move msg_tstamp test app out of patchset and to github
      git://github.com/wdebruij/kerneltools.git
 
Tested:
  msg_tstamp with various settings:
  - IPv4 and IPv6
  - UDP and TCP
  - 1 B and 20 KB payload
  - GSO, TSO, neither
  - with and without the no-payload feature (patch 3)

  Example output from one IPv4/TCP/1B/payload run:
  (on a bonded machine, resulting in 2 ENQ timestamps per send)

proto INET
  test SND
      USR: 1400265321 s 167847 us (seq=0, len=0)
      SND: 1400265321 s 167854 us (seq=408779523, len=7)  (+7 us)
  test ENQ
      USR: 1400265321 s 768728 us (seq=0, len=7)
      ENQ: 1400265321 s 768732 us (seq=3113669987, len=7)  (+4 us)
      ENQ: 1400265321 s 768734 us (seq=3113669987, len=7)  (+2 us)
  test ENQ + SND
      USR: 1400265322 s 369747 us (seq=0, len=7)
      ENQ: 1400265322 s 369750 us (seq=2305548511, len=7)  (+3 us)
      ENQ: 1400265322 s 369751 us (seq=2305548511, len=7)  (+1 us)
      SND: 1400265322 s 369752 us (seq=2305548511, len=7)  (+1 us)
  test ACK
      USR: 1400265322 s 970717 us (seq=0, len=7)
      ACK: 1400265322 s 970752 us (seq=2324323855, len=7)  (+35 us)
  test SND + ACK
      USR: 1400265323 s 571662 us (seq=0, len=7)
      SND: 1400265323 s 571681 us (seq=872301729, len=7)  (+19 us)
      ACK: 1400265323 s 571708 us (seq=872301729, len=7)  (+27 us)
  test ENQ + SND + ACK
      USR: 1400265324 s 172558 us (seq=0, len=7)
      ENQ: 1400265324 s 172561 us (seq=2135092223, len=7)  (+3 us)
      ENQ: 1400265324 s 172565 us (seq=2135092223, len=7)  (+4 us)
      SND: 1400265324 s 172581 us (seq=2135092223, len=7)  (+16 us)
      ACK: 1400265324 s 172624 us (seq=2135092223, len=7)  (+43 us)

  also regression tested existing SO_TIMESTAMP(NS|ING) interfaces
 
Comments:

# overloading SCM_TIMESTAMPING

This patchset reuses SCM_TIMESTAMPING to return new kinds of
timestamps. It is possible to create a new SCM_.. and struct,
but both socket option and flags-based interface can in principle
be equivalent feature-wise. In that case, it is preferable if
they return information in the same format.

# SO_TIMESTAMPING / MSG_TSTAMP feature parity

This patchset does not implement a fully equivalent feature set yet.
Not included is a draft patch that adds the functionality of
MSG_TSTAMP_ENQ and MSG_TSTAMP_ACK to SO_TIMESTAMPING. It is trivial,
save from the fact that these new options overflow sk_flags. The
patch introduces a separate timestamping sk_tsflags.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-07-03 19:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-24 15:43 net-timestamp: MSG_TSTAMP flags and bytestream support Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 1/7] net-timestamp: explicit SO_TIMESTAMPING ancillary data struct Willem de Bruijn
2014-06-25  4:56   ` Richard Cochran
2014-06-25 21:18     ` Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 2/7] net-timestamp: MSG_TSTAMP one-shot tx timestamps Willem de Bruijn
2014-06-25  5:01   ` Richard Cochran
2014-06-25 21:20     ` Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 3/7] net-timestamp: tx timestamp without payload Willem de Bruijn
2014-06-25  5:16   ` Richard Cochran
2014-06-25 21:22     ` Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 4/7] net-timestamp: TCP timestamping Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 5/7] net-timestamp: ACK timestamp for bytestreams Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 6/7] net-timestamp: ENQ timestamp on enqueue to traffic shaping layer Willem de Bruijn
2014-06-24 15:43 ` [PATCH net-next 7/7] net-timestamp: expand documentation Willem de Bruijn
2014-06-25  7:32 ` net-timestamp: MSG_TSTAMP flags and bytestream support Richard Cochran
2014-06-25 21:11   ` Willem de Bruijn
2014-07-03 19:39 Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.