linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/4] Allow TX timestamp with UDP GSO
@ 2019-05-23 21:06 Fred Klassen
  2019-05-23 21:06 ` [PATCH net 1/4] net/udp_gso: " Fred Klassen
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-23 21:06 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	netdev, linux-kernel, linux-kselftest, Willem de Bruijn
  Cc: Fred Klassen

Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.

Also this updates the UDP GSO selftests to optionally stress test
this condition, and report the reliability and performance of both
TX Timestamping and ZEROCOPY messages.

Fred Klassen (4):
  net/udp_gso: Allow TX timestamp with UDP GSO
  net/udpgso_bench_tx: options to exercise TX CMSG
  net/udpgso_bench_tx: fix sendmmsg on unconnected socket
  net/udpgso_bench_tx: audit error queue

 net/ipv4/udp_offload.c                        |   4 +
 tools/testing/selftests/net/udpgso_bench_tx.c | 376 ++++++++++++++++++++++++--
 2 files changed, 358 insertions(+), 22 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:06 [PATCH net 0/4] Allow TX timestamp with UDP GSO Fred Klassen
@ 2019-05-23 21:06 ` Fred Klassen
  2019-05-23 21:39   ` Willem de Bruijn
  2019-05-23 21:59   ` Willem de Bruijn
  2019-05-23 21:06 ` [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG Fred Klassen
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-23 21:06 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	netdev, linux-kernel, linux-kselftest, Willem de Bruijn
  Cc: Fred Klassen

Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This can be illustrated with an updated updgso_bench_tx program which
includes the '-T' option to test for this condition.

    ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
    poll timeout
    udp tx:      0 MB/s        1 calls/s      1 msg/s

The "poll timeout" message above indicates that TX timestamp never
arrived.

It also appears that other TX CMSG types cause similar issues, for
example trying to set SOL_IP/IP_TOS.

    ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18
    poll timeout
    udp tx:      0 MB/s        1 calls/s      1 msg/s

This patch preserves tx_flags for the first UDP GSO segment. This
mirrors the stack's behaviour for IPv4 fragments.

Fixes: ee80d1ebe5ba ("udp: add udp gso")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
---
 net/ipv4/udp_offload.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 065334b41d57..33de347695ae 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
 	seg = segs;
 	uh = udp_hdr(seg);
 
+	/* preserve TX timestamp and zero-copy info for first segment */
+	skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey;
+	skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags;
+
 	/* compute checksum adjustment based on old length versus new */
 	newlen = htons(sizeof(*uh) + mss);
 	check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG
  2019-05-23 21:06 [PATCH net 0/4] Allow TX timestamp with UDP GSO Fred Klassen
  2019-05-23 21:06 ` [PATCH net 1/4] net/udp_gso: " Fred Klassen
@ 2019-05-23 21:06 ` Fred Klassen
  2019-05-23 21:45   ` Willem de Bruijn
  2019-05-23 21:52   ` Willem de Bruijn
  2019-05-23 21:06 ` [PATCH net 3/4] net/udpgso_bench_tx: fix sendmmsg on unconnected socket Fred Klassen
  2019-05-23 21:06 ` [PATCH net 4/4] net/udpgso_bench_tx: audit error queue Fred Klassen
  3 siblings, 2 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-23 21:06 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	netdev, linux-kernel, linux-kselftest, Willem de Bruijn
  Cc: Fred Klassen

This enhancement adds options that facilitate load testing with
additional TX CMSG options, and to optionally print results of
various send CMSG operations.

These options are especially useful in isolating situations
where error-queue messages are lost when combined with other
CMSG operations (e.g. SO_ZEROCOPY).

New options:

    -T - add TX CMSG that requests TX software timestamps
    -H - similar to -T except request TX hardware timestamps
    -q - add IP_TOS/IPV6_TCLASS TX CMSG
    -P - call poll() before reading error queue
    -v - print detailed results

Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
---
 tools/testing/selftests/net/udpgso_bench_tx.c | 290 ++++++++++++++++++++++++--
 1 file changed, 273 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c
index 4074538b5df5..a900f016b9e7 100644
--- a/tools/testing/selftests/net/udpgso_bench_tx.c
+++ b/tools/testing/selftests/net/udpgso_bench_tx.c
@@ -5,6 +5,8 @@
 #include <arpa/inet.h>
 #include <errno.h>
 #include <error.h>
+#include <linux/errqueue.h>
+#include <linux/net_tstamp.h>
 #include <netinet/if_ether.h>
 #include <netinet/in.h>
 #include <netinet/ip.h>
@@ -19,6 +21,7 @@
 #include <string.h>
 #include <sys/socket.h>
 #include <sys/time.h>
+#include <sys/poll.h>
 #include <sys/types.h>
 #include <unistd.h>
 
@@ -34,6 +37,10 @@
 #define SO_ZEROCOPY	60
 #endif
 
+#ifndef SO_EE_ORIGIN_ZEROCOPY
+#define SO_EE_ORIGIN_ZEROCOPY 5
+#endif
+
 #ifndef MSG_ZEROCOPY
 #define MSG_ZEROCOPY	0x4000000
 #endif
@@ -48,9 +55,14 @@ static uint16_t	cfg_mss;
 static int	cfg_payload_len	= (1472 * 42);
 static int	cfg_port	= 8000;
 static int	cfg_runtime_ms	= -1;
+static bool	cfg_poll;
 static bool	cfg_segment;
 static bool	cfg_sendmmsg;
 static bool	cfg_tcp;
+static uint32_t	cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE;
+static bool	cfg_tx_tstamp;
+static uint32_t	cfg_tos;
+static bool	cfg_verbose;
 static bool	cfg_zerocopy;
 static int	cfg_msg_nr;
 static uint16_t	cfg_gso_size;
@@ -58,6 +70,10 @@ static uint16_t	cfg_gso_size;
 static socklen_t cfg_alen;
 static struct sockaddr_storage cfg_dst_addr;
 
+struct my_scm_timestamping {
+	struct timespec ts[3];
+};
+
 static bool interrupted;
 static char buf[NUM_PKT][ETH_MAX_MTU];
 
@@ -89,20 +105,20 @@ static int set_cpu(int cpu)
 
 static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr)
 {
-	struct sockaddr_in6 *addr6 = (void *) sockaddr;
-	struct sockaddr_in *addr4 = (void *) sockaddr;
+	struct sockaddr_in6 *addr6 = (void *)sockaddr;
+	struct sockaddr_in *addr4 = (void *)sockaddr;
 
 	switch (domain) {
 	case PF_INET:
 		addr4->sin_family = AF_INET;
 		addr4->sin_port = htons(cfg_port);
-		if (inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
+		if (inet_pton(AF_INET, str_addr, &addr4->sin_addr) != 1)
 			error(1, 0, "ipv4 parse error: %s", str_addr);
 		break;
 	case PF_INET6:
 		addr6->sin6_family = AF_INET6;
 		addr6->sin6_port = htons(cfg_port);
-		if (inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
+		if (inet_pton(AF_INET6, str_addr, &addr6->sin6_addr) != 1)
 			error(1, 0, "ipv6 parse error: %s", str_addr);
 		break;
 	default:
@@ -110,23 +126,143 @@ static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr)
 	}
 }
 
-static void flush_zerocopy(int fd)
+static void flush_cmsg(struct cmsghdr *cmsg)
+{
+	int i;
+
+	switch (cmsg->cmsg_level) {
+	case SOL_SOCKET:
+		if (cmsg->cmsg_type == SO_TIMESTAMPING) {
+			i = (cfg_tx_ts == SOF_TIMESTAMPING_TX_HARDWARE) ? 2 : 0;
+			struct my_scm_timestamping *tss;
+
+			tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
+			fprintf(stderr, "tx timestamp = %lu.%09lu\n",
+				tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
+		} else {
+			error(1, 0,
+			      "unknown SOL_SOCKET cmsg type=%u level=%u\n",
+			      cmsg->cmsg_type, cmsg->cmsg_level);
+		}
+		break;
+	case SOL_IP:
+	case SOL_IPV6:
+		switch (cmsg->cmsg_type) {
+		case IP_RECVERR:
+		case IPV6_RECVERR:
+		{
+			struct sock_extended_err *err;
+
+			err = (struct sock_extended_err *)CMSG_DATA(cmsg);
+			switch (err->ee_origin) {
+			case SO_EE_ORIGIN_TIMESTAMPING:
+				// Got a TX timestamp from error queue
+				fprintf(stderr,
+					"got SO_EE_ORIGIN_TIMESTAMPING\n");
+				break;
+			case SO_EE_ORIGIN_ICMP:
+			case SO_EE_ORIGIN_ICMP6:
+				fprintf(stderr,
+					"received ICMP error: type=%u, code=%u\n",
+					err->ee_type, err->ee_code);
+				break;
+			case SO_EE_ORIGIN_ZEROCOPY:
+			{
+				__u32 lo = err->ee_info;
+				__u32 hi = err->ee_data;
+
+				if (hi == lo - 1)
+					// TX was aborted
+					fprintf(stderr,
+						"Zerocopy TX aborted: lo=%u hi=%u\n",
+						lo, hi);
+				if (hi == lo)
+					// single ID acknowledged
+					fprintf(stderr,
+						"Zerocopy TX ack ID: %u\n",
+						lo);
+				else
+					// range of IDs acknowledged
+					fprintf(stderr,
+						"Zerocopy TX ack %u IDs %u to %u\n",
+						hi - lo + 1, lo, hi);
+				break;
+			}
+			case SO_EE_ORIGIN_LOCAL:
+				fprintf(stderr,
+					"received packet with local origin: %u\n",
+					err->ee_origin);
+				break;
+			default:
+				error(0, 1,
+				      "received packet with origin: %u\n",
+				      err->ee_origin);
+			}
+
+			break;
+		}
+		default:
+			error(0, 1, "unknown IP msg type=%u level=%u\n",
+			      cmsg->cmsg_type, cmsg->cmsg_level);
+			break;
+		}
+		break;
+	default:
+		error(0, 1, "unknown cmsg type=%u level=%u\n",
+		      cmsg->cmsg_type, cmsg->cmsg_level);
+	}
+}
+
+static void flush_errqueue_recv(int fd)
 {
 	struct msghdr msg = {0};	/* flush */
+	struct cmsghdr *cmsg;
+	struct iovec entry;
+	char control[1024];
+	char buf[1500];
 	int ret;
 
+	entry.iov_base = buf;
+	entry.iov_len = sizeof(buf);
+	msg.msg_iovlen = 1;
+	msg.msg_iov = &entry;
+	msg.msg_control = control;
+	msg.msg_controllen = sizeof(control);
+
 	while (1) {
 		ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
 		if (ret == -1 && errno == EAGAIN)
 			break;
 		if (ret == -1)
 			error(1, errno, "errqueue");
-		if (msg.msg_flags != (MSG_ERRQUEUE | MSG_CTRUNC))
-			error(1, 0, "errqueue: flags 0x%x\n", msg.msg_flags);
 		msg.msg_flags = 0;
+		if (cfg_verbose) {
+			for (cmsg = CMSG_FIRSTHDR(&msg);
+					cmsg;
+					cmsg = CMSG_NXTHDR(&msg, cmsg))
+				flush_cmsg(cmsg);
+		}
 	}
 }
 
+static void flush_errqueue(int fd)
+{
+	if (cfg_poll) {
+		struct pollfd fds = { 0 };
+		int ret;
+
+		fds.fd = fd;
+		fds.events = POLLERR;
+		ret = poll(&fds, 1, 1000);
+		if (ret == 0)
+			error(1, 0, "poll timeout");
+		else if (ret < 0)
+			error(1, errno, "poll");
+	}
+
+	flush_errqueue_recv(fd);
+}
+
 static int send_tcp(int fd, char *data)
 {
 	int ret, done = 0, count = 0;
@@ -168,16 +304,70 @@ static int send_udp(int fd, char *data)
 	return count;
 }
 
+static void send_ts_cmsg(struct cmsghdr *cm)
+{
+	uint32_t *valp;
+
+	cm->cmsg_level = SOL_SOCKET;
+	cm->cmsg_type = SO_TIMESTAMPING;
+	cm->cmsg_len = CMSG_LEN(sizeof(cfg_tx_ts));
+	valp = (void *)CMSG_DATA(cm);
+	*valp = cfg_tx_ts;
+}
+
+static void send_tos_cmsg(struct cmsghdr *cm)
+{
+	uint32_t *valp;
+	int level, type;
+
+	if (cfg_family == PF_INET) {
+		level = SOL_IP;
+		type  = IP_TOS;
+	} else {
+		level = SOL_IPV6;
+		type  = IPV6_TCLASS;
+	}
+
+	cm->cmsg_level = level;
+	cm->cmsg_type = type;
+	cm->cmsg_len = CMSG_LEN(sizeof(cfg_tos));
+	valp = (void *)CMSG_DATA(cm);
+	*valp = cfg_tos;
+}
+
 static int send_udp_sendmmsg(int fd, char *data)
 {
+	char control[CMSG_SPACE(sizeof(cfg_tos)) +
+		     CMSG_SPACE(sizeof(cfg_tx_ts))] = {0};
 	const int max_nr_msg = ETH_MAX_MTU / ETH_DATA_LEN;
 	struct mmsghdr mmsgs[max_nr_msg];
 	struct iovec iov[max_nr_msg];
 	unsigned int off = 0, left;
+	size_t msg_controllen = 0;
 	int i = 0, ret;
 
 	memset(mmsgs, 0, sizeof(mmsgs));
 
+	if (cfg_tx_tstamp || cfg_tos) {
+		struct msghdr msg = {0};
+		struct cmsghdr *cmsg;
+
+		msg.msg_control = control;
+		msg.msg_controllen = sizeof(control);
+		cmsg = CMSG_FIRSTHDR(&msg);
+		if (cfg_tos) {
+			send_tos_cmsg(cmsg);
+			msg_controllen += CMSG_SPACE(sizeof(cfg_tos));
+		}
+
+		if (cfg_tx_tstamp) {
+			if (msg_controllen)
+				cmsg = CMSG_NXTHDR(&msg, cmsg);
+			send_ts_cmsg(cmsg);
+			msg_controllen += CMSG_SPACE(sizeof(cfg_tx_ts));
+		}
+	}
+
 	left = cfg_payload_len;
 	while (left) {
 		if (i == max_nr_msg)
@@ -188,6 +378,10 @@ static int send_udp_sendmmsg(int fd, char *data)
 
 		mmsgs[i].msg_hdr.msg_iov = iov + i;
 		mmsgs[i].msg_hdr.msg_iovlen = 1;
+		if (msg_controllen) {
+			mmsgs[i].msg_hdr.msg_control = control;
+			mmsgs[i].msg_hdr.msg_controllen = msg_controllen;
+		}
 
 		off += iov[i].iov_len;
 		left -= iov[i].iov_len;
@@ -214,9 +408,13 @@ static void send_udp_segment_cmsg(struct cmsghdr *cm)
 
 static int send_udp_segment(int fd, char *data)
 {
-	char control[CMSG_SPACE(sizeof(cfg_gso_size))] = {0};
+	char control[CMSG_SPACE(sizeof(cfg_gso_size)) +
+		     CMSG_SPACE(sizeof(cfg_tos)) +
+		     CMSG_SPACE(sizeof(cfg_tx_ts))] = {0};
 	struct msghdr msg = {0};
 	struct iovec iov = {0};
+	size_t msg_controllen;
+	struct cmsghdr *cmsg;
 	int ret;
 
 	iov.iov_base = data;
@@ -227,8 +425,22 @@ static int send_udp_segment(int fd, char *data)
 
 	msg.msg_control = control;
 	msg.msg_controllen = sizeof(control);
-	send_udp_segment_cmsg(CMSG_FIRSTHDR(&msg));
+	cmsg = CMSG_FIRSTHDR(&msg);
+	send_udp_segment_cmsg(cmsg);
+	msg_controllen = CMSG_SPACE(sizeof(cfg_mss));
+	if (cfg_tx_tstamp) {
+		cmsg = CMSG_NXTHDR(&msg, cmsg);
+		send_ts_cmsg(cmsg);
+		msg_controllen += CMSG_SPACE(sizeof(cfg_tx_ts));
+	}
 
+	if (cfg_tos) {
+		cmsg = CMSG_NXTHDR(&msg, cmsg);
+		send_tos_cmsg(cmsg);
+		msg_controllen += CMSG_SPACE(sizeof(cfg_tos));
+	}
+
+	msg.msg_controllen = msg_controllen;
 	msg.msg_name = (void *)&cfg_dst_addr;
 	msg.msg_namelen = cfg_alen;
 
@@ -236,15 +448,16 @@ static int send_udp_segment(int fd, char *data)
 	if (ret == -1)
 		error(1, errno, "sendmsg");
 	if (ret != iov.iov_len)
-		error(1, 0, "sendmsg: %u != %lu\n", ret, iov.iov_len);
+		error(1, 0, "sendmsg: %u != %lu", ret, iov.iov_len);
 
 	return 1;
 }
 
 static void usage(const char *filepath)
 {
-	error(1, 0, "Usage: %s [-46cmtuz] [-C cpu] [-D dst ip] [-l secs] [-m messagenr] [-p port] [-s sendsize] [-S gsosize]",
-		    filepath);
+	error(1, 0,
+	      "Usage: %s [-46cmHPtTuvz] [-C cpu] [-D dst ip] [-l secs] [-M messagenr] [-p port] [-q tos] [-s sendsize] [-S gsosize]",
+	      filepath);
 }
 
 static void parse_opts(int argc, char **argv)
@@ -252,7 +465,7 @@ static void parse_opts(int argc, char **argv)
 	int max_len, hdrlen;
 	int c;
 
-	while ((c = getopt(argc, argv, "46cC:D:l:mM:p:s:S:tuz")) != -1) {
+	while ((c = getopt(argc, argv, "46cC:D:Hl:mM:p:s:q:PS:tTuvz")) != -1) {
 		switch (c) {
 		case '4':
 			if (cfg_family != PF_UNSPEC)
@@ -287,19 +500,35 @@ static void parse_opts(int argc, char **argv)
 		case 'p':
 			cfg_port = strtoul(optarg, NULL, 0);
 			break;
+		case 'P':
+			cfg_poll = true;
+			break;
 		case 's':
 			cfg_payload_len = strtoul(optarg, NULL, 0);
 			break;
+		case 'q':
+			cfg_tos = strtoul(optarg, NULL, 0);
+			break;
 		case 'S':
 			cfg_gso_size = strtoul(optarg, NULL, 0);
 			cfg_segment = true;
 			break;
+		case 'H':
+			cfg_tx_ts = SOF_TIMESTAMPING_TX_HARDWARE;
+			cfg_tx_tstamp = true;
+			break;
 		case 't':
 			cfg_tcp = true;
 			break;
+		case 'T':
+			cfg_tx_tstamp = true;
+			break;
 		case 'u':
 			cfg_connected = false;
 			break;
+		case 'v':
+			cfg_verbose = true;
+			break;
 		case 'z':
 			cfg_zerocopy = true;
 			break;
@@ -315,6 +544,12 @@ static void parse_opts(int argc, char **argv)
 		error(1, 0, "connectionless tcp makes no sense");
 	if (cfg_segment && cfg_sendmmsg)
 		error(1, 0, "cannot combine segment offload and sendmmsg");
+	if (cfg_tx_tstamp && !(cfg_segment || cfg_sendmmsg))
+		error(1, 0, "Options -T and -H require either -S or -m option");
+	if (cfg_tos && !(cfg_segment || cfg_sendmmsg))
+		error(1, 0, "Option -q requires either -S or -m option");
+	if (cfg_poll && !(cfg_segment || cfg_sendmmsg))
+		error(1, 0, "Poll option -P requires either -S or -m option");
 
 	if (cfg_family == PF_INET)
 		hdrlen = sizeof(struct iphdr) + sizeof(struct udphdr);
@@ -349,6 +584,19 @@ static void set_pmtu_discover(int fd, bool is_ipv4)
 		error(1, errno, "setsockopt path mtu");
 }
 
+static void set_tx_timestamping(int fd)
+{
+	int val = SOF_TIMESTAMPING_OPT_CMSG | SOF_TIMESTAMPING_OPT_ID;
+
+	if (cfg_tx_ts == SOF_TIMESTAMPING_TX_SOFTWARE)
+		val |= SOF_TIMESTAMPING_SOFTWARE;
+	else
+		val |= SOF_TIMESTAMPING_RAW_HARDWARE;
+
+	if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)))
+		error(1, errno, "setsockopt tx timestamping");
+}
+
 int main(int argc, char **argv)
 {
 	unsigned long num_msgs, num_sends;
@@ -384,7 +632,11 @@ int main(int argc, char **argv)
 	if (cfg_segment)
 		set_pmtu_discover(fd, cfg_family == PF_INET);
 
-	num_msgs = num_sends = 0;
+	if (cfg_tx_tstamp)
+		set_tx_timestamping(fd);
+
+	num_msgs = 0;
+	num_sends = 0;
 	tnow = gettimeofday_ms();
 	tstop = tnow + cfg_runtime_ms;
 	treport = tnow + 1000;
@@ -400,8 +652,8 @@ int main(int argc, char **argv)
 		else
 			num_sends += send_udp(fd, buf[i]);
 		num_msgs++;
-		if (cfg_zerocopy && ((num_msgs & 0xF) == 0))
-			flush_zerocopy(fd);
+		if ((cfg_zerocopy && (num_msgs & 0xF) == 0) || cfg_tx_tstamp)
+			flush_errqueue(fd);
 
 		if (cfg_msg_nr && num_msgs >= cfg_msg_nr)
 			break;
@@ -413,7 +665,8 @@ int main(int argc, char **argv)
 				cfg_tcp ? "tcp" : "udp",
 				(num_msgs * cfg_payload_len) >> 20,
 				num_sends, num_msgs);
-			num_msgs = num_sends = 0;
+			num_msgs = 0;
+			num_sends = 0;
 			treport = tnow + 1000;
 		}
 
@@ -423,6 +676,9 @@ int main(int argc, char **argv)
 
 	} while (!interrupted && (cfg_runtime_ms == -1 || tnow < tstop));
 
+	if (cfg_zerocopy || cfg_tx_tstamp)
+		flush_errqueue(fd);
+
 	if (close(fd))
 		error(1, errno, "close");
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net 3/4] net/udpgso_bench_tx: fix sendmmsg on unconnected socket
  2019-05-23 21:06 [PATCH net 0/4] Allow TX timestamp with UDP GSO Fred Klassen
  2019-05-23 21:06 ` [PATCH net 1/4] net/udp_gso: " Fred Klassen
  2019-05-23 21:06 ` [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG Fred Klassen
@ 2019-05-23 21:06 ` Fred Klassen
  2019-05-23 21:06 ` [PATCH net 4/4] net/udpgso_bench_tx: audit error queue Fred Klassen
  3 siblings, 0 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-23 21:06 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	netdev, linux-kernel, linux-kselftest, Willem de Bruijn
  Cc: Fred Klassen

Fixes the following error if using both -m and -u options:

    # ./udpgso_bench -4um -D 172.16.120.189
    ./udpgso_bench: sendmmsg: Destination address required

Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
---
 tools/testing/selftests/net/udpgso_bench_tx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c
index a900f016b9e7..56e0d890b066 100644
--- a/tools/testing/selftests/net/udpgso_bench_tx.c
+++ b/tools/testing/selftests/net/udpgso_bench_tx.c
@@ -378,6 +378,8 @@ static int send_udp_sendmmsg(int fd, char *data)
 
 		mmsgs[i].msg_hdr.msg_iov = iov + i;
 		mmsgs[i].msg_hdr.msg_iovlen = 1;
+		mmsgs[i].msg_hdr.msg_name = (void *)&cfg_dst_addr;
+		mmsgs[i].msg_hdr.msg_namelen = cfg_alen;
 		if (msg_controllen) {
 			mmsgs[i].msg_hdr.msg_control = control;
 			mmsgs[i].msg_hdr.msg_controllen = msg_controllen;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-23 21:06 [PATCH net 0/4] Allow TX timestamp with UDP GSO Fred Klassen
                   ` (2 preceding siblings ...)
  2019-05-23 21:06 ` [PATCH net 3/4] net/udpgso_bench_tx: fix sendmmsg on unconnected socket Fred Klassen
@ 2019-05-23 21:06 ` Fred Klassen
  2019-05-23 21:56   ` Willem de Bruijn
  3 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-23 21:06 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	netdev, linux-kernel, linux-kselftest, Willem de Bruijn
  Cc: Fred Klassen

This enhancement adds the '-a' option, which will count all CMSG
messages on the error queue and print a summary report.

Fixes: 3a687bef148d ("selftests: udp gso benchmark")

Example:

    # ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189
    udp tx:    492 MB/s     8354 calls/s   8354 msg/s
    udp tx:    477 MB/s     8106 calls/s   8106 msg/s
    udp tx:    488 MB/s     8288 calls/s   8288 msg/s
    udp tx:    882 MB/s    14975 calls/s  14975 msg/s
    Summary over 5.000 seconds ...
    sum udp tx:    696 MB/s      57696 calls (11539/s)  57696 msgs (11539/s)
    Tx Timestamps: received:     57696   errors: 0

This can be useful in tracking loss of messages when under load. For example,
adding the '-z' option results in loss of TX timestamp messages:

    # ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z
    udp tx:    490 MB/s     8325 calls/s   8325 msg/s
    udp tx:    500 MB/s     8492 calls/s   8492 msg/s
    udp tx:    883 MB/s    14985 calls/s  14985 msg/s
    udp tx:    756 MB/s    12823 calls/s  12823 msg/s
    Summary over 5.000 seconds ...
    sum udp tx:    657 MB/s      54429 calls (10885/s)  54429 msgs (10885/s)
    Tx Timestamps: received:     34046   errors: 0
    Zerocopy acks: received:     54422   errors: 0

Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
---
 tools/testing/selftests/net/udpgso_bench_tx.c | 152 +++++++++++++++++++-------
 1 file changed, 113 insertions(+), 39 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c
index 56e0d890b066..9924342a0b03 100644
--- a/tools/testing/selftests/net/udpgso_bench_tx.c
+++ b/tools/testing/selftests/net/udpgso_bench_tx.c
@@ -62,10 +62,19 @@ static bool	cfg_tcp;
 static uint32_t	cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE;
 static bool	cfg_tx_tstamp;
 static uint32_t	cfg_tos;
+static bool	cfg_audit;
 static bool	cfg_verbose;
 static bool	cfg_zerocopy;
 static int	cfg_msg_nr;
 static uint16_t	cfg_gso_size;
+static unsigned long total_num_msgs;
+static unsigned long total_num_sends;
+static unsigned long stat_tx_ts;
+static unsigned long stat_tx_ts_errors;
+static unsigned long tstart;
+static unsigned long tend;
+static unsigned long stat_zcopies;
+static unsigned long stat_zcopy_errors;
 
 static socklen_t cfg_alen;
 static struct sockaddr_storage cfg_dst_addr;
@@ -137,8 +146,11 @@ static void flush_cmsg(struct cmsghdr *cmsg)
 			struct my_scm_timestamping *tss;
 
 			tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
-			fprintf(stderr, "tx timestamp = %lu.%09lu\n",
-				tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
+			if (tss->ts[i].tv_sec == 0)
+				stat_tx_ts_errors++;
+			if (cfg_verbose)
+				fprintf(stderr, "tx timestamp = %lu.%09lu\n",
+					tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
 		} else {
 			error(1, 0,
 			      "unknown SOL_SOCKET cmsg type=%u level=%u\n",
@@ -157,41 +169,52 @@ static void flush_cmsg(struct cmsghdr *cmsg)
 			switch (err->ee_origin) {
 			case SO_EE_ORIGIN_TIMESTAMPING:
 				// Got a TX timestamp from error queue
-				fprintf(stderr,
-					"got SO_EE_ORIGIN_TIMESTAMPING\n");
+				stat_tx_ts++;
+				if (cfg_verbose)
+					fprintf(stderr,
+						"got SO_EE_ORIGIN_TIMESTAMPING\n");
 				break;
 			case SO_EE_ORIGIN_ICMP:
 			case SO_EE_ORIGIN_ICMP6:
-				fprintf(stderr,
-					"received ICMP error: type=%u, code=%u\n",
-					err->ee_type, err->ee_code);
+				if (cfg_verbose)
+					fprintf(stderr,
+						"received ICMP error: type=%u, code=%u\n",
+						err->ee_type, err->ee_code);
 				break;
 			case SO_EE_ORIGIN_ZEROCOPY:
 			{
 				__u32 lo = err->ee_info;
 				__u32 hi = err->ee_data;
 
-				if (hi == lo - 1)
+				if (hi == lo - 1) {
 					// TX was aborted
-					fprintf(stderr,
-						"Zerocopy TX aborted: lo=%u hi=%u\n",
-						lo, hi);
-				if (hi == lo)
+					stat_zcopy_errors++;
+					if (cfg_verbose)
+						fprintf(stderr,
+							"Zerocopy TX aborted: lo=%u hi=%u\n",
+							lo, hi);
+				} else if (hi == lo) {
 					// single ID acknowledged
-					fprintf(stderr,
-						"Zerocopy TX ack ID: %u\n",
-						lo);
-				else
+					stat_zcopies++;
+					if (cfg_verbose)
+						fprintf(stderr,
+							"Zerocopy TX ack ID: %u\n",
+							lo);
+				} else {
 					// range of IDs acknowledged
-					fprintf(stderr,
-						"Zerocopy TX ack %u IDs %u to %u\n",
-						hi - lo + 1, lo, hi);
+					stat_zcopies += hi - lo + 1;
+					if (cfg_verbose)
+						fprintf(stderr,
+							"Zerocopy TX ack %u IDs %u to %u\n",
+							hi - lo + 1, lo, hi);
+				}
 				break;
 			}
 			case SO_EE_ORIGIN_LOCAL:
-				fprintf(stderr,
-					"received packet with local origin: %u\n",
-					err->ee_origin);
+				if (cfg_verbose)
+					fprintf(stderr,
+						"received packet with local origin: %u\n",
+						err->ee_origin);
 				break;
 			default:
 				error(0, 1,
@@ -236,7 +259,7 @@ static void flush_errqueue_recv(int fd)
 		if (ret == -1)
 			error(1, errno, "errqueue");
 		msg.msg_flags = 0;
-		if (cfg_verbose) {
+		if (cfg_audit || cfg_verbose) {
 			for (cmsg = CMSG_FIRSTHDR(&msg);
 					cmsg;
 					cmsg = CMSG_NXTHDR(&msg, cmsg))
@@ -245,19 +268,21 @@ static void flush_errqueue_recv(int fd)
 	}
 }
 
-static void flush_errqueue(int fd)
+static void flush_errqueue(int fd, const bool do_poll)
 {
-	if (cfg_poll) {
+	if (do_poll) {
 		struct pollfd fds = { 0 };
 		int ret;
 
 		fds.fd = fd;
 		fds.events = POLLERR;
-		ret = poll(&fds, 1, 1000);
-		if (ret == 0)
-			error(1, 0, "poll timeout");
-		else if (ret < 0)
+		ret = poll(&fds, 1, 500);
+		if (ret == 0) {
+			if (cfg_verbose)
+				fprintf(stderr, "poll timeout\n");
+		} else if (ret < 0) {
 			error(1, errno, "poll");
+		}
 	}
 
 	flush_errqueue_recv(fd);
@@ -458,7 +483,7 @@ static int send_udp_segment(int fd, char *data)
 static void usage(const char *filepath)
 {
 	error(1, 0,
-	      "Usage: %s [-46cmHPtTuvz] [-C cpu] [-D dst ip] [-l secs] [-M messagenr] [-p port] [-q tos] [-s sendsize] [-S gsosize]",
+	      "Usage: %s [-46acmHPtTuvz] [-C cpu] [-D dst ip] [-l secs] [-M messagenr] [-p port] [-q tos] [-s sendsize] [-S gsosize]",
 	      filepath);
 }
 
@@ -467,7 +492,7 @@ static void parse_opts(int argc, char **argv)
 	int max_len, hdrlen;
 	int c;
 
-	while ((c = getopt(argc, argv, "46cC:D:Hl:mM:p:s:q:PS:tTuvz")) != -1) {
+	while ((c = getopt(argc, argv, "46acC:D:Hl:mM:p:s:q:PS:tTuvz")) != -1) {
 		switch (c) {
 		case '4':
 			if (cfg_family != PF_UNSPEC)
@@ -481,6 +506,9 @@ static void parse_opts(int argc, char **argv)
 			cfg_family = PF_INET6;
 			cfg_alen = sizeof(struct sockaddr_in6);
 			break;
+		case 'a':
+			cfg_audit = true;
+			break;
 		case 'c':
 			cfg_cache_trash = true;
 			break;
@@ -599,6 +627,51 @@ static void set_tx_timestamping(int fd)
 		error(1, errno, "setsockopt tx timestamping");
 }
 
+static void print_final_report(unsigned long num_msgs, unsigned long num_sends)
+{
+	unsigned long tdelta;
+
+	tdelta = tend - tstart;
+	if (!tdelta)
+		return;
+
+	fprintf(stderr, "Summary over %lu.%03lu seconds ...\n", tdelta / 1000,
+		tdelta % 1000);
+	fprintf(stderr,
+		"sum %s tx: %6lu MB/s %10lu calls (%lu/s) %10lu msgs (%lu/s)\n",
+		cfg_tcp ? "tcp" : "udp",
+		((num_msgs * cfg_payload_len) >> 10) / tdelta,
+		num_sends, num_sends * 1000 / tdelta,
+		num_msgs, num_msgs * 1000 / tdelta);
+	if (cfg_tx_tstamp)
+		fprintf(stderr,
+			"Tx Timestamps: received: %9lu   errors: %lu\n",
+			stat_tx_ts, stat_tx_ts_errors);
+
+	if (cfg_zerocopy)
+		fprintf(stderr,
+			"Zerocopy acks: received: %9lu   errors: %lu\n",
+			stat_zcopies, stat_zcopy_errors);
+}
+
+static void print_report(unsigned long num_msgs, unsigned long num_sends,
+			 const bool final)
+{
+	if (!final)
+		fprintf(stderr,
+			"%s tx: %6lu MB/s %8lu calls/s %6lu msg/s\n",
+			cfg_tcp ? "tcp" : "udp",
+			(num_msgs * cfg_payload_len) >> 20,
+			num_sends, num_msgs);
+
+	if (cfg_audit) {
+		total_num_msgs += num_msgs;
+		total_num_sends += num_sends;
+		if (final)
+			print_final_report(total_num_msgs, total_num_sends);
+	}
+}
+
 int main(int argc, char **argv)
 {
 	unsigned long num_msgs, num_sends;
@@ -640,6 +713,8 @@ int main(int argc, char **argv)
 	num_msgs = 0;
 	num_sends = 0;
 	tnow = gettimeofday_ms();
+	tstart = tnow;
+	tend = tnow;
 	tstop = tnow + cfg_runtime_ms;
 	treport = tnow + 1000;
 
@@ -654,19 +729,15 @@ int main(int argc, char **argv)
 		else
 			num_sends += send_udp(fd, buf[i]);
 		num_msgs++;
-		if ((cfg_zerocopy && (num_msgs & 0xF) == 0) || cfg_tx_tstamp)
-			flush_errqueue(fd);
+		if (cfg_tx_tstamp || (cfg_zerocopy && (num_msgs & 0xF) == 0))
+			flush_errqueue(fd, cfg_poll);
 
 		if (cfg_msg_nr && num_msgs >= cfg_msg_nr)
 			break;
 
 		tnow = gettimeofday_ms();
 		if (tnow > treport) {
-			fprintf(stderr,
-				"%s tx: %6lu MB/s %8lu calls/s %6lu msg/s\n",
-				cfg_tcp ? "tcp" : "udp",
-				(num_msgs * cfg_payload_len) >> 20,
-				num_sends, num_msgs);
+			print_report(num_msgs, num_sends, false);
 			num_msgs = 0;
 			num_sends = 0;
 			treport = tnow + 1000;
@@ -679,10 +750,13 @@ int main(int argc, char **argv)
 	} while (!interrupted && (cfg_runtime_ms == -1 || tnow < tstop));
 
 	if (cfg_zerocopy || cfg_tx_tstamp)
-		flush_errqueue(fd);
+		flush_errqueue(fd, true);
 
 	if (close(fd))
 		error(1, errno, "close");
 
+	tend = tnow;
+	print_report(num_msgs, num_sends, true);
+
 	return 0;
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:06 ` [PATCH net 1/4] net/udp_gso: " Fred Klassen
@ 2019-05-23 21:39   ` Willem de Bruijn
  2019-05-24  1:38     ` Fred Klassen
  2019-05-25 20:46     ` Fred Klassen
  2019-05-23 21:59   ` Willem de Bruijn
  1 sibling, 2 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-23 21:39 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 5:09 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> Fixes an issue where TX Timestamps are not arriving on the error queue
> when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
> This can be illustrated with an updated updgso_bench_tx program which
> includes the '-T' option to test for this condition.
>
>     ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s
>
> The "poll timeout" message above indicates that TX timestamp never
> arrived.
>
> It also appears that other TX CMSG types cause similar issues, for
> example trying to set SOL_IP/IP_TOS.
>
>     ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s
>
> This patch preserves tx_flags for the first UDP GSO segment. This
> mirrors the stack's behaviour for IPv4 fragments.
>
> Fixes: ee80d1ebe5ba ("udp: add udp gso")
> Signed-off-by: Fred Klassen <fklassen@appneta.com>
> ---
>  net/ipv4/udp_offload.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 065334b41d57..33de347695ae 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>         seg = segs;
>         uh = udp_hdr(seg);
>
> +       /* preserve TX timestamp and zero-copy info for first segment */
> +       skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey;
> +       skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags;
> +

Thanks for the report.

Zerocopy notification reference count is managed in skb_segment. That
should work.

Support for timestamping with the new GSO feature is indeed an
oversight. The solution is similar to how TCP associates the timestamp
with the right segment in tcp_gso_tstamp.

Only, I think we want to transfer the timestamp request to the last
datagram, not the first. For send timestamp, the final byte leaving
the host is usually more interesting.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG
  2019-05-23 21:06 ` [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG Fred Klassen
@ 2019-05-23 21:45   ` Willem de Bruijn
  2019-05-23 21:52   ` Willem de Bruijn
  1 sibling, 0 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-23 21:45 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 5:11 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> This enhancement adds options that facilitate load testing with
> additional TX CMSG options, and to optionally print results of
> various send CMSG operations.
>
> These options are especially useful in isolating situations
> where error-queue messages are lost when combined with other
> CMSG operations (e.g. SO_ZEROCOPY).
>
> New options:
>
>     -T - add TX CMSG that requests TX software timestamps
>     -H - similar to -T except request TX hardware timestamps
>     -q - add IP_TOS/IPV6_TCLASS TX CMSG
>     -P - call poll() before reading error queue
>     -v - print detailed results
>
> Fixes: 3a687bef148d ("selftests: udp gso benchmark")

This is not a fix, but an extension. Fixes tags help with backporting
to stable kernels. There is something to be said to backport the main
change and support SO_TIMESTAMPING + UDP_GSO on older kernels,
especially since it is very concise. But the tests should probably be
in a separate patch set targeting net-next.


> Signed-off-by: Fred Klassen <fklassen@appneta.com>
> ---
>  tools/testing/selftests/net/udpgso_bench_tx.c | 290 ++++++++++++++++++++++++--
>  1 file changed, 273 insertions(+), 17 deletions(-)
>
> diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c
> index 4074538b5df5..a900f016b9e7 100644
> --- a/tools/testing/selftests/net/udpgso_bench_tx.c
> +++ b/tools/testing/selftests/net/udpgso_bench_tx.c
> @@ -5,6 +5,8 @@
>  #include <arpa/inet.h>
>  #include <errno.h>
>  #include <error.h>
> +#include <linux/errqueue.h>
> +#include <linux/net_tstamp.h>
>  #include <netinet/if_ether.h>
>  #include <netinet/in.h>
>  #include <netinet/ip.h>
> @@ -19,6 +21,7 @@
>  #include <string.h>
>  #include <sys/socket.h>
>  #include <sys/time.h>
> +#include <sys/poll.h>
>  #include <sys/types.h>
>  #include <unistd.h>
>
> @@ -34,6 +37,10 @@
>  #define SO_ZEROCOPY    60
>  #endif
>
> +#ifndef SO_EE_ORIGIN_ZEROCOPY
> +#define SO_EE_ORIGIN_ZEROCOPY 5
> +#endif
> +
>  #ifndef MSG_ZEROCOPY
>  #define MSG_ZEROCOPY   0x4000000
>  #endif
> @@ -48,9 +55,14 @@ static uint16_t      cfg_mss;
>  static int     cfg_payload_len = (1472 * 42);
>  static int     cfg_port        = 8000;
>  static int     cfg_runtime_ms  = -1;
> +static bool    cfg_poll;
>  static bool    cfg_segment;
>  static bool    cfg_sendmmsg;
>  static bool    cfg_tcp;
> +static uint32_t        cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE;
> +static bool    cfg_tx_tstamp;
> +static uint32_t        cfg_tos;
> +static bool    cfg_verbose;
>  static bool    cfg_zerocopy;
>  static int     cfg_msg_nr;
>  static uint16_t        cfg_gso_size;
> @@ -58,6 +70,10 @@ static uint16_t      cfg_gso_size;
>  static socklen_t cfg_alen;
>  static struct sockaddr_storage cfg_dst_addr;
>
> +struct my_scm_timestamping {
> +       struct timespec ts[3];
> +};
> +

This and the above should not be needed if including <linux/errqueue.h>

It may be absent if relying on the host header files, but the
kselftest build system should correctly use the files from the kernel
source tree.

>  static bool interrupted;
>  static char buf[NUM_PKT][ETH_MAX_MTU];
>
> @@ -89,20 +105,20 @@ static int set_cpu(int cpu)
>
>  static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr)
>  {
> -       struct sockaddr_in6 *addr6 = (void *) sockaddr;
> -       struct sockaddr_in *addr4 = (void *) sockaddr;
> +       struct sockaddr_in6 *addr6 = (void *)sockaddr;
> +       struct sockaddr_in *addr4 = (void *)sockaddr;
>
>         switch (domain) {
>         case PF_INET:
>                 addr4->sin_family = AF_INET;
>                 addr4->sin_port = htons(cfg_port);
> -               if (inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
> +               if (inet_pton(AF_INET, str_addr, &addr4->sin_addr) != 1)
>                         error(1, 0, "ipv4 parse error: %s", str_addr);
>                 break;
>         case PF_INET6:
>                 addr6->sin6_family = AF_INET6;
>                 addr6->sin6_port = htons(cfg_port);
> -               if (inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
> +               if (inet_pton(AF_INET6, str_addr, &addr6->sin6_addr) != 1)

Please do not include style changes like these. Try to minimize
changes required to add the new feature.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG
  2019-05-23 21:06 ` [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG Fred Klassen
  2019-05-23 21:45   ` Willem de Bruijn
@ 2019-05-23 21:52   ` Willem de Bruijn
  2019-05-24  2:10     ` Fred Klassen
  1 sibling, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-23 21:52 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 5:11 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> This enhancement adds options that facilitate load testing with
> additional TX CMSG options, and to optionally print results of
> various send CMSG operations.
>
> These options are especially useful in isolating situations
> where error-queue messages are lost when combined with other
> CMSG operations (e.g. SO_ZEROCOPY).
>
> New options:
>
>     -T - add TX CMSG that requests TX software timestamps
>     -H - similar to -T except request TX hardware timestamps
>     -q - add IP_TOS/IPV6_TCLASS TX CMSG

To ensure that we do not regress, when adding options, please consider
(just a general suggestion, not a strong request for this patch set)
updating the kselftest to run a variant of the test with the new code
coverage. In this case, make the code pass/fail instead of only user
interpretable and add variants to udpgso.sh.

>     -P - call poll() before reading error queue
>     -v - print detailed results
>
> Fixes: 3a687bef148d ("selftests: udp gso benchmark")
> Signed-off-by: Fred Klassen <fklassen@appneta.com>

> +static void flush_errqueue_recv(int fd)
>  {
>         struct msghdr msg = {0};        /* flush */
> +       struct cmsghdr *cmsg;
> +       struct iovec entry;
> +       char control[1024];

can use more precise CMSG_SPACE based on worst case expectations, like
in udp_sendmmsg

> +       char buf[1500];

no need for payload

> +static void flush_errqueue(int fd)
> +{
> +       if (cfg_poll) {
> +               struct pollfd fds = { 0 };
> +               int ret;
> +
> +               fds.fd = fd;
> +               fds.events = POLLERR;

no need to pass POLLERR, it is always returned in revents.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-23 21:06 ` [PATCH net 4/4] net/udpgso_bench_tx: audit error queue Fred Klassen
@ 2019-05-23 21:56   ` Willem de Bruijn
  2019-05-24  1:27     ` Fred Klassen
  2019-05-27 21:30     ` Fred Klassen
  0 siblings, 2 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-23 21:56 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 5:11 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> This enhancement adds the '-a' option, which will count all CMSG
> messages on the error queue and print a summary report.
>
> Fixes: 3a687bef148d ("selftests: udp gso benchmark")

Also not a fix, but an extension.

>
> Example:
>
>     # ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189
>     udp tx:    492 MB/s     8354 calls/s   8354 msg/s
>     udp tx:    477 MB/s     8106 calls/s   8106 msg/s
>     udp tx:    488 MB/s     8288 calls/s   8288 msg/s
>     udp tx:    882 MB/s    14975 calls/s  14975 msg/s
>     Summary over 5.000 seconds ...
>     sum udp tx:    696 MB/s      57696 calls (11539/s)  57696 msgs (11539/s)
>     Tx Timestamps: received:     57696   errors: 0
>
> This can be useful in tracking loss of messages when under load. For example,
> adding the '-z' option results in loss of TX timestamp messages:
>
>     # ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z
>     udp tx:    490 MB/s     8325 calls/s   8325 msg/s
>     udp tx:    500 MB/s     8492 calls/s   8492 msg/s
>     udp tx:    883 MB/s    14985 calls/s  14985 msg/s
>     udp tx:    756 MB/s    12823 calls/s  12823 msg/s
>     Summary over 5.000 seconds ...
>     sum udp tx:    657 MB/s      54429 calls (10885/s)  54429 msgs (10885/s)
>     Tx Timestamps: received:     34046   errors: 0
>     Zerocopy acks: received:     54422   errors: 0

This would probably also be more useful as regression test if it is in
the form of a pass/fail test: if timestamps are requested and total
count is zero, then the feature is broken and the process should exit
with an error.

>
> Fixes: 3a687bef148d ("selftests: udp gso benchmark")

Repeated

> Signed-off-by: Fred Klassen <fklassen@appneta.com>
> ---
>  tools/testing/selftests/net/udpgso_bench_tx.c | 152 +++++++++++++++++++-------
>  1 file changed, 113 insertions(+), 39 deletions(-)
>
> diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c
> index 56e0d890b066..9924342a0b03 100644
> --- a/tools/testing/selftests/net/udpgso_bench_tx.c
> +++ b/tools/testing/selftests/net/udpgso_bench_tx.c
> @@ -62,10 +62,19 @@ static bool cfg_tcp;
>  static uint32_t        cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE;
>  static bool    cfg_tx_tstamp;
>  static uint32_t        cfg_tos;
> +static bool    cfg_audit;
>  static bool    cfg_verbose;
>  static bool    cfg_zerocopy;
>  static int     cfg_msg_nr;
>  static uint16_t        cfg_gso_size;
> +static unsigned long total_num_msgs;
> +static unsigned long total_num_sends;
> +static unsigned long stat_tx_ts;
> +static unsigned long stat_tx_ts_errors;
> +static unsigned long tstart;
> +static unsigned long tend;
> +static unsigned long stat_zcopies;
> +static unsigned long stat_zcopy_errors;
>
>  static socklen_t cfg_alen;
>  static struct sockaddr_storage cfg_dst_addr;
> @@ -137,8 +146,11 @@ static void flush_cmsg(struct cmsghdr *cmsg)
>                         struct my_scm_timestamping *tss;
>
>                         tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
> -                       fprintf(stderr, "tx timestamp = %lu.%09lu\n",
> -                               tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
> +                       if (tss->ts[i].tv_sec == 0)
> +                               stat_tx_ts_errors++;
> +                       if (cfg_verbose)
> +                               fprintf(stderr, "tx timestamp = %lu.%09lu\n",
> +                                       tss->ts[i].tv_sec, tss->ts[i].tv_nsec);

changes unrelated to this feature?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:06 ` [PATCH net 1/4] net/udp_gso: " Fred Klassen
  2019-05-23 21:39   ` Willem de Bruijn
@ 2019-05-23 21:59   ` Willem de Bruijn
  2019-05-25 20:09     ` Fred Klassen
  2019-05-25 20:47     ` Fred Klassen
  1 sibling, 2 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-23 21:59 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 5:09 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> Fixes an issue where TX Timestamps are not arriving on the error queue
> when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
> This can be illustrated with an updated updgso_bench_tx program which
> includes the '-T' option to test for this condition.
>
>     ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s
>
> The "poll timeout" message above indicates that TX timestamp never
> arrived.
>
> It also appears that other TX CMSG types cause similar issues, for
> example trying to set SOL_IP/IP_TOS.
>
>     ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s

what exactly is the issue with IP_TOS?

If I understand correctly, the issue here is that the new 'P' option
that polls on the error queue times out. This is unrelated to
specifying TOS bits? Without zerocopy or timestamps, no message is
expected on the error queue.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-23 21:56   ` Willem de Bruijn
@ 2019-05-24  1:27     ` Fred Klassen
  2019-05-24  5:02       ` Willem de Bruijn
  2019-05-27 21:30     ` Fred Klassen
  1 sibling, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-24  1:27 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

Willem, this is only my 2nd patch, and my last one was a one liner.
I’ll try to work through this, but let me know if I am doing a rookie
mistake (learning curve and all).


> On May 23, 2019, at 2:56 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> On Thu, May 23, 2019 at 5:11 PM Fred Klassen <fklassen@appneta.com> wrote:
>> 
>> This enhancement adds the '-a' option, which will count all CMSG
>> messages on the error queue and print a summary report.
>> 
>> Fixes: 3a687bef148d ("selftests: udp gso benchmark")
> 
> Also not a fix, but an extension.

I’ll make a v2 patch and remove “Fixes:".

> 
>> 
>> Example:
>> 
>>    # ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189
>>    udp tx:    492 MB/s     8354 calls/s   8354 msg/s
>>    udp tx:    477 MB/s     8106 calls/s   8106 msg/s
>>    udp tx:    488 MB/s     8288 calls/s   8288 msg/s
>>    udp tx:    882 MB/s    14975 calls/s  14975 msg/s
>>    Summary over 5.000 seconds ...
>>    sum udp tx:    696 MB/s      57696 calls (11539/s)  57696 msgs (11539/s)
>>    Tx Timestamps: received:     57696   errors: 0
>> 
>> This can be useful in tracking loss of messages when under load. For example,
>> adding the '-z' option results in loss of TX timestamp messages:
>> 
>>    # ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z
>>    udp tx:    490 MB/s     8325 calls/s   8325 msg/s
>>    udp tx:    500 MB/s     8492 calls/s   8492 msg/s
>>    udp tx:    883 MB/s    14985 calls/s  14985 msg/s
>>    udp tx:    756 MB/s    12823 calls/s  12823 msg/s
>>    Summary over 5.000 seconds ...
>>    sum udp tx:    657 MB/s      54429 calls (10885/s)  54429 msgs (10885/s)
>>    Tx Timestamps: received:     34046   errors: 0
>>    Zerocopy acks: received:     54422   errors: 0
> 
> This would probably also be more useful as regression test if it is in
> the form of a pass/fail test: if timestamps are requested and total
> count is zero, then the feature is broken and the process should exit
> with an error.
> 

I’ll add a hard failure for zero response for TX Timestamps or Zerocopy,
or if any errors occur.


>> 
>> Fixes: 3a687bef148d ("selftests: udp gso benchmark")
> 
> Repeated

Will fix.
> 
>> Signed-off-by: Fred Klassen <fklassen@appneta.com>
>> ---
>> tools/testing/selftests/net/udpgso_bench_tx.c | 152 +++++++++++++++++++-------
>> 1 file changed, 113 insertions(+), 39 deletions(-)
>> 
>> diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c b/tools/testing/selftests/net/udpgso_bench_tx.c
>> index 56e0d890b066..9924342a0b03 100644
>> --- a/tools/testing/selftests/net/udpgso_bench_tx.c
>> +++ b/tools/testing/selftests/net/udpgso_bench_tx.c
>> @@ -62,10 +62,19 @@ static bool cfg_tcp;
>> static uint32_t        cfg_tx_ts = SOF_TIMESTAMPING_TX_SOFTWARE;
>> static bool    cfg_tx_tstamp;
>> static uint32_t        cfg_tos;
>> +static bool    cfg_audit;
>> static bool    cfg_verbose;
>> static bool    cfg_zerocopy;
>> static int     cfg_msg_nr;
>> static uint16_t        cfg_gso_size;
>> +static unsigned long total_num_msgs;
>> +static unsigned long total_num_sends;
>> +static unsigned long stat_tx_ts;
>> +static unsigned long stat_tx_ts_errors;
>> +static unsigned long tstart;
>> +static unsigned long tend;
>> +static unsigned long stat_zcopies;
>> +static unsigned long stat_zcopy_errors;
>> 
>> static socklen_t cfg_alen;
>> static struct sockaddr_storage cfg_dst_addr;
>> @@ -137,8 +146,11 @@ static void flush_cmsg(struct cmsghdr *cmsg)
>>                        struct my_scm_timestamping *tss;
>> 
>>                        tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
>> -                       fprintf(stderr, "tx timestamp = %lu.%09lu\n",
>> -                               tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
>> +                       if (tss->ts[i].tv_sec == 0)
>> +                               stat_tx_ts_errors++;
>> +                       if (cfg_verbose)
>> +                               fprintf(stderr, "tx timestamp = %lu.%09lu\n",
>> +                                       tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
> 
> changes unrelated to this feature?

I’ll remove. Do you think that I should pull out any messages related
to “cfg_verbose”? 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:39   ` Willem de Bruijn
@ 2019-05-24  1:38     ` Fred Klassen
  2019-05-24  4:53       ` Willem de Bruijn
  2019-05-25 20:46     ` Fred Klassen
  1 sibling, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-24  1:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

> Thanks for the report.
> 
> Zerocopy notification reference count is managed in skb_segment. That
> should work.
> 
> Support for timestamping with the new GSO feature is indeed an
> oversight. The solution is similar to how TCP associates the timestamp
> with the right segment in tcp_gso_tstamp.
> 
> Only, I think we want to transfer the timestamp request to the last
> datagram, not the first. For send timestamp, the final byte leaving
> the host is usually more interesting.

TX Timestamping the last packet of a datagram is something that would
work poorly for our application. We need to measure the time it takes
for the first bit that is sent until the first bit of the last packet is received.
Timestaming the last packet of a burst seems somewhat random to me
and would not be useful. Essentially we would be timestamping a 
random byte in a UDP GSO buffer.

I believe there is a precedence for timestamping the first packet. With
IPv4 packets, the first packet is timestamped and the remaining fragments
are not.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG
  2019-05-23 21:52   ` Willem de Bruijn
@ 2019-05-24  2:10     ` Fred Klassen
  0 siblings, 0 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-24  2:10 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

> 
> To ensure that we do not regress, when adding options, please consider
> (just a general suggestion, not a strong request for this patch set)
> updating the kselftest to run a variant of the test with the new code
> coverage. In this case, make the code pass/fail instead of only user
> interpretable and add variants to udpgso.sh.

I had a look at how kselftest works, and I absolutely want to see this
work with these changes. I’ll investigate and implement in v2 patch.

I will most likely do a 5 second test. This seems to be sufficient to get
meaningful results

> can use more precise CMSG_SPACE based on worst case expectations, like
> in udp_sendmmsg
> 
>> +       char buf[1500];
> 
> no need for payload
> 
>> +static void flush_errqueue(int fd)
>> +{
>> +       if (cfg_poll) {
>> +               struct pollfd fds = { 0 };
>> +               int ret;
>> +
>> +               fds.fd = fd;
>> +               fds.events = POLLERR;
> 
> no need to pass POLLERR, it is always returned in revents.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-24  1:38     ` Fred Klassen
@ 2019-05-24  4:53       ` Willem de Bruijn
  2019-05-24 16:34         ` Fred Klassen
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-24  4:53 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 9:38 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> > Thanks for the report.
> >
> > Zerocopy notification reference count is managed in skb_segment. That
> > should work.
> >
> > Support for timestamping with the new GSO feature is indeed an
> > oversight. The solution is similar to how TCP associates the timestamp
> > with the right segment in tcp_gso_tstamp.
> >
> > Only, I think we want to transfer the timestamp request to the last
> > datagram, not the first. For send timestamp, the final byte leaving
> > the host is usually more interesting.
>
> TX Timestamping the last packet of a datagram is something that would
> work poorly for our application. We need to measure the time it takes
> for the first bit that is sent until the first bit of the last packet is received.
> Timestaming the last packet of a burst seems somewhat random to me
> and would not be useful. Essentially we would be timestamping a
> random byte in a UDP GSO buffer.
>
> I believe there is a precedence for timestamping the first packet. With
> IPv4 packets, the first packet is timestamped and the remaining fragments
> are not.

Interesting. TCP timestamping takes the opposite choice and does
timestamp the last byte in the sendmsg request.

It sounds like it depends on the workload. Perhaps this then needs to
be configurable with an SOF_.. flag.

Another option would be to return a timestamp for every segment. But
they would all return the same tskey. And it causes different behavior
with and without hardware offload.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-24  1:27     ` Fred Klassen
@ 2019-05-24  5:02       ` Willem de Bruijn
  0 siblings, 0 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-24  5:02 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Thu, May 23, 2019 at 9:27 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> Willem, this is only my 2nd patch, and my last one was a one liner.
> I’ll try to work through this, but let me know if I am doing a rookie
> mistake (learning curve and all).

Not at all. The fix makes perfect sense.

The test patches 2 and 4 are not fixes, so are better suited to to
net-next. Perhaps the changes to the test can also be more concise,
just the minimal changes needed to demonstrate the bug and fix.

> >>                        tss = (struct my_scm_timestamping *)CMSG_DATA(cmsg);
> >> -                       fprintf(stderr, "tx timestamp = %lu.%09lu\n",
> >> -                               tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
> >> +                       if (tss->ts[i].tv_sec == 0)
> >> +                               stat_tx_ts_errors++;
> >> +                       if (cfg_verbose)
> >> +                               fprintf(stderr, "tx timestamp = %lu.%09lu\n",
> >> +                                       tss->ts[i].tv_sec, tss->ts[i].tv_nsec);
> >
> > changes unrelated to this feature?
>
> I’ll remove. Do you think that I should pull out any messages related
> to “cfg_verbose”?

This change did not seem relevant to the main feature of the patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-24  4:53       ` Willem de Bruijn
@ 2019-05-24 16:34         ` Fred Klassen
  2019-05-24 19:29           ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-24 16:34 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

> Interesting. TCP timestamping takes the opposite choice and does
> timestamp the last byte in the sendmsg request.
> 

I have a difficult time with the philosophy of TX timestamping the last
segment. The actual timestamp occurs just before the last segment
is sent. This is neither the start  nor the end of a GSO packet, which
to me seems somewhat arbitrary. It is even more arbitrary when using
software TX tiimestamping. These are timestamps represent the
time that the packet is queued onto the NIC’s buffer, not actual
time leaving the wire. Queuing to a ring buffer is usually much faster
than wire rates. Therefore, say the timestamp of the last 1500 byte 
segment of a 64K GSO packet may in reality be representing a time
about half way through the burst.

Since the timestamp of a TX packet occurs just before any data is sent,
I have found it most valuable to timestamp just before the first byte of 
the packet or burst. Conversely, I find it most valuable to get an RX
timestamp  after the last byte arrives.

> It sounds like it depends on the workload. Perhaps this then needs to
> be configurable with an SOF_.. flag.
> 

It would be interesting if a practical case can be made for timestamping
the last segment. In my mind, I don’t see how that would be valuable.

> Another option would be to return a timestamp for every segment. But
> they would all return the same tskey. And it causes different behavior
> with and without hardware offload.

When it comes to RX packets, getting per-packet (or per segment)
timestamps is invaluable. They represent actual wire times. However
my previous research into TX timestamping has led me to conclude
that there is no practical value when timestamping every packet of 
a back-to-back burst.

When using software TX timestamping, The inter-packet timestamps
are typically much faster than line rate. Whereas you may be sending
on a GigE link, you may measure 20Gbps. At higher rates, I have found
that the overhead of per-packet software timestamping can produce
gaps in packets.

When using hardware timestamping, I think you will find that nearly all
adapters only allow one timestamp at a time. Therefore only one
packet in a burst would get timestamped. There are exceptions, for
example I am playing with a 100G Mellanox adapter that has
per-packet TX timestamping. However, I suspect that when I am
done testing, all I will see is timestamps that are representing wire
rate (e.g. 123nsec per 1500 byte packet).

Beyond testing the accuracy of a NIC’s timestamping capabilities, I
see very little value in doing per-segment timestamping.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-24 16:34         ` Fred Klassen
@ 2019-05-24 19:29           ` Willem de Bruijn
  2019-05-24 22:01             ` Fred Klassen
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-24 19:29 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Fri, May 24, 2019 at 12:34 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> > Interesting. TCP timestamping takes the opposite choice and does
> > timestamp the last byte in the sendmsg request.
> >
>
> I have a difficult time with the philosophy of TX timestamping the last
> segment. The actual timestamp occurs just before the last segment
> is sent. This is neither the start  nor the end of a GSO packet, which
> to me seems somewhat arbitrary. It is even more arbitrary when using
> software TX tiimestamping. These are timestamps represent the
> time that the packet is queued onto the NIC’s buffer, not actual
> time leaving the wire.

It is the last moment that a timestamp can be generated for the last
byte, I don't see how that is "neither the start nor the end of a GSO
packet".

> Queuing to a ring buffer is usually much faster
> than wire rates. Therefore, say the timestamp of the last 1500 byte
> segment of a 64K GSO packet may in reality be representing a time
> about half way through the burst.
>
> Since the timestamp of a TX packet occurs just before any data is sent,
> I have found it most valuable to timestamp just before the first byte of
> the packet or burst. Conversely, I find it most valuable to get an RX
> timestamp  after the last byte arrives.
>
> > It sounds like it depends on the workload. Perhaps this then needs to
> > be configurable with an SOF_.. flag.
> >
>
> It would be interesting if a practical case can be made for timestamping
> the last segment. In my mind, I don’t see how that would be valuable.

It depends whether you are interested in measuring network latency or
host transmit path latency.

For the latter, knowing the time from the start of the sendmsg call to
the moment the last byte hits the wire is most relevant. Or in absence
of (well defined) hardware support, the last byte being queued to the
device is the next best thing.

It would make sense for this software implementation to follow
established hardware behavior. But as far as I know, the exact time a
hardware timestamp is taken is not consistent across devices, either.

For fine grained timestamped data, perhaps GSO is simply not a good
mechanism. That said, it still has to queue a timestamp if requested.

> > Another option would be to return a timestamp for every segment. But
> > they would all return the same tskey. And it causes different behavior
> > with and without hardware offload.
>
> When it comes to RX packets, getting per-packet (or per segment)
> timestamps is invaluable. They represent actual wire times. However
> my previous research into TX timestamping has led me to conclude
> that there is no practical value when timestamping every packet of
> a back-to-back burst.
>
> When using software TX timestamping, The inter-packet timestamps
> are typically much faster than line rate. Whereas you may be sending
> on a GigE link, you may measure 20Gbps. At higher rates, I have found
> that the overhead of per-packet software timestamping can produce
> gaps in packets.
>
> When using hardware timestamping, I think you will find that nearly all
> adapters only allow one timestamp at a time. Therefore only one
> packet in a burst would get timestamped.

Can you elaborate? When the host queues N packets all with hardware
timestamps requested, all N completions will have a timestamp? Or is
that not guaranteed?

> There are exceptions, for
> example I am playing with a 100G Mellanox adapter that has
> per-packet TX timestamping. However, I suspect that when I am
> done testing, all I will see is timestamps that are representing wire
> rate (e.g. 123nsec per 1500 byte packet).
>
> Beyond testing the accuracy of a NIC’s timestamping capabilities, I
> see very little value in doing per-segment timestamping.

Ack. Great detailed argument, thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-24 19:29           ` Willem de Bruijn
@ 2019-05-24 22:01             ` Fred Klassen
  2019-05-25 15:20               ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-24 22:01 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> It is the last moment that a timestamp can be generated for the last
> byte, I don't see how that is "neither the start nor the end of a GSO
> packet”.

My misunderstanding. I thought TCP did last segment timestamping, not
last byte. In that case, your statements make sense.

>> It would be interesting if a practical case can be made for timestamping
>> the last segment. In my mind, I don’t see how that would be valuable.
> 
> It depends whether you are interested in measuring network latency or
> host transmit path latency.
> 
> For the latter, knowing the time from the start of the sendmsg call to
> the moment the last byte hits the wire is most relevant. Or in absence
> of (well defined) hardware support, the last byte being queued to the
> device is the next best thing.
> 
> It would make sense for this software implementation to follow
> established hardware behavior. But as far as I know, the exact time a
> hardware timestamp is taken is not consistent across devices, either.
> 
> For fine grained timestamped data, perhaps GSO is simply not a good
> mechanism. That said, it still has to queue a timestamp if requested.

I see your point. Makes sense to me.

>> When using hardware timestamping, I think you will find that nearly all
>> adapters only allow one timestamp at a time. Therefore only one
>> packet in a burst would get timestamped.
> 
> Can you elaborate? When the host queues N packets all with hardware
> timestamps requested, all N completions will have a timestamp? Or is
> that not guaranteed?
> 

It is not guaranteed. The best example is in ixgbe_main.c and search for
‘SKBTX_HW_TSTAMP’.  If there is a PTP TX timestamp in progress,
‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps
are possible. The flag is cleared after transmit softirq, and only then
can another TX timestamp be taken.  

>> There are exceptions, for
>> example I am playing with a 100G Mellanox adapter that has
>> per-packet TX timestamping. However, I suspect that when I am
>> done testing, all I will see is timestamps that are representing wire
>> rate (e.g. 123nsec per 1500 byte packet).
>> 
>> Beyond testing the accuracy of a NIC’s timestamping capabilities, I
>> see very little value in doing per-segment timestamping.
> 
> Ack. Great detailed argument, thanks.

Thanks. I’m a timestamping nerd and have learned lots with this 
discussion.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-24 22:01             ` Fred Klassen
@ 2019-05-25 15:20               ` Willem de Bruijn
  2019-05-25 18:47                 ` Fred Klassen
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-25 15:20 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest

On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
>
>
>
> > On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >
> > It is the last moment that a timestamp can be generated for the last
> > byte, I don't see how that is "neither the start nor the end of a GSO
> > packet”.
>
> My misunderstanding. I thought TCP did last segment timestamping, not
> last byte. In that case, your statements make sense.
>
> >> It would be interesting if a practical case can be made for timestamping
> >> the last segment. In my mind, I don’t see how that would be valuable.
> >
> > It depends whether you are interested in measuring network latency or
> > host transmit path latency.
> >
> > For the latter, knowing the time from the start of the sendmsg call to
> > the moment the last byte hits the wire is most relevant. Or in absence
> > of (well defined) hardware support, the last byte being queued to the
> > device is the next best thing.

Sounds to me like both cases have a legitimate use case, and we want
to support both.

Implementation constraints are that storage for this timestamp
information is scarce and we cannot add new cold cacheline accesses in
the datapath.

The simplest approach would be to unconditionally timestamp both the
first and last segment. With the same ID. Not terribly elegant. But it
works.

If conditional, tx_flags has only one bit left. I think we can harvest
some, as not all defined bits are in use at the same stages in the
datapath, but that is not a trivial change. Some might also better be
set in the skb, instead of skb_shinfo. Which would also avoids
touching that cacheline. We could possibly repurpose bits from u32
tskey.

All that can come later. Initially, unless we can come up with
something more elegant, I would suggest that UDP follows the rule
established by TCP and timestamps the last byte. And we add an
explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
__sock_tx_timestamp and is interpreted in __udp_gso_segment.

> >
> > It would make sense for this software implementation to follow
> > established hardware behavior. But as far as I know, the exact time a
> > hardware timestamp is taken is not consistent across devices, either.
> >
> > For fine grained timestamped data, perhaps GSO is simply not a good
> > mechanism. That said, it still has to queue a timestamp if requested.
>
> I see your point. Makes sense to me.
>
> >> When using hardware timestamping, I think you will find that nearly all
> >> adapters only allow one timestamp at a time. Therefore only one
> >> packet in a burst would get timestamped.
> >
> > Can you elaborate? When the host queues N packets all with hardware
> > timestamps requested, all N completions will have a timestamp? Or is
> > that not guaranteed?
> >
>
> It is not guaranteed. The best example is in ixgbe_main.c and search for
> ‘SKBTX_HW_TSTAMP’.  If there is a PTP TX timestamp in progress,
> ‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps
> are possible. The flag is cleared after transmit softirq, and only then
> can another TX timestamp be taken.

Interesting, thanks. I had no idea.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-25 15:20               ` Willem de Bruijn
@ 2019-05-25 18:47                 ` Fred Klassen
  2019-05-27  1:30                   ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-25 18:47 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest



> On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
>> 
>> 
>> 
>>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>>> 
>>> It is the last moment that a timestamp can be generated for the last
>>> byte, I don't see how that is "neither the start nor the end of a GSO
>>> packet”.
>> 
>> My misunderstanding. I thought TCP did last segment timestamping, not
>> last byte. In that case, your statements make sense.
>> 
>>>> It would be interesting if a practical case can be made for timestamping
>>>> the last segment. In my mind, I don’t see how that would be valuable.
>>> 
>>> It depends whether you are interested in measuring network latency or
>>> host transmit path latency.
>>> 
>>> For the latter, knowing the time from the start of the sendmsg call to
>>> the moment the last byte hits the wire is most relevant. Or in absence
>>> of (well defined) hardware support, the last byte being queued to the
>>> device is the next best thing.
> 
> Sounds to me like both cases have a legitimate use case, and we want
> to support both.
> 
> Implementation constraints are that storage for this timestamp
> information is scarce and we cannot add new cold cacheline accesses in
> the datapath.
> 
> The simplest approach would be to unconditionally timestamp both the
> first and last segment. With the same ID. Not terribly elegant. But it
> works.
> 
> If conditional, tx_flags has only one bit left. I think we can harvest
> some, as not all defined bits are in use at the same stages in the
> datapath, but that is not a trivial change. Some might also better be
> set in the skb, instead of skb_shinfo. Which would also avoids
> touching that cacheline. We could possibly repurpose bits from u32
> tskey.
> 
> All that can come later. Initially, unless we can come up with
> something more elegant, I would suggest that UDP follows the rule
> established by TCP and timestamps the last byte. And we add an
> explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
> supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
> __sock_tx_timestamp and is interpreted in __udp_gso_segment.
> 

I don’t see how to practically TX timestamp the last byte of any packet
(UDP GSO or otherwise). The best we could do is timestamp the last
segment,  or rather the time that the last segment is queued. Let me
attempt to explain.

First let’s look at software TX timestamps which are for are generated
by skb_tx_timestamp() in nearly every network driver’s xmit routine. It
states:

—————————— cut ————————————
 * Ethernet MAC Drivers should call this function in their hard_xmit()
 * function immediately before giving the sk_buff to the MAC hardware.
—————————— cut ————————————

That means that the sk_buff will get timestamped just before rather
than just after it is sent. To truly capture the timestamp of the last
byte, this routine routine would have to be called a second time, right
after sending to MAC hardware. Then the user program would have
sort out the 2 timestamps. My guess is that this isn’t something that
NIC vendors would be willing to implement in their drivers.

So, the best we can do is timestamp is just before the last segment.
Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter.
If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp
occurs half way through the burst. But it may not be exactly half way
because the segments may get queued much faster than wire rate.
Therefore the time between segment 1 and segment 2 may be much
much smaller than their spacing on the wire. I would not find this
useful.

I propose that we stick with the method used for IP fragments, which
is timestamping just before the first byte is sent. Put another way, I 
propose that we start the clock in an automobile race just before the
front of the first car crosses the start line rather than when the front
of the last car crosses the start line.

TX timestamping in hardware has even more limitations. For the most
part, we can only do one timestamp per packet or burst.  If we requested
a timestamp of only the last segment of a packet, we would have work
backwards to calculate the start time of the packet, but that would
only be be a best guess. For extremely time sensitive applications
(such as the one we develop), this would not be practical.

We could still consider setting a flag that would allow the timestamping
the last segment rather than the first. However since we cannot 
truly measure the timestamp of the last byte, I would question the value
in doing so.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:59   ` Willem de Bruijn
@ 2019-05-25 20:09     ` Fred Klassen
  2019-05-25 20:47     ` Fred Klassen
  1 sibling, 0 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-25 20:09 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 23, 2019, at 2:59 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> what exactly is the issue with IP_TOS?
> 
> If I understand correctly, the issue here is that the new 'P' option
> that polls on the error queue times out. This is unrelated to
> specifying TOS bits? Without zerocopy or timestamps, no message is
> expected on the error queue.

I was not able to get to the root cause, but I noticed that IP_TOS
CMSG was lost until I applied this fix. I also found it confusing as to
why that may be the case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:39   ` Willem de Bruijn
  2019-05-24  1:38     ` Fred Klassen
@ 2019-05-25 20:46     ` Fred Klassen
  1 sibling, 0 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-25 20:46 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 23, 2019, at 2:39 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> Zerocopy notification reference count is managed in skb_segment. That
> should work.
> 

I’m trying to understand the context of reference counting in skb_segment. I assume that
there is an opportunity to optimize the count of outstanding zerocopy buffers, but I 
can’t see it. Please clarify.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-23 21:59   ` Willem de Bruijn
  2019-05-25 20:09     ` Fred Klassen
@ 2019-05-25 20:47     ` Fred Klassen
  1 sibling, 0 replies; 34+ messages in thread
From: Fred Klassen @ 2019-05-25 20:47 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 23, 2019, at 2:59 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:what exactly is the issue with IP_TOS?
> 
> If I understand correctly, the issue here is that the new 'P' option
> that polls on the error queue times out. This is unrelated to
> specifying TOS bits? Without zerocopy or timestamps, no message is
> expected on the error queue.

Please disregard last message. I think I was chasing a non-issue with
TOS bits. I will remove all references to TOS.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-25 18:47                 ` Fred Klassen
@ 2019-05-27  1:30                   ` Willem de Bruijn
  2019-05-27  2:09                     ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-27  1:30 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest

On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@appneta.com> wrote:
>
>
>
> > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
> >>
> >>
> >>
> >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >>>
> >>> It is the last moment that a timestamp can be generated for the last
> >>> byte, I don't see how that is "neither the start nor the end of a GSO
> >>> packet”.
> >>
> >> My misunderstanding. I thought TCP did last segment timestamping, not
> >> last byte. In that case, your statements make sense.
> >>
> >>>> It would be interesting if a practical case can be made for timestamping
> >>>> the last segment. In my mind, I don’t see how that would be valuable.
> >>>
> >>> It depends whether you are interested in measuring network latency or
> >>> host transmit path latency.
> >>>
> >>> For the latter, knowing the time from the start of the sendmsg call to
> >>> the moment the last byte hits the wire is most relevant. Or in absence
> >>> of (well defined) hardware support, the last byte being queued to the
> >>> device is the next best thing.
> >
> > Sounds to me like both cases have a legitimate use case, and we want
> > to support both.
> >
> > Implementation constraints are that storage for this timestamp
> > information is scarce and we cannot add new cold cacheline accesses in
> > the datapath.
> >
> > The simplest approach would be to unconditionally timestamp both the
> > first and last segment. With the same ID. Not terribly elegant. But it
> > works.
> >
> > If conditional, tx_flags has only one bit left. I think we can harvest
> > some, as not all defined bits are in use at the same stages in the
> > datapath, but that is not a trivial change. Some might also better be
> > set in the skb, instead of skb_shinfo. Which would also avoids
> > touching that cacheline. We could possibly repurpose bits from u32
> > tskey.
> >
> > All that can come later. Initially, unless we can come up with
> > something more elegant, I would suggest that UDP follows the rule
> > established by TCP and timestamps the last byte. And we add an
> > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
> > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
> > __sock_tx_timestamp and is interpreted in __udp_gso_segment.
> >
>
> I don’t see how to practically TX timestamp the last byte of any packet
> (UDP GSO or otherwise). The best we could do is timestamp the last
> segment,  or rather the time that the last segment is queued. Let me
> attempt to explain.
>
> First let’s look at software TX timestamps which are for are generated
> by skb_tx_timestamp() in nearly every network driver’s xmit routine. It
> states:
>
> —————————— cut ————————————
>  * Ethernet MAC Drivers should call this function in their hard_xmit()
>  * function immediately before giving the sk_buff to the MAC hardware.
> —————————— cut ————————————
>
> That means that the sk_buff will get timestamped just before rather
> than just after it is sent. To truly capture the timestamp of the last
> byte, this routine routine would have to be called a second time, right
> after sending to MAC hardware. Then the user program would have
> sort out the 2 timestamps. My guess is that this isn’t something that
> NIC vendors would be willing to implement in their drivers.
>
> So, the best we can do is timestamp is just before the last segment.
> Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter.
> If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp
> occurs half way through the burst. But it may not be exactly half way
> because the segments may get queued much faster than wire rate.
> Therefore the time between segment 1 and segment 2 may be much
> much smaller than their spacing on the wire. I would not find this
> useful.

For measuring host queueing latency, a timestamp at the existing
skb_tx_timestamp() for the last segment is perfectly informative.

> I propose that we stick with the method used for IP fragments, which
> is timestamping just before the first byte is sent.

I understand that this addresses your workload. It simply ignores the
other identified earlier in this thread.

> Put another way, I
> propose that we start the clock in an automobile race just before the
> front of the first car crosses the start line rather than when the front
> of the last car crosses the start line.
>
> TX timestamping in hardware has even more limitations. For the most
> part, we can only do one timestamp per packet or burst.  If we requested
> a timestamp of only the last segment of a packet, we would have work
> backwards to calculate the start time of the packet, but that would
> only be be a best guess. For extremely time sensitive applications
> (such as the one we develop), this would not be practical.

Note that for any particularly sensitive measurements, a segment can
always be sent separately.

> We could still consider setting a flag that would allow the timestamping
> the last segment rather than the first. However since we cannot
> truly measure the timestamp of the last byte, I would question the value
> in doing so.
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO
  2019-05-27  1:30                   ` Willem de Bruijn
@ 2019-05-27  2:09                     ` Willem de Bruijn
  0 siblings, 0 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-27  2:09 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest

On Sun, May 26, 2019 at 8:30 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@appneta.com> wrote:
> >
> >
> >
> > > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
> > >>
> > >>
> > >>
> > >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > >>>
> > >>> It is the last moment that a timestamp can be generated for the last
> > >>> byte, I don't see how that is "neither the start nor the end of a GSO
> > >>> packet”.
> > >>
> > >> My misunderstanding. I thought TCP did last segment timestamping, not
> > >> last byte. In that case, your statements make sense.
> > >>
> > >>>> It would be interesting if a practical case can be made for timestamping
> > >>>> the last segment. In my mind, I don’t see how that would be valuable.
> > >>>
> > >>> It depends whether you are interested in measuring network latency or
> > >>> host transmit path latency.
> > >>>
> > >>> For the latter, knowing the time from the start of the sendmsg call to
> > >>> the moment the last byte hits the wire is most relevant. Or in absence
> > >>> of (well defined) hardware support, the last byte being queued to the
> > >>> device is the next best thing.
> > >
> > > Sounds to me like both cases have a legitimate use case, and we want
> > > to support both.
> > >
> > > Implementation constraints are that storage for this timestamp
> > > information is scarce and we cannot add new cold cacheline accesses in
> > > the datapath.
> > >
> > > The simplest approach would be to unconditionally timestamp both the
> > > first and last segment. With the same ID. Not terribly elegant. But it
> > > works.
> > >
> > > If conditional, tx_flags has only one bit left. I think we can harvest
> > > some, as not all defined bits are in use at the same stages in the
> > > datapath, but that is not a trivial change. Some might also better be
> > > set in the skb, instead of skb_shinfo. Which would also avoids
> > > touching that cacheline. We could possibly repurpose bits from u32
> > > tskey.
> > >
> > > All that can come later. Initially, unless we can come up with
> > > something more elegant, I would suggest that UDP follows the rule
> > > established by TCP and timestamps the last byte. And we add an
> > > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
> > > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
> > > __sock_tx_timestamp and is interpreted in __udp_gso_segment.
> > >
> >
> > I don’t see how to practically TX timestamp the last byte of any packet
> > (UDP GSO or otherwise). The best we could do is timestamp the last
> > segment,  or rather the time that the last segment is queued. Let me
> > attempt to explain.
> >
> > First let’s look at software TX timestamps which are for are generated
> > by skb_tx_timestamp() in nearly every network driver’s xmit routine. It
> > states:
> >
> > —————————— cut ————————————
> >  * Ethernet MAC Drivers should call this function in their hard_xmit()
> >  * function immediately before giving the sk_buff to the MAC hardware.
> > —————————— cut ————————————
> >
> > That means that the sk_buff will get timestamped just before rather
> > than just after it is sent. To truly capture the timestamp of the last
> > byte, this routine routine would have to be called a second time, right
> > after sending to MAC hardware. Then the user program would have
> > sort out the 2 timestamps. My guess is that this isn’t something that
> > NIC vendors would be willing to implement in their drivers.
> >
> > So, the best we can do is timestamp is just before the last segment.
> > Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter.
> > If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp
> > occurs half way through the burst. But it may not be exactly half way
> > because the segments may get queued much faster than wire rate.
> > Therefore the time between segment 1 and segment 2 may be much
> > much smaller than their spacing on the wire. I would not find this
> > useful.
>
> For measuring host queueing latency, a timestamp at the existing
> skb_tx_timestamp() for the last segment is perfectly informative.

In most cases all segments will be sent in a single xmit_more train.
In which case the device doorbell is rung when the last segment is
queued.

A device may also pause in the middle of a train, causing the rest of
the list to be requeued and resent after a tx completion frees up
descriptors and wakes the device. This seems like a relevant exception
to be able to measure.

That said, I am not opposed to the first segment, if we have to make a
binary choice for a default. Either option has cons. See more specific
revision requests in the v2 patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-23 21:56   ` Willem de Bruijn
  2019-05-24  1:27     ` Fred Klassen
@ 2019-05-27 21:30     ` Fred Klassen
  2019-05-27 21:46       ` Willem de Bruijn
  1 sibling, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-27 21:30 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

Willem,

Thanks for spending so much time with me on this patch. I’m pretty
new to this, so I know I am making lots of mistakes. I have been
working on a v2 of the selftests in net-next, but want to review the
layout of the report before I submit (see below).

Also, I my v2 fix in net is still up for debate. In its current state, it
meets my application’s requirements, but may not meet all of yours.
I am still open to suggestions, but so far I don’t have an alternate
solution that doesn’t break what I need working.

I also have a question about submitting a fix in net and a related
enhancement in net-next. I feel I should be referencing the fix
in net in my net-next commit, but I don’t know how it should be
done. Any suggestions?    

> On May 23, 2019, at 2:56 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>> 
>> Example:
>> 
>>    # ./udpgso_bench_tx -4uT -a -l5 -S 1472 -D 172.16.120.189
>>    udp tx:    492 MB/s     8354 calls/s   8354 msg/s
>>    udp tx:    477 MB/s     8106 calls/s   8106 msg/s
>>    udp tx:    488 MB/s     8288 calls/s   8288 msg/s
>>    udp tx:    882 MB/s    14975 calls/s  14975 msg/s
>>    Summary over 5.000 seconds ...
>>    sum udp tx:    696 MB/s      57696 calls (11539/s)  57696 msgs (11539/s)
>>    Tx Timestamps: received:     57696   errors: 0
>> 
>> This can be useful in tracking loss of messages when under load. For example,
>> adding the '-z' option results in loss of TX timestamp messages:
>> 
>>    # ./udpgso_bench_tx -4ucT -a -l5 -S 1472 -D 172.16.120.189 -p 3239 -z
>>    udp tx:    490 MB/s     8325 calls/s   8325 msg/s
>>    udp tx:    500 MB/s     8492 calls/s   8492 msg/s
>>    udp tx:    883 MB/s    14985 calls/s  14985 msg/s
>>    udp tx:    756 MB/s    12823 calls/s  12823 msg/s
>>    Summary over 5.000 seconds ...
>>    sum udp tx:    657 MB/s      54429 calls (10885/s)  54429 msgs (10885/s)
>>    Tx Timestamps: received:     34046   errors: 0
>>    Zerocopy acks: received:     54422   errors: 0
> 
> This would probably also be more useful as regression test if it is in
> the form of a pass/fail test: if timestamps are requested and total
> count is zero, then the feature is broken and the process should exit
> with an error.

I have it set up as a pass/fail test. Below is a sample output of the
test, including a failure on IPv6 TCP Zerocopy audit (a failure that may
lead to a memory leak). I wanted to review the report with you before
I push up the v2 patch into net-next.

Are these extra tests what you were expecting? Is it OK that it doesn’t
flow well? Also, there is a failure about every 3rd time I run it,
indicating that some TX or Zerocopy messages are lost. Is that OK? 

ipv4
tcp
tcp rx:  11665 MB/s   189014 calls/s
tcp tx:  11665 MB/s   197859 calls/s 197859 msg/s
tcp rx:  11706 MB/s   190749 calls/s
tcp tx:  11706 MB/s   198545 calls/s 198545 msg/s
tcp tx:  11653 MB/s   197645 calls/s 197645 msg/s
tcp rx:  11653 MB/s   189292 calls/s
tcp zerocopy
tcp rx:   6658 MB/s   111849 calls/s
tcp tx:   6658 MB/s   112940 calls/s 112940 msg/s
tcp tx:   6824 MB/s   115742 calls/s 115742 msg/s
tcp rx:   6824 MB/s   115711 calls/s
tcp rx:   6865 MB/s   116402 calls/s
tcp tx:   6865 MB/s   116440 calls/s 116440 msg/s
tcp zerocopy audit
tcp tx:   6646 MB/s   112729 calls/s 112729 msg/s
tcp rx:   6646 MB/s   111445 calls/s
tcp rx:   6672 MB/s   112833 calls/s
tcp tx:   6672 MB/s   113164 calls/s 113164 msg/s
tcp tx:   6624 MB/s   112355 calls/s 112355 msg/s
tcp rx:   6624 MB/s   110877 calls/s
Summary over 4.000 seconds...
sum tcp tx:   6813 MB/s     451402 calls (112850/s)     451402 msgs (112850/s)
Zerocopy acks:              451402 received                 0 errors
udp
udp rx:    914 MB/s   651407 calls/s
udp tx:    925 MB/s   659274 calls/s  15697 msg/s
udp rx:    919 MB/s   654859 calls/s
udp tx:    919 MB/s   654864 calls/s  15592 msg/s
udp rx:    914 MB/s   651668 calls/s
udp tx:    914 MB/s   651630 calls/s  15515 msg/s
udp rx:    918 MB/s   654642 calls/s
udp gso
udp rx:   2271 MB/s  1618319 calls/s
udp tx:   2515 MB/s    42658 calls/s  42658 msg/s
udp rx:   2337 MB/s  1665280 calls/s
udp tx:   2551 MB/s    43276 calls/s  43276 msg/s
udp rx:   2338 MB/s  1665792 calls/s
udp tx:   2557 MB/s    43384 calls/s  43384 msg/s
udp rx:   2339 MB/s  1666560 calls/s
udp gso zerocopy
udp rx:   1725 MB/s  1229087 calls/s
udp tx:   1840 MB/s    31219 calls/s  31219 msg/s
udp rx:   1850 MB/s  1318460 calls/s
udp tx:   1850 MB/s    31388 calls/s  31388 msg/s
udp rx:   1845 MB/s  1314428 calls/s
udp tx:   1845 MB/s    31299 calls/s  31299 msg/s
udp gso timestamp
udp rx:   2286 MB/s  1628928 calls/s
udp tx:   2446 MB/s    41499 calls/s  41499 msg/s
udp rx:   2354 MB/s  1677312 calls/s
udp tx:   2473 MB/s    41952 calls/s  41952 msg/s
udp rx:   2342 MB/s  1668864 calls/s
udp tx:   2471 MB/s    41925 calls/s  41925 msg/s
udp rx:   2333 MB/s  1662464 calls/s
udp gso zerocopy audit
udp rx:   1787 MB/s  1273481 calls/s
udp tx:   1832 MB/s    31082 calls/s  31082 msg/s
udp rx:   1881 MB/s  1340476 calls/s
udp tx:   1881 MB/s    31916 calls/s  31916 msg/s
udp rx:   1880 MB/s  1339880 calls/s
udp tx:   1881 MB/s    31904 calls/s  31904 msg/s
udp rx:   1881 MB/s  1340010 calls/s
Summary over 4.000 seconds...
sum udp tx:   1912 MB/s     126694 calls (31673/s)     126694 msgs (31673/s)
Zerocopy acks:              126694 received                 0 errors
udp gso timestamp audit
udp rx:   2259 MB/s  1609327 calls/s
udp tx:   2410 MB/s    40879 calls/s  40879 msg/s
udp rx:   2346 MB/s  1671168 calls/s
udp tx:   2446 MB/s    41491 calls/s  41491 msg/s
udp rx:   2358 MB/s  1680128 calls/s
udp tx:   2448 MB/s    41522 calls/s  41522 msg/s
udp rx:   2356 MB/s  1678848 calls/s
Summary over 4.000 seconds...
sum udp tx:   2494 MB/s     165276 calls (41319/s)     165276 msgs (41319/s)
Tx Timestamps:              165276 received                 0 errors
udp gso zerocopy timestamp audit
udp rx:   1658 MB/s  1181647 calls/s
udp tx:   1678 MB/s    28466 calls/s  28466 msg/s
udp rx:   1718 MB/s  1224010 calls/s
udp tx:   1718 MB/s    29146 calls/s  29146 msg/s
udp rx:   1718 MB/s  1224086 calls/s
udp tx:   1718 MB/s    29144 calls/s  29144 msg/s
udp rx:   1717 MB/s  1223464 calls/s
Summary over 4.000 seconds...
sum udp tx:   1747 MB/s     115771 calls (28942/s)     115771 msgs (28942/s)
Tx Timestamps:              115771 received                 0 errors
Zerocopy acks:              115771 received                 0 errors
ipv6
tcp
tcp tx:  11470 MB/s   194547 calls/s 194547 msg/s
tcp rx:  11470 MB/s   188442 calls/s
tcp tx:  11803 MB/s   200193 calls/s 200193 msg/s
tcp rx:  11803 MB/s   194526 calls/s
tcp tx:  11832 MB/s   200681 calls/s 200681 msg/s
tcp rx:  11832 MB/s   194459 calls/s
tcp zerocopy
tcp rx:   7482 MB/s    80176 calls/s
tcp tx:   7482 MB/s   126908 calls/s 126908 msg/s
tcp rx:   6641 MB/s   112609 calls/s
tcp tx:   6641 MB/s   112649 calls/s 112649 msg/s
tcp tx:   6584 MB/s   111683 calls/s 111683 msg/s
tcp rx:   6584 MB/s   111617 calls/s
tcp zerocopy audit
tcp tx:   6719 MB/s   113968 calls/s 113968 msg/s
tcp rx:   6719 MB/s   113893 calls/s
tcp rx:   6772 MB/s   114831 calls/s
tcp tx:   6772 MB/s   114872 calls/s 114872 msg/s
tcp tx:   6793 MB/s   115226 calls/s 115226 msg/s
tcp rx:   7075 MB/s   116595 calls/s
Summary over 4.000 seconds...
sum tcp tx:   6921 MB/s     458580 calls (114645/s)     458580 msgs (114645/s)
./udpgso_bench_tx: Unexpected number of Zerocopy completions:    458580 expected    458578 received
udp
udp rx:    833 MB/s   607639 calls/s
udp tx:    851 MB/s   621264 calls/s  14448 msg/s
udp rx:    860 MB/s   627246 calls/s
udp tx:    860 MB/s   627284 calls/s  14588 msg/s
udp rx:    862 MB/s   628718 calls/s
udp tx:    861 MB/s   628574 calls/s  14618 msg/s
udp rx:    863 MB/s   630058 calls/s
udp gso
udp rx:   2310 MB/s  1683314 calls/s
udp tx:   2472 MB/s    41931 calls/s  41931 msg/s
udp rx:   2343 MB/s  1708032 calls/s
udp tx:   2493 MB/s    42298 calls/s  42298 msg/s
udp rx:   2322 MB/s  1692160 calls/s
udp tx:   2496 MB/s    42347 calls/s  42347 msg/s
udp gso zerocopy
udp rx:   1752 MB/s  1278423 calls/s
udp tx:   1778 MB/s    30162 calls/s  30162 msg/s
udp rx:   1804 MB/s  1316058 calls/s
udp tx:   1804 MB/s    30605 calls/s  30605 msg/s
udp rx:   1807 MB/s  1318120 calls/s
udp tx:   1808 MB/s    30681 calls/s  30681 msg/s
udp rx:   1730 MB/s  1261819 calls/s
udp gso timestamp
udp rx:   2296 MB/s  1673728 calls/s
udp tx:   2375 MB/s    40284 calls/s  40284 msg/s
udp rx:   2334 MB/s  1701632 calls/s
udp tx:   2377 MB/s    40319 calls/s  40319 msg/s
udp rx:   2335 MB/s  1702093 calls/s
udp tx:   2377 MB/s    40319 calls/s  40319 msg/s
udp rx:   2336 MB/s  1702656 calls/s
udp gso zerocopy audit
udp rx:   1717 MB/s  1252554 calls/s
udp tx:   1759 MB/s    29839 calls/s  29839 msg/s
udp rx:   1811 MB/s  1321003 calls/s
udp tx:   1811 MB/s    30722 calls/s  30722 msg/s
udp rx:   1811 MB/s  1320917 calls/s
udp tx:   1811 MB/s    30719 calls/s  30719 msg/s
udp rx:   1810 MB/s  1320606 calls/s
Summary over 4.000 seconds...
sum udp tx:   1839 MB/s     121868 calls (30467/s)     121868 msgs (30467/s)
Zerocopy acks:              121868 received                 0 errors
udp gso timestamp audit
udp rx:   2231 MB/s  1626112 calls/s
udp tx:   2337 MB/s    39646 calls/s  39646 msg/s
udp rx:   2292 MB/s  1670400 calls/s
udp tx:   2365 MB/s    40122 calls/s  40122 msg/s
udp rx:   2287 MB/s  1666816 calls/s
udp tx:   2363 MB/s    40084 calls/s  40084 msg/s
udp rx:   2288 MB/s  1667840 calls/s
Summary over 4.000 seconds...
sum udp tx:   2412 MB/s     159818 calls (39954/s)     159818 msgs (39954/s)
Tx Timestamps:              159818 received                 0 errors
udp gso zerocopy timestamp audit
udp rx:   1592 MB/s  1161479 calls/s
udp tx:   1624 MB/s    27560 calls/s  27560 msg/s
udp rx:   1657 MB/s  1208472 calls/s
udp tx:   1656 MB/s    28103 calls/s  28103 msg/s
udp rx:   1653 MB/s  1206064 calls/s
udp tx:   1653 MB/s    28047 calls/s  28047 msg/s
udp rx:   1648 MB/s  1202151 calls/s
Summary over 4.000 seconds...
sum udp tx:   1683 MB/s     111556 calls (27889/s)     111556 msgs (27889/s)
Tx Timestamps:              111556 received                 0 errors
Zerocopy acks:              111556 received                 0 errors

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-27 21:30     ` Fred Klassen
@ 2019-05-27 21:46       ` Willem de Bruijn
  2019-05-27 22:56         ` Fred Klassen
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-27 21:46 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Mon, May 27, 2019 at 5:30 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> Willem,
>
> Thanks for spending so much time with me on this patch. I’m pretty
> new to this, so I know I am making lots of mistakes. I have been
> working on a v2 of the selftests in net-next, but want to review the
> layout of the report before I submit (see below).
>
> Also, I my v2 fix in net is still up for debate. In its current state, it
> meets my application’s requirements, but may not meet all of yours.

I gave more specific feedback on issues with it (referencing zerocopy
and IP_TOS, say).

Also, it is safer to update only the relevant timestamp bits in
tx_flags, rather that blanket overwrite, given that some bits are
already set in skb_segment. I have not checked whether this is
absolutely necessary.

> I am still open to suggestions, but so far I don’t have an alternate
> solution that doesn’t break what I need working.

Did you see my response yesterday? I can live with the first segment.
Even if I don't think that it buys much in practice given xmit_more
(and it does cost something, e.g., during requeueing).

> I also have a question about submitting a fix in net and a related
> enhancement in net-next. I feel I should be referencing the fix
> in net in my net-next commit, but I don’t know how it should be
> done. Any suggestions?

It is not strictly necessary, but indeed often a nice to have. We
generally reference by SHA1, so wait with submitting the test until
the fix is merged. See also the ipv6 flowlabel test that I just sent
for one example.

> >
> > This would probably also be more useful as regression test if it is in
> > the form of a pass/fail test: if timestamps are requested and total
> > count is zero, then the feature is broken and the process should exit
> > with an error.
>
> I have it set up as a pass/fail test.

Great, thanks.

> Below is a sample output of the
> test, including a failure on IPv6 TCP Zerocopy audit (a failure that may
> lead to a memory leak).

Can you elaborate on this suspected memory leak?

> I wanted to review the report with you before
> I push up the v2 patch into net-next.
>
> Are these extra tests what you were expecting? Is it OK that it doesn’t
> flow well?

Do you mean how the output looks? That seems fine.

> Also, there is a failure about every 3rd time I run it,
> indicating that some TX or Zerocopy messages are lost. Is that OK?

No that is not. These tests are run in a continuous test
infrastructure. We should try hard to avoid flakiness.

If this intermittent failure is due to a real kernel bug, please move
that part to a flag (or just comment out) to temporarily exclude it
from continuous testing.

More commonly it is an issue with the test itself. My SO_TXTIME test
from last week depends on timing, which has me somewhat worried when
run across a wide variety of (likely virtualized) platforms. I
purposely chose large timescales to minimize the risk.

On a related note, tests run as part of continuous testing should run
as briefly as possible. Perhaps we need to reduce the time per run to
accommodate for the new variants you are adding.

>
> ipv4
> tcp
> tcp rx:  11665 MB/s   189014 calls/s
> tcp tx:  11665 MB/s   197859 calls/s 197859 msg/s
> tcp rx:  11706 MB/s   190749 calls/s
> tcp tx:  11706 MB/s   198545 calls/s 198545 msg/s
> tcp tx:  11653 MB/s   197645 calls/s 197645 msg/s
> tcp rx:  11653 MB/s   189292 calls/s
> tcp zerocopy
> tcp rx:   6658 MB/s   111849 calls/s
> tcp tx:   6658 MB/s   112940 calls/s 112940 msg/s
> tcp tx:   6824 MB/s   115742 calls/s 115742 msg/s
> tcp rx:   6824 MB/s   115711 calls/s
> tcp rx:   6865 MB/s   116402 calls/s
> tcp tx:   6865 MB/s   116440 calls/s 116440 msg/s
> tcp zerocopy audit
> tcp tx:   6646 MB/s   112729 calls/s 112729 msg/s
> tcp rx:   6646 MB/s   111445 calls/s
> tcp rx:   6672 MB/s   112833 calls/s
> tcp tx:   6672 MB/s   113164 calls/s 113164 msg/s
> tcp tx:   6624 MB/s   112355 calls/s 112355 msg/s
> tcp rx:   6624 MB/s   110877 calls/s
> Summary over 4.000 seconds...
> sum tcp tx:   6813 MB/s     451402 calls (112850/s)     451402 msgs (112850/s)
> Zerocopy acks:              451402 received                 0 errors
> udp
> udp rx:    914 MB/s   651407 calls/s
> udp tx:    925 MB/s   659274 calls/s  15697 msg/s
> udp rx:    919 MB/s   654859 calls/s
> udp tx:    919 MB/s   654864 calls/s  15592 msg/s
> udp rx:    914 MB/s   651668 calls/s
> udp tx:    914 MB/s   651630 calls/s  15515 msg/s
> udp rx:    918 MB/s   654642 calls/s
> udp gso
> udp rx:   2271 MB/s  1618319 calls/s
> udp tx:   2515 MB/s    42658 calls/s  42658 msg/s
> udp rx:   2337 MB/s  1665280 calls/s
> udp tx:   2551 MB/s    43276 calls/s  43276 msg/s
> udp rx:   2338 MB/s  1665792 calls/s
> udp tx:   2557 MB/s    43384 calls/s  43384 msg/s
> udp rx:   2339 MB/s  1666560 calls/s
> udp gso zerocopy
> udp rx:   1725 MB/s  1229087 calls/s
> udp tx:   1840 MB/s    31219 calls/s  31219 msg/s
> udp rx:   1850 MB/s  1318460 calls/s
> udp tx:   1850 MB/s    31388 calls/s  31388 msg/s
> udp rx:   1845 MB/s  1314428 calls/s
> udp tx:   1845 MB/s    31299 calls/s  31299 msg/s
> udp gso timestamp
> udp rx:   2286 MB/s  1628928 calls/s
> udp tx:   2446 MB/s    41499 calls/s  41499 msg/s
> udp rx:   2354 MB/s  1677312 calls/s
> udp tx:   2473 MB/s    41952 calls/s  41952 msg/s
> udp rx:   2342 MB/s  1668864 calls/s
> udp tx:   2471 MB/s    41925 calls/s  41925 msg/s
> udp rx:   2333 MB/s  1662464 calls/s
> udp gso zerocopy audit
> udp rx:   1787 MB/s  1273481 calls/s
> udp tx:   1832 MB/s    31082 calls/s  31082 msg/s
> udp rx:   1881 MB/s  1340476 calls/s
> udp tx:   1881 MB/s    31916 calls/s  31916 msg/s
> udp rx:   1880 MB/s  1339880 calls/s
> udp tx:   1881 MB/s    31904 calls/s  31904 msg/s
> udp rx:   1881 MB/s  1340010 calls/s
> Summary over 4.000 seconds...
> sum udp tx:   1912 MB/s     126694 calls (31673/s)     126694 msgs (31673/s)
> Zerocopy acks:              126694 received                 0 errors
> udp gso timestamp audit
> udp rx:   2259 MB/s  1609327 calls/s
> udp tx:   2410 MB/s    40879 calls/s  40879 msg/s
> udp rx:   2346 MB/s  1671168 calls/s
> udp tx:   2446 MB/s    41491 calls/s  41491 msg/s
> udp rx:   2358 MB/s  1680128 calls/s
> udp tx:   2448 MB/s    41522 calls/s  41522 msg/s
> udp rx:   2356 MB/s  1678848 calls/s
> Summary over 4.000 seconds...
> sum udp tx:   2494 MB/s     165276 calls (41319/s)     165276 msgs (41319/s)
> Tx Timestamps:              165276 received                 0 errors
> udp gso zerocopy timestamp audit
> udp rx:   1658 MB/s  1181647 calls/s
> udp tx:   1678 MB/s    28466 calls/s  28466 msg/s
> udp rx:   1718 MB/s  1224010 calls/s
> udp tx:   1718 MB/s    29146 calls/s  29146 msg/s
> udp rx:   1718 MB/s  1224086 calls/s
> udp tx:   1718 MB/s    29144 calls/s  29144 msg/s
> udp rx:   1717 MB/s  1223464 calls/s
> Summary over 4.000 seconds...
> sum udp tx:   1747 MB/s     115771 calls (28942/s)     115771 msgs (28942/s)
> Tx Timestamps:              115771 received                 0 errors
> Zerocopy acks:              115771 received                 0 errors
> ipv6
> tcp
> tcp tx:  11470 MB/s   194547 calls/s 194547 msg/s
> tcp rx:  11470 MB/s   188442 calls/s
> tcp tx:  11803 MB/s   200193 calls/s 200193 msg/s
> tcp rx:  11803 MB/s   194526 calls/s
> tcp tx:  11832 MB/s   200681 calls/s 200681 msg/s
> tcp rx:  11832 MB/s   194459 calls/s
> tcp zerocopy
> tcp rx:   7482 MB/s    80176 calls/s
> tcp tx:   7482 MB/s   126908 calls/s 126908 msg/s
> tcp rx:   6641 MB/s   112609 calls/s
> tcp tx:   6641 MB/s   112649 calls/s 112649 msg/s
> tcp tx:   6584 MB/s   111683 calls/s 111683 msg/s
> tcp rx:   6584 MB/s   111617 calls/s
> tcp zerocopy audit
> tcp tx:   6719 MB/s   113968 calls/s 113968 msg/s
> tcp rx:   6719 MB/s   113893 calls/s
> tcp rx:   6772 MB/s   114831 calls/s
> tcp tx:   6772 MB/s   114872 calls/s 114872 msg/s
> tcp tx:   6793 MB/s   115226 calls/s 115226 msg/s
> tcp rx:   7075 MB/s   116595 calls/s
> Summary over 4.000 seconds...
> sum tcp tx:   6921 MB/s     458580 calls (114645/s)     458580 msgs (114645/s)
> ./udpgso_bench_tx: Unexpected number of Zerocopy completions:    458580 expected    458578 received

Is this the issue you're referring to? Good catch. Clearly this is a
good test to have :) That is likely due to some timing issue in the
test, e.g., no waiting long enough to harvest all completions. That is
something I can look into after the code is merged.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-27 21:46       ` Willem de Bruijn
@ 2019-05-27 22:56         ` Fred Klassen
  2019-05-28  1:15           ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-27 22:56 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 27, 2019, at 2:46 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>> Also, I my v2 fix in net is still up for debate. In its current state, it
>> meets my application’s requirements, but may not meet all of yours.

> I gave more specific feedback on issues with it (referencing zerocopy
> and IP_TOS, say).
> 

Unfortunately I don’t have a very good email setup, and I found a
bunch of your comments in my junk folder. That was on Saturday,
and on Sunday I spent some time implementing your suggestions.
I have not pushed the changes up yet. 

I wanted to discuss whether or not to attach a buffer to the 
recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have
MSG_TRUNC errors in my msg_flags. Either I have to add
a buffer, or ignore that error flag. 

> Also, it is safer to update only the relevant timestamp bits in
> tx_flags, rather that blanket overwrite, given that some bits are
> already set in skb_segment. I have not checked whether this is
> absolutely necessary.
> 
 I agree. See tcp_fragment_tstamp().

I think this should work.

skb_shinfo(seg)->tx_flags |=
			(skb_shinfo(gso_skb)->tx_flags & SKBTX_ANY_TSTAMP);

>> I am still open to suggestions, but so far I don’t have an alternate
>> solution that doesn’t break what I need working.
> 
> Did you see my response yesterday? I can live with the first segment.
> Even if I don't think that it buys much in practice given xmit_more
> (and it does cost something, e.g., during requeueing).
> 

I’m sorry, I didn’t receive a response. Once again, I am struggling
with crappy email setup. Hopefully as of today my junk mail filters are
set up properly.

I’d like to see that comment. I have been wondering about xmit_more
myself. I don’t think it changes anything for software timestamps,
but it may with hardware timestamps.

I have service contracts with Intel and Mellanox. I can open up a ticket
with them to see exactly when the timestamp is taken. I believe you
mentioned before that this is vendor specific.

> It is not strictly necessary, but indeed often a nice to have. We
> generally reference by SHA1, so wait with submitting the test until
> the fix is merged. See also the ipv6 flowlabel test that I just sent
> for one example.

Thanks. I will hold off with the test until I get a final fix in net, and I’ll use
your example.

>> Below is a sample output of the
>> test, including a failure on IPv6 TCP Zerocopy audit (a failure that may
>> lead to a memory leak).
> 
> Can you elaborate on this suspected memory leak?

A user program cannot free a zerocopy buffer until it is reported as free.
If zerocopy events are not reported, that could be a memory leak.

I may have a fix. I have added a -P option when I am running an audit.
It doesn’t appear to affect performance, and since implementing it I have
received all error messages expected for both timestamp and zerocopy.

I am still testing. 

>> I wanted to review the report with you before
>> I push up the v2 patch into net-next.
>> 
>> Are these extra tests what you were expecting? Is it OK that it doesn’t
>> flow well?
> 
> Do you mean how the output looks? That seems fine.
> 

Good. Thanks.

>> Also, there is a failure about every 3rd time I run it,
>> indicating that some TX or Zerocopy messages are lost. Is that OK?
> 
> No that is not. These tests are run in a continuous test
> infrastructure. We should try hard to avoid flakiness.
> 

As per above comment, I think I removed the flakiness. I will run
overnight to confirm.

> If this intermittent failure is due to a real kernel bug, please move
> that part to a flag (or just comment out) to temporarily exclude it
> from continuous testing.
> 
> More commonly it is an issue with the test itself. My SO_TXTIME test
> from last week depends on timing, which has me somewhat worried when
> run across a wide variety of (likely virtualized) platforms. I
> purposely chose large timescales to minimize the risk.
> 
> On a related note, tests run as part of continuous testing should run
> as briefly as possible. Perhaps we need to reduce the time per run to
> accommodate for the new variants you are adding.
> 

I could reduce testing from 4 to 2 seconds. Anything below that and I
miss some reports. When I found flakey results, I found I could reproduce
them in as little as 1 second.
>> Summary over 4.000 seconds...
>> sum tcp tx:   6921 MB/s     458580 calls (114645/s)     458580 msgs (114645/s)
>> ./udpgso_bench_tx: Unexpected number of Zerocopy completions:    458580 expected    458578 received
> 
> Is this the issue you're referring to? Good catch. Clearly this is a
> good test to have :) That is likely due to some timing issue in the
> test, e.g., no waiting long enough to harvest all completions. That is
> something I can look into after the code is merged.

Thanks.

Should the test have failed at this point? I did return an error(), but
the script kept running.

As stated, I don’t want to push up until I have tested more fully, and
the fix is accepted (which requires a v3). If you want to review what
I have, I can push it up now with the understanding that I may still
fine tune things. 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-27 22:56         ` Fred Klassen
@ 2019-05-28  1:15           ` Willem de Bruijn
  2019-05-28  5:19             ` Fred Klassen
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-28  1:15 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Mon, May 27, 2019 at 6:56 PM Fred Klassen <fklassen@appneta.com> wrote:
>
>
>
> > On May 27, 2019, at 2:46 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >> Also, I my v2 fix in net is still up for debate. In its current state, it
> >> meets my application’s requirements, but may not meet all of yours.
>
> > I gave more specific feedback on issues with it (referencing zerocopy
> > and IP_TOS, say).
> >
>
> Unfortunately I don’t have a very good email setup, and I found a
> bunch of your comments in my junk folder. That was on Saturday,
> and on Sunday I spent some time implementing your suggestions.
> I have not pushed the changes up yet.
>
> I wanted to discuss whether or not to attach a buffer to the
> recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have
> MSG_TRUNC errors in my msg_flags. Either I have to add
> a buffer, or ignore that error flag.

Either sounds reasonable. It is an expected and well understood
message if underprovisioning the receive data buffer.

> > Also, it is safer to update only the relevant timestamp bits in
> > tx_flags, rather that blanket overwrite, given that some bits are
> > already set in skb_segment. I have not checked whether this is
> > absolutely necessary.
> >
>  I agree. See tcp_fragment_tstamp().
>
> I think this should work.
>
> skb_shinfo(seg)->tx_flags |=
>                         (skb_shinfo(gso_skb)->tx_flags & SKBTX_ANY_TSTAMP);

Agreed. It is more obviously correct. Only drawback is that the RMW is
more expensive than a straight assignment.

> >> I am still open to suggestions, but so far I don’t have an alternate
> >> solution that doesn’t break what I need working.
> >
> > Did you see my response yesterday? I can live with the first segment.
> > Even if I don't think that it buys much in practice given xmit_more
> > (and it does cost something, e.g., during requeueing).
> >
>
> I’m sorry, I didn’t receive a response. Once again, I am struggling
> with crappy email setup. Hopefully as of today my junk mail filters are
> set up properly.
>
> I’d like to see that comment.

The netdev list is archived and available through various websites,
like lore.kernel.org/netdev . As well as the patches with comments at
patchwork.ozlabs.org/project/netdev/list

> I have been wondering about xmit_more
> myself. I don’t think it changes anything for software timestamps,
> but it may with hardware timestamps.

It arguably makes the software timestamp too early if taken on the
first segment, as the NIC is only informed of all the new descriptors
when the last segment is written and the doorbell is rung.

> > Can you elaborate on this suspected memory leak?
>
> A user program cannot free a zerocopy buffer until it is reported as free.
> If zerocopy events are not reported, that could be a memory leak.
>
> I may have a fix. I have added a -P option when I am running an audit.
> It doesn’t appear to affect performance, and since implementing it I have
> received all error messages expected for both timestamp and zerocopy.
>
> I am still testing.

I see, a userspace leak from lack of completion notification.

If the issue is a few missing notifications at the end of the run,
then perhaps cfg_waittime_ms is too short.

> > On a related note, tests run as part of continuous testing should run
> > as briefly as possible. Perhaps we need to reduce the time per run to
> > accommodate for the new variants you are adding.
> >
>
> I could reduce testing from 4 to 2 seconds. Anything below that and I
> miss some reports. When I found flakey results, I found I could reproduce
> them in as little as 1 second.
> >> Summary over 4.000 seconds...
> >> sum tcp tx:   6921 MB/s     458580 calls (114645/s)     458580 msgs (114645/s)
> >> ./udpgso_bench_tx: Unexpected number of Zerocopy completions:    458580 expected    458578 received
> >
> > Is this the issue you're referring to? Good catch. Clearly this is a
> > good test to have :) That is likely due to some timing issue in the
> > test, e.g., no waiting long enough to harvest all completions. That is
> > something I can look into after the code is merged.
>
> Thanks.
>
> Should the test have failed at this point? I did return an error(), but
> the script kept running.

This should normally be cause for test failure, I think yes. Though
it's fine to send the code for review and possibly even merge, so that
I can take a look.

> As stated, I don’t want to push up until I have tested more fully, and
> the fix is accepted (which requires a v3). If you want to review what
> I have, I can push it up now with the understanding that I may still
> fine tune things.

Sounds good, thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-28  1:15           ` Willem de Bruijn
@ 2019-05-28  5:19             ` Fred Klassen
  2019-05-28 15:08               ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-28  5:19 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 27, 2019, at 6:15 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>> I wanted to discuss whether or not to attach a buffer to the
>> recvmsg(fd, &msg, MSG_ERRQUEUE). Without it, I have
>> MSG_TRUNC errors in my msg_flags. Either I have to add
>> a buffer, or ignore that error flag.
> 
> Either sounds reasonable. It is an expected and well understood
> message if underprovisioning the receive data buffer.
> 

I’ll stick with setting up buffers. It will fail if there are any bugs 
introduced in buffer copy routines.

> 
> The netdev list is archived and available through various websites,
> like lore.kernel.org/netdev . As well as the patches with comments at
> patchwork.ozlabs.org/project/netdev/list
> 

Much better. Sure beats hunting down lost emails.


>> I have been wondering about xmit_more
>> myself. I don’t think it changes anything for software timestamps,
>> but it may with hardware timestamps.
> 
> It arguably makes the software timestamp too early if taken on the
> first segment, as the NIC is only informed of all the new descriptors
> when the last segment is written and the doorbell is rung.
> 

Totally makes sense. Possibly this can be improved software TX
timestamps by delaying until just before ring buffer is advanced.
It would have to be updated in each driver. I may have a look at
this once I am complete this patch. Hopefully that one will be a bit
smoother. 

>>> Can you elaborate on this suspected memory leak?
>> 
>> A user program cannot free a zerocopy buffer until it is reported as free.
>> If zerocopy events are not reported, that could be a memory leak.
>> 
>> I may have a fix. I have added a -P option when I am running an audit.
>> It doesn’t appear to affect performance, and since implementing it I have
>> received all error messages expected for both timestamp and zerocopy.
>> 
>> I am still testing.
> 
> I see, a userspace leak from lack of completion notification.
> 
> If the issue is a few missing notifications at the end of the run,
> then perhaps cfg_waittime_ms is too short.
> 

I’ll get back to you when I have tested this more thoroughly. Early results
suggest that adding the -P poll() option has fixed it without any appreciable
performance hit. I’ll share raw results with you, and we can make a final
decision together.

>> Should the test have failed at this point? I did return an error(), but
>> the script kept running.
> 
> This should normally be cause for test failure, I think yes. Though
> it's fine to send the code for review and possibly even merge, so that
> I can take a look.
> 

Sounds like udpgso_bench.sh needs a ’set -e’ to ensure it stops on
first error.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-28  5:19             ` Fred Klassen
@ 2019-05-28 15:08               ` Willem de Bruijn
  2019-05-28 16:57                 ` Fred Klassen
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-28 15:08 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

> >> I have been wondering about xmit_more
> >> myself. I don’t think it changes anything for software timestamps,
> >> but it may with hardware timestamps.
> >
> > It arguably makes the software timestamp too early if taken on the
> > first segment, as the NIC is only informed of all the new descriptors
> > when the last segment is written and the doorbell is rung.
> >
>
> Totally makes sense. Possibly this can be improved software TX
> timestamps by delaying until just before ring buffer is advanced.
> It would have to be updated in each driver. I may have a look at
> this once I am complete this patch. Hopefully that one will be a bit
> smoother.

How do you see that? The skb_tstamp_tx call currently is already the
last action before ringing the doorbell, after setting up the
descriptor. It cannot be set later.

The only issue specific to GSO is that xmit_more can forego this
doorbell until the last segment. We want to complicate this logic with
a special case based on tx_flags. A process that cares should either
not use GSO, or the timestamp should be associated with the last
segment as I've been arguing so far.

> >>> Can you elaborate on this suspected memory leak?
> >>
> >> A user program cannot free a zerocopy buffer until it is reported as free.
> >> If zerocopy events are not reported, that could be a memory leak.
> >>
> >> I may have a fix. I have added a -P option when I am running an audit.
> >> It doesn’t appear to affect performance, and since implementing it I have
> >> received all error messages expected for both timestamp and zerocopy.
> >>
> >> I am still testing.
> >
> > I see, a userspace leak from lack of completion notification.
> >
> > If the issue is a few missing notifications at the end of the run,
> > then perhaps cfg_waittime_ms is too short.
> >
>
> I’ll get back to you when I have tested this more thoroughly. Early results
> suggest that adding the -P poll() option has fixed it without any appreciable
> performance hit. I’ll share raw results with you, and we can make a final
> decision together.

In the main loop? It still is peculiar that notifications appear to go
missing unless the process blocks waiting for them. Nothing in
sock_zerocopy_callback or the queueing onto the error queue should
cause drops, as far as I know.

>
> >> Should the test have failed at this point? I did return an error(), but
> >> the script kept running.
> >
> > This should normally be cause for test failure, I think yes. Though
> > it's fine to send the code for review and possibly even merge, so that
> > I can take a look.
> >
>
> Sounds like udpgso_bench.sh needs a ’set -e’ to ensure it stops on
> first error.

Indeed. Ideally even run all tests, but return error if any failed,
like this recent patch

  selftests/bpf: fail test_tunnel.sh if subtests fail
  https://patchwork.ozlabs.org/patch/1105221/

but that may be a lot of code churn and better left to a separate patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-28 15:08               ` Willem de Bruijn
@ 2019-05-28 16:57                 ` Fred Klassen
  2019-05-28 17:07                   ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Fred Klassen @ 2019-05-28 16:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn



> On May 28, 2019, at 8:08 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 

I will push up latest patches soon.

I did some testing and discovered that only TCP audit tests failed. They
failed much less often when enabling poll.  Once in about 20 runs
still failed. Therefore I commented out the TCP audit tests.

As for the other tests, this is what I got with poll() disabled…

udp gso zerocopy timestamp audit
udp rx:   1611 MB/s  1148129 calls/s
udp tx:   1659 MB/s    28146 calls/s  28146 msg/s
udp rx:   1686 MB/s  1201494 calls/s
udp tx:   1685 MB/s    28579 calls/s  28579 msg/s
udp rx:   1685 MB/s  1200402 calls/s
udp tx:   1683 MB/s    28552 calls/s  28552 msg/s
Summary over 3.000 seconds...
sum udp tx:   1716 MB/s      85277 calls (28425/s)      85277 msgs (28425/s)
Tx Timestamps:               85277 received                 0 errors
Zerocopy acks:               85277 received                 0 errors

Here you see that with poll() enabled, it is a bit slower, so I don’t have it
enabled in udpgso_bench.sh …

udp gso zerocopy timestamp audit
udp rx:   1591 MB/s  1133945 calls/s
udp tx:   1613 MB/s    27358 calls/s  27358 msg/s
udp rx:   1644 MB/s  1171674 calls/s
udp tx:   1643 MB/s    27869 calls/s  27869 msg/s
udp rx:   1643 MB/s  1170666 calls/s
udp tx:   1641 MB/s    27845 calls/s  27845 msg/s
Summary over 3.000 seconds...
sum udp tx:   1671 MB/s      83072 calls (27690/s)      83072 msgs (27690/s)
Tx Timestamps:               83072 received                 0 errors
Zerocopy acks:               83072 received                 0 errors


You may be interested that I reduced test lengths from 4 to 3 seconds,
but I am still getting 3 reports per test. I picked up the extra report by
changing 'if (tnow > treport)’ to 'if (tnow >= treport)’

> The only issue specific to GSO is that xmit_more can forego this
> doorbell until the last segment. We want to complicate this logic with
> a special case based on tx_flags. A process that cares should either
> not use GSO, or the timestamp should be associated with the last
> segment as I've been arguing so far.

This is the area I was thinking of looking into. I’m not sure it will work
or that it will be too messy. It may be worth a little bit of digging to
see if there is anything there. That will be down the road a bu

>> 
>> I’ll get back to you when I have tested this more thoroughly. Early results
>> suggest that adding the -P poll() option has fixed it without any appreciable
>> performance hit. I’ll share raw results with you, and we can make a final
>> decision together.
> 
> In the main loop? It still is peculiar that notifications appear to go
> missing unless the process blocks waiting for them. Nothing in
> sock_zerocopy_callback or the queueing onto the error queue should
> cause drops, as far as I know.
> 

Now that I know the issue is only in TCP, I can speculate that all bytes are
being reported, but done with fewer messages. It may warrant some
investigation in case there is some kind of bug.

> Indeed. Ideally even run all tests, but return error if any failed,
> like this recent patch
> 
>  selftests/bpf: fail test_tunnel.sh if subtests fail
>  https://patchwork.ozlabs.org/patch/1105221/
> 
> but that may be a lot of code churn and better left to a separate patch.

I like it. I have it coded up, and it seems to work well. I’ll make a
separate commit in the patch set so we can yank it out if you feel
it is too much

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-28 16:57                 ` Fred Klassen
@ 2019-05-28 17:07                   ` Willem de Bruijn
  2019-05-28 17:11                     ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-28 17:07 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

On Tue, May 28, 2019 at 12:57 PM Fred Klassen <fklassen@appneta.com> wrote:
>
>
>
> > On May 28, 2019, at 8:08 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >
>
> I will push up latest patches soon.
>
> I did some testing and discovered that only TCP audit tests failed. They
> failed much less often when enabling poll.  Once in about 20 runs
> still failed. Therefore I commented out the TCP audit tests.

Sounds good, thanks.

> You may be interested that I reduced test lengths from 4 to 3 seconds,
> but I am still getting 3 reports per test. I picked up the extra report by
> changing 'if (tnow > treport)’ to 'if (tnow >= treport)’

Nice!

> > The only issue specific to GSO is that xmit_more can forego this
> > doorbell until the last segment. We want to complicate this logic with
> > a special case based on tx_flags. A process that cares should either
> > not use GSO, or the timestamp should be associated with the last
> > segment as I've been arguing so far.
>
> This is the area I was thinking of looking into. I’m not sure it will work
> or that it will be too messy. It may be worth a little bit of digging to
> see if there is anything there. That will be down the road a bu

Sorry, I meant  we [do not (!)] want to complicate this logic. And
definitely don't want to read skb_shinfo where that cacheline isn't
accessed already.

Given xmit_more, do you still find the first segment the right one to
move the timestamp tx_flags to in __udp_gso_segment?

>
> >>
> >> I’ll get back to you when I have tested this more thoroughly. Early results
> >> suggest that adding the -P poll() option has fixed it without any appreciable
> >> performance hit. I’ll share raw results with you, and we can make a final
> >> decision together.
> >
> > In the main loop? It still is peculiar that notifications appear to go
> > missing unless the process blocks waiting for them. Nothing in
> > sock_zerocopy_callback or the queueing onto the error queue should
> > cause drops, as far as I know.
> >
>
> Now that I know the issue is only in TCP, I can speculate that all bytes are
> being reported, but done with fewer messages. It may warrant some
> investigation in case there is some kind of bug.

This would definitely still be a bug and should not happen. We have
quite a bit of experience with TCP zerocopy and I have not run into
this in practice, so I do think that it is somehow a test artifact.

> > Indeed. Ideally even run all tests, but return error if any failed,
> > like this recent patch
> >
> >  selftests/bpf: fail test_tunnel.sh if subtests fail
> >  https://patchwork.ozlabs.org/patch/1105221/
> >
> > but that may be a lot of code churn and better left to a separate patch.
>
> I like it. I have it coded up, and it seems to work well. I’ll make a
> separate commit in the patch set so we can yank it out if you feel
> it is too much

Great. Yes, it sounds like an independent improvement, in which case
it is easier to review as a separate patch. If you already have it, by
all means send it as part of the larger patchset.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH net 4/4] net/udpgso_bench_tx: audit error queue
  2019-05-28 17:07                   ` Willem de Bruijn
@ 2019-05-28 17:11                     ` Willem de Bruijn
  0 siblings, 0 replies; 34+ messages in thread
From: Willem de Bruijn @ 2019-05-28 17:11 UTC (permalink / raw)
  To: Fred Klassen
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Network Development, LKML, linux-kselftest, Willem de Bruijn

 > Now that I know the issue is only in TCP, I can speculate that all bytes are
> > being reported, but done with fewer messages. It may warrant some
> > investigation in case there is some kind of bug.
>
> This would definitely still be a bug and should not happen. We have
> quite a bit of experience with TCP zerocopy and I have not run into
> this in practice, so I do think that it is somehow a test artifact.

To be clear, I'm not saying that it is an artifact of your extensions.
It's quite likely that the zerocopy benchmark was a bit flaky in that
regard all along. No need to spend further time on that for this
patchset.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-05-28 17:12 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23 21:06 [PATCH net 0/4] Allow TX timestamp with UDP GSO Fred Klassen
2019-05-23 21:06 ` [PATCH net 1/4] net/udp_gso: " Fred Klassen
2019-05-23 21:39   ` Willem de Bruijn
2019-05-24  1:38     ` Fred Klassen
2019-05-24  4:53       ` Willem de Bruijn
2019-05-24 16:34         ` Fred Klassen
2019-05-24 19:29           ` Willem de Bruijn
2019-05-24 22:01             ` Fred Klassen
2019-05-25 15:20               ` Willem de Bruijn
2019-05-25 18:47                 ` Fred Klassen
2019-05-27  1:30                   ` Willem de Bruijn
2019-05-27  2:09                     ` Willem de Bruijn
2019-05-25 20:46     ` Fred Klassen
2019-05-23 21:59   ` Willem de Bruijn
2019-05-25 20:09     ` Fred Klassen
2019-05-25 20:47     ` Fred Klassen
2019-05-23 21:06 ` [PATCH net 2/4] net/udpgso_bench_tx: options to exercise TX CMSG Fred Klassen
2019-05-23 21:45   ` Willem de Bruijn
2019-05-23 21:52   ` Willem de Bruijn
2019-05-24  2:10     ` Fred Klassen
2019-05-23 21:06 ` [PATCH net 3/4] net/udpgso_bench_tx: fix sendmmsg on unconnected socket Fred Klassen
2019-05-23 21:06 ` [PATCH net 4/4] net/udpgso_bench_tx: audit error queue Fred Klassen
2019-05-23 21:56   ` Willem de Bruijn
2019-05-24  1:27     ` Fred Klassen
2019-05-24  5:02       ` Willem de Bruijn
2019-05-27 21:30     ` Fred Klassen
2019-05-27 21:46       ` Willem de Bruijn
2019-05-27 22:56         ` Fred Klassen
2019-05-28  1:15           ` Willem de Bruijn
2019-05-28  5:19             ` Fred Klassen
2019-05-28 15:08               ` Willem de Bruijn
2019-05-28 16:57                 ` Fred Klassen
2019-05-28 17:07                   ` Willem de Bruijn
2019-05-28 17:11                     ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).