All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-07  1:12 ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

This series is the v3 of the Time based packet transmission RFC, which was
originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
and further developed by us with the addition of the tbs qdisc
(v2: https://lwn.net/Articles/744797/ ).

It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
implements support for hw offloading on the igb driver for the Intel
i210 NIC. The tbs qdisc also supports SW best effort that can be used
as a fallback.

The main changes since v2 can be found below.

Fixes since v2:
 - skb->tstamp is only cleared on the forwarding path;
 - ktime_t is no longer the type used for timestamps (s64 is);
 - get_unaligned() is now used for copying data from the cmsg header;
 - added getsockopt() support for SO_TXTIME;
 - restricted SO_TXTIME input range to [0,1];
 - removed ns_capable() check from __sock_cmsg_send();
 - the qdisc  control struct now uses a 32 bitmap for config flags;
 - fixed qdisc backlog decrement bug;
 - 'overlimits' is now incremented on dequeue() drops in addition to the
   'dropped' counter;

Interface changes since v2:
 * CMSG interface:
   - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
   - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
 * tc-tbs:
   - clockid now receives a string;
     e.g.: CLOCK_REALTIME or /dev/ptp0
   - offload is now a standalone argument (i.e. no more offload 1);
   - sorting is now argument that enables txtime based sorting provided
     by the qdisc;

Design changes since v2:
 - Now on the dequeue() path, tbs only drops an expired packet if it has the
   skb->tc_drop_if_late flag set. In practical terms, this will define if
   the semantics of txtime on a system is "not earlier than" or "not later
   than" a given timestamp;
 - Now on the enqueue() path, the qdisc will drop a packet if its clockid
   doesn't match the qdisc's one;
 - Sorting the packets based on their txtime is now an option for the disc.
   Effectively, this means it can be configured in 4 modes: HW offload or
   SW best-effort, sorting enabled or disabled;


The tbs qdisc is designed so it buffers packets until a configurable time before
their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
fallback modes, the qdisc uses a rbtree internally so the buffered packets are
always 'ordered' by the earliest deadline.

If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
it will use a 'scheduled' FIFO.

The other configurable parameter from the tbs qdisc is the clockid to be used.
In order to provide that, this series adds a new API to pkt_sched.h (i.e.
qdisc_watchdog_init_clockid()).

The tbs qdisc will drop any packets with a transmission time in the past or
when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
advance plus configuring the delta parameter for the system correctly makes
all the difference in reducing the number of drops. Moreover, note that the
delta parameter ends up defining the Tx time when SW best-effort is used
given that the timestamps won't be used by the NIC on this case.

Examples:

# SW best-effort with sorting #

    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
               map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
               clockid CLOCK_REALTIME sorting

    In this example first the mqprio qdisc is setup, then the tbs qdisc is
    configured onto the first hw Tx queue using SW best-effort with sorting
    enabled. Also, it is configured so the timestamps on each packet are in
    reference to the clockid CLOCK_REALTIME and so packets are dequeued from
    the qdisc 100000 nanoseconds before their transmission time.


# HW offload without sorting #

    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
               map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 tbs offload

    In this example, the Qdisc will use HW offload for the control of the
    transmission time through the network adapter. It's assumed implicitly
    the timestamp in skbuffs are in reference to the interface's PHC and
    setting any other valid clockid would be treated as an error. Because
    there is no scheduling being performed in the qdisc, setting a delta != 0
    would also be considered an error.


# HW offload with sorting #
    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
               map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
               clockid CLOCK_REALTIME sorting

    Here, the Qdisc will use HW offload for the txtime control again,
    but now sorting will be enabled, and thus there will be scheduling being
    performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
    and packets leave the Qdisc "delta" (100000) nanoseconds before
    their transmission time. Because this will be using HW offload and
    since dynamic clocks are not supported by the hrtimer, the system clock
    and the PHC clock must be synchronized for this mode to behave as expected.


For testing, we've followed a similar approach from the v1 and v2 testing and
no significant changes on the results were observed. An updated version of
udp_tai.c is attached to this cover letter.

For last, most of the To Dos we still have before a final patchset are related
to further testing the igb support:
 - testing with L2 only talkers + AF_PACKET sockets;
 - testing tbs in conjunction with cbs;

Thanks for all the feedback so far,
Jesus


Jesus Sanchez-Palencia (12):
  sock: Fix SO_ZEROCOPY switch case
  net: Clear skb->tstamp only on the forwarding path
  posix-timers: Add CLOCKID_INVALID mask
  net: SO_TXTIME: Add clockid and drop_if_late params
  net: ipv4: raw: Handle remaining txtime parameters
  net: ipv4: udp: Handle remaining txtime parameters
  net: packet: Handle remaining txtime parameters
  net/sched: Add HW offloading capability to TBS
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Add support for TBS offload

Richard Cochran (4):
  net: Add a new socket option for a future transmit time.
  net: ipv4: raw: Hook into time based transmission.
  net: ipv4: udp: Hook into time based transmission.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the TBS Qdisc

 arch/alpha/include/uapi/asm/socket.h           |   5 +
 arch/frv/include/uapi/asm/socket.h             |   5 +
 arch/ia64/include/uapi/asm/socket.h            |   5 +
 arch/m32r/include/uapi/asm/socket.h            |   5 +
 arch/mips/include/uapi/asm/socket.h            |   5 +
 arch/mn10300/include/uapi/asm/socket.h         |   5 +
 arch/parisc/include/uapi/asm/socket.h          |   5 +
 arch/s390/include/uapi/asm/socket.h            |   5 +
 arch/sparc/include/uapi/asm/socket.h           |   5 +
 arch/xtensa/include/uapi/asm/socket.h          |   5 +
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +
 drivers/net/ethernet/intel/igb/igb.h           |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 239 +++++++---
 include/linux/netdevice.h                      |   2 +
 include/linux/posix-timers.h                   |   1 +
 include/linux/skbuff.h                         |   3 +
 include/net/pkt_sched.h                        |   7 +
 include/net/sock.h                             |   4 +
 include/uapi/asm-generic/socket.h              |   5 +
 include/uapi/linux/pkt_sched.h                 |  18 +
 net/core/skbuff.c                              |   1 -
 net/core/sock.c                                |  44 +-
 net/ipv4/raw.c                                 |   7 +
 net/ipv4/udp.c                                 |  10 +-
 net/packet/af_packet.c                         |  19 +
 net/sched/Kconfig                              |  11 +
 net/sched/Makefile                             |   1 +
 net/sched/sch_api.c                            |  11 +-
 net/sched/sch_tbs.c                            | 591 +++++++++++++++++++++++++
 29 files changed, 978 insertions(+), 63 deletions(-)
 create mode 100644 net/sched/sch_tbs.c

-- 
2.16.2

---8<---
/*
 * This program demonstrates transmission of UDP packets using the
 * system TAI timer.
 *
 * Copyright (C) 2017 linutronix GmbH
 *
 * Large portions taken from the linuxptp stack.
 * Copyright (C) 2011, 2012 Richard Cochran <richardcochran@gmail.com>
 *
 * Some portions taken from the sgd test program.
 * Copyright (C) 2015 linutronix GmbH
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, write to the Free Software Foundation, Inc.,
 * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 */
#define _GNU_SOURCE /*for CPU_SET*/
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <ifaddrs.h>
#include <linux/ethtool.h>
#include <linux/net_tstamp.h>
#include <linux/sockios.h>
#include <net/if.h>
#include <netinet/in.h>
#include <poll.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define DEFAULT_PERIOD	1000000
#define DEFAULT_DELAY	500000
#define MCAST_IPADDR	"239.1.1.1"
#define UDP_PORT	7788

#ifndef SO_TXTIME
#define SO_TXTIME	61
#define SCM_TXTIME	SO_TXTIME
#define SCM_DROP_IF_LATE	62
#define SCM_CLOCKID	63
#endif

#define pr_err(s)	fprintf(stderr, s "\n")
#define pr_info(s)	fprintf(stdout, s "\n")

static int running = 1, use_so_txtime = 1;
static int period_nsec = DEFAULT_PERIOD;
static int waketx_delay = DEFAULT_DELAY;
static struct in_addr mcast_addr;

static int mcast_bind(int fd, int index)
{
	int err;
	struct ip_mreqn req;
	memset(&req, 0, sizeof(req));
	req.imr_ifindex = index;
	err = setsockopt(fd, IPPROTO_IP, IP_MULTICAST_IF, &req, sizeof(req));
	if (err) {
		pr_err("setsockopt IP_MULTICAST_IF failed: %m");
		return -1;
	}
	return 0;
}

static int mcast_join(int fd, int index, const struct sockaddr *grp,
		      socklen_t grplen)
{
	int err, off = 0;
	struct ip_mreqn req;
	struct sockaddr_in *sa = (struct sockaddr_in *) grp;

	memset(&req, 0, sizeof(req));
	memcpy(&req.imr_multiaddr, &sa->sin_addr, sizeof(struct in_addr));
	req.imr_ifindex = index;
	err = setsockopt(fd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &req, sizeof(req));
	if (err) {
		pr_err("setsockopt IP_ADD_MEMBERSHIP failed: %m");
		return -1;
	}
	err = setsockopt(fd, IPPROTO_IP, IP_MULTICAST_LOOP, &off, sizeof(off));
	if (err) {
		pr_err("setsockopt IP_MULTICAST_LOOP failed: %m");
		return -1;
	}
	return 0;
}

static void normalize(struct timespec *ts)
{
	while (ts->tv_nsec > 999999999) {
		ts->tv_sec += 1;
		ts->tv_nsec -= 1000000000;
	}
}

static int sk_interface_index(int fd, const char *name)
{
	struct ifreq ifreq;
	int err;

	memset(&ifreq, 0, sizeof(ifreq));
	strncpy(ifreq.ifr_name, name, sizeof(ifreq.ifr_name) - 1);
	err = ioctl(fd, SIOCGIFINDEX, &ifreq);
	if (err < 0) {
		pr_err("ioctl SIOCGIFINDEX failed: %m");
		return err;
	}
	return ifreq.ifr_ifindex;
}

static int open_socket(const char *name, struct in_addr mc_addr, short port)
{
	struct sockaddr_in addr;
	int fd, index, on = 1;
	int priority = 3;

	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_addr.s_addr = htonl(INADDR_ANY);
	addr.sin_port = htons(port);

	fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
	if (fd < 0) {
		pr_err("socket failed: %m");
		goto no_socket;
	}
	index = sk_interface_index(fd, name);
	if (index < 0)
		goto no_option;

	if (setsockopt(fd, SOL_SOCKET, SO_PRIORITY, &priority, sizeof(priority))) {
		pr_err("Couldn't set priority");
		goto no_option;
	}
	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on))) {
		pr_err("setsockopt SO_REUSEADDR failed: %m");
		goto no_option;
	}
	if (bind(fd, (struct sockaddr *) &addr, sizeof(addr))) {
		pr_err("bind failed: %m");
		goto no_option;
	}
	if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, name, strlen(name))) {
		pr_err("setsockopt SO_BINDTODEVICE failed: %m");
		goto no_option;
	}
	addr.sin_addr = mc_addr;
	if (mcast_join(fd, index, (struct sockaddr *) &addr, sizeof(addr))) {
		pr_err("mcast_join failed");
		goto no_option;
	}
	if (mcast_bind(fd, index)) {
		goto no_option;
	}
	if (use_so_txtime && setsockopt(fd, SOL_SOCKET, SO_TXTIME, &on, sizeof(on))) {
		pr_err("setsockopt SO_TXTIME failed: %m");
		goto no_option;
	}

	return fd;
no_option:
	close(fd);
no_socket:
	return -1;
}

static int udp_open(const char *name)
{
	int fd;

	if (!inet_aton(MCAST_IPADDR, &mcast_addr))
		return -1;

	fd = open_socket(name, mcast_addr, UDP_PORT);

	return fd;
}

static int udp_send(int fd, void *buf, int len, __u64 txtime, clockid_t clkid)
{
	char control[CMSG_SPACE(sizeof(txtime)) + CMSG_SPACE(sizeof(clkid)) + CMSG_SPACE(sizeof(uint8_t))] = {};
	struct sockaddr_in sin;
	struct cmsghdr *cmsg;
	struct msghdr msg;
	struct iovec iov;
	ssize_t cnt;
	uint8_t drop_if_late = 1;

	memset(&sin, 0, sizeof(sin));
	sin.sin_family = AF_INET;
	sin.sin_addr = mcast_addr;
	sin.sin_port = htons(UDP_PORT);

	iov.iov_base = buf;
	iov.iov_len = len;

	memset(&msg, 0, sizeof(msg));
	msg.msg_name = &sin;
	msg.msg_namelen = sizeof(sin);
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;

	/*
	 * We specify the transmission time in the CMSG.
	 */
	if (use_so_txtime) {
		msg.msg_control = control;
		msg.msg_controllen = sizeof(control);

		cmsg = CMSG_FIRSTHDR(&msg);
		cmsg->cmsg_level = SOL_SOCKET;
		cmsg->cmsg_type = SCM_TXTIME;
		cmsg->cmsg_len = CMSG_LEN(sizeof(__u64));
		*((__u64 *) CMSG_DATA(cmsg)) = txtime;

		cmsg = CMSG_NXTHDR(&msg, cmsg);
		cmsg->cmsg_level = SOL_SOCKET;
		cmsg->cmsg_type = SCM_CLOCKID;
		cmsg->cmsg_len = CMSG_LEN(sizeof(clockid_t));
		*((clockid_t *) CMSG_DATA(cmsg)) = clkid;

		cmsg = CMSG_NXTHDR(&msg, cmsg);
		cmsg->cmsg_level = SOL_SOCKET;
		cmsg->cmsg_type = SCM_DROP_IF_LATE;
		cmsg->cmsg_len = CMSG_LEN(sizeof(uint8_t));
		*((uint8_t *) CMSG_DATA(cmsg)) = drop_if_late;
	}
	cnt = sendmsg(fd, &msg, 0);
	if (cnt < 1) {
		pr_err("sendmsg failed: %m");
		return cnt;
	}
	return cnt;
}

static unsigned char tx_buffer[256];
static int marker;

static int run_nanosleep(clockid_t clkid, int fd)
{
	struct timespec ts;
	int cnt, err;
	__u64 txtime;

	clock_gettime(clkid, &ts);

	/* Start one to two seconds in the future. */
	ts.tv_sec += 1;
	ts.tv_nsec = 1000000000 - waketx_delay;
	normalize(&ts);

	txtime = ts.tv_sec * 1000000000ULL + ts.tv_nsec;
	txtime += waketx_delay;

	while (running) {
		err = clock_nanosleep(clkid, TIMER_ABSTIME, &ts, NULL);
		switch (err) {
		case 0:
			cnt = udp_send(fd, tx_buffer, sizeof(tx_buffer), txtime, clkid);
			if (cnt != sizeof(tx_buffer)) {
				pr_err("udp_send failed");
			}
			memset(tx_buffer, marker++, sizeof(tx_buffer));
			ts.tv_nsec += period_nsec;
			normalize(&ts);
			txtime += period_nsec;
			break;
		case EINTR:
			continue;
		default:
			fprintf(stderr, "clock_nanosleep returned %d: %s",
				err, strerror(err));
			return err;
		}
	}

	return 0;
}

static int set_realtime(pthread_t thread, int priority, int cpu)
{
	cpu_set_t cpuset;
	struct sched_param sp;
	int err, policy;

	int min = sched_get_priority_min(SCHED_FIFO);
	int max = sched_get_priority_max(SCHED_FIFO);

	fprintf(stderr, "min %d max %d\n", min, max);

	if (priority < 0) {
		return 0;
	}

	err = pthread_getschedparam(thread, &policy, &sp);
	if (err) {
		fprintf(stderr, "pthread_getschedparam: %s\n", strerror(err));
		return -1;
	}

	sp.sched_priority = priority;

	err = pthread_setschedparam(thread, SCHED_FIFO, &sp);
	if (err) {
		fprintf(stderr, "pthread_setschedparam: %s\n", strerror(err));
		return -1;
	}

	if (cpu < 0) {
		return 0;
	}
	CPU_ZERO(&cpuset);
	CPU_SET(cpu, &cpuset);
	err = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
	if (err) {
		fprintf(stderr, "pthread_setaffinity_np: %s\n", strerror(err));
		return -1;
	}

	return 0;
}

static void usage(char *progname)
{
	fprintf(stderr,
		"\n"
		"usage: %s [options]\n"
		"\n"
		" -c [num]   run on CPU 'num'\n"
		" -d [num]   delay from wake up to transmission in nanoseconds (default %d)\n"
		" -h         prints this message and exits\n"
		" -i [name]  use network interface 'name'\n"
		" -p [num]   run with RT priorty 'num'\n"
		" -P [num]   period in nanoseconds (default %d)\n"
		" -u         do not use SO_TXTIME\n"
		"\n",
		progname, DEFAULT_DELAY, DEFAULT_PERIOD);
}

int main(int argc, char *argv[])
{
	int c, cpu = -1, err, fd, priority = -1;
	clockid_t clkid = CLOCK_REALTIME;
	char *iface = NULL, *progname;

	/* Process the command line arguments. */
	progname = strrchr(argv[0], '/');
	progname = progname ? 1 + progname : argv[0];
	while (EOF != (c = getopt(argc, argv, "c:d:hi:p:P:u"))) {
		switch (c) {
		case 'c':
			cpu = atoi(optarg);
			break;
		case 'd':
			waketx_delay = atoi(optarg);
			break;
		case 'h':
			usage(progname);
			return 0;
		case 'i':
			iface = optarg;
			break;
		case 'p':
			priority = atoi(optarg);
			break;
		case 'P':
			period_nsec = atoi(optarg);
			break;
		case 'u':
			use_so_txtime = 0;
			break;
		case '?':
			usage(progname);
			return -1;
		}
	}

	if (waketx_delay > 999999999 || waketx_delay < 0) {
		pr_err("Bad wake up to transmission delay.");
		usage(progname);
		return -1;
	}

	if (period_nsec < 1000) {
		pr_err("Bad period.");
		usage(progname);
		return -1;
	}

	if (!iface) {
		pr_err("Need a network interface.");
		usage(progname);
		return -1;
	}

	if (set_realtime(pthread_self(), priority, cpu)) {
		return -1;
	}

	fd = udp_open(iface);
	if (fd < 0) {
		return -1;
	}

	err = run_nanosleep(clkid, fd);

	close(fd);
	return err;
}

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-07  1:12 ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

This series is the v3 of the Time based packet transmission RFC, which was
originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
and further developed by us with the addition of the tbs qdisc
(v2: https://lwn.net/Articles/744797/ ).

It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
implements support for hw offloading on the igb driver for the Intel
i210 NIC. The tbs qdisc also supports SW best effort that can be used
as a fallback.

The main changes since v2 can be found below.

Fixes since v2:
 - skb->tstamp is only cleared on the forwarding path;
 - ktime_t is no longer the type used for timestamps (s64 is);
 - get_unaligned() is now used for copying data from the cmsg header;
 - added getsockopt() support for SO_TXTIME;
 - restricted SO_TXTIME input range to [0,1];
 - removed ns_capable() check from __sock_cmsg_send();
 - the qdisc  control struct now uses a 32 bitmap for config flags;
 - fixed qdisc backlog decrement bug;
 - 'overlimits' is now incremented on dequeue() drops in addition to the
   'dropped' counter;

Interface changes since v2:
 * CMSG interface:
   - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
   - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
 * tc-tbs:
   - clockid now receives a string;
     e.g.: CLOCK_REALTIME or /dev/ptp0
   - offload is now a standalone argument (i.e. no more offload 1);
   - sorting is now argument that enables txtime based sorting provided
     by the qdisc;

Design changes since v2:
 - Now on the dequeue() path, tbs only drops an expired packet if it has the
   skb->tc_drop_if_late flag set. In practical terms, this will define if
   the semantics of txtime on a system is "not earlier than" or "not later
   than" a given timestamp;
 - Now on the enqueue() path, the qdisc will drop a packet if its clockid
   doesn't match the qdisc's one;
 - Sorting the packets based on their txtime is now an option for the disc.
   Effectively, this means it can be configured in 4 modes: HW offload or
   SW best-effort, sorting enabled or disabled;


The tbs qdisc is designed so it buffers packets until a configurable time before
their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
fallback modes, the qdisc uses a rbtree internally so the buffered packets are
always 'ordered' by the earliest deadline.

If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
it will use a 'scheduled' FIFO.

The other configurable parameter from the tbs qdisc is the clockid to be used.
In order to provide that, this series adds a new API to pkt_sched.h (i.e.
qdisc_watchdog_init_clockid()).

The tbs qdisc will drop any packets with a transmission time in the past or
when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
advance plus configuring the delta parameter for the system correctly makes
all the difference in reducing the number of drops. Moreover, note that the
delta parameter ends up defining the Tx time when SW best-effort is used
given that the timestamps won't be used by the NIC on this case.

Examples:

# SW best-effort with sorting #

    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
               map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
               clockid CLOCK_REALTIME sorting

    In this example first the mqprio qdisc is setup, then the tbs qdisc is
    configured onto the first hw Tx queue using SW best-effort with sorting
    enabled. Also, it is configured so the timestamps on each packet are in
    reference to the clockid CLOCK_REALTIME and so packets are dequeued from
    the qdisc 100000 nanoseconds before their transmission time.


# HW offload without sorting #

    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
               map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 tbs offload

    In this example, the Qdisc will use HW offload for the control of the
    transmission time through the network adapter. It's assumed implicitly
    the timestamp in skbuffs are in reference to the interface's PHC and
    setting any other valid clockid would be treated as an error. Because
    there is no scheduling being performed in the qdisc, setting a delta != 0
    would also be considered an error.


# HW offload with sorting #
    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
               map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
               clockid CLOCK_REALTIME sorting

    Here, the Qdisc will use HW offload for the txtime control again,
    but now sorting will be enabled, and thus there will be scheduling being
    performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
    and packets leave the Qdisc "delta" (100000) nanoseconds before
    their transmission time. Because this will be using HW offload and
    since dynamic clocks are not supported by the hrtimer, the system clock
    and the PHC clock must be synchronized for this mode to behave as expected.


For testing, we've followed a similar approach from the v1 and v2 testing and
no significant changes on the results were observed. An updated version of
udp_tai.c is attached to this cover letter.

For last, most of the To Dos we still have before a final patchset are related
to further testing the igb support:
 - testing with L2 only talkers + AF_PACKET sockets;
 - testing tbs in conjunction with cbs;

Thanks for all the feedback so far,
Jesus


Jesus Sanchez-Palencia (12):
  sock: Fix SO_ZEROCOPY switch case
  net: Clear skb->tstamp only on the forwarding path
  posix-timers: Add CLOCKID_INVALID mask
  net: SO_TXTIME: Add clockid and drop_if_late params
  net: ipv4: raw: Handle remaining txtime parameters
  net: ipv4: udp: Handle remaining txtime parameters
  net: packet: Handle remaining txtime parameters
  net/sched: Add HW offloading capability to TBS
  igb: Refactor igb_configure_cbs()
  igb: Only change Tx arbitration when CBS is on
  igb: Refactor igb_offload_cbs()
  igb: Add support for TBS offload

Richard Cochran (4):
  net: Add a new socket option for a future transmit time.
  net: ipv4: raw: Hook into time based transmission.
  net: ipv4: udp: Hook into time based transmission.
  net: packet: Hook into time based transmission.

Vinicius Costa Gomes (2):
  net/sched: Allow creating a Qdisc watchdog with other clocks
  net/sched: Introduce the TBS Qdisc

 arch/alpha/include/uapi/asm/socket.h           |   5 +
 arch/frv/include/uapi/asm/socket.h             |   5 +
 arch/ia64/include/uapi/asm/socket.h            |   5 +
 arch/m32r/include/uapi/asm/socket.h            |   5 +
 arch/mips/include/uapi/asm/socket.h            |   5 +
 arch/mn10300/include/uapi/asm/socket.h         |   5 +
 arch/parisc/include/uapi/asm/socket.h          |   5 +
 arch/s390/include/uapi/asm/socket.h            |   5 +
 arch/sparc/include/uapi/asm/socket.h           |   5 +
 arch/xtensa/include/uapi/asm/socket.h          |   5 +
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +
 drivers/net/ethernet/intel/igb/igb.h           |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 239 +++++++---
 include/linux/netdevice.h                      |   2 +
 include/linux/posix-timers.h                   |   1 +
 include/linux/skbuff.h                         |   3 +
 include/net/pkt_sched.h                        |   7 +
 include/net/sock.h                             |   4 +
 include/uapi/asm-generic/socket.h              |   5 +
 include/uapi/linux/pkt_sched.h                 |  18 +
 net/core/skbuff.c                              |   1 -
 net/core/sock.c                                |  44 +-
 net/ipv4/raw.c                                 |   7 +
 net/ipv4/udp.c                                 |  10 +-
 net/packet/af_packet.c                         |  19 +
 net/sched/Kconfig                              |  11 +
 net/sched/Makefile                             |   1 +
 net/sched/sch_api.c                            |  11 +-
 net/sched/sch_tbs.c                            | 591 +++++++++++++++++++++++++
 29 files changed, 978 insertions(+), 63 deletions(-)
 create mode 100644 net/sched/sch_tbs.c

-- 
2.16.2

---8<---
/*
 * This program demonstrates transmission of UDP packets using the
 * system TAI timer.
 *
 * Copyright (C) 2017 linutronix GmbH
 *
 * Large portions taken from the linuxptp stack.
 * Copyright (C) 2011, 2012 Richard Cochran <richardcochran@gmail.com>
 *
 * Some portions taken from the sgd test program.
 * Copyright (C) 2015 linutronix GmbH
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, write to the Free Software Foundation, Inc.,
 * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 */
#define _GNU_SOURCE /*for CPU_SET*/
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <ifaddrs.h>
#include <linux/ethtool.h>
#include <linux/net_tstamp.h>
#include <linux/sockios.h>
#include <net/if.h>
#include <netinet/in.h>
#include <poll.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define DEFAULT_PERIOD	1000000
#define DEFAULT_DELAY	500000
#define MCAST_IPADDR	"239.1.1.1"
#define UDP_PORT	7788

#ifndef SO_TXTIME
#define SO_TXTIME	61
#define SCM_TXTIME	SO_TXTIME
#define SCM_DROP_IF_LATE	62
#define SCM_CLOCKID	63
#endif

#define pr_err(s)	fprintf(stderr, s "\n")
#define pr_info(s)	fprintf(stdout, s "\n")

static int running = 1, use_so_txtime = 1;
static int period_nsec = DEFAULT_PERIOD;
static int waketx_delay = DEFAULT_DELAY;
static struct in_addr mcast_addr;

static int mcast_bind(int fd, int index)
{
	int err;
	struct ip_mreqn req;
	memset(&req, 0, sizeof(req));
	req.imr_ifindex = index;
	err = setsockopt(fd, IPPROTO_IP, IP_MULTICAST_IF, &req, sizeof(req));
	if (err) {
		pr_err("setsockopt IP_MULTICAST_IF failed: %m");
		return -1;
	}
	return 0;
}

static int mcast_join(int fd, int index, const struct sockaddr *grp,
		      socklen_t grplen)
{
	int err, off = 0;
	struct ip_mreqn req;
	struct sockaddr_in *sa = (struct sockaddr_in *) grp;

	memset(&req, 0, sizeof(req));
	memcpy(&req.imr_multiaddr, &sa->sin_addr, sizeof(struct in_addr));
	req.imr_ifindex = index;
	err = setsockopt(fd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &req, sizeof(req));
	if (err) {
		pr_err("setsockopt IP_ADD_MEMBERSHIP failed: %m");
		return -1;
	}
	err = setsockopt(fd, IPPROTO_IP, IP_MULTICAST_LOOP, &off, sizeof(off));
	if (err) {
		pr_err("setsockopt IP_MULTICAST_LOOP failed: %m");
		return -1;
	}
	return 0;
}

static void normalize(struct timespec *ts)
{
	while (ts->tv_nsec > 999999999) {
		ts->tv_sec += 1;
		ts->tv_nsec -= 1000000000;
	}
}

static int sk_interface_index(int fd, const char *name)
{
	struct ifreq ifreq;
	int err;

	memset(&ifreq, 0, sizeof(ifreq));
	strncpy(ifreq.ifr_name, name, sizeof(ifreq.ifr_name) - 1);
	err = ioctl(fd, SIOCGIFINDEX, &ifreq);
	if (err < 0) {
		pr_err("ioctl SIOCGIFINDEX failed: %m");
		return err;
	}
	return ifreq.ifr_ifindex;
}

static int open_socket(const char *name, struct in_addr mc_addr, short port)
{
	struct sockaddr_in addr;
	int fd, index, on = 1;
	int priority = 3;

	memset(&addr, 0, sizeof(addr));
	addr.sin_family = AF_INET;
	addr.sin_addr.s_addr = htonl(INADDR_ANY);
	addr.sin_port = htons(port);

	fd = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
	if (fd < 0) {
		pr_err("socket failed: %m");
		goto no_socket;
	}
	index = sk_interface_index(fd, name);
	if (index < 0)
		goto no_option;

	if (setsockopt(fd, SOL_SOCKET, SO_PRIORITY, &priority, sizeof(priority))) {
		pr_err("Couldn't set priority");
		goto no_option;
	}
	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on))) {
		pr_err("setsockopt SO_REUSEADDR failed: %m");
		goto no_option;
	}
	if (bind(fd, (struct sockaddr *) &addr, sizeof(addr))) {
		pr_err("bind failed: %m");
		goto no_option;
	}
	if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, name, strlen(name))) {
		pr_err("setsockopt SO_BINDTODEVICE failed: %m");
		goto no_option;
	}
	addr.sin_addr = mc_addr;
	if (mcast_join(fd, index, (struct sockaddr *) &addr, sizeof(addr))) {
		pr_err("mcast_join failed");
		goto no_option;
	}
	if (mcast_bind(fd, index)) {
		goto no_option;
	}
	if (use_so_txtime && setsockopt(fd, SOL_SOCKET, SO_TXTIME, &on, sizeof(on))) {
		pr_err("setsockopt SO_TXTIME failed: %m");
		goto no_option;
	}

	return fd;
no_option:
	close(fd);
no_socket:
	return -1;
}

static int udp_open(const char *name)
{
	int fd;

	if (!inet_aton(MCAST_IPADDR, &mcast_addr))
		return -1;

	fd = open_socket(name, mcast_addr, UDP_PORT);

	return fd;
}

static int udp_send(int fd, void *buf, int len, __u64 txtime, clockid_t clkid)
{
	char control[CMSG_SPACE(sizeof(txtime)) + CMSG_SPACE(sizeof(clkid)) + CMSG_SPACE(sizeof(uint8_t))] = {};
	struct sockaddr_in sin;
	struct cmsghdr *cmsg;
	struct msghdr msg;
	struct iovec iov;
	ssize_t cnt;
	uint8_t drop_if_late = 1;

	memset(&sin, 0, sizeof(sin));
	sin.sin_family = AF_INET;
	sin.sin_addr = mcast_addr;
	sin.sin_port = htons(UDP_PORT);

	iov.iov_base = buf;
	iov.iov_len = len;

	memset(&msg, 0, sizeof(msg));
	msg.msg_name = &sin;
	msg.msg_namelen = sizeof(sin);
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;

	/*
	 * We specify the transmission time in the CMSG.
	 */
	if (use_so_txtime) {
		msg.msg_control = control;
		msg.msg_controllen = sizeof(control);

		cmsg = CMSG_FIRSTHDR(&msg);
		cmsg->cmsg_level = SOL_SOCKET;
		cmsg->cmsg_type = SCM_TXTIME;
		cmsg->cmsg_len = CMSG_LEN(sizeof(__u64));
		*((__u64 *) CMSG_DATA(cmsg)) = txtime;

		cmsg = CMSG_NXTHDR(&msg, cmsg);
		cmsg->cmsg_level = SOL_SOCKET;
		cmsg->cmsg_type = SCM_CLOCKID;
		cmsg->cmsg_len = CMSG_LEN(sizeof(clockid_t));
		*((clockid_t *) CMSG_DATA(cmsg)) = clkid;

		cmsg = CMSG_NXTHDR(&msg, cmsg);
		cmsg->cmsg_level = SOL_SOCKET;
		cmsg->cmsg_type = SCM_DROP_IF_LATE;
		cmsg->cmsg_len = CMSG_LEN(sizeof(uint8_t));
		*((uint8_t *) CMSG_DATA(cmsg)) = drop_if_late;
	}
	cnt = sendmsg(fd, &msg, 0);
	if (cnt < 1) {
		pr_err("sendmsg failed: %m");
		return cnt;
	}
	return cnt;
}

static unsigned char tx_buffer[256];
static int marker;

static int run_nanosleep(clockid_t clkid, int fd)
{
	struct timespec ts;
	int cnt, err;
	__u64 txtime;

	clock_gettime(clkid, &ts);

	/* Start one to two seconds in the future. */
	ts.tv_sec += 1;
	ts.tv_nsec = 1000000000 - waketx_delay;
	normalize(&ts);

	txtime = ts.tv_sec * 1000000000ULL + ts.tv_nsec;
	txtime += waketx_delay;

	while (running) {
		err = clock_nanosleep(clkid, TIMER_ABSTIME, &ts, NULL);
		switch (err) {
		case 0:
			cnt = udp_send(fd, tx_buffer, sizeof(tx_buffer), txtime, clkid);
			if (cnt != sizeof(tx_buffer)) {
				pr_err("udp_send failed");
			}
			memset(tx_buffer, marker++, sizeof(tx_buffer));
			ts.tv_nsec += period_nsec;
			normalize(&ts);
			txtime += period_nsec;
			break;
		case EINTR:
			continue;
		default:
			fprintf(stderr, "clock_nanosleep returned %d: %s",
				err, strerror(err));
			return err;
		}
	}

	return 0;
}

static int set_realtime(pthread_t thread, int priority, int cpu)
{
	cpu_set_t cpuset;
	struct sched_param sp;
	int err, policy;

	int min = sched_get_priority_min(SCHED_FIFO);
	int max = sched_get_priority_max(SCHED_FIFO);

	fprintf(stderr, "min %d max %d\n", min, max);

	if (priority < 0) {
		return 0;
	}

	err = pthread_getschedparam(thread, &policy, &sp);
	if (err) {
		fprintf(stderr, "pthread_getschedparam: %s\n", strerror(err));
		return -1;
	}

	sp.sched_priority = priority;

	err = pthread_setschedparam(thread, SCHED_FIFO, &sp);
	if (err) {
		fprintf(stderr, "pthread_setschedparam: %s\n", strerror(err));
		return -1;
	}

	if (cpu < 0) {
		return 0;
	}
	CPU_ZERO(&cpuset);
	CPU_SET(cpu, &cpuset);
	err = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
	if (err) {
		fprintf(stderr, "pthread_setaffinity_np: %s\n", strerror(err));
		return -1;
	}

	return 0;
}

static void usage(char *progname)
{
	fprintf(stderr,
		"\n"
		"usage: %s [options]\n"
		"\n"
		" -c [num]   run on CPU 'num'\n"
		" -d [num]   delay from wake up to transmission in nanoseconds (default %d)\n"
		" -h         prints this message and exits\n"
		" -i [name]  use network interface 'name'\n"
		" -p [num]   run with RT priorty 'num'\n"
		" -P [num]   period in nanoseconds (default %d)\n"
		" -u         do not use SO_TXTIME\n"
		"\n",
		progname, DEFAULT_DELAY, DEFAULT_PERIOD);
}

int main(int argc, char *argv[])
{
	int c, cpu = -1, err, fd, priority = -1;
	clockid_t clkid = CLOCK_REALTIME;
	char *iface = NULL, *progname;

	/* Process the command line arguments. */
	progname = strrchr(argv[0], '/');
	progname = progname ? 1 + progname : argv[0];
	while (EOF != (c = getopt(argc, argv, "c:d:hi:p:P:u"))) {
		switch (c) {
		case 'c':
			cpu = atoi(optarg);
			break;
		case 'd':
			waketx_delay = atoi(optarg);
			break;
		case 'h':
			usage(progname);
			return 0;
		case 'i':
			iface = optarg;
			break;
		case 'p':
			priority = atoi(optarg);
			break;
		case 'P':
			period_nsec = atoi(optarg);
			break;
		case 'u':
			use_so_txtime = 0;
			break;
		case '?':
			usage(progname);
			return -1;
		}
	}

	if (waketx_delay > 999999999 || waketx_delay < 0) {
		pr_err("Bad wake up to transmission delay.");
		usage(progname);
		return -1;
	}

	if (period_nsec < 1000) {
		pr_err("Bad period.");
		usage(progname);
		return -1;
	}

	if (!iface) {
		pr_err("Need a network interface.");
		usage(progname);
		return -1;
	}

	if (set_realtime(pthread_self(), priority, cpu)) {
		return -1;
	}

	fd = udp_open(iface);
	if (fd < 0) {
		return -1;
	}

	err = run_nanosleep(clkid, fd);

	close(fd);
	return err;
}


^ permalink raw reply	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
ret values to be overwritten by the one set on the default case.

Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 507d8c6c4319..27f218bba43f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1062,8 +1062,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 				ret = -EINVAL;
 			else
 				sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
-			break;
 		}
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
ret values to be overwritten by the one set on the default case.

Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/core/sock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 507d8c6c4319..27f218bba43f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1062,8 +1062,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 				ret = -EINVAL;
 			else
 				sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
-			break;
 		}
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() would break our feature when
tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/linux/netdevice.h | 1 +
 net/core/skbuff.c         | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dbe6344b727a..7104de2bc957 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
 
 	skb_scrub_packet(skb, true);
 	skb->priority = 0;
+	skb->tstamp = 0;
 	return 0;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..678fc5416ae1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-	skb->tstamp = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->skb_iif = 0;
 	skb->ignore_df = 0;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

This is done in preparation for the upcoming time based transmission
patchset. Now that skb->tstamp will be used to hold packet's txtime,
we must ensure that it is being cleared when traversing namespaces.
Also, doing that from skb_scrub_packet() would break our feature when
tunnels are used.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/linux/netdevice.h | 1 +
 net/core/skbuff.c         | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dbe6344b727a..7104de2bc957 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
 
 	skb_scrub_packet(skb, true);
 	skb->priority = 0;
+	skb->tstamp = 0;
 	return 0;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c13495ba6..678fc5416ae1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-	skb->tstamp = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->skb_iif = 0;
 	skb->ignore_df = 0;
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

posix-timers.h states that a clockid_t value is invalid if bits 0, 1 and
2 are all set. Add a mask that can be safely used elsewhere even if this
implicit rule's implementation is changed.

This is done in preparation for the upcoming time based transmission
patchset.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/linux/posix-timers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index c85704fcdbd2..0ba677cc8da6 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -28,6 +28,7 @@ struct cpu_timer_list {
  *
  * A clockid is invalid if bits 2, 1, and 0 are all set.
  */
+#define CLOCKID_INVALID			GENMASK(2, 0)
 #define CPUCLOCK_PID(clock)		((pid_t) ~((clock) >> 3))
 #define CPUCLOCK_PERTHREAD(clock) \
 	(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

posix-timers.h states that a clockid_t value is invalid if bits 0, 1 and
2 are all set. Add a mask that can be safely used elsewhere even if this
implicit rule's implementation is changed.

This is done in preparation for the upcoming time based transmission
patchset.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/linux/posix-timers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index c85704fcdbd2..0ba677cc8da6 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -28,6 +28,7 @@ struct cpu_timer_list {
  *
  * A clockid is invalid if bits 2, 1, and 0 are all set.
  */
+#define CLOCKID_INVALID			GENMASK(2, 0)
 #define CPUCLOCK_PID(clock)		((pid_t) ~((clock) >> 3))
 #define CPUCLOCK_PERTHREAD(clock) \
 	(((clock) & (clockid_t) CPUCLOCK_PERTHREAD_MASK) != 0)
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time.
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar, Richard Cochran,
	Jesus Sanchez-Palencia

From: Richard Cochran <rcochran@linutronix.de>

This patch introduces SO_TXTIME.  User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2).

A new field is added to struct sockcm_cookie, and the tstamp from
skbuffs will be used later on.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  3 +++
 arch/frv/include/uapi/asm/socket.h     |  3 +++
 arch/ia64/include/uapi/asm/socket.h    |  3 +++
 arch/m32r/include/uapi/asm/socket.h    |  3 +++
 arch/mips/include/uapi/asm/socket.h    |  3 +++
 arch/mn10300/include/uapi/asm/socket.h |  3 +++
 arch/parisc/include/uapi/asm/socket.h  |  3 +++
 arch/s390/include/uapi/asm/socket.h    |  3 +++
 arch/sparc/include/uapi/asm/socket.h   |  3 +++
 arch/xtensa/include/uapi/asm/socket.h  |  3 +++
 include/net/sock.h                     |  2 ++
 include/uapi/asm-generic/socket.h      |  3 +++
 net/core/sock.c                        | 21 +++++++++++++++++++++
 13 files changed, 56 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 9168e78fa32a..0e95f45cd058 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -105,5 +105,8 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index cf5018e82c3d..65276c95b8df 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index b35eee132142..d029a40b1b55 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY		0x4035
 
+#define SO_TXTIME		0x4036
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY		0x003e
 
+#define SO_TXTIME		0x003f
+#define SCM_TXTIME		SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..16a90a69c9b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -778,6 +778,7 @@ enum sock_flags {
 	SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
 	SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
 	SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+	SOCK_TXTIME,
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1568,6 +1569,7 @@ void sock_kzfree_s(struct sock *sk, void *mem, int size);
 void sk_send_sigurg(struct sock *sk);
 
 struct sockcm_cookie {
+	u64 transmit_time;
 	u32 mark;
 	u16 tsflags;
 };
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 0ae758c90e54..a12692e5f7a8 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -107,4 +107,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 27f218bba43f..2ba09f311e71 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -91,6 +91,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <asm/unaligned.h>
 #include <linux/capability.h>
 #include <linux/errno.h>
 #include <linux/errqueue.h>
@@ -1065,6 +1066,15 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		}
 		break;
 
+	case SO_TXTIME:
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+			ret = -EPERM;
+		else if (val < 0 || val > 1)
+			ret = -EINVAL;
+		else
+			sock_valbool_flag(sk, SOCK_TXTIME, valbool);
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1398,6 +1408,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = sock_flag(sk, SOCK_ZEROCOPY);
 		break;
 
+	case SO_TXTIME:
+		v.val = sock_flag(sk, SOCK_TXTIME);
+		break;
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).
@@ -2132,6 +2146,13 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
 		sockc->tsflags |= tsflags;
 		break;
+	case SO_TXTIME:
+		if (!sock_flag(sk, SOCK_TXTIME))
+			return -EINVAL;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
+			return -EINVAL;
+		sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
+		break;
 	/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
 	case SCM_RIGHTS:
 	case SCM_CREDENTIALS:
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time.
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

From: Richard Cochran <rcochran@linutronix.de>

This patch introduces SO_TXTIME.  User space enables this option in
order to pass a desired future transmit time in a CMSG when calling
sendmsg(2).

A new field is added to struct sockcm_cookie, and the tstamp from
skbuffs will be used later on.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  3 +++
 arch/frv/include/uapi/asm/socket.h     |  3 +++
 arch/ia64/include/uapi/asm/socket.h    |  3 +++
 arch/m32r/include/uapi/asm/socket.h    |  3 +++
 arch/mips/include/uapi/asm/socket.h    |  3 +++
 arch/mn10300/include/uapi/asm/socket.h |  3 +++
 arch/parisc/include/uapi/asm/socket.h  |  3 +++
 arch/s390/include/uapi/asm/socket.h    |  3 +++
 arch/sparc/include/uapi/asm/socket.h   |  3 +++
 arch/xtensa/include/uapi/asm/socket.h  |  3 +++
 include/net/sock.h                     |  2 ++
 include/uapi/asm-generic/socket.h      |  3 +++
 net/core/sock.c                        | 21 +++++++++++++++++++++
 13 files changed, 56 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index be14f16149d5..065fb372e355 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -112,4 +112,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 9168e78fa32a..0e95f45cd058 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -105,5 +105,8 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 3efba40adc54..c872c4e6bafb 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -114,4 +114,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index cf5018e82c3d..65276c95b8df 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 49c3d4795963..71370fb3ceef 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -123,4 +123,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index b35eee132142..d029a40b1b55 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -105,4 +105,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 1d0fdc3b5d22..061b9cf2a779 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -104,4 +104,7 @@
 
 #define SO_ZEROCOPY		0x4035
 
+#define SO_TXTIME		0x4036
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 3510c0fd06f4..39d901476ee5 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -111,4 +111,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index d58520c2e6ff..7ea35e5601b6 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -101,6 +101,9 @@
 
 #define SO_ZEROCOPY		0x003e
 
+#define SO_TXTIME		0x003f
+#define SCM_TXTIME		SO_TXTIME
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 75a07b8119a9..1de07a7f7680 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -116,4 +116,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index b9624581d639..16a90a69c9b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -778,6 +778,7 @@ enum sock_flags {
 	SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
 	SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
 	SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+	SOCK_TXTIME,
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1568,6 +1569,7 @@ void sock_kzfree_s(struct sock *sk, void *mem, int size);
 void sk_send_sigurg(struct sock *sk);
 
 struct sockcm_cookie {
+	u64 transmit_time;
 	u32 mark;
 	u16 tsflags;
 };
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 0ae758c90e54..a12692e5f7a8 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -107,4 +107,7 @@
 
 #define SO_ZEROCOPY		60
 
+#define SO_TXTIME		61
+#define SCM_TXTIME		SO_TXTIME
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 27f218bba43f..2ba09f311e71 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -91,6 +91,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <asm/unaligned.h>
 #include <linux/capability.h>
 #include <linux/errno.h>
 #include <linux/errqueue.h>
@@ -1065,6 +1066,15 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		}
 		break;
 
+	case SO_TXTIME:
+		if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+			ret = -EPERM;
+		else if (val < 0 || val > 1)
+			ret = -EINVAL;
+		else
+			sock_valbool_flag(sk, SOCK_TXTIME, valbool);
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1398,6 +1408,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val = sock_flag(sk, SOCK_ZEROCOPY);
 		break;
 
+	case SO_TXTIME:
+		v.val = sock_flag(sk, SOCK_TXTIME);
+		break;
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).
@@ -2132,6 +2146,13 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
 		sockc->tsflags |= tsflags;
 		break;
+	case SO_TXTIME:
+		if (!sock_flag(sk, SOCK_TXTIME))
+			return -EINVAL;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
+			return -EINVAL;
+		sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
+		break;
 	/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
 	case SCM_RIGHTS:
 	case SCM_CREDENTIALS:
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar, Richard Cochran,
	Jesus Sanchez-Palencia

From: Richard Cochran <rcochran@linutronix.de>

For raw packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/raw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..8e05970ba7c4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
+	skb->tstamp = sockc->transmit_time;
 	skb_dst_set(skb, &rt->dst);
 	*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
+	ipc.sockc.transmit_time = 0;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

From: Richard Cochran <rcochran@linutronix.de>

For raw packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/raw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..8e05970ba7c4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
+	skb->tstamp = sockc->transmit_time;
 	skb_dst_set(skb, &rt->dst);
 	*rtp = NULL;
 
@@ -562,6 +563,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
+	ipc.sockc.transmit_time = 0;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar, Richard Cochran,
	Jesus Sanchez-Palencia

From: Richard Cochran <rcochran@linutronix.de>

For udp packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/udp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..d683bbde526b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
+	ipc.sockc.transmit_time = 0;
 	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
 
@@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 				  sizeof(struct udphdr), &ipc, &rt,
 				  msg->msg_flags);
 		err = PTR_ERR(skb);
-		if (!IS_ERR_OR_NULL(skb))
+		if (!IS_ERR_OR_NULL(skb)) {
+			skb->tstamp = ipc.sockc.transmit_time;
 			err = udp_send_skb(skb, fl4);
+		}
 		goto out;
 	}
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

From: Richard Cochran <rcochran@linutronix.de>

For udp packets, copy the desired future transmit time from the CMSG
cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/udp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404d0935..d683bbde526b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	}
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
+	ipc.sockc.transmit_time = 0;
 	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
 
@@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 				  sizeof(struct udphdr), &ipc, &rt,
 				  msg->msg_flags);
 		err = PTR_ERR(skb);
-		if (!IS_ERR_OR_NULL(skb))
+		if (!IS_ERR_OR_NULL(skb)) {
+			skb->tstamp = ipc.sockc.transmit_time;
 			err = udp_send_skb(skb, fl4);
+		}
 		goto out;
 	}
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 07/18] net: packet: Hook into time based transmission.
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar, Richard Cochran,
	Jesus Sanchez-Palencia

From: Richard Cochran <rcochran@linutronix.de>

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/packet/af_packet.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2c5a6fe5d749..b2115fac2a8d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 		goto out_unlock;
 	}
 
+	sockc.transmit_time = 0;
 	sockc.tsflags = sk->sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(sk, msg, &sockc);
@@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
+	skb->tstamp = sockc.transmit_time;
 
 	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->dev = dev;
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
+	skb->tstamp = sockc->transmit_time;
 	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_put;
 
+	sockc.transmit_time = 0;
 	sockc.tsflags = po->sk.sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_unlock;
 
+	sockc.transmit_time = 0;
 	sockc.tsflags = sk->sk_tsflags;
 	sockc.mark = sk->sk_mark;
 	if (msg->msg_controllen) {
@@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sockc.mark;
+	skb->tstamp = sockc.transmit_time;
 
 	if (has_vnet_hdr) {
 		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 07/18] net: packet: Hook into time based transmission.
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

From: Richard Cochran <rcochran@linutronix.de>

For raw layer-2 packets, copy the desired future transmit time from
the CMSG cookie into the skb.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/packet/af_packet.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2c5a6fe5d749..b2115fac2a8d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1976,6 +1976,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 		goto out_unlock;
 	}
 
+	sockc.transmit_time = 0;
 	sockc.tsflags = sk->sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(sk, msg, &sockc);
@@ -1987,6 +1988,7 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
+	skb->tstamp = sockc.transmit_time;
 
 	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2484,6 +2486,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->dev = dev;
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
+	skb->tstamp = sockc->transmit_time;
 	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2660,6 +2663,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_put;
 
+	sockc.transmit_time = 0;
 	sockc.tsflags = po->sk.sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2856,6 +2860,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	if (unlikely(!(dev->flags & IFF_UP)))
 		goto out_unlock;
 
+	sockc.transmit_time = 0;
 	sockc.tsflags = sk->sk_tsflags;
 	sockc.mark = sk->sk_mark;
 	if (msg->msg_controllen) {
@@ -2928,6 +2933,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sockc.mark;
+	skb->tstamp = sockc.transmit_time;
 
 	if (has_vnet_hdr) {
 		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
a drop_if_late flag. With this commit the API becomes:

- use SO_TXTIME to enable the feature on a socket;
- pass the per-packet arguments through the cmsg header using:
  * SCM_CLOCKID for the clockid to be used as the txtime clock source;
  * SCM_TXTIME for the txtime timestamp;
  * SCM_DROP_IF_LATE for the drop flag. This flag will be used by the
    traffic control to decide if a delayed packet should be dropped.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h     |  2 ++
 arch/ia64/include/uapi/asm/socket.h    |  2 ++
 arch/m32r/include/uapi/asm/socket.h    |  2 ++
 arch/mips/include/uapi/asm/socket.h    |  2 ++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/s390/include/uapi/asm/socket.h    |  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/linux/skbuff.h                 |  3 +++
 include/net/sock.h                     |  2 ++
 include/uapi/asm-generic/socket.h      |  2 ++
 net/core/sock.c                        | 22 +++++++++++++++++++++-
 14 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..3399dfefa579 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -114,5 +114,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 0e95f45cd058..43b636836722 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -107,6 +107,8 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..1f06d07aadbe 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -116,5 +116,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 65276c95b8df..69ab380d8d48 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..97da79f58538 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -125,5 +125,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index d029a40b1b55..7c7a174fdfae 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..7fe86b5cd593 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -106,5 +106,7 @@
 
 #define SO_TXTIME		0x4036
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	0x4037
+#define SCM_CLOCKID		0x4038
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..97f90c4a9b8c 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -113,5 +113,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..6397c366dd2d 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -103,6 +103,8 @@
 
 #define SO_TXTIME		0x003f
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	0x0040
+#define SCM_CLOCKID		0x0041
 
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 1de07a7f7680..bc81b02a1f5f 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -118,5 +118,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8340e6e8814..951969ceaf65 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -788,6 +788,9 @@ struct sk_buff {
 	__u8			tc_redirected:1;
 	__u8			tc_from_ingress:1;
 #endif
+	__u8			tc_drop_if_late:1;
+
+	clockid_t		txtime_clockid;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/include/net/sock.h b/include/net/sock.h
index 16a90a69c9b3..50e36e0f62f6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1571,7 +1571,9 @@ void sk_send_sigurg(struct sock *sk);
 struct sockcm_cookie {
 	u64 transmit_time;
 	u32 mark;
+	clockid_t clockid;
 	u16 tsflags;
+	u8 drop_if_late;
 };
 
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index a12692e5f7a8..c9e1ea0097e1 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -109,5 +109,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 2ba09f311e71..51cfade342ec 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2126,6 +2126,7 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
 	u32 tsflags;
+	u8 drop;
 
 	switch (cmsg->cmsg_type) {
 	case SO_MARK:
@@ -2146,13 +2147,32 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
 		sockc->tsflags |= tsflags;
 		break;
-	case SO_TXTIME:
+	case SCM_TXTIME:
 		if (!sock_flag(sk, SOCK_TXTIME))
 			return -EINVAL;
 		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
 			return -EINVAL;
 		sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
 		break;
+	case SCM_DROP_IF_LATE:
+		if (!sock_flag(sk, SOCK_TXTIME))
+			return -EINVAL;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u8)))
+			return -EINVAL;
+
+		drop = get_unaligned((u8 *)CMSG_DATA(cmsg));
+		if (drop < 0 || drop > 1)
+			return -EINVAL;
+
+		sockc->drop_if_late = drop;
+		break;
+	case SCM_CLOCKID:
+		if (!sock_flag(sk, SOCK_TXTIME))
+			return -EINVAL;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(clockid_t)))
+			return -EINVAL;
+		sockc->clockid = get_unaligned((clockid_t *)CMSG_DATA(cmsg));
+		break;
 	/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
 	case SCM_RIGHTS:
 	case SCM_CREDENTIALS:
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
a drop_if_late flag. With this commit the API becomes:

- use SO_TXTIME to enable the feature on a socket;
- pass the per-packet arguments through the cmsg header using:
  * SCM_CLOCKID for the clockid to be used as the txtime clock source;
  * SCM_TXTIME for the txtime timestamp;
  * SCM_DROP_IF_LATE for the drop flag. This flag will be used by the
    traffic control to decide if a delayed packet should be dropped.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h     |  2 ++
 arch/ia64/include/uapi/asm/socket.h    |  2 ++
 arch/m32r/include/uapi/asm/socket.h    |  2 ++
 arch/mips/include/uapi/asm/socket.h    |  2 ++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/s390/include/uapi/asm/socket.h    |  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/linux/skbuff.h                 |  3 +++
 include/net/sock.h                     |  2 ++
 include/uapi/asm-generic/socket.h      |  2 ++
 net/core/sock.c                        | 22 +++++++++++++++++++++-
 14 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..3399dfefa579 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -114,5 +114,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 0e95f45cd058..43b636836722 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -107,6 +107,8 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..1f06d07aadbe 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -116,5 +116,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index 65276c95b8df..69ab380d8d48 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..97da79f58538 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -125,5 +125,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index d029a40b1b55..7c7a174fdfae 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -107,5 +107,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..7fe86b5cd593 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -106,5 +106,7 @@
 
 #define SO_TXTIME		0x4036
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	0x4037
+#define SCM_CLOCKID		0x4038
 
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..97f90c4a9b8c 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -113,5 +113,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..6397c366dd2d 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -103,6 +103,8 @@
 
 #define SO_TXTIME		0x003f
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	0x0040
+#define SCM_CLOCKID		0x0041
 
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 1de07a7f7680..bc81b02a1f5f 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -118,5 +118,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d8340e6e8814..951969ceaf65 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -788,6 +788,9 @@ struct sk_buff {
 	__u8			tc_redirected:1;
 	__u8			tc_from_ingress:1;
 #endif
+	__u8			tc_drop_if_late:1;
+
+	clockid_t		txtime_clockid;
 
 #ifdef CONFIG_NET_SCHED
 	__u16			tc_index;	/* traffic control index */
diff --git a/include/net/sock.h b/include/net/sock.h
index 16a90a69c9b3..50e36e0f62f6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1571,7 +1571,9 @@ void sk_send_sigurg(struct sock *sk);
 struct sockcm_cookie {
 	u64 transmit_time;
 	u32 mark;
+	clockid_t clockid;
 	u16 tsflags;
+	u8 drop_if_late;
 };
 
 int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index a12692e5f7a8..c9e1ea0097e1 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -109,5 +109,7 @@
 
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
+#define SCM_DROP_IF_LATE	62
+#define SCM_CLOCKID		63
 
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 2ba09f311e71..51cfade342ec 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2126,6 +2126,7 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		     struct sockcm_cookie *sockc)
 {
 	u32 tsflags;
+	u8 drop;
 
 	switch (cmsg->cmsg_type) {
 	case SO_MARK:
@@ -2146,13 +2147,32 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg,
 		sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK;
 		sockc->tsflags |= tsflags;
 		break;
-	case SO_TXTIME:
+	case SCM_TXTIME:
 		if (!sock_flag(sk, SOCK_TXTIME))
 			return -EINVAL;
 		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u64)))
 			return -EINVAL;
 		sockc->transmit_time = get_unaligned((u64 *)CMSG_DATA(cmsg));
 		break;
+	case SCM_DROP_IF_LATE:
+		if (!sock_flag(sk, SOCK_TXTIME))
+			return -EINVAL;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(u8)))
+			return -EINVAL;
+
+		drop = get_unaligned((u8 *)CMSG_DATA(cmsg));
+		if (drop < 0 || drop > 1)
+			return -EINVAL;
+
+		sockc->drop_if_late = drop;
+		break;
+	case SCM_CLOCKID:
+		if (!sock_flag(sk, SOCK_TXTIME))
+			return -EINVAL;
+		if (cmsg->cmsg_len != CMSG_LEN(sizeof(clockid_t)))
+			return -EINVAL;
+		sockc->clockid = get_unaligned((clockid_t *)CMSG_DATA(cmsg));
+		break;
 	/* SCM_RIGHTS and SCM_CREDENTIALS are semantically in SOL_UNIX. */
 	case SCM_RIGHTS:
 	case SCM_CREDENTIALS:
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/raw.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8e05970ba7c4..61b6acccc72b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -79,6 +79,7 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/compat.h>
 #include <linux/uio.h>
+#include <linux/posix-timers.h>
 
 struct raw_frag_vec {
 	struct msghdr *msg;
@@ -382,6 +383,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
 	skb->tstamp = sockc->transmit_time;
+	skb->txtime_clockid = sockc->clockid;
+	skb->tc_drop_if_late = sockc->drop_if_late;
 	skb_dst_set(skb, &rt->dst);
 	*rtp = NULL;
 
@@ -564,6 +567,8 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.sockc.transmit_time = 0;
+	ipc.sockc.drop_if_late = 0;
+	ipc.sockc.clockid = CLOCKID_INVALID;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/raw.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8e05970ba7c4..61b6acccc72b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -79,6 +79,7 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/compat.h>
 #include <linux/uio.h>
+#include <linux/posix-timers.h>
 
 struct raw_frag_vec {
 	struct msghdr *msg;
@@ -382,6 +383,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
 	skb->tstamp = sockc->transmit_time;
+	skb->txtime_clockid = sockc->clockid;
+	skb->tc_drop_if_late = sockc->drop_if_late;
 	skb_dst_set(skb, &rt->dst);
 	*rtp = NULL;
 
@@ -564,6 +567,8 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.sockc.transmit_time = 0;
+	ipc.sockc.drop_if_late = 0;
+	ipc.sockc.clockid = CLOCKID_INVALID;
 	ipc.addr = inet->inet_saddr;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 10/18] net: ipv4: udp: Handle remaining txtime parameters
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/udp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d683bbde526b..4bea8d5ab968 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <linux/posix-timers.h>
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -927,6 +928,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.sockc.transmit_time = 0;
+	ipc.sockc.drop_if_late = 0;
+	ipc.sockc.clockid = CLOCKID_INVALID;
 	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
 
@@ -1043,6 +1046,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb)) {
 			skb->tstamp = ipc.sockc.transmit_time;
+			skb->txtime_clockid = ipc.sockc.clockid;
+			skb->tc_drop_if_late = ipc.sockc.drop_if_late;
 			err = udp_send_skb(skb, fl4);
 		}
 		goto out;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 10/18] net: ipv4: udp: Handle remaining txtime parameters
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/ipv4/udp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d683bbde526b..4bea8d5ab968 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
+#include <linux/posix-timers.h>
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -927,6 +928,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 	ipc.sockc.tsflags = sk->sk_tsflags;
 	ipc.sockc.transmit_time = 0;
+	ipc.sockc.drop_if_late = 0;
+	ipc.sockc.clockid = CLOCKID_INVALID;
 	ipc.addr = inet->inet_saddr;
 	ipc.oif = sk->sk_bound_dev_if;
 
@@ -1043,6 +1046,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(skb);
 		if (!IS_ERR_OR_NULL(skb)) {
 			skb->tstamp = ipc.sockc.transmit_time;
+			skb->txtime_clockid = ipc.sockc.clockid;
+			skb->tc_drop_if_late = ipc.sockc.drop_if_late;
 			err = udp_send_skb(skb, fl4);
 		}
 		goto out;
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 11/18] net: packet: Handle remaining txtime parameters
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/packet/af_packet.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b2115fac2a8d..e455fbf5a356 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -94,6 +94,7 @@
 #endif
 #include <linux/bpf.h>
 #include <net/compat.h>
+#include <linux/posix-timers.h>
 
 #include "internal.h"
 
@@ -1977,6 +1978,8 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	}
 
 	sockc.transmit_time = 0;
+	sockc.drop_if_late = 0;
+	sockc.clockid = CLOCKID_INVALID;
 	sockc.tsflags = sk->sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(sk, msg, &sockc);
@@ -1989,6 +1992,8 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
 	skb->tstamp = sockc.transmit_time;
+	skb->tc_drop_if_late = sockc.drop_if_late;
+	skb->txtime_clockid = sockc.clockid;
 
 	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2487,6 +2492,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
 	skb->tstamp = sockc->transmit_time;
+	skb->tc_drop_if_late = sockc->drop_if_late;
+	skb->txtime_clockid = sockc->clockid;
 	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2664,6 +2671,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		goto out_put;
 
 	sockc.transmit_time = 0;
+	sockc.drop_if_late = 0;
+	sockc.clockid = CLOCKID_INVALID;
 	sockc.tsflags = po->sk.sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2861,6 +2870,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		goto out_unlock;
 
 	sockc.transmit_time = 0;
+	sockc.drop_if_late = 0;
+	sockc.clockid = CLOCKID_INVALID;
 	sockc.tsflags = sk->sk_tsflags;
 	sockc.mark = sk->sk_mark;
 	if (msg->msg_controllen) {
@@ -2934,6 +2945,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	skb->priority = sk->sk_priority;
 	skb->mark = sockc.mark;
 	skb->tstamp = sockc.transmit_time;
+	skb->tc_drop_if_late = sockc.drop_if_late;
+	skb->txtime_clockid = sockc.clockid;
 
 	if (has_vnet_hdr) {
 		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 11/18] net: packet: Handle remaining txtime parameters
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Initialize clockid to CLOCKID_INVALID instead of 0 (i.e.
CLOCK_REALTIME), and copy both drop_if_late and clockid from CMSG cookie
into skb.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 net/packet/af_packet.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b2115fac2a8d..e455fbf5a356 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -94,6 +94,7 @@
 #endif
 #include <linux/bpf.h>
 #include <net/compat.h>
+#include <linux/posix-timers.h>
 
 #include "internal.h"
 
@@ -1977,6 +1978,8 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	}
 
 	sockc.transmit_time = 0;
+	sockc.drop_if_late = 0;
+	sockc.clockid = CLOCKID_INVALID;
 	sockc.tsflags = sk->sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(sk, msg, &sockc);
@@ -1989,6 +1992,8 @@ static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
 	skb->tstamp = sockc.transmit_time;
+	skb->tc_drop_if_late = sockc.drop_if_late;
+	skb->txtime_clockid = sockc.clockid;
 
 	sock_tx_timestamp(sk, sockc.tsflags, &skb_shinfo(skb)->tx_flags);
 
@@ -2487,6 +2492,8 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
 	skb->tstamp = sockc->transmit_time;
+	skb->tc_drop_if_late = sockc->drop_if_late;
+	skb->txtime_clockid = sockc->clockid;
 	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
 	skb_shinfo(skb)->destructor_arg = ph.raw;
 
@@ -2664,6 +2671,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		goto out_put;
 
 	sockc.transmit_time = 0;
+	sockc.drop_if_late = 0;
+	sockc.clockid = CLOCKID_INVALID;
 	sockc.tsflags = po->sk.sk_tsflags;
 	if (msg->msg_controllen) {
 		err = sock_cmsg_send(&po->sk, msg, &sockc);
@@ -2861,6 +2870,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 		goto out_unlock;
 
 	sockc.transmit_time = 0;
+	sockc.drop_if_late = 0;
+	sockc.clockid = CLOCKID_INVALID;
 	sockc.tsflags = sk->sk_tsflags;
 	sockc.mark = sk->sk_mark;
 	if (msg->msg_controllen) {
@@ -2934,6 +2945,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	skb->priority = sk->sk_priority;
 	skb->mark = sockc.mark;
 	skb->tstamp = sockc.transmit_time;
+	skb->tc_drop_if_late = sockc.drop_if_late;
+	skb->txtime_clockid = sockc.clockid;
 
 	if (has_vnet_hdr) {
 		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c     | 11 +++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
 	struct Qdisc	*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc,
+				 clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 68f9d942bed4..beb1dc296bfb 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc,
+				 clockid_t clockid)
 {
-	hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED);
 	wd->timer.function = qdisc_watchdog;
 	wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+	qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be
passed, this allows other time references to be used when scheduling
the Qdisc to run.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 include/net/pkt_sched.h |  2 ++
 net/sched/sch_api.c     | 11 +++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 815b92a23936..2466ea143d01 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -72,6 +72,8 @@ struct qdisc_watchdog {
 	struct Qdisc	*qdisc;
 };
 
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc,
+				 clockid_t clockid);
 void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 68f9d942bed4..beb1dc296bfb 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -596,12 +596,19 @@ static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc *qdisc,
+				 clockid_t clockid)
 {
-	hrtimer_init(&wd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&wd->timer, clockid, HRTIMER_MODE_ABS_PINNED);
 	wd->timer.function = qdisc_watchdog;
 	wd->qdisc = qdisc;
 }
+EXPORT_SYMBOL(qdisc_watchdog_init_clockid);
+
+void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc)
+{
+	qdisc_watchdog_init_clockid(wd, qdisc, CLOCK_MONOTONIC);
+}
 EXPORT_SYMBOL(qdisc_watchdog_init);
 
 void qdisc_watchdog_schedule_ns(struct qdisc_watchdog *wd, u64 expires)
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

TBS (Time Based Scheduler) uses the information added earlier in this
series (the socket option SO_TXTIME and the new role of
sk_buff->tstamp) to schedule traffic transmission based on absolute
time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
           clockid CLOCK_REALTIME sorting

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in socket
are in reference to the clockid CLOCK_REALTIME and packets leave the
Qdisc "delta" (100000) nanoseconds before its transmission time. It will
also enable sorting of the buffered packets based on their txtime.

The qdisc will drop packets on enqueue() if their skbuff clockid does not
match the clock reference of the Qdisc. Moreover, the tc_drop_if_late
flag from skbuffs will be used on dequeue() to determine if a packet
that has expired while being enqueued should be dropped or not.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 include/linux/netdevice.h      |   1 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig              |  11 +
 net/sched/Makefile             |   1 +
 net/sched/sch_tbs.c            | 474 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 504 insertions(+)
 create mode 100644 net/sched/sch_tbs.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7104de2bc957..09b5b2e08f04 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -781,6 +781,7 @@ enum tc_setup_type {
 	TC_SETUP_QDISC_CBS,
 	TC_SETUP_QDISC_RED,
 	TC_SETUP_QDISC_PRIO,
+	TC_SETUP_QDISC_TBS,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..a33b5b9da81a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+	__s32 delta;
+	__s32 clockid;
+	__u32 flags;
+#define TC_TBS_SORTING_ON BIT(0)
+};
+
+enum {
+	TCA_TBS_UNSPEC,
+	TCA_TBS_PARMS,
+	__TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..9e68fef78d50 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_cbs.
 
+config NET_SCH_TBS
+	tristate "Time Based Scheduler (TBS)"
+	---help---
+	  Say Y here if you want to use the Time Based Scheduler (TBS) packet
+	  scheduling algorithm.
+
+	  See the top of <file:net/sched/sch_tbs.c> for more details.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_tbs.
+
 config NET_SCH_GRED
 	tristate "Generic Random Early Detection (GRED)"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..f02378a0a8f2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
+obj-$(CONFIG_NET_SCH_TBS)	+= sch_tbs.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
new file mode 100644
index 000000000000..c19eedda9bc5
--- /dev/null
+++ b/net/sched/sch_tbs.c
@@ -0,0 +1,474 @@
+/*
+ * net/sched/sch_tbs.c	Time Based Shaper
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
+ *		Vinicius Costa Gomes <vinicius.gomes@intel.com>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/rbtree.h>
+#include <linux/skbuff.h>
+#include <linux/posix-timers.h>
+#include <net/netlink.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+#include <net/sock.h>
+
+#define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+
+struct tbs_sched_data {
+	bool sorting;
+	int clockid;
+	int queue;
+	s32 delta; /* in ns */
+	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
+	struct rb_root head;
+	struct qdisc_watchdog watchdog;
+	struct Qdisc *qdisc;
+	int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free);
+	struct sk_buff *(*dequeue)(struct Qdisc *sch);
+	struct sk_buff *(*peek)(struct Qdisc *sch);
+};
+
+static const struct nla_policy tbs_policy[TCA_TBS_MAX + 1] = {
+	[TCA_TBS_PARMS]	= { .len = sizeof(struct tc_tbs_qopt) },
+};
+
+typedef ktime_t (*get_time_func_t)(void);
+
+static const get_time_func_t clockid_to_get_time[MAX_CLOCKS] = {
+	[CLOCK_MONOTONIC] = ktime_get,
+	[CLOCK_REALTIME] = ktime_get_real,
+	[CLOCK_BOOTTIME] = ktime_get_boottime,
+	[CLOCK_TAI] = ktime_get_clocktai,
+};
+
+static ktime_t get_time_by_clockid(clockid_t clockid)
+{
+	get_time_func_t func = clockid_to_get_time[clockid];
+
+	if (!func)
+		return 0;
+
+	return func();
+}
+
+static inline int validate_input_params(struct tc_tbs_qopt *qopt,
+					struct netlink_ext_ack *extack)
+{
+	/* Check if params comply to the following rules:
+	 *	* If SW best-effort, then clockid and delta must be valid
+	 *	  regardless of sorting enabled or not.
+	 *
+	 *	* Dynamic clockids are not supported.
+	 *	* Delta must be a positive integer.
+	 */
+	if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+	    qopt->clockid >= MAX_CLOCKS) {
+		NL_SET_ERR_MSG(extack, "Invalid clockid");
+		return -EINVAL;
+	} else if (qopt->clockid < 0 ||
+		   !clockid_to_get_time[qopt->clockid]) {
+		NL_SET_ERR_MSG(extack, "Clockid is not supported");
+		return -ENOTSUPP;
+	}
+
+	if (qopt->delta < 0) {
+		NL_SET_ERR_MSG(extack, "Delta must be positive");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	ktime_t txtime = nskb->tstamp;
+	struct sock *sk = nskb->sk;
+	ktime_t now;
+
+	if (sk && !sock_flag(sk, SOCK_TXTIME))
+		return false;
+
+	/* We don't perform crosstimestamping.
+	 * Drop if packet's clockid differs from qdisc's.
+	 */
+	if (nskb->txtime_clockid != q->clockid)
+		return false;
+
+	now = get_time_by_clockid(q->clockid);
+	if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
+		return false;
+
+	return true;
+}
+
+static struct sk_buff *tbs_peek(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	return q->peek(sch);
+}
+
+static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct rb_node *p;
+
+	p = rb_first(&q->head);
+	if (!p)
+		return NULL;
+
+	return rb_to_skb(p);
+}
+
+static void reset_watchdog(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = tbs_peek(sch);
+	ktime_t next;
+
+	if (!skb)
+		return;
+
+	next = ktime_sub_ns(skb->tstamp, q->delta);
+	qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
+}
+
+static int tbs_enqueue(struct sk_buff *nskb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	return q->enqueue(nskb, sch, to_free);
+}
+
+static int tbs_enqueue_scheduledfifo(struct sk_buff *nskb, struct Qdisc *sch,
+				     struct sk_buff **to_free)
+{
+	int err;
+
+	if (!is_packet_valid(sch, nskb))
+		return qdisc_drop(nskb, sch, to_free);
+
+	err = qdisc_enqueue_tail(nskb, sch);
+
+	/* If there is only 1 packet, then we must reset the watchdog. */
+	if (err >= 0 && sch->q.qlen == 1)
+		reset_watchdog(sch);
+
+	return err;
+}
+
+static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
+				      struct sk_buff **to_free)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct rb_node **p = &q->head.rb_node, *parent = NULL;
+	ktime_t txtime = nskb->tstamp;
+
+	if (!is_packet_valid(sch, nskb))
+		return qdisc_drop(nskb, sch, to_free);
+
+	while (*p) {
+		struct sk_buff *skb;
+
+		parent = *p;
+		skb = rb_to_skb(parent);
+		if (ktime_after(txtime, skb->tstamp))
+			p = &parent->rb_right;
+		else
+			p = &parent->rb_left;
+	}
+	rb_link_node(&nskb->rbnode, parent, p);
+	rb_insert_color(&nskb->rbnode, &q->head);
+
+	qdisc_qstats_backlog_inc(sch, nskb);
+	sch->q.qlen++;
+
+	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
+	reset_watchdog(sch);
+
+	return NET_XMIT_SUCCESS;
+}
+
+static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
+				 bool drop)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	rb_erase(&skb->rbnode, &q->head);
+
+	qdisc_qstats_backlog_dec(sch, skb);
+
+	if (drop) {
+		struct sk_buff *to_free = NULL;
+
+		qdisc_drop(skb, sch, &to_free);
+		kfree_skb_list(to_free);
+		qdisc_qstats_overlimit(sch);
+	} else {
+		qdisc_bstats_update(sch, skb);
+
+		q->last = skb->tstamp;
+	}
+
+	sch->q.qlen--;
+
+	/* The rbnode field in the skb re-uses these fields, now that
+	 * we are done with the rbnode, reset them.
+	 */
+	skb->next = NULL;
+	skb->prev = NULL;
+	skb->dev = qdisc_dev(sch);
+}
+
+static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	return q->dequeue(sch);
+}
+
+static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = tbs_peek(sch);
+	ktime_t now, next;
+
+	if (!skb)
+		return NULL;
+
+	now = get_time_by_clockid(q->clockid);
+
+	/* Drop if packet has expired while in queue and the drop_if_late
+	 * flag is set.
+	 */
+	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
+		struct sk_buff *to_free = NULL;
+
+		qdisc_queue_drop_head(sch, &to_free);
+		kfree_skb_list(to_free);
+		qdisc_qstats_overlimit(sch);
+
+		skb = NULL;
+		goto out;
+	}
+
+	next = ktime_sub_ns(skb->tstamp, q->delta);
+
+	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
+	if (ktime_after(now, next))
+		skb = qdisc_dequeue_head(sch);
+	else
+		skb = NULL;
+
+out:
+	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
+	reset_watchdog(sch);
+
+	return skb;
+}
+
+static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	ktime_t now, next;
+
+	skb = tbs_peek(sch);
+	if (!skb)
+		return NULL;
+
+	now = get_time_by_clockid(q->clockid);
+
+	/* Drop if packet has expired while in queue and the drop_if_late
+	 * flag is set.
+	 */
+	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
+		timesortedlist_erase(sch, skb, true);
+		skb = NULL;
+		goto out;
+	}
+
+	next = ktime_sub_ns(skb->tstamp, q->delta);
+
+	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
+	if (ktime_after(now, next))
+		timesortedlist_erase(sch, skb, false);
+	else
+		skb = NULL;
+
+out:
+	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
+	reset_watchdog(sch);
+
+	return skb;
+}
+
+static inline void setup_queueing_mode(struct tbs_sched_data *q)
+{
+	if (q->sorting) {
+		q->enqueue = tbs_enqueue_timesortedlist;
+		q->dequeue = tbs_dequeue_timesortedlist;
+		q->peek = tbs_peek_timesortedlist;
+	} else {
+		q->enqueue = tbs_enqueue_scheduledfifo;
+		q->dequeue = tbs_dequeue_scheduledfifo;
+		q->peek = qdisc_peek_head;
+	}
+}
+
+static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
+		    struct netlink_ext_ack *extack)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct nlattr *tb[TCA_TBS_MAX + 1];
+	struct tc_tbs_qopt *qopt;
+	int err;
+
+	if (!opt) {
+		NL_SET_ERR_MSG(extack, "Missing TBS qdisc options which are mandatory");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_TBS_MAX, opt, tbs_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_TBS_PARMS]) {
+		NL_SET_ERR_MSG(extack, "Missing mandatory TBS parameters");
+		return -EINVAL;
+	}
+
+	qopt = nla_data(tb[TCA_TBS_PARMS]);
+
+	pr_debug("delta %d clockid %d sorting %s\n",
+		 qopt->delta, qopt->clockid,
+		 SORTING_IS_ON(qopt) ? "on" : "off");
+
+	err = validate_input_params(qopt, extack);
+	if (err < 0)
+		return err;
+
+	q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
+
+	/* Everything went OK, save the parameters used. */
+	q->delta = qopt->delta;
+	q->clockid = qopt->clockid;
+	q->sorting = SORTING_IS_ON(qopt);
+
+	/* Select queueing mode based on parameters. */
+	setup_queueing_mode(q);
+
+	qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
+
+	return 0;
+}
+
+static void timesortedlist_clear(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct rb_node *p = rb_first(&q->head);
+
+	while (p) {
+		struct sk_buff *skb = rb_to_skb(p);
+
+		p = rb_next(p);
+
+		rb_erase(&skb->rbnode, &q->head);
+		rtnl_kfree_skbs(skb, skb);
+		sch->q.qlen--;
+	}
+}
+
+static void tbs_reset(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	/* Only cancel watchdog if it's been initialized. */
+	if (q->watchdog.qdisc == sch)
+		qdisc_watchdog_cancel(&q->watchdog);
+
+	/* No matter which mode we are on, it's safe to clear both lists. */
+	timesortedlist_clear(sch);
+	__qdisc_reset_queue(&sch->q);
+
+	sch->qstats.backlog = 0;
+	sch->q.qlen = 0;
+
+	q->last = 0;
+}
+
+static void tbs_destroy(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	/* Only cancel watchdog if it's been initialized. */
+	if (q->watchdog.qdisc == sch)
+		qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct tc_tbs_qopt opt = { };
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	opt.delta = q->delta;
+	opt.clockid = q->clockid;
+	if (q->sorting)
+		opt.flags |= TC_TBS_SORTING_ON;
+
+	if (nla_put(skb, TCA_TBS_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static struct Qdisc_ops tbs_qdisc_ops __read_mostly = {
+	.id		=	"tbs",
+	.priv_size	=	sizeof(struct tbs_sched_data),
+	.enqueue	=	tbs_enqueue,
+	.dequeue	=	tbs_dequeue,
+	.peek		=	tbs_peek,
+	.init		=	tbs_init,
+	.reset		=	tbs_reset,
+	.destroy	=	tbs_destroy,
+	.dump		=	tbs_dump,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init tbs_module_init(void)
+{
+	return register_qdisc(&tbs_qdisc_ops);
+}
+
+static void __exit tbs_module_exit(void)
+{
+	unregister_qdisc(&tbs_qdisc_ops);
+}
+module_init(tbs_module_init)
+module_exit(tbs_module_exit)
+MODULE_LICENSE("GPL");
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

TBS (Time Based Scheduler) uses the information added earlier in this
series (the socket option SO_TXTIME and the new role of
sk_buff->tstamp) to schedule traffic transmission based on absolute
time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
           clockid CLOCK_REALTIME sorting

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in socket
are in reference to the clockid CLOCK_REALTIME and packets leave the
Qdisc "delta" (100000) nanoseconds before its transmission time. It will
also enable sorting of the buffered packets based on their txtime.

The qdisc will drop packets on enqueue() if their skbuff clockid does not
match the clock reference of the Qdisc. Moreover, the tc_drop_if_late
flag from skbuffs will be used on dequeue() to determine if a packet
that has expired while being enqueued should be dropped or not.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 include/linux/netdevice.h      |   1 +
 include/uapi/linux/pkt_sched.h |  17 ++
 net/sched/Kconfig              |  11 +
 net/sched/Makefile             |   1 +
 net/sched/sch_tbs.c            | 474 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 504 insertions(+)
 create mode 100644 net/sched/sch_tbs.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7104de2bc957..09b5b2e08f04 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -781,6 +781,7 @@ enum tc_setup_type {
 	TC_SETUP_QDISC_CBS,
 	TC_SETUP_QDISC_RED,
 	TC_SETUP_QDISC_PRIO,
+	TC_SETUP_QDISC_TBS,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..a33b5b9da81a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,21 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+
+/* TBS */
+struct tc_tbs_qopt {
+	__s32 delta;
+	__s32 clockid;
+	__u32 flags;
+#define TC_TBS_SORTING_ON BIT(0)
+};
+
+enum {
+	TCA_TBS_UNSPEC,
+	TCA_TBS_PARMS,
+	__TCA_TBS_MAX,
+};
+
+#define TCA_TBS_MAX (__TCA_TBS_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..9e68fef78d50 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -183,6 +183,17 @@ config NET_SCH_CBS
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_cbs.
 
+config NET_SCH_TBS
+	tristate "Time Based Scheduler (TBS)"
+	---help---
+	  Say Y here if you want to use the Time Based Scheduler (TBS) packet
+	  scheduling algorithm.
+
+	  See the top of <file:net/sched/sch_tbs.c> for more details.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_tbs.
+
 config NET_SCH_GRED
 	tristate "Generic Random Early Detection (GRED)"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..f02378a0a8f2 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
+obj-$(CONFIG_NET_SCH_TBS)	+= sch_tbs.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
new file mode 100644
index 000000000000..c19eedda9bc5
--- /dev/null
+++ b/net/sched/sch_tbs.c
@@ -0,0 +1,474 @@
+/*
+ * net/sched/sch_tbs.c	Time Based Shaper
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
+ *		Vinicius Costa Gomes <vinicius.gomes@intel.com>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/rbtree.h>
+#include <linux/skbuff.h>
+#include <linux/posix-timers.h>
+#include <net/netlink.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+#include <net/sock.h>
+
+#define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+
+struct tbs_sched_data {
+	bool sorting;
+	int clockid;
+	int queue;
+	s32 delta; /* in ns */
+	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
+	struct rb_root head;
+	struct qdisc_watchdog watchdog;
+	struct Qdisc *qdisc;
+	int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free);
+	struct sk_buff *(*dequeue)(struct Qdisc *sch);
+	struct sk_buff *(*peek)(struct Qdisc *sch);
+};
+
+static const struct nla_policy tbs_policy[TCA_TBS_MAX + 1] = {
+	[TCA_TBS_PARMS]	= { .len = sizeof(struct tc_tbs_qopt) },
+};
+
+typedef ktime_t (*get_time_func_t)(void);
+
+static const get_time_func_t clockid_to_get_time[MAX_CLOCKS] = {
+	[CLOCK_MONOTONIC] = ktime_get,
+	[CLOCK_REALTIME] = ktime_get_real,
+	[CLOCK_BOOTTIME] = ktime_get_boottime,
+	[CLOCK_TAI] = ktime_get_clocktai,
+};
+
+static ktime_t get_time_by_clockid(clockid_t clockid)
+{
+	get_time_func_t func = clockid_to_get_time[clockid];
+
+	if (!func)
+		return 0;
+
+	return func();
+}
+
+static inline int validate_input_params(struct tc_tbs_qopt *qopt,
+					struct netlink_ext_ack *extack)
+{
+	/* Check if params comply to the following rules:
+	 *	* If SW best-effort, then clockid and delta must be valid
+	 *	  regardless of sorting enabled or not.
+	 *
+	 *	* Dynamic clockids are not supported.
+	 *	* Delta must be a positive integer.
+	 */
+	if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+	    qopt->clockid >= MAX_CLOCKS) {
+		NL_SET_ERR_MSG(extack, "Invalid clockid");
+		return -EINVAL;
+	} else if (qopt->clockid < 0 ||
+		   !clockid_to_get_time[qopt->clockid]) {
+		NL_SET_ERR_MSG(extack, "Clockid is not supported");
+		return -ENOTSUPP;
+	}
+
+	if (qopt->delta < 0) {
+		NL_SET_ERR_MSG(extack, "Delta must be positive");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	ktime_t txtime = nskb->tstamp;
+	struct sock *sk = nskb->sk;
+	ktime_t now;
+
+	if (sk && !sock_flag(sk, SOCK_TXTIME))
+		return false;
+
+	/* We don't perform crosstimestamping.
+	 * Drop if packet's clockid differs from qdisc's.
+	 */
+	if (nskb->txtime_clockid != q->clockid)
+		return false;
+
+	now = get_time_by_clockid(q->clockid);
+	if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
+		return false;
+
+	return true;
+}
+
+static struct sk_buff *tbs_peek(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	return q->peek(sch);
+}
+
+static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct rb_node *p;
+
+	p = rb_first(&q->head);
+	if (!p)
+		return NULL;
+
+	return rb_to_skb(p);
+}
+
+static void reset_watchdog(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = tbs_peek(sch);
+	ktime_t next;
+
+	if (!skb)
+		return;
+
+	next = ktime_sub_ns(skb->tstamp, q->delta);
+	qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
+}
+
+static int tbs_enqueue(struct sk_buff *nskb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	return q->enqueue(nskb, sch, to_free);
+}
+
+static int tbs_enqueue_scheduledfifo(struct sk_buff *nskb, struct Qdisc *sch,
+				     struct sk_buff **to_free)
+{
+	int err;
+
+	if (!is_packet_valid(sch, nskb))
+		return qdisc_drop(nskb, sch, to_free);
+
+	err = qdisc_enqueue_tail(nskb, sch);
+
+	/* If there is only 1 packet, then we must reset the watchdog. */
+	if (err >= 0 && sch->q.qlen == 1)
+		reset_watchdog(sch);
+
+	return err;
+}
+
+static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
+				      struct sk_buff **to_free)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct rb_node **p = &q->head.rb_node, *parent = NULL;
+	ktime_t txtime = nskb->tstamp;
+
+	if (!is_packet_valid(sch, nskb))
+		return qdisc_drop(nskb, sch, to_free);
+
+	while (*p) {
+		struct sk_buff *skb;
+
+		parent = *p;
+		skb = rb_to_skb(parent);
+		if (ktime_after(txtime, skb->tstamp))
+			p = &parent->rb_right;
+		else
+			p = &parent->rb_left;
+	}
+	rb_link_node(&nskb->rbnode, parent, p);
+	rb_insert_color(&nskb->rbnode, &q->head);
+
+	qdisc_qstats_backlog_inc(sch, nskb);
+	sch->q.qlen++;
+
+	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
+	reset_watchdog(sch);
+
+	return NET_XMIT_SUCCESS;
+}
+
+static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
+				 bool drop)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	rb_erase(&skb->rbnode, &q->head);
+
+	qdisc_qstats_backlog_dec(sch, skb);
+
+	if (drop) {
+		struct sk_buff *to_free = NULL;
+
+		qdisc_drop(skb, sch, &to_free);
+		kfree_skb_list(to_free);
+		qdisc_qstats_overlimit(sch);
+	} else {
+		qdisc_bstats_update(sch, skb);
+
+		q->last = skb->tstamp;
+	}
+
+	sch->q.qlen--;
+
+	/* The rbnode field in the skb re-uses these fields, now that
+	 * we are done with the rbnode, reset them.
+	 */
+	skb->next = NULL;
+	skb->prev = NULL;
+	skb->dev = qdisc_dev(sch);
+}
+
+static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	return q->dequeue(sch);
+}
+
+static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = tbs_peek(sch);
+	ktime_t now, next;
+
+	if (!skb)
+		return NULL;
+
+	now = get_time_by_clockid(q->clockid);
+
+	/* Drop if packet has expired while in queue and the drop_if_late
+	 * flag is set.
+	 */
+	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
+		struct sk_buff *to_free = NULL;
+
+		qdisc_queue_drop_head(sch, &to_free);
+		kfree_skb_list(to_free);
+		qdisc_qstats_overlimit(sch);
+
+		skb = NULL;
+		goto out;
+	}
+
+	next = ktime_sub_ns(skb->tstamp, q->delta);
+
+	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
+	if (ktime_after(now, next))
+		skb = qdisc_dequeue_head(sch);
+	else
+		skb = NULL;
+
+out:
+	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
+	reset_watchdog(sch);
+
+	return skb;
+}
+
+static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	ktime_t now, next;
+
+	skb = tbs_peek(sch);
+	if (!skb)
+		return NULL;
+
+	now = get_time_by_clockid(q->clockid);
+
+	/* Drop if packet has expired while in queue and the drop_if_late
+	 * flag is set.
+	 */
+	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
+		timesortedlist_erase(sch, skb, true);
+		skb = NULL;
+		goto out;
+	}
+
+	next = ktime_sub_ns(skb->tstamp, q->delta);
+
+	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
+	if (ktime_after(now, next))
+		timesortedlist_erase(sch, skb, false);
+	else
+		skb = NULL;
+
+out:
+	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
+	reset_watchdog(sch);
+
+	return skb;
+}
+
+static inline void setup_queueing_mode(struct tbs_sched_data *q)
+{
+	if (q->sorting) {
+		q->enqueue = tbs_enqueue_timesortedlist;
+		q->dequeue = tbs_dequeue_timesortedlist;
+		q->peek = tbs_peek_timesortedlist;
+	} else {
+		q->enqueue = tbs_enqueue_scheduledfifo;
+		q->dequeue = tbs_dequeue_scheduledfifo;
+		q->peek = qdisc_peek_head;
+	}
+}
+
+static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
+		    struct netlink_ext_ack *extack)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct nlattr *tb[TCA_TBS_MAX + 1];
+	struct tc_tbs_qopt *qopt;
+	int err;
+
+	if (!opt) {
+		NL_SET_ERR_MSG(extack, "Missing TBS qdisc options which are mandatory");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_TBS_MAX, opt, tbs_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_TBS_PARMS]) {
+		NL_SET_ERR_MSG(extack, "Missing mandatory TBS parameters");
+		return -EINVAL;
+	}
+
+	qopt = nla_data(tb[TCA_TBS_PARMS]);
+
+	pr_debug("delta %d clockid %d sorting %s\n",
+		 qopt->delta, qopt->clockid,
+		 SORTING_IS_ON(qopt) ? "on" : "off");
+
+	err = validate_input_params(qopt, extack);
+	if (err < 0)
+		return err;
+
+	q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
+
+	/* Everything went OK, save the parameters used. */
+	q->delta = qopt->delta;
+	q->clockid = qopt->clockid;
+	q->sorting = SORTING_IS_ON(qopt);
+
+	/* Select queueing mode based on parameters. */
+	setup_queueing_mode(q);
+
+	qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
+
+	return 0;
+}
+
+static void timesortedlist_clear(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct rb_node *p = rb_first(&q->head);
+
+	while (p) {
+		struct sk_buff *skb = rb_to_skb(p);
+
+		p = rb_next(p);
+
+		rb_erase(&skb->rbnode, &q->head);
+		rtnl_kfree_skbs(skb, skb);
+		sch->q.qlen--;
+	}
+}
+
+static void tbs_reset(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	/* Only cancel watchdog if it's been initialized. */
+	if (q->watchdog.qdisc == sch)
+		qdisc_watchdog_cancel(&q->watchdog);
+
+	/* No matter which mode we are on, it's safe to clear both lists. */
+	timesortedlist_clear(sch);
+	__qdisc_reset_queue(&sch->q);
+
+	sch->qstats.backlog = 0;
+	sch->q.qlen = 0;
+
+	q->last = 0;
+}
+
+static void tbs_destroy(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+
+	/* Only cancel watchdog if it's been initialized. */
+	if (q->watchdog.qdisc == sch)
+		qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct tc_tbs_qopt opt = { };
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	opt.delta = q->delta;
+	opt.clockid = q->clockid;
+	if (q->sorting)
+		opt.flags |= TC_TBS_SORTING_ON;
+
+	if (nla_put(skb, TCA_TBS_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static struct Qdisc_ops tbs_qdisc_ops __read_mostly = {
+	.id		=	"tbs",
+	.priv_size	=	sizeof(struct tbs_sched_data),
+	.enqueue	=	tbs_enqueue,
+	.dequeue	=	tbs_dequeue,
+	.peek		=	tbs_peek,
+	.init		=	tbs_init,
+	.reset		=	tbs_reset,
+	.destroy	=	tbs_destroy,
+	.dump		=	tbs_dump,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init tbs_module_init(void)
+{
+	return register_qdisc(&tbs_qdisc_ops);
+}
+
+static void __exit tbs_module_exit(void)
+{
+	unregister_qdisc(&tbs_qdisc_ops);
+}
+module_init(tbs_module_init)
+module_exit(tbs_module_exit)
+MODULE_LICENSE("GPL");
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Add new queueing modes to tbs qdisc so HW offload is supported.

For hw offload, if sorting is on, then the time sorted list will still
be used, but when sorting is disabled the enqueue / dequeue flow will
be based on a 'raw' FIFO through the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(). For the 'raw hw offload' mode, the drop_if_late
flag from skbuffs is not used by the Qdisc since this mode implicitly
assumes the PHC clock is being used by applications.

Example 1:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed the timestamp
in skbuffs are in reference to the interface's PHC and setting any other
valid clockid would be treated as an error. Because there is no
scheduling being performed in the qdisc, setting a delta != 0 would also
be considered an error.

Example 2:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
	   clockid CLOCK_REALTIME sorting

Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
reference and packets leave the Qdisc "delta" (100000) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/net/pkt_sched.h        |   5 ++
 include/uapi/linux/pkt_sched.h |   1 +
 net/sched/sch_tbs.c            | 159 +++++++++++++++++++++++++++++++++++------
 3 files changed, 144 insertions(+), 21 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..d042ffda7f21 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
 	s32 sendslope;
 };
 
+struct tc_tbs_qopt_offload {
+	u8 enable;
+	s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a33b5b9da81a..92af9fa4dee4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -941,6 +941,7 @@ struct tc_tbs_qopt {
 	__s32 clockid;
 	__u32 flags;
 #define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
index c19eedda9bc5..2aafa55de42c 100644
--- a/net/sched/sch_tbs.c
+++ b/net/sched/sch_tbs.c
@@ -25,8 +25,10 @@
 #include <net/sock.h>
 
 #define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+#define OFFLOAD_IS_ON(x) (x->flags & TC_TBS_OFFLOAD_ON)
 
 struct tbs_sched_data {
+	bool offload;
 	bool sorting;
 	int clockid;
 	int queue;
@@ -68,25 +70,42 @@ static inline int validate_input_params(struct tc_tbs_qopt *qopt,
 					struct netlink_ext_ack *extack)
 {
 	/* Check if params comply to the following rules:
-	 *	* If SW best-effort, then clockid and delta must be valid
-	 *	  regardless of sorting enabled or not.
+	 *	* If SW best-effort, then clockid and delta must be valid.
+	 *
+	 *	* If HW offload is ON and sorting is ON, then clockid and delta
+	 *	  must be valid.
+	 *
+	 *	* If HW offload is ON and sorting is OFF, then clockid and
+	 *	  delta must not have been set. The netdevice PHC will be used
+	 *	  implictly.
 	 *
 	 *	* Dynamic clockids are not supported.
 	 *	* Delta must be a positive integer.
 	 */
-	if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
-	    qopt->clockid >= MAX_CLOCKS) {
-		NL_SET_ERR_MSG(extack, "Invalid clockid");
-		return -EINVAL;
-	} else if (qopt->clockid < 0 ||
-		   !clockid_to_get_time[qopt->clockid]) {
-		NL_SET_ERR_MSG(extack, "Clockid is not supported");
-		return -ENOTSUPP;
-	}
-
-	if (qopt->delta < 0) {
-		NL_SET_ERR_MSG(extack, "Delta must be positive");
-		return -EINVAL;
+	if (!OFFLOAD_IS_ON(qopt) || SORTING_IS_ON(qopt)) {
+		if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+		    qopt->clockid >= MAX_CLOCKS) {
+			NL_SET_ERR_MSG(extack, "Invalid clockid");
+			return -EINVAL;
+		} else if (qopt->clockid < 0 ||
+			   !clockid_to_get_time[qopt->clockid]) {
+			NL_SET_ERR_MSG(extack, "Clockid is not supported");
+			return -ENOTSUPP;
+		}
+
+		if (qopt->delta < 0) {
+			NL_SET_ERR_MSG(extack, "Delta must be positive");
+			return -EINVAL;
+		}
+	} else {
+		if (qopt->delta != 0) {
+			NL_SET_ERR_MSG(extack, "Cannot set delta for this mode");
+			return -EINVAL;
+		}
+		if ((qopt->clockid & CLOCKID_INVALID) != CLOCKID_INVALID) {
+			NL_SET_ERR_MSG(extack, "Cannot set clockid for this mode");
+			return -EINVAL;
+		}
 	}
 
 	return 0;
@@ -155,6 +174,15 @@ static int tbs_enqueue(struct sk_buff *nskb, struct Qdisc *sch,
 	return q->enqueue(nskb, sch, to_free);
 }
 
+static int tbs_enqueue_fifo(struct sk_buff *nskb, struct Qdisc *sch,
+			    struct sk_buff **to_free)
+{
+	if (!is_packet_valid(sch, nskb))
+		return qdisc_drop(nskb, sch, to_free);
+
+	return qdisc_enqueue_tail(nskb, sch);
+}
+
 static int tbs_enqueue_scheduledfifo(struct sk_buff *nskb, struct Qdisc *sch,
 				     struct sk_buff **to_free)
 {
@@ -242,6 +270,21 @@ static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
 	return q->dequeue(sch);
 }
 
+static struct sk_buff *tbs_dequeue_fifo(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = qdisc_dequeue_head(sch);
+
+	/* XXX: The drop_if_late bit is not checked here because that would
+	 *      require the PHC time to be read directly.
+	 */
+
+	if (skb)
+		q->last = skb->tstamp;
+
+	return skb;
+}
+
 static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
 {
 	struct tbs_sched_data *q = qdisc_priv(sch);
@@ -318,6 +361,56 @@ static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
 	return skb;
 }
 
+static void tbs_disable_offload(struct net_device *dev,
+				struct tbs_sched_data *q)
+{
+	struct tc_tbs_qopt_offload tbs = { };
+	const struct net_device_ops *ops;
+	int err;
+
+	if (!q->offload)
+		return;
+
+	ops = dev->netdev_ops;
+	if (!ops->ndo_setup_tc)
+		return;
+
+	tbs.queue = q->queue;
+	tbs.enable = 0;
+
+	err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TBS, &tbs);
+	if (err < 0)
+		pr_warn("Couldn't disable TBS offload for queue %d\n",
+			tbs.queue);
+}
+
+static int tbs_enable_offload(struct net_device *dev, struct tbs_sched_data *q,
+			      struct netlink_ext_ack *extack)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct tc_tbs_qopt_offload tbs = { };
+	int err;
+
+	if (q->offload)
+		return 0;
+
+	if (!ops->ndo_setup_tc) {
+		NL_SET_ERR_MSG(extack, "Specified device does not support TBS offload");
+		return -EOPNOTSUPP;
+	}
+
+	tbs.queue = q->queue;
+	tbs.enable = 1;
+
+	err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TBS, &tbs);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Specified device failed to setup TBS hardware offload");
+		return err;
+	}
+
+	return 0;
+}
+
 static inline void setup_queueing_mode(struct tbs_sched_data *q)
 {
 	if (q->sorting) {
@@ -325,9 +418,15 @@ static inline void setup_queueing_mode(struct tbs_sched_data *q)
 		q->dequeue = tbs_dequeue_timesortedlist;
 		q->peek = tbs_peek_timesortedlist;
 	} else {
-		q->enqueue = tbs_enqueue_scheduledfifo;
-		q->dequeue = tbs_dequeue_scheduledfifo;
-		q->peek = qdisc_peek_head;
+		if (q->offload) {
+			q->enqueue = tbs_enqueue_fifo;
+			q->dequeue = tbs_dequeue_fifo;
+			q->peek = qdisc_peek_head;
+		} else {
+			q->enqueue = tbs_enqueue_scheduledfifo;
+			q->dequeue = tbs_dequeue_scheduledfifo;
+			q->peek = qdisc_peek_head;
+		}
 	}
 }
 
@@ -356,8 +455,9 @@ static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
 
 	qopt = nla_data(tb[TCA_TBS_PARMS]);
 
-	pr_debug("delta %d clockid %d sorting %s\n",
+	pr_debug("delta %d clockid %d offload %s sorting %s\n",
 		 qopt->delta, qopt->clockid,
+		 OFFLOAD_IS_ON(qopt) ? "on" : "off",
 		 SORTING_IS_ON(qopt) ? "on" : "off");
 
 	err = validate_input_params(qopt, extack);
@@ -366,15 +466,26 @@ static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
 
 	q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
 
+	if (OFFLOAD_IS_ON(qopt)) {
+		err = tbs_enable_offload(dev, q, extack);
+		if (err < 0)
+			return err;
+	}
+
 	/* Everything went OK, save the parameters used. */
 	q->delta = qopt->delta;
 	q->clockid = qopt->clockid;
+	q->offload = OFFLOAD_IS_ON(qopt);
 	q->sorting = SORTING_IS_ON(qopt);
 
-	/* Select queueing mode based on parameters. */
+	/* Select queueing mode based on offload and sorting parameters. */
 	setup_queueing_mode(q);
 
-	qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
+	/* The watchdog will be needed for SW best-effort or if TxTime
+	 * based sorting is on.
+	 */
+	if (!q->offload || q->sorting)
+		qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
 
 	return 0;
 }
@@ -416,10 +527,13 @@ static void tbs_reset(struct Qdisc *sch)
 static void tbs_destroy(struct Qdisc *sch)
 {
 	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
 
 	/* Only cancel watchdog if it's been initialized. */
 	if (q->watchdog.qdisc == sch)
 		qdisc_watchdog_cancel(&q->watchdog);
+
+	tbs_disable_offload(dev, q);
 }
 
 static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
@@ -434,6 +548,9 @@ static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
 
 	opt.delta = q->delta;
 	opt.clockid = q->clockid;
+	if (q->offload)
+		opt.flags |= TC_TBS_OFFLOAD_ON;
+
 	if (q->sorting)
 		opt.flags |= TC_TBS_SORTING_ON;
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Add new queueing modes to tbs qdisc so HW offload is supported.

For hw offload, if sorting is on, then the time sorted list will still
be used, but when sorting is disabled the enqueue / dequeue flow will
be based on a 'raw' FIFO through the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(). For the 'raw hw offload' mode, the drop_if_late
flag from skbuffs is not used by the Qdisc since this mode implicitly
assumes the PHC clock is being used by applications.

Example 1:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload

In this example, the Qdisc will use HW offload for the control of the
transmission time through the network adapter. It's assumed the timestamp
in skbuffs are in reference to the interface's PHC and setting any other
valid clockid would be treated as an error. Because there is no
scheduling being performed in the qdisc, setting a delta != 0 would also
be considered an error.

Example 2:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
           map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
	   clockid CLOCK_REALTIME sorting

Here, the Qdisc will use HW offload for the txtime control again,
but now sorting will be enabled, and thus there will be scheduling being
performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
reference and packets leave the Qdisc "delta" (100000) nanoseconds before
their transmission time. Because this will be using HW offload and
since dynamic clocks are not supported by the hrtimer, the system clock
and the PHC clock must be synchronized for this mode to behave as expected.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/net/pkt_sched.h        |   5 ++
 include/uapi/linux/pkt_sched.h |   1 +
 net/sched/sch_tbs.c            | 159 +++++++++++++++++++++++++++++++++++------
 3 files changed, 144 insertions(+), 21 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 2466ea143d01..d042ffda7f21 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -155,4 +155,9 @@ struct tc_cbs_qopt_offload {
 	s32 sendslope;
 };
 
+struct tc_tbs_qopt_offload {
+	u8 enable;
+	s32 queue;
+};
+
 #endif
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a33b5b9da81a..92af9fa4dee4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -941,6 +941,7 @@ struct tc_tbs_qopt {
 	__s32 clockid;
 	__u32 flags;
 #define TC_TBS_SORTING_ON BIT(0)
+#define TC_TBS_OFFLOAD_ON BIT(1)
 };
 
 enum {
diff --git a/net/sched/sch_tbs.c b/net/sched/sch_tbs.c
index c19eedda9bc5..2aafa55de42c 100644
--- a/net/sched/sch_tbs.c
+++ b/net/sched/sch_tbs.c
@@ -25,8 +25,10 @@
 #include <net/sock.h>
 
 #define SORTING_IS_ON(x) (x->flags & TC_TBS_SORTING_ON)
+#define OFFLOAD_IS_ON(x) (x->flags & TC_TBS_OFFLOAD_ON)
 
 struct tbs_sched_data {
+	bool offload;
 	bool sorting;
 	int clockid;
 	int queue;
@@ -68,25 +70,42 @@ static inline int validate_input_params(struct tc_tbs_qopt *qopt,
 					struct netlink_ext_ack *extack)
 {
 	/* Check if params comply to the following rules:
-	 *	* If SW best-effort, then clockid and delta must be valid
-	 *	  regardless of sorting enabled or not.
+	 *	* If SW best-effort, then clockid and delta must be valid.
+	 *
+	 *	* If HW offload is ON and sorting is ON, then clockid and delta
+	 *	  must be valid.
+	 *
+	 *	* If HW offload is ON and sorting is OFF, then clockid and
+	 *	  delta must not have been set. The netdevice PHC will be used
+	 *	  implictly.
 	 *
 	 *	* Dynamic clockids are not supported.
 	 *	* Delta must be a positive integer.
 	 */
-	if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
-	    qopt->clockid >= MAX_CLOCKS) {
-		NL_SET_ERR_MSG(extack, "Invalid clockid");
-		return -EINVAL;
-	} else if (qopt->clockid < 0 ||
-		   !clockid_to_get_time[qopt->clockid]) {
-		NL_SET_ERR_MSG(extack, "Clockid is not supported");
-		return -ENOTSUPP;
-	}
-
-	if (qopt->delta < 0) {
-		NL_SET_ERR_MSG(extack, "Delta must be positive");
-		return -EINVAL;
+	if (!OFFLOAD_IS_ON(qopt) || SORTING_IS_ON(qopt)) {
+		if ((qopt->clockid & CLOCKID_INVALID) == CLOCKID_INVALID ||
+		    qopt->clockid >= MAX_CLOCKS) {
+			NL_SET_ERR_MSG(extack, "Invalid clockid");
+			return -EINVAL;
+		} else if (qopt->clockid < 0 ||
+			   !clockid_to_get_time[qopt->clockid]) {
+			NL_SET_ERR_MSG(extack, "Clockid is not supported");
+			return -ENOTSUPP;
+		}
+
+		if (qopt->delta < 0) {
+			NL_SET_ERR_MSG(extack, "Delta must be positive");
+			return -EINVAL;
+		}
+	} else {
+		if (qopt->delta != 0) {
+			NL_SET_ERR_MSG(extack, "Cannot set delta for this mode");
+			return -EINVAL;
+		}
+		if ((qopt->clockid & CLOCKID_INVALID) != CLOCKID_INVALID) {
+			NL_SET_ERR_MSG(extack, "Cannot set clockid for this mode");
+			return -EINVAL;
+		}
 	}
 
 	return 0;
@@ -155,6 +174,15 @@ static int tbs_enqueue(struct sk_buff *nskb, struct Qdisc *sch,
 	return q->enqueue(nskb, sch, to_free);
 }
 
+static int tbs_enqueue_fifo(struct sk_buff *nskb, struct Qdisc *sch,
+			    struct sk_buff **to_free)
+{
+	if (!is_packet_valid(sch, nskb))
+		return qdisc_drop(nskb, sch, to_free);
+
+	return qdisc_enqueue_tail(nskb, sch);
+}
+
 static int tbs_enqueue_scheduledfifo(struct sk_buff *nskb, struct Qdisc *sch,
 				     struct sk_buff **to_free)
 {
@@ -242,6 +270,21 @@ static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
 	return q->dequeue(sch);
 }
 
+static struct sk_buff *tbs_dequeue_fifo(struct Qdisc *sch)
+{
+	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb = qdisc_dequeue_head(sch);
+
+	/* XXX: The drop_if_late bit is not checked here because that would
+	 *      require the PHC time to be read directly.
+	 */
+
+	if (skb)
+		q->last = skb->tstamp;
+
+	return skb;
+}
+
 static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
 {
 	struct tbs_sched_data *q = qdisc_priv(sch);
@@ -318,6 +361,56 @@ static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
 	return skb;
 }
 
+static void tbs_disable_offload(struct net_device *dev,
+				struct tbs_sched_data *q)
+{
+	struct tc_tbs_qopt_offload tbs = { };
+	const struct net_device_ops *ops;
+	int err;
+
+	if (!q->offload)
+		return;
+
+	ops = dev->netdev_ops;
+	if (!ops->ndo_setup_tc)
+		return;
+
+	tbs.queue = q->queue;
+	tbs.enable = 0;
+
+	err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TBS, &tbs);
+	if (err < 0)
+		pr_warn("Couldn't disable TBS offload for queue %d\n",
+			tbs.queue);
+}
+
+static int tbs_enable_offload(struct net_device *dev, struct tbs_sched_data *q,
+			      struct netlink_ext_ack *extack)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct tc_tbs_qopt_offload tbs = { };
+	int err;
+
+	if (q->offload)
+		return 0;
+
+	if (!ops->ndo_setup_tc) {
+		NL_SET_ERR_MSG(extack, "Specified device does not support TBS offload");
+		return -EOPNOTSUPP;
+	}
+
+	tbs.queue = q->queue;
+	tbs.enable = 1;
+
+	err = ops->ndo_setup_tc(dev, TC_SETUP_QDISC_TBS, &tbs);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Specified device failed to setup TBS hardware offload");
+		return err;
+	}
+
+	return 0;
+}
+
 static inline void setup_queueing_mode(struct tbs_sched_data *q)
 {
 	if (q->sorting) {
@@ -325,9 +418,15 @@ static inline void setup_queueing_mode(struct tbs_sched_data *q)
 		q->dequeue = tbs_dequeue_timesortedlist;
 		q->peek = tbs_peek_timesortedlist;
 	} else {
-		q->enqueue = tbs_enqueue_scheduledfifo;
-		q->dequeue = tbs_dequeue_scheduledfifo;
-		q->peek = qdisc_peek_head;
+		if (q->offload) {
+			q->enqueue = tbs_enqueue_fifo;
+			q->dequeue = tbs_dequeue_fifo;
+			q->peek = qdisc_peek_head;
+		} else {
+			q->enqueue = tbs_enqueue_scheduledfifo;
+			q->dequeue = tbs_dequeue_scheduledfifo;
+			q->peek = qdisc_peek_head;
+		}
 	}
 }
 
@@ -356,8 +455,9 @@ static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
 
 	qopt = nla_data(tb[TCA_TBS_PARMS]);
 
-	pr_debug("delta %d clockid %d sorting %s\n",
+	pr_debug("delta %d clockid %d offload %s sorting %s\n",
 		 qopt->delta, qopt->clockid,
+		 OFFLOAD_IS_ON(qopt) ? "on" : "off",
 		 SORTING_IS_ON(qopt) ? "on" : "off");
 
 	err = validate_input_params(qopt, extack);
@@ -366,15 +466,26 @@ static int tbs_init(struct Qdisc *sch, struct nlattr *opt,
 
 	q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
 
+	if (OFFLOAD_IS_ON(qopt)) {
+		err = tbs_enable_offload(dev, q, extack);
+		if (err < 0)
+			return err;
+	}
+
 	/* Everything went OK, save the parameters used. */
 	q->delta = qopt->delta;
 	q->clockid = qopt->clockid;
+	q->offload = OFFLOAD_IS_ON(qopt);
 	q->sorting = SORTING_IS_ON(qopt);
 
-	/* Select queueing mode based on parameters. */
+	/* Select queueing mode based on offload and sorting parameters. */
 	setup_queueing_mode(q);
 
-	qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
+	/* The watchdog will be needed for SW best-effort or if TxTime
+	 * based sorting is on.
+	 */
+	if (!q->offload || q->sorting)
+		qdisc_watchdog_init_clockid(&q->watchdog, sch, q->clockid);
 
 	return 0;
 }
@@ -416,10 +527,13 @@ static void tbs_reset(struct Qdisc *sch)
 static void tbs_destroy(struct Qdisc *sch)
 {
 	struct tbs_sched_data *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
 
 	/* Only cancel watchdog if it's been initialized. */
 	if (q->watchdog.qdisc == sch)
 		qdisc_watchdog_cancel(&q->watchdog);
+
+	tbs_disable_offload(dev, q);
 }
 
 static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
@@ -434,6 +548,9 @@ static int tbs_dump(struct Qdisc *sch, struct sk_buff *skb)
 
 	opt.delta = q->delta;
 	opt.clockid = q->clockid;
+	if (q->offload)
+		opt.flags |= TC_TBS_OFFLOAD_ON;
+
 	if (q->sorting)
 		opt.flags |= TC_TBS_SORTING_ON;
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs()
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 54 ++++++++++++++-----------------
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae785369..49cfbe4fd2b1 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1673,23 +1673,17 @@ static void set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
 }
 
 /**
- *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
- *  @enable: true = enable CBS, false = disable CBS
- *  @idleslope: idleSlope in kbps
- *  @sendslope: sendSlope in kbps
- *  @hicredit: hiCredit in bytes
- *  @locredit: loCredit in bytes
  *
- *  Configure CBS for a given hardware queue. When disabling, idleslope,
- *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- *  success. Negative otherwise.
+ *  Configure CBS for a given hardware queue. Parameters are retrieved
+ *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  for setting those correctly prior to this function being called.
  **/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
-			      bool enable, int idleslope, int sendslope,
-			      int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 {
+	struct igb_ring *ring = adapter->tx_ring[queue];
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_hw *hw = &adapter->hw;
 	u32 tqavcc;
@@ -1698,7 +1692,7 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 	WARN_ON(hw->mac.type != e1000_i210);
 	WARN_ON(queue < 0 || queue > 1);
 
-	if (enable) {
+	if (ring->cbs_enable) {
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
@@ -1759,14 +1753,15 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 		 *       calculated value, so the resulting bandwidth might
 		 *       be slightly higher for some configurations.
 		 */
-		value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 1000000);
+		value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 1000000);
 
 		tqavcc = rd32(E1000_I210_TQAVCC(queue));
 		tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
 		tqavcc |= value;
 		wr32(E1000_I210_TQAVCC(queue), tqavcc);
 
-		wr32(E1000_I210_TQAVHC(queue), 0x80000000 + hicredit * 0x7735);
+		wr32(E1000_I210_TQAVHC(queue),
+		     0x80000000 + ring->hicredit * 0x7735);
 	} else {
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
 		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1786,8 +1781,9 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 	 */
 
 	netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit %d locredit %d\n",
-		   (enable) ? "enabled" : "disabled", queue,
-		   idleslope, sendslope, hicredit, locredit);
+		   (ring->cbs_enable) ? "enabled" : "disabled", queue,
+		   ring->idleslope, ring->sendslope, ring->hicredit,
+		   ring->locredit);
 }
 
 static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
@@ -1812,19 +1808,25 @@ static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
 
 static bool is_any_cbs_enabled(struct igb_adapter *adapter)
 {
-	struct igb_ring *ring;
 	int i;
 
 	for (i = 0; i < adapter->num_tx_queues; i++) {
-		ring = adapter->tx_ring[i];
-
-		if (ring->cbs_enable)
+		if (adapter->tx_ring[i]->cbs_enable)
 			return true;
 	}
 
 	return false;
 }
 
+/**
+ *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
+ *  @adapter: pointer to adapter struct
+ *
+ *  Configure TQAVCTRL register switching the controller's Tx mode
+ *  if FQTSS mode is enabled or disabled. Additionally, will issue
+ *  a call to igb_config_tx_modes() per queue so any previously saved
+ *  Tx parameters are applied.
+ **/
 static void igb_setup_tx_mode(struct igb_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
@@ -1884,11 +1886,7 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 			    adapter->num_tx_queues : I210_SR_QUEUES_NUM;
 
 		for (i = 0; i < max_queue; i++) {
-			struct igb_ring *ring = adapter->tx_ring[i];
-
-			igb_configure_cbs(adapter, i, ring->cbs_enable,
-					  ring->idleslope, ring->sendslope,
-					  ring->hicredit, ring->locredit);
+			igb_config_tx_modes(adapter, i);
 		}
 	} else {
 		wr32(E1000_RXPBS, I210_RXPBSIZE_DEFAULT);
@@ -2482,9 +2480,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 		return err;
 
 	if (is_fqtss_enabled(adapter)) {
-		igb_configure_cbs(adapter, qopt->queue, qopt->enable,
-				  qopt->idleslope, qopt->sendslope,
-				  qopt->hicredit, qopt->locredit);
+		igb_config_tx_modes(adapter, qopt->queue);
 
 		if (!is_any_cbs_enabled(adapter))
 			enable_fqtss(adapter, false);
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs()
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Make this function retrieve what it needs from the Tx ring being
addressed since it already relies on what had been saved on it before.
Also, since this function will be used by the upcoming Launchtime
patches rename it to better reflect its intention. Note that
Launchtime is not part of what 802.1Qav specifies, but the i210
datasheet refers to this set of functionality as "Qav Transmission
Mode".

Here we also perform a tiny refactor at is_any_cbs_enabled(), and add
further documentation to igb_setup_tx_mode().

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 54 ++++++++++++++-----------------
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae785369..49cfbe4fd2b1 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1673,23 +1673,17 @@ static void set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
 }
 
 /**
- *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
- *  @enable: true = enable CBS, false = disable CBS
- *  @idleslope: idleSlope in kbps
- *  @sendslope: sendSlope in kbps
- *  @hicredit: hiCredit in bytes
- *  @locredit: loCredit in bytes
  *
- *  Configure CBS for a given hardware queue. When disabling, idleslope,
- *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
- *  success. Negative otherwise.
+ *  Configure CBS for a given hardware queue. Parameters are retrieved
+ *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  for setting those correctly prior to this function being called.
  **/
-static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
-			      bool enable, int idleslope, int sendslope,
-			      int hicredit, int locredit)
+static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 {
+	struct igb_ring *ring = adapter->tx_ring[queue];
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_hw *hw = &adapter->hw;
 	u32 tqavcc;
@@ -1698,7 +1692,7 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 	WARN_ON(hw->mac.type != e1000_i210);
 	WARN_ON(queue < 0 || queue > 1);
 
-	if (enable) {
+	if (ring->cbs_enable) {
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
@@ -1759,14 +1753,15 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 		 *       calculated value, so the resulting bandwidth might
 		 *       be slightly higher for some configurations.
 		 */
-		value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 1000000);
+		value = DIV_ROUND_UP_ULL(ring->idleslope * 61034ULL, 1000000);
 
 		tqavcc = rd32(E1000_I210_TQAVCC(queue));
 		tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
 		tqavcc |= value;
 		wr32(E1000_I210_TQAVCC(queue), tqavcc);
 
-		wr32(E1000_I210_TQAVHC(queue), 0x80000000 + hicredit * 0x7735);
+		wr32(E1000_I210_TQAVHC(queue),
+		     0x80000000 + ring->hicredit * 0x7735);
 	} else {
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
 		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
@@ -1786,8 +1781,9 @@ static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
 	 */
 
 	netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit %d locredit %d\n",
-		   (enable) ? "enabled" : "disabled", queue,
-		   idleslope, sendslope, hicredit, locredit);
+		   (ring->cbs_enable) ? "enabled" : "disabled", queue,
+		   ring->idleslope, ring->sendslope, ring->hicredit,
+		   ring->locredit);
 }
 
 static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
@@ -1812,19 +1808,25 @@ static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
 
 static bool is_any_cbs_enabled(struct igb_adapter *adapter)
 {
-	struct igb_ring *ring;
 	int i;
 
 	for (i = 0; i < adapter->num_tx_queues; i++) {
-		ring = adapter->tx_ring[i];
-
-		if (ring->cbs_enable)
+		if (adapter->tx_ring[i]->cbs_enable)
 			return true;
 	}
 
 	return false;
 }
 
+/**
+ *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
+ *  @adapter: pointer to adapter struct
+ *
+ *  Configure TQAVCTRL register switching the controller's Tx mode
+ *  if FQTSS mode is enabled or disabled. Additionally, will issue
+ *  a call to igb_config_tx_modes() per queue so any previously saved
+ *  Tx parameters are applied.
+ **/
 static void igb_setup_tx_mode(struct igb_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
@@ -1884,11 +1886,7 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 			    adapter->num_tx_queues : I210_SR_QUEUES_NUM;
 
 		for (i = 0; i < max_queue; i++) {
-			struct igb_ring *ring = adapter->tx_ring[i];
-
-			igb_configure_cbs(adapter, i, ring->cbs_enable,
-					  ring->idleslope, ring->sendslope,
-					  ring->hicredit, ring->locredit);
+			igb_config_tx_modes(adapter, i);
 		}
 	} else {
 		wr32(E1000_RXPBS, I210_RXPBSIZE_DEFAULT);
@@ -2482,9 +2480,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 		return err;
 
 	if (is_fqtss_enabled(adapter)) {
-		igb_configure_cbs(adapter, qopt->queue, qopt->enable,
-				  qopt->idleslope, qopt->sendslope,
-				  qopt->hicredit, qopt->locredit);
+		igb_config_tx_modes(adapter, qopt->queue);
 
 		if (!is_any_cbs_enabled(adapter))
 			enable_fqtss(adapter, false);
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 49cfbe4fd2b1..9c33f2d18d8c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1672,6 +1672,18 @@ static void set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
 	wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		if (adapter->tx_ring[i]->cbs_enable)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1686,7 +1698,7 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 	struct igb_ring *ring = adapter->tx_ring[queue];
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_hw *hw = &adapter->hw;
-	u32 tqavcc;
+	u32 tqavcc, tqavctrl;
 	u16 value;
 
 	WARN_ON(hw->mac.type != e1000_i210);
@@ -1696,6 +1708,14 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+		/* Always set data transfer arbitration to credit-based
+		 * shaper algorithm on TQAVCTRL if CBS is enabled for any of
+		 * the queues.
+		 */
+		tqavctrl = rd32(E1000_I210_TQAVCTRL);
+		tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+		wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
 		/* According to i210 datasheet section 7.2.7.7, we should set
 		 * the 'idleSlope' field from TQAVCC register following the
 		 * equation:
@@ -1773,6 +1793,16 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 
 		/* Set hiCredit to zero. */
 		wr32(E1000_I210_TQAVHC(queue), 0);
+
+		/* If CBS is not enabled for any queues anymore, then return to
+		 * the default state of Data Transmission Arbitration on
+		 * TQAVCTRL.
+		 */
+		if (!is_any_cbs_enabled(adapter)) {
+			tqavctrl = rd32(E1000_I210_TQAVCTRL);
+			tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+			wr32(E1000_I210_TQAVCTRL, tqavctrl);
+		}
 	}
 
 	/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1806,18 +1836,6 @@ static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
 	return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-	int i;
-
-	for (i = 0; i < adapter->num_tx_queues; i++) {
-		if (adapter->tx_ring[i]->cbs_enable)
-			return true;
-	}
-
-	return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1841,11 +1859,10 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 		int i, max_queue;
 
 		/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-		 * set data fetch arbitration to 'round robin' and set data
-		 * transfer arbitration to 'credit shaper algorithm.
+		 * set data fetch arbitration to 'round robin'.
 		 */
 		val = rd32(E1000_I210_TQAVCTRL);
-		val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+		val |= E1000_TQAVCTRL_XMIT_MODE;
 		val &= ~E1000_TQAVCTRL_DATAFETCHARB;
 		wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Currently the data transmission arbitration algorithm - DataTranARB
field on TQAVCTRL reg - is always set to CBS when the Tx mode is
changed from legacy to 'Qav' mode.

Make that configuration a bit more granular in preparation for the
upcoming Launchtime enabling patches, since CBS and Launchtime can be
enabled separately. That is achieved by moving the DataTranARB setup
to igb_config_tx_modes() instead.

Similarly, when disabling CBS we must check if it has been disabled
for all queues, and clear the DataTranARB accordingly.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 49 +++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 49cfbe4fd2b1..9c33f2d18d8c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1672,6 +1672,18 @@ static void set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
 	wr32(E1000_I210_TQAVCC(queue), val);
 }
 
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		if (adapter->tx_ring[i]->cbs_enable)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
@@ -1686,7 +1698,7 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 	struct igb_ring *ring = adapter->tx_ring[queue];
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_hw *hw = &adapter->hw;
-	u32 tqavcc;
+	u32 tqavcc, tqavctrl;
 	u16 value;
 
 	WARN_ON(hw->mac.type != e1000_i210);
@@ -1696,6 +1708,14 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
+		/* Always set data transfer arbitration to credit-based
+		 * shaper algorithm on TQAVCTRL if CBS is enabled for any of
+		 * the queues.
+		 */
+		tqavctrl = rd32(E1000_I210_TQAVCTRL);
+		tqavctrl |= E1000_TQAVCTRL_DATATRANARB;
+		wr32(E1000_I210_TQAVCTRL, tqavctrl);
+
 		/* According to i210 datasheet section 7.2.7.7, we should set
 		 * the 'idleSlope' field from TQAVCC register following the
 		 * equation:
@@ -1773,6 +1793,16 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 
 		/* Set hiCredit to zero. */
 		wr32(E1000_I210_TQAVHC(queue), 0);
+
+		/* If CBS is not enabled for any queues anymore, then return to
+		 * the default state of Data Transmission Arbitration on
+		 * TQAVCTRL.
+		 */
+		if (!is_any_cbs_enabled(adapter)) {
+			tqavctrl = rd32(E1000_I210_TQAVCTRL);
+			tqavctrl &= ~E1000_TQAVCTRL_DATATRANARB;
+			wr32(E1000_I210_TQAVCTRL, tqavctrl);
+		}
 	}
 
 	/* XXX: In i210 controller the sendSlope and loCredit parameters from
@@ -1806,18 +1836,6 @@ static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
 	return 0;
 }
 
-static bool is_any_cbs_enabled(struct igb_adapter *adapter)
-{
-	int i;
-
-	for (i = 0; i < adapter->num_tx_queues; i++) {
-		if (adapter->tx_ring[i]->cbs_enable)
-			return true;
-	}
-
-	return false;
-}
-
 /**
  *  igb_setup_tx_mode - Switch to/from Qav Tx mode when applicable
  *  @adapter: pointer to adapter struct
@@ -1841,11 +1859,10 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 		int i, max_queue;
 
 		/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-		 * set data fetch arbitration to 'round robin' and set data
-		 * transfer arbitration to 'credit shaper algorithm.
+		 * set data fetch arbitration to 'round robin'.
 		 */
 		val = rd32(E1000_I210_TQAVCTRL);
-		val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+		val |= E1000_TQAVCTRL_XMIT_MODE;
 		val &= ~E1000_TQAVCTRL_DATAFETCHARB;
 		wr32(E1000_I210_TQAVCTRL, val);
 
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs()
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Split code into a separate function (igb_offload_apply()) that will be
used by TBS offload implementation.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9c33f2d18d8c..10d7809a85d7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2476,6 +2476,19 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev,
 	return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+	if (!is_fqtss_enabled(adapter)) {
+		enable_fqtss(adapter, true);
+		return;
+	}
+
+	igb_config_tx_modes(adapter, queue);
+
+	if (!is_any_cbs_enabled(adapter))
+		enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
 			   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2496,15 +2509,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 	if (err)
 		return err;
 
-	if (is_fqtss_enabled(adapter)) {
-		igb_config_tx_modes(adapter, qopt->queue);
-
-		if (!is_any_cbs_enabled(adapter))
-			enable_fqtss(adapter, false);
-
-	} else {
-		enable_fqtss(adapter, true);
-	}
+	igb_offload_apply(adapter, qopt->queue);
 
 	return 0;
 }
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs()
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Split code into a separate function (igb_offload_apply()) that will be
used by TBS offload implementation.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 9c33f2d18d8c..10d7809a85d7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2476,6 +2476,19 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev,
 	return features;
 }
 
+static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
+{
+	if (!is_fqtss_enabled(adapter)) {
+		enable_fqtss(adapter, true);
+		return;
+	}
+
+	igb_config_tx_modes(adapter, queue);
+
+	if (!is_any_cbs_enabled(adapter))
+		enable_fqtss(adapter, false);
+}
+
 static int igb_offload_cbs(struct igb_adapter *adapter,
 			   struct tc_cbs_qopt_offload *qopt)
 {
@@ -2496,15 +2509,7 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 	if (err)
 		return err;
 
-	if (is_fqtss_enabled(adapter)) {
-		igb_config_tx_modes(adapter, qopt->queue);
-
-		if (!is_any_cbs_enabled(adapter))
-			enable_fqtss(adapter, false);
-
-	} else {
-		enable_fqtss(adapter, true);
-	}
+	igb_offload_apply(adapter, qopt->queue);
 
 	return 0;
 }
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [RFC v3 net-next 18/18] igb: Add support for TBS offload
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar,
	Jesus Sanchez-Palencia

Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_TBS and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the TBS qdisc already.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +++
 drivers/net/ethernet/intel/igb/igb.h           |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 135 ++++++++++++++++++++++---
 3 files changed, 137 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..9e357848c550 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1066,6 +1066,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE	BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARB	BIT(4)
 #define E1000_TQAVCTRL_DATATRANARB	BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM	BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR	BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA	(0xFFFF << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK	0xFFFF
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..4e1146efa399 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,7 @@ struct igb_ring {
 	u16 count;			/* number of desc. in the ring */
 	u8 queue_index;			/* logical index of the ring*/
 	u8 reg_idx;			/* physical index of the ring */
+	bool launchtime_enable;		/* true if LaunchTime is enabled */
 	bool cbs_enable;		/* indicates if CBS is enabled */
 	s32 idleslope;			/* idleSlope in kbps */
 	s32 sendslope;			/* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 10d7809a85d7..fa931f66a1f8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1684,13 +1684,26 @@ static bool is_any_cbs_enabled(struct igb_adapter *adapter)
 	return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		if (adapter->tx_ring[i]->launchtime_enable)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1704,10 +1717,20 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 	WARN_ON(hw->mac.type != e1000_i210);
 	WARN_ON(queue < 0 || queue > 1);
 
-	if (ring->cbs_enable) {
+	/* If any of the Qav features is enabled, configure queues as SR and
+	 * with HIGH PRIO. If none is, then configure them with LOW PRIO and
+	 * as SP.
+	 */
+	if (ring->cbs_enable || ring->launchtime_enable) {
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+	} else {
+		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
+		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
+	}
 
+	/* If CBS is enabled, set DataTranARB and config its parameters. */
+	if (ring->cbs_enable) {
 		/* Always set data transfer arbitration to credit-based
 		 * shaper algorithm on TQAVCTRL if CBS is enabled for any of
 		 * the queues.
@@ -1783,8 +1806,6 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 		wr32(E1000_I210_TQAVHC(queue),
 		     0x80000000 + ring->hicredit * 0x7735);
 	} else {
-		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
-		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
 
 		/* Set idleSlope to zero. */
 		tqavcc = rd32(E1000_I210_TQAVCC(queue));
@@ -1805,17 +1826,61 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 		}
 	}
 
+	/* If LaunchTime is enabled, set DataTranTIM. */
+	if (ring->launchtime_enable) {
+		/* Always set DataTranTIM on TQAVCTRL if LaunchTime is enabled
+		 * for any of the SR queues, and configure fetchtime delta.
+		 * XXX NOTE:
+		 *     - LaunchTime will be enabled for all SR queues.
+		 *     - A fixed offset can be added relative to the launch
+		 *       time of all packets if configured at reg LAUNCH_OS0.
+		 *       We are keeping it as 0 for now (default value).
+		 */
+		tqavctrl = rd32(E1000_I210_TQAVCTRL);
+		tqavctrl |= E1000_TQAVCTRL_DATATRANTIM |
+		       E1000_TQAVCTRL_FETCHTIME_DELTA;
+		wr32(E1000_I210_TQAVCTRL, tqavctrl);
+	} else {
+		/* If Launchtime is not enabled for any SR queues anymore,
+		 * then clear DataTranTIM on TQAVCTRL and clear fetchtime delta,
+		 * effectively disabling Launchtime.
+		 */
+		if (!is_any_txtime_enabled(adapter)) {
+			tqavctrl = rd32(E1000_I210_TQAVCTRL);
+			tqavctrl &= ~E1000_TQAVCTRL_DATATRANTIM;
+			tqavctrl &= ~E1000_TQAVCTRL_FETCHTIME_DELTA;
+			wr32(E1000_I210_TQAVCTRL, tqavctrl);
+		}
+	}
+
 	/* XXX: In i210 controller the sendSlope and loCredit parameters from
 	 * CBS are not configurable by software so we don't do any 'controller
 	 * configuration' in respect to these parameters.
 	 */
 
-	netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit %d locredit %d\n",
-		   (ring->cbs_enable) ? "enabled" : "disabled", queue,
+	netdev_dbg(netdev, "Qav Tx mode: cbs %s, launchtime %s, queue %d \
+			    idleslope %d sendslope %d hiCredit %d \
+			    locredit %d\n",
+		   (ring->cbs_enable) ? "enabled" : "disabled",
+		   (ring->launchtime_enable) ? "enabled" : "disabled", queue,
 		   ring->idleslope, ring->sendslope, ring->hicredit,
 		   ring->locredit);
 }
 
+static int igb_save_txtime_params(struct igb_adapter *adapter, int queue,
+				  bool enable)
+{
+	struct igb_ring *ring;
+
+	if (queue < 0 || queue > adapter->num_tx_queues)
+		return -EINVAL;
+
+	ring = adapter->tx_ring[queue];
+	ring->launchtime_enable = enable;
+
+	return 0;
+}
+
 static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
 			       bool enable, int idleslope, int sendslope,
 			       int hicredit, int locredit)
@@ -1859,10 +1924,11 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 		int i, max_queue;
 
 		/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-		 * set data fetch arbitration to 'round robin'.
+		 * set data fetch arbitration to 'round robin', set SP_WAIT_SR
+		 * so SP queues wait for SR ones.
 		 */
 		val = rd32(E1000_I210_TQAVCTRL);
-		val |= E1000_TQAVCTRL_XMIT_MODE;
+		val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_SP_WAIT_SR;
 		val &= ~E1000_TQAVCTRL_DATAFETCHARB;
 		wr32(E1000_I210_TQAVCTRL, val);
 
@@ -2485,7 +2551,7 @@ static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
 
 	igb_config_tx_modes(adapter, queue);
 
-	if (!is_any_cbs_enabled(adapter))
+	if (!is_any_cbs_enabled(adapter) && !is_any_txtime_enabled(adapter))
 		enable_fqtss(adapter, false);
 }
 
@@ -2514,6 +2580,30 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 	return 0;
 }
 
+static int igb_offload_txtime(struct igb_adapter *adapter,
+			      struct tc_tbs_qopt_offload *qopt)
+{
+	struct e1000_hw *hw = &adapter->hw;
+	int err;
+
+	/* Launchtime offloading is only supported by i210 controller. */
+	if (hw->mac.type != e1000_i210)
+		return -EOPNOTSUPP;
+
+	/* Launchtime offloading is only supported by queues 0 and 1. */
+	if (qopt->queue < 0 || qopt->queue > 1)
+		return -EINVAL;
+
+	err = igb_save_txtime_params(adapter, qopt->queue, qopt->enable);
+
+	if (err)
+		return err;
+
+	igb_offload_apply(adapter, qopt->queue);
+
+	return 0;
+}
+
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			void *type_data)
 {
@@ -2522,6 +2612,8 @@ static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 	switch (type) {
 	case TC_SETUP_QDISC_CBS:
 		return igb_offload_cbs(adapter, type_data);
+	case TC_SETUP_QDISC_TBS:
+		return igb_offload_txtime(adapter, type_data);
 
 	default:
 		return -EOPNOTSUPP;
@@ -5333,11 +5425,14 @@ static void igb_set_itr(struct igb_q_vector *q_vector)
 	}
 }
 
-static void igb_tx_ctxtdesc(struct igb_ring *tx_ring, u32 vlan_macip_lens,
-			    u32 type_tucmd, u32 mss_l4len_idx)
+static void igb_tx_ctxtdesc(struct igb_ring *tx_ring,
+			    struct igb_tx_buffer *first,
+			    u32 vlan_macip_lens, u32 type_tucmd,
+			    u32 mss_l4len_idx)
 {
 	struct e1000_adv_tx_context_desc *context_desc;
 	u16 i = tx_ring->next_to_use;
+	struct timespec64 ts;
 
 	context_desc = IGB_TX_CTXTDESC(tx_ring, i);
 
@@ -5352,9 +5447,18 @@ static void igb_tx_ctxtdesc(struct igb_ring *tx_ring, u32 vlan_macip_lens,
 		mss_l4len_idx |= tx_ring->reg_idx << 4;
 
 	context_desc->vlan_macip_lens	= cpu_to_le32(vlan_macip_lens);
-	context_desc->seqnum_seed	= 0;
 	context_desc->type_tucmd_mlhl	= cpu_to_le32(type_tucmd);
 	context_desc->mss_l4len_idx	= cpu_to_le32(mss_l4len_idx);
+
+	/* We assume there is always a valid tx time available. Invalid times
+	 * should have been handled by the upper layers.
+	 */
+	if (tx_ring->launchtime_enable) {
+		ts = ns_to_timespec64(first->skb->tstamp);
+		context_desc->seqnum_seed = cpu_to_le32(ts.tv_nsec / 32);
+	} else {
+		context_desc->seqnum_seed = 0;
+	}
 }
 
 static int igb_tso(struct igb_ring *tx_ring,
@@ -5437,7 +5541,8 @@ static int igb_tso(struct igb_ring *tx_ring,
 	vlan_macip_lens |= (ip.hdr - skb->data) << E1000_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IGB_TX_FLAGS_VLAN_MASK;
 
-	igb_tx_ctxtdesc(tx_ring, vlan_macip_lens, type_tucmd, mss_l4len_idx);
+	igb_tx_ctxtdesc(tx_ring, first, vlan_macip_lens,
+			type_tucmd, mss_l4len_idx);
 
 	return 1;
 }
@@ -5492,7 +5597,7 @@ static void igb_tx_csum(struct igb_ring *tx_ring, struct igb_tx_buffer *first)
 	vlan_macip_lens |= skb_network_offset(skb) << E1000_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IGB_TX_FLAGS_VLAN_MASK;
 
-	igb_tx_ctxtdesc(tx_ring, vlan_macip_lens, type_tucmd, 0);
+	igb_tx_ctxtdesc(tx_ring, first, vlan_macip_lens, type_tucmd, 0);
 }
 
 #define IGB_SET_FLAG(_input, _flag, _result) \
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 18/18] igb: Add support for TBS offload
@ 2018-03-07  1:12   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07  1:12 UTC (permalink / raw)
  To: intel-wired-lan

Implement HW offload support for SO_TXTIME through igb's Launchtime
feature. This is done by extending igb_setup_tc() so it supports
TC_SETUP_QDISC_TBS and configuring i210 so time based transmit
arbitration is enabled.

The FQTSS transmission mode added before is extended so strict
priority (SP) queues wait for stream reservation (SR) ones.
igb_config_tx_modes() is extended so it can support enabling/disabling
Launchtime following the previous approach used for the credit-based
shaper (CBS).

As the previous flow, FQTSS transmission mode is enabled automatically
by the driver once Launchtime (or CBS, as before) is enabled.
Similarly, it's automatically disabled when the feature is disabled
for the last queue that had it setup on.

The driver just consumes the transmit times from the skbuffs directly,
so no special handling is done in case an 'invalid' time is provided.
We assume this has been handled by the TBS qdisc already.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  16 +++
 drivers/net/ethernet/intel/igb/igb.h           |   1 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 135 ++++++++++++++++++++++---
 3 files changed, 137 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 83cabff1e0ab..9e357848c550 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -1066,6 +1066,22 @@
 #define E1000_TQAVCTRL_XMIT_MODE	BIT(0)
 #define E1000_TQAVCTRL_DATAFETCHARB	BIT(4)
 #define E1000_TQAVCTRL_DATATRANARB	BIT(8)
+#define E1000_TQAVCTRL_DATATRANTIM	BIT(9)
+#define E1000_TQAVCTRL_SP_WAIT_SR	BIT(10)
+/* Fetch Time Delta - bits 31:16
+ *
+ * This field holds the value to be reduced from the launch time for
+ * fetch time decision. The FetchTimeDelta value is defined in 32 ns
+ * granularity.
+ *
+ * This field is 16 bits wide, and so the maximum value is:
+ *
+ * 65535 * 32 = 2097120 ~= 2.1 msec
+ *
+ * XXX: We are configuring the max value here since we couldn't come up
+ * with a reason for not doing so.
+ */
+#define E1000_TQAVCTRL_FETCHTIME_DELTA	(0xFFFF << 16)
 
 /* TX Qav Credit Control fields */
 #define E1000_TQAVCC_IDLESLOPE_MASK	0xFFFF
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 1c6b8d9176a8..4e1146efa399 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,7 @@ struct igb_ring {
 	u16 count;			/* number of desc. in the ring */
 	u8 queue_index;			/* logical index of the ring*/
 	u8 reg_idx;			/* physical index of the ring */
+	bool launchtime_enable;		/* true if LaunchTime is enabled */
 	bool cbs_enable;		/* indicates if CBS is enabled */
 	s32 idleslope;			/* idleSlope in kbps */
 	s32 sendslope;			/* sendSlope in kbps */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 10d7809a85d7..fa931f66a1f8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1684,13 +1684,26 @@ static bool is_any_cbs_enabled(struct igb_adapter *adapter)
 	return false;
 }
 
+static bool is_any_txtime_enabled(struct igb_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		if (adapter->tx_ring[i]->launchtime_enable)
+			return true;
+	}
+
+	return false;
+}
+
 /**
  *  igb_config_tx_modes - Configure "Qav Tx mode" features on igb
  *  @adapter: pointer to adapter struct
  *  @queue: queue number
  *
- *  Configure CBS for a given hardware queue. Parameters are retrieved
- *  from the correct Tx ring, so igb_save_cbs_params() should be used
+ *  Configure CBS and Launchtime for a given hardware queue.
+ *  Parameters are retrieved from the correct Tx ring, so
+ *  igb_save_cbs_params() and igb_save_txtime_params() should be used
  *  for setting those correctly prior to this function being called.
  **/
 static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
@@ -1704,10 +1717,20 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 	WARN_ON(hw->mac.type != e1000_i210);
 	WARN_ON(queue < 0 || queue > 1);
 
-	if (ring->cbs_enable) {
+	/* If any of the Qav features is enabled, configure queues as SR and
+	 * with HIGH PRIO. If none is, then configure them with LOW PRIO and
+	 * as SP.
+	 */
+	if (ring->cbs_enable || ring->launchtime_enable) {
 		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
 		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+	} else {
+		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
+		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
+	}
 
+	/* If CBS is enabled, set DataTranARB and config its parameters. */
+	if (ring->cbs_enable) {
 		/* Always set data transfer arbitration to credit-based
 		 * shaper algorithm on TQAVCTRL if CBS is enabled for any of
 		 * the queues.
@@ -1783,8 +1806,6 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 		wr32(E1000_I210_TQAVHC(queue),
 		     0x80000000 + ring->hicredit * 0x7735);
 	} else {
-		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
-		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
 
 		/* Set idleSlope to zero. */
 		tqavcc = rd32(E1000_I210_TQAVCC(queue));
@@ -1805,17 +1826,61 @@ static void igb_config_tx_modes(struct igb_adapter *adapter, int queue)
 		}
 	}
 
+	/* If LaunchTime is enabled, set DataTranTIM. */
+	if (ring->launchtime_enable) {
+		/* Always set DataTranTIM on TQAVCTRL if LaunchTime is enabled
+		 * for any of the SR queues, and configure fetchtime delta.
+		 * XXX NOTE:
+		 *     - LaunchTime will be enabled for all SR queues.
+		 *     - A fixed offset can be added relative to the launch
+		 *       time of all packets if configured at reg LAUNCH_OS0.
+		 *       We are keeping it as 0 for now (default value).
+		 */
+		tqavctrl = rd32(E1000_I210_TQAVCTRL);
+		tqavctrl |= E1000_TQAVCTRL_DATATRANTIM |
+		       E1000_TQAVCTRL_FETCHTIME_DELTA;
+		wr32(E1000_I210_TQAVCTRL, tqavctrl);
+	} else {
+		/* If Launchtime is not enabled for any SR queues anymore,
+		 * then clear DataTranTIM on TQAVCTRL and clear fetchtime delta,
+		 * effectively disabling Launchtime.
+		 */
+		if (!is_any_txtime_enabled(adapter)) {
+			tqavctrl = rd32(E1000_I210_TQAVCTRL);
+			tqavctrl &= ~E1000_TQAVCTRL_DATATRANTIM;
+			tqavctrl &= ~E1000_TQAVCTRL_FETCHTIME_DELTA;
+			wr32(E1000_I210_TQAVCTRL, tqavctrl);
+		}
+	}
+
 	/* XXX: In i210 controller the sendSlope and loCredit parameters from
 	 * CBS are not configurable by software so we don't do any 'controller
 	 * configuration' in respect to these parameters.
 	 */
 
-	netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d hiCredit %d locredit %d\n",
-		   (ring->cbs_enable) ? "enabled" : "disabled", queue,
+	netdev_dbg(netdev, "Qav Tx mode: cbs %s, launchtime %s, queue %d \
+			    idleslope %d sendslope %d hiCredit %d \
+			    locredit %d\n",
+		   (ring->cbs_enable) ? "enabled" : "disabled",
+		   (ring->launchtime_enable) ? "enabled" : "disabled", queue,
 		   ring->idleslope, ring->sendslope, ring->hicredit,
 		   ring->locredit);
 }
 
+static int igb_save_txtime_params(struct igb_adapter *adapter, int queue,
+				  bool enable)
+{
+	struct igb_ring *ring;
+
+	if (queue < 0 || queue > adapter->num_tx_queues)
+		return -EINVAL;
+
+	ring = adapter->tx_ring[queue];
+	ring->launchtime_enable = enable;
+
+	return 0;
+}
+
 static int igb_save_cbs_params(struct igb_adapter *adapter, int queue,
 			       bool enable, int idleslope, int sendslope,
 			       int hicredit, int locredit)
@@ -1859,10 +1924,11 @@ static void igb_setup_tx_mode(struct igb_adapter *adapter)
 		int i, max_queue;
 
 		/* Configure TQAVCTRL register: set transmit mode to 'Qav',
-		 * set data fetch arbitration to 'round robin'.
+		 * set data fetch arbitration to 'round robin', set SP_WAIT_SR
+		 * so SP queues wait for SR ones.
 		 */
 		val = rd32(E1000_I210_TQAVCTRL);
-		val |= E1000_TQAVCTRL_XMIT_MODE;
+		val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_SP_WAIT_SR;
 		val &= ~E1000_TQAVCTRL_DATAFETCHARB;
 		wr32(E1000_I210_TQAVCTRL, val);
 
@@ -2485,7 +2551,7 @@ static void igb_offload_apply(struct igb_adapter *adapter, s32 queue)
 
 	igb_config_tx_modes(adapter, queue);
 
-	if (!is_any_cbs_enabled(adapter))
+	if (!is_any_cbs_enabled(adapter) && !is_any_txtime_enabled(adapter))
 		enable_fqtss(adapter, false);
 }
 
@@ -2514,6 +2580,30 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 	return 0;
 }
 
+static int igb_offload_txtime(struct igb_adapter *adapter,
+			      struct tc_tbs_qopt_offload *qopt)
+{
+	struct e1000_hw *hw = &adapter->hw;
+	int err;
+
+	/* Launchtime offloading is only supported by i210 controller. */
+	if (hw->mac.type != e1000_i210)
+		return -EOPNOTSUPP;
+
+	/* Launchtime offloading is only supported by queues 0 and 1. */
+	if (qopt->queue < 0 || qopt->queue > 1)
+		return -EINVAL;
+
+	err = igb_save_txtime_params(adapter, qopt->queue, qopt->enable);
+
+	if (err)
+		return err;
+
+	igb_offload_apply(adapter, qopt->queue);
+
+	return 0;
+}
+
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			void *type_data)
 {
@@ -2522,6 +2612,8 @@ static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 	switch (type) {
 	case TC_SETUP_QDISC_CBS:
 		return igb_offload_cbs(adapter, type_data);
+	case TC_SETUP_QDISC_TBS:
+		return igb_offload_txtime(adapter, type_data);
 
 	default:
 		return -EOPNOTSUPP;
@@ -5333,11 +5425,14 @@ static void igb_set_itr(struct igb_q_vector *q_vector)
 	}
 }
 
-static void igb_tx_ctxtdesc(struct igb_ring *tx_ring, u32 vlan_macip_lens,
-			    u32 type_tucmd, u32 mss_l4len_idx)
+static void igb_tx_ctxtdesc(struct igb_ring *tx_ring,
+			    struct igb_tx_buffer *first,
+			    u32 vlan_macip_lens, u32 type_tucmd,
+			    u32 mss_l4len_idx)
 {
 	struct e1000_adv_tx_context_desc *context_desc;
 	u16 i = tx_ring->next_to_use;
+	struct timespec64 ts;
 
 	context_desc = IGB_TX_CTXTDESC(tx_ring, i);
 
@@ -5352,9 +5447,18 @@ static void igb_tx_ctxtdesc(struct igb_ring *tx_ring, u32 vlan_macip_lens,
 		mss_l4len_idx |= tx_ring->reg_idx << 4;
 
 	context_desc->vlan_macip_lens	= cpu_to_le32(vlan_macip_lens);
-	context_desc->seqnum_seed	= 0;
 	context_desc->type_tucmd_mlhl	= cpu_to_le32(type_tucmd);
 	context_desc->mss_l4len_idx	= cpu_to_le32(mss_l4len_idx);
+
+	/* We assume there is always a valid tx time available. Invalid times
+	 * should have been handled by the upper layers.
+	 */
+	if (tx_ring->launchtime_enable) {
+		ts = ns_to_timespec64(first->skb->tstamp);
+		context_desc->seqnum_seed = cpu_to_le32(ts.tv_nsec / 32);
+	} else {
+		context_desc->seqnum_seed = 0;
+	}
 }
 
 static int igb_tso(struct igb_ring *tx_ring,
@@ -5437,7 +5541,8 @@ static int igb_tso(struct igb_ring *tx_ring,
 	vlan_macip_lens |= (ip.hdr - skb->data) << E1000_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IGB_TX_FLAGS_VLAN_MASK;
 
-	igb_tx_ctxtdesc(tx_ring, vlan_macip_lens, type_tucmd, mss_l4len_idx);
+	igb_tx_ctxtdesc(tx_ring, first, vlan_macip_lens,
+			type_tucmd, mss_l4len_idx);
 
 	return 1;
 }
@@ -5492,7 +5597,7 @@ static void igb_tx_csum(struct igb_ring *tx_ring, struct igb_tx_buffer *first)
 	vlan_macip_lens |= skb_network_offset(skb) << E1000_ADVTXD_MACLEN_SHIFT;
 	vlan_macip_lens |= first->tx_flags & IGB_TX_FLAGS_VLAN_MASK;
 
-	igb_tx_ctxtdesc(tx_ring, vlan_macip_lens, type_tucmd, 0);
+	igb_tx_ctxtdesc(tx_ring, first, vlan_macip_lens, type_tucmd, 0);
 }
 
 #define IGB_SET_FLAG(_input, _flag, _result) \
-- 
2.16.2


^ permalink raw reply related	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  2:53     ` Eric Dumazet
  -1 siblings, 0 replies; 129+ messages in thread
From: Eric Dumazet @ 2018-03-07  2:53 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia, netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
> a drop_if_late flag. With this commit the API becomes:
> 
> 

 * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 * index d8340e6e8814..951969ceaf65 100644
 * --- a/include/linux/skbuff.h
 * +++ b/include/linux/skbuff.h
 * @@ -788,6 +788,9 @@ struct sk_buff {
 *  	__u8			tc_redirected:1;
 *  	__u8			tc_from_ingress:1;
 *  #endif
 * +	__u8			tc_drop_if_late:1;
 * +
 * +	clockid_t		txtime_clockid;
 *  
 *  #ifdef CONFIG_NET_SCHED
 *  	__u16			tc_index;	/* traffic
   control index */


This is adding 32+1 bits to sk_buff, and possibly holes in this very
very hot (and already too fat) structure.

Do we really need 32 bits for a clockid_t ?

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07  2:53     ` Eric Dumazet
  0 siblings, 0 replies; 129+ messages in thread
From: Eric Dumazet @ 2018-03-07  2:53 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
> a drop_if_late flag. With this commit the API becomes:
> 
> 

 * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 * index d8340e6e8814..951969ceaf65 100644
 * --- a/include/linux/skbuff.h
 * +++ b/include/linux/skbuff.h
 * @@ -788,6 +788,9 @@ struct sk_buff {
 * ?	__u8			tc_redirected:1;
 * ?	__u8			tc_from_ingress:1;
 * ?#endif
 * +	__u8			tc_drop_if_late:1;
 * +
 * +	clockid_t		txtime_clockid;
 * ?
 * ?#ifdef CONFIG_NET_SCHED
 * ?	__u16			tc_index;	/* traffic
   control index */


This is adding 32+1 bits to sk_buff, and possibly holes in this very
very hot (and already too fat) structure.

Do we really need 32 bits for a clockid_t ?



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07  2:53     ` [Intel-wired-lan] " Eric Dumazet
@ 2018-03-07  5:24       ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07  5:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, intel-wired-lan, anna-maria, henrik, tglx,
	john.stultz, levi.pearson, edumazet, willemb, mlichvar

On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.
> 
> Do we really need 32 bits for a clockid_t ?

Probably we can live with fewer bits.

For clock IDs with a positive sign, the max possible clock value is 16.

For clock IDs with a negative sign, IIRC, three bits are for the type
code (we have also posix timers packed like this) and the are for the
file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
so for encoding the FD.

The downside would be that this forces the application to make sure
and open the dynamic posix clock early enough before the FD count gets
too high.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07  5:24       ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07  5:24 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.
> 
> Do we really need 32 bits for a clockid_t ?

Probably we can live with fewer bits.

For clock IDs with a positive sign, the max possible clock value is 16.

For clock IDs with a negative sign, IIRC, three bits are for the type
code (we have also posix timers packed like this) and the are for the
file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
so for encoding the FD.

The downside would be that this forces the application to make sure
and open the dynamic posix clock early enough before the FD count gets
too high.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 00/18] Time based packet transmission
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07  5:28   ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07  5:28 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>    skb->tc_drop_if_late flag set. In practical terms, this will define if
>    the semantics of txtime on a system is "not earlier than" or "not later
>    than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>    doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>    Effectively, this means it can be configured in 4 modes: HW offload or
>    SW best-effort, sorting enabled or disabled;

While all of this makes the series and the configuration more complex,
still I like the fact that the interface offers these different modes.

Looking forward to testing this...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-07  5:28   ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07  5:28 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>    skb->tc_drop_if_late flag set. In practical terms, this will define if
>    the semantics of txtime on a system is "not earlier than" or "not later
>    than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>    doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>    Effectively, this means it can be configured in 4 modes: HW offload or
>    SW best-effort, sorting enabled or disabled;

While all of this makes the series and the configuration more complex,
still I like the fact that the interface offers these different modes.

Looking forward to testing this...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07 16:58     ` Willem de Bruijn
  -1 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 16:58 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, Richard Cochran,
	intel-wired-lan, anna-maria, Henrik Austad, Thomas Gleixner,
	John Stultz, Levi Pearson, Eric Dumazet, Willem de Bruijn,
	Miroslav Lichvar

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
> ret values to be overwritten by the one set on the default case.
>
> Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

Acked-by: Willem de Bruijn <willemb@google.com>

Please send this fix to net-next independent from the rest of the patchset.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case
@ 2018-03-07 16:58     ` Willem de Bruijn
  0 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 16:58 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the
> ret values to be overwritten by the one set on the default case.
>
> Fixes: 28190752c7092 ("sock: permit SO_ZEROCOPY on PF_RDS socket")
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

Acked-by: Willem de Bruijn <willemb@google.com>

Please send this fix to net-next independent from the rest of the patchset.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07 16:59     ` Willem de Bruijn
  -1 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 16:59 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, Richard Cochran,
	intel-wired-lan, anna-maria, Henrik Austad, Thomas Gleixner,
	John Stultz, Levi Pearson, Eric Dumazet, Willem de Bruijn,
	Miroslav Lichvar

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> This is done in preparation for the upcoming time based transmission
> patchset. Now that skb->tstamp will be used to hold packet's txtime,
> we must ensure that it is being cleared when traversing namespaces.
> Also, doing that from skb_scrub_packet() would break our feature when
> tunnels are used.

Then the right location to move to is skb_scrub_packet below the test for xnet.

> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
>  include/linux/netdevice.h | 1 +
>  net/core/skbuff.c         | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dbe6344b727a..7104de2bc957 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>
>         skb_scrub_packet(skb, true);
>         skb->priority = 0;
> +       skb->tstamp = 0;
>         return 0;
>  }
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c13495ba6..678fc5416ae1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>   */
>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>  {
> -       skb->tstamp = 0;
>         skb->pkt_type = PACKET_HOST;
>         skb->skb_iif = 0;
>         skb->ignore_df = 0;
> --
> 2.16.2
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
@ 2018-03-07 16:59     ` Willem de Bruijn
  0 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 16:59 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> This is done in preparation for the upcoming time based transmission
> patchset. Now that skb->tstamp will be used to hold packet's txtime,
> we must ensure that it is being cleared when traversing namespaces.
> Also, doing that from skb_scrub_packet() would break our feature when
> tunnels are used.

Then the right location to move to is skb_scrub_packet below the test for xnet.

> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
>  include/linux/netdevice.h | 1 +
>  net/core/skbuff.c         | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index dbe6344b727a..7104de2bc957 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>
>         skb_scrub_packet(skb, true);
>         skb->priority = 0;
> +       skb->tstamp = 0;
>         return 0;
>  }
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c13495ba6..678fc5416ae1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>   */
>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>  {
> -       skb->tstamp = 0;
>         skb->pkt_type = PACKET_HOST;
>         skb->skb_iif = 0;
>         skb->ignore_df = 0;
> --
> 2.16.2
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07 17:00     ` Willem de Bruijn
  -1 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:00 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, Richard Cochran,
	intel-wired-lan, anna-maria, Henrik Austad, Thomas Gleixner,
	John Stultz, Levi Pearson, Eric Dumazet, Willem de Bruijn,
	Miroslav Lichvar, Richard Cochran

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> From: Richard Cochran <rcochran@linutronix.de>
>
> For raw packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
>
> Signed-off-by: Richard Cochran <rcochran@linutronix.de>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
>  net/ipv4/raw.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 54648d20bf0f..8e05970ba7c4 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
>
>         skb->priority = sk->sk_priority;
>         skb->mark = sk->sk_mark;
> +       skb->tstamp = sockc->transmit_time;

This implements the feature only for the hdrincl case and silently
drops the txtime request on other raw sockets (incl. corked).

At the least, should probably fail if sockc.transmit_time is non-zero
and the hdrincl path is not taken. Or implement by passing through
inet_cork and set in __ip_make_skb. Then be careful to ignore the
field for other protocols, where it may be uninitialized.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission.
@ 2018-03-07 17:00     ` Willem de Bruijn
  0 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:00 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> From: Richard Cochran <rcochran@linutronix.de>
>
> For raw packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
>
> Signed-off-by: Richard Cochran <rcochran@linutronix.de>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
>  net/ipv4/raw.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 54648d20bf0f..8e05970ba7c4 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -381,6 +381,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
>
>         skb->priority = sk->sk_priority;
>         skb->mark = sk->sk_mark;
> +       skb->tstamp = sockc->transmit_time;

This implements the feature only for the hdrincl case and silently
drops the txtime request on other raw sockets (incl. corked).

At the least, should probably fail if sockc.transmit_time is non-zero
and the hdrincl path is not taken. Or implement by passing through
inet_cork and set in __ip_make_skb. Then be careful to ignore the
field for other protocols, where it may be uninitialized.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07 17:00     ` Willem de Bruijn
  -1 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:00 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, Richard Cochran,
	intel-wired-lan, anna-maria, Henrik Austad, Thomas Gleixner,
	John Stultz, Levi Pearson, Eric Dumazet, Willem de Bruijn,
	Miroslav Lichvar, Richard Cochran

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> From: Richard Cochran <rcochran@linutronix.de>
>
> For udp packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
>
> Signed-off-by: Richard Cochran <rcochran@linutronix.de>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
>  net/ipv4/udp.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 3013404d0935..d683bbde526b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>         }
>
>         ipc.sockc.tsflags = sk->sk_tsflags;
> +       ipc.sockc.transmit_time = 0;
>         ipc.addr = inet->inet_saddr;
>         ipc.oif = sk->sk_bound_dev_if;
>
> @@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>                                   sizeof(struct udphdr), &ipc, &rt,
>                                   msg->msg_flags);
>                 err = PTR_ERR(skb);
> -               if (!IS_ERR_OR_NULL(skb))
> +               if (!IS_ERR_OR_NULL(skb)) {
> +                       skb->tstamp = ipc.sockc.transmit_time;
>                         err = udp_send_skb(skb, fl4);
> +               }

similar comment to raw: this implements only for a subset of udp requests:
those that can take the fast path.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 06/18] net: ipv4: udp: Hook into time based transmission.
@ 2018-03-07 17:00     ` Willem de Bruijn
  0 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:00 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
<jesus.sanchez-palencia@intel.com> wrote:
> From: Richard Cochran <rcochran@linutronix.de>
>
> For udp packets, copy the desired future transmit time from the CMSG
> cookie into the skb.
>
> Signed-off-by: Richard Cochran <rcochran@linutronix.de>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
>  net/ipv4/udp.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 3013404d0935..d683bbde526b 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -926,6 +926,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>         }
>
>         ipc.sockc.tsflags = sk->sk_tsflags;
> +       ipc.sockc.transmit_time = 0;
>         ipc.addr = inet->inet_saddr;
>         ipc.oif = sk->sk_bound_dev_if;
>
> @@ -1040,8 +1041,10 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>                                   sizeof(struct udphdr), &ipc, &rt,
>                                   msg->msg_flags);
>                 err = PTR_ERR(skb);
> -               if (!IS_ERR_OR_NULL(skb))
> +               if (!IS_ERR_OR_NULL(skb)) {
> +                       skb->tstamp = ipc.sockc.transmit_time;
>                         err = udp_send_skb(skb, fl4);
> +               }

similar comment to raw: this implements only for a subset of udp requests:
those that can take the fast path.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07  5:24       ` [Intel-wired-lan] " Richard Cochran
@ 2018-03-07 17:01         ` Willem de Bruijn
  -1 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:01 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, Network Development,
	Jamal Hadi Salim, Cong Wang, Jiří Pírko,
	Vinicius Gomes, intel-wired-lan, anna-maria, Henrik Austad,
	Thomas Gleixner, John Stultz, Levi Pearson, Eric Dumazet,
	Willem de Bruijn, Miroslav Lichvar

On Wed, Mar 7, 2018 at 12:24 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
>> This is adding 32+1 bits to sk_buff, and possibly holes in this very
>> very hot (and already too fat) structure.
>>
>> Do we really need 32 bits for a clockid_t ?
>
> Probably we can live with fewer bits.
>
> For clock IDs with a positive sign, the max possible clock value is 16.
>
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
>
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.

The same choices are probably made for all packets on a given
socket. Unless skb->sk gets scrubbed in some transmit paths,
then these be set as sockopt instead of cmsg.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 17:01         ` Willem de Bruijn
  0 siblings, 0 replies; 129+ messages in thread
From: Willem de Bruijn @ 2018-03-07 17:01 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 7, 2018 at 12:24 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
>> This is adding 32+1 bits to sk_buff, and possibly holes in this very
>> very hot (and already too fat) structure.
>>
>> Do we really need 32 bits for a clockid_t ?
>
> Probably we can live with fewer bits.
>
> For clock IDs with a positive sign, the max possible clock value is 16.
>
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
>
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.

The same choices are probably made for all packets on a given
socket. Unless skb->sk gets scrubbed in some transmit paths,
then these be set as sockopt instead of cmsg.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 17:01         ` [Intel-wired-lan] " Willem de Bruijn
@ 2018-03-07 17:35           ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07 17:35 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, Network Development,
	Jamal Hadi Salim, Cong Wang, Jiří Pírko,
	Vinicius Gomes, intel-wired-lan, anna-maria, Henrik Austad,
	Thomas Gleixner, John Stultz, Levi Pearson, Eric Dumazet,
	Willem de Bruijn, Miroslav Lichvar

On Wed, Mar 07, 2018 at 12:01:19PM -0500, Willem de Bruijn wrote:
> The same choices are probably made for all packets on a given
> socket. Unless skb->sk gets scrubbed in some transmit paths,
> then these be set as sockopt instead of cmsg.

The discussion on v2 ended with this per-message idea, in preference
to the per-socket idea, IIRC.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 17:35           ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07 17:35 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 07, 2018 at 12:01:19PM -0500, Willem de Bruijn wrote:
> The same choices are probably made for all packets on a given
> socket. Unless skb->sk gets scrubbed in some transmit paths,
> then these be set as sockopt instead of cmsg.

The discussion on v2 ended with this per-message idea, in preference
to the per-socket idea, IIRC.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 17:35           ` [Intel-wired-lan] " Richard Cochran
@ 2018-03-07 17:37             ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07 17:37 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, Network Development,
	Jamal Hadi Salim, Cong Wang, Jiří Pírko,
	Vinicius Gomes, intel-wired-lan, anna-maria, Henrik Austad,
	Thomas Gleixner, John Stultz, Levi Pearson, Eric Dumazet,
	Willem de Bruijn, Miroslav Lichvar

On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote:
> The discussion on v2 ended with this per-message idea, in preference
> to the per-socket idea, IIRC.

(But my own opinion is that per-socket is good enough...)
 
Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 17:37             ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-07 17:37 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote:
> The discussion on v2 ended with this per-message idea, in preference
> to the per-socket idea, IIRC.

(But my own opinion is that per-socket is good enough...)
 
Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 17:37             ` [Intel-wired-lan] " Richard Cochran
@ 2018-03-07 17:47               ` Eric Dumazet
  -1 siblings, 0 replies; 129+ messages in thread
From: Eric Dumazet @ 2018-03-07 17:47 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Willem de Bruijn, Eric Dumazet, Jesus Sanchez-Palencia,
	Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, intel-wired-lan,
	anna-maria, Henrik Austad, Thomas Gleixner, John Stultz,
	Levi Pearson, Willem de Bruijn, Miroslav Lichvar

On Wed, Mar 7, 2018 at 9:37 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote:
>> The discussion on v2 ended with this per-message idea, in preference
>> to the per-socket idea, IIRC.
>
> (But my own opinion is that per-socket is good enough...)
>
> Thanks,
> Richard

I would love if skb->tstamp could be either 0 or expressed in
ktime_get() base all the time.

( Even if we would have to convert this to other bases when/if needed)

Having to deal with many clockid in the core networking stack seems
over engineered.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 17:47               ` Eric Dumazet
  0 siblings, 0 replies; 129+ messages in thread
From: Eric Dumazet @ 2018-03-07 17:47 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 7, 2018 at 9:37 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Wed, Mar 07, 2018 at 09:35:24AM -0800, Richard Cochran wrote:
>> The discussion on v2 ended with this per-message idea, in preference
>> to the per-socket idea, IIRC.
>
> (But my own opinion is that per-socket is good enough...)
>
> Thanks,
> Richard

I would love if skb->tstamp could be either 0 or expressed in
ktime_get() base all the time.

( Even if we would have to convert this to other bases when/if needed)

Having to deal with many clockid in the core networking stack seems
over engineered.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07  2:53     ` [Intel-wired-lan] " Eric Dumazet
@ 2018-03-07 21:52       ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 21:52 UTC (permalink / raw)
  To: Eric Dumazet, netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

Hi,


On 03/06/2018 06:53 PM, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
>> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
>> a drop_if_late flag. With this commit the API becomes:
>>
>>
> 
>  * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>  * index d8340e6e8814..951969ceaf65 100644
>  * --- a/include/linux/skbuff.h
>  * +++ b/include/linux/skbuff.h
>  * @@ -788,6 +788,9 @@ struct sk_buff {
>  *  	__u8			tc_redirected:1;
>  *  	__u8			tc_from_ingress:1;
>  *  #endif
>  * +	__u8			tc_drop_if_late:1;
>  * +
>  * +	clockid_t		txtime_clockid;
>  *  
>  *  #ifdef CONFIG_NET_SCHED
>  *  	__u16			tc_index;	/* traffic
>    control index */
> 
> 
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.

I should have mentioned on the commit msg, but the tc_drop_if_late is actually
filling a 1 bit hole that was already there.


> 
> Do we really need 32 bits for a clockid_t ?

There is a 2 bytes hole just after tc_index, so a u16 clockid would fit
perfectly without increasing the skbuffs size / cachelines any further.

>From Richard's reply, it seems safe to just change the definition here if we
make it explicit on the SCM_CLOCKID documentation the caveat about the max
possible fd count for dynamic clocks.

How does that sound?

Thanks,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 21:52       ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 21:52 UTC (permalink / raw)
  To: intel-wired-lan

Hi,


On 03/06/2018 06:53 PM, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 17:12 -0800, Jesus Sanchez-Palencia wrote:
>> Extend SO_TXTIME APIs with new per-packet parameters: a clockid_t and
>> a drop_if_late flag. With this commit the API becomes:
>>
>>
> 
>  * diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>  * index d8340e6e8814..951969ceaf65 100644
>  * --- a/include/linux/skbuff.h
>  * +++ b/include/linux/skbuff.h
>  * @@ -788,6 +788,9 @@ struct sk_buff {
>  * ?	__u8			tc_redirected:1;
>  * ?	__u8			tc_from_ingress:1;
>  * ?#endif
>  * +	__u8			tc_drop_if_late:1;
>  * +
>  * +	clockid_t		txtime_clockid;
>  * ?
>  * ?#ifdef CONFIG_NET_SCHED
>  * ?	__u16			tc_index;	/* traffic
>    control index */
> 
> 
> This is adding 32+1 bits to sk_buff, and possibly holes in this very
> very hot (and already too fat) structure.

I should have mentioned on the commit msg, but the tc_drop_if_late is actually
filling a 1 bit hole that was already there.


> 
> Do we really need 32 bits for a clockid_t ?

There is a 2 bytes hole just after tc_index, so a u16 clockid would fit
perfectly without increasing the skbuffs size / cachelines any further.

From Richard's reply, it seems safe to just change the definition here if we
make it explicit on the SCM_CLOCKID documentation the caveat about the max
possible fd count for dynamic clocks.

How does that sound?

Thanks,
Jesus


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
  2018-03-07 16:59     ` [Intel-wired-lan] " Willem de Bruijn
@ 2018-03-07 22:03       ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 22:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, Richard Cochran,
	intel-wired-lan, anna-maria, Henrik Austad, Thomas Gleixner,
	John Stultz, Levi Pearson, Eric Dumazet, Willem de Bruijn,
	Miroslav Lichvar



On 03/07/2018 08:59 AM, Willem de Bruijn wrote:
> On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
> <jesus.sanchez-palencia@intel.com> wrote:
>> This is done in preparation for the upcoming time based transmission
>> patchset. Now that skb->tstamp will be used to hold packet's txtime,
>> we must ensure that it is being cleared when traversing namespaces.
>> Also, doing that from skb_scrub_packet() would break our feature when
>> tunnels are used.
> 
> Then the right location to move to is skb_scrub_packet below the test for xnet.

Fixed, thanks.



> 
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>> ---
>>  include/linux/netdevice.h | 1 +
>>  net/core/skbuff.c         | 1 -
>>  2 files changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index dbe6344b727a..7104de2bc957 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>>
>>         skb_scrub_packet(skb, true);
>>         skb->priority = 0;
>> +       skb->tstamp = 0;
>>         return 0;
>>  }
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c13495ba6..678fc5416ae1 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>>   */
>>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>  {
>> -       skb->tstamp = 0;
>>         skb->pkt_type = PACKET_HOST;
>>         skb->skb_iif = 0;
>>         skb->ignore_df = 0;
>> --
>> 2.16.2
>>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path
@ 2018-03-07 22:03       ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-07 22:03 UTC (permalink / raw)
  To: intel-wired-lan



On 03/07/2018 08:59 AM, Willem de Bruijn wrote:
> On Tue, Mar 6, 2018 at 8:12 PM, Jesus Sanchez-Palencia
> <jesus.sanchez-palencia@intel.com> wrote:
>> This is done in preparation for the upcoming time based transmission
>> patchset. Now that skb->tstamp will be used to hold packet's txtime,
>> we must ensure that it is being cleared when traversing namespaces.
>> Also, doing that from skb_scrub_packet() would break our feature when
>> tunnels are used.
> 
> Then the right location to move to is skb_scrub_packet below the test for xnet.

Fixed, thanks.



> 
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>> ---
>>  include/linux/netdevice.h | 1 +
>>  net/core/skbuff.c         | 1 -
>>  2 files changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index dbe6344b727a..7104de2bc957 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3379,6 +3379,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>>
>>         skb_scrub_packet(skb, true);
>>         skb->priority = 0;
>> +       skb->tstamp = 0;
>>         return 0;
>>  }
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c13495ba6..678fc5416ae1 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -4865,7 +4865,6 @@ EXPORT_SYMBOL(skb_try_coalesce);
>>   */
>>  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>>  {
>> -       skb->tstamp = 0;
>>         skb->pkt_type = PACKET_HOST;
>>         skb->skb_iif = 0;
>>         skb->ignore_df = 0;
>> --
>> 2.16.2
>>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 21:52       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-07 22:45         ` Eric Dumazet
  -1 siblings, 0 replies; 129+ messages in thread
From: Eric Dumazet @ 2018-03-07 22:45 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia, netdev
  Cc: jhs, xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote:
> Hi,
...
> I should have mentioned on the commit msg, but the tc_drop_if_late is
> actually
> filling a 1 bit hole that was already there.
> 
> 
> > 
> > Do we really need 32 bits for a clockid_t ?
> 
> There is a 2 bytes hole just after tc_index, so a u16 clockid would
> fit
> perfectly without increasing the skbuffs size / cachelines any
> further.
> 
> From Richard's reply, it seems safe to just change the definition
> here if we
> make it explicit on the SCM_CLOCKID documentation the caveat about
> the max
> possible fd count for dynamic clocks.
> 
> How does that sound?

Not convincing really :/

Next big feature needing one bit in sk_buff will add it, and add a
63bit hole.

Then next feature(s) will happily consume 'because there are holes
anyway'.

Then at some point we will cross cache line boundary and performance
will take a 10 % hit.

It is a never ending trend.

If you really need 33 bits, then maybe we'll ask you to guard the new
bits with some #if IS_ENABLED(CONFIG_...) so that we can opt-out.

Why do we _really_ need dynamic clocks being supported in core
networking stack, other than 'that is needed to send 2 packets per
second with precise departure time and arbitrary user defined clocks,
so lets do that, and do not care of the other 10,000,000 packets we
receive/send per second'

I have one patch (TXCS, something that I called XPS in the past)
implementing the remote-freeing of skbs that help workloads where skb
are produced on cpu A and consumed on cpu B,
using an additional 16bit field that I have not upstreamed yet (even if
Mellanox folks want that), simply because of this additional field...

Maybe I should eat this hole before you take it ?

No, we need to be extra careful.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 22:45         ` Eric Dumazet
  0 siblings, 0 replies; 129+ messages in thread
From: Eric Dumazet @ 2018-03-07 22:45 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote:
> Hi,
...
> I should have mentioned on the commit msg, but the tc_drop_if_late is
> actually
> filling a 1 bit hole that was already there.
> 
> 
> > 
> > Do we really need 32 bits for a clockid_t ?
> 
> There is a 2 bytes hole just after tc_index, so a u16 clockid would
> fit
> perfectly without increasing the skbuffs size / cachelines any
> further.
> 
> From Richard's reply, it seems safe to just change the definition
> here if we
> make it explicit on the SCM_CLOCKID documentation the caveat about
> the max
> possible fd count for dynamic clocks.
> 
> How does that sound?

Not convincing really :/

Next big feature needing one bit in sk_buff will add it, and add a
63bit hole.

Then next feature(s) will happily consume 'because there are holes
anyway'.

Then at some point we will cross cache line boundary and performance
will take a 10 % hit.

It is a never ending trend.

If you really need 33 bits, then maybe we'll ask you to guard the new
bits with some #if IS_ENABLED(CONFIG_...) so that we can opt-out.

Why do we _really_ need dynamic clocks being supported in core
networking stack, other than 'that is needed to send 2 packets per
second with precise departure time and arbitrary user defined clocks,
so lets do that, and do not care of the other 10,000,000 packets we
receive/send per second'

I have one patch (TXCS, something that I called XPS in the past)
implementing the remote-freeing of skbs that help workloads where skb
are produced on cpu A and consumed on cpu B,
using an additional 16bit field that I have not upstreamed yet (even if
Mellanox folks want that), simply because of this additional field...

Maybe I should eat this hole before you take it ?

No, we need to be extra careful.


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 22:45         ` [Intel-wired-lan] " Eric Dumazet
@ 2018-03-07 23:03           ` David Miller
  -1 siblings, 0 replies; 129+ messages in thread
From: David Miller @ 2018-03-07 23:03 UTC (permalink / raw)
  To: eric.dumazet
  Cc: jesus.sanchez-palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, richardcochran, intel-wired-lan, anna-maria,
	henrik, tglx, john.stultz, levi.pearson, edumazet, willemb,
	mlichvar

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 07 Mar 2018 14:45:45 -0800

> No, we need to be extra careful.

+1

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-07 23:03           ` David Miller
  0 siblings, 0 replies; 129+ messages in thread
From: David Miller @ 2018-03-07 23:03 UTC (permalink / raw)
  To: intel-wired-lan

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 07 Mar 2018 14:45:45 -0800

> No, we need to be extra careful.

+1

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 22:45         ` [Intel-wired-lan] " Eric Dumazet
@ 2018-03-08 11:37           ` Miroslav Lichvar
  -1 siblings, 0 replies; 129+ messages in thread
From: Miroslav Lichvar @ 2018-03-08 11:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, richardcochran, intel-wired-lan, anna-maria,
	henrik, tglx, john.stultz, levi.pearson, edumazet, willemb

On Wed, Mar 07, 2018 at 02:45:45PM -0800, Eric Dumazet wrote:
> On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote:
> > > Do we really need 32 bits for a clockid_t ?
> > 
> > There is a 2 bytes hole just after tc_index, so a u16 clockid would
> > fit
> > perfectly without increasing the skbuffs size / cachelines any
> > further.

> Not convincing really :/
> 
> Next big feature needing one bit in sk_buff will add it, and add a
> 63bit hole.

Would it be possible to put the clockid in skb_shared_info? If that's
technically difficult or does not make sense, I'm ok with the clockid
being a socket option.

If a packet is sent immediately after changing the clockid via
setsockopt(), will it be still guaranteed that the packet is
restricted by the new id?

> Why do we _really_ need dynamic clocks being supported in core
> networking stack, other than 'that is needed to send 2 packets per
> second with precise departure time and arbitrary user defined clocks,
> so lets do that, and do not care of the other 10,000,000 packets we
> receive/send per second'

Well, I'd not expect it to be a common use case, but a public NTP
server could be sending millions of packets per second in traffic
peaks (typically at *:00:00) over multiple interfaces.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-08 11:37           ` Miroslav Lichvar
  0 siblings, 0 replies; 129+ messages in thread
From: Miroslav Lichvar @ 2018-03-08 11:37 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 07, 2018 at 02:45:45PM -0800, Eric Dumazet wrote:
> On Wed, 2018-03-07 at 13:52 -0800, Jesus Sanchez-Palencia wrote:
> > > Do we really need 32 bits for a clockid_t ?
> > 
> > There is a 2 bytes hole just after tc_index, so a u16 clockid would
> > fit
> > perfectly without increasing the skbuffs size / cachelines any
> > further.

> Not convincing really :/
> 
> Next big feature needing one bit in sk_buff will add it, and add a
> 63bit hole.

Would it be possible to put the clockid in skb_shared_info? If that's
technically difficult or does not make sense, I'm ok with the clockid
being a socket option.

If a packet is sent immediately after changing the clockid via
setsockopt(), will it be still guaranteed that the packet is
restricted by the new id?

> Why do we _really_ need dynamic clocks being supported in core
> networking stack, other than 'that is needed to send 2 packets per
> second with precise departure time and arbitrary user defined clocks,
> so lets do that, and do not care of the other 10,000,000 packets we
> receive/send per second'

Well, I'd not expect it to be a common use case, but a public NTP
server could be sending millions of packets per second in traffic
peaks (typically at *:00:00) over multiple interfaces.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 00/18] Time based packet transmission
  2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-08 14:09   ` Henrik Austad
  -1 siblings, 0 replies; 129+ messages in thread
From: Henrik Austad @ 2018-03-08 14:09 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

[-- Attachment #1: Type: text/plain, Size: 7758 bytes --]

On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> This series is the v3 of the Time based packet transmission RFC, which was
> originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
> and further developed by us with the addition of the tbs qdisc
> (v2: https://lwn.net/Articles/744797/ ).

Nice!

> It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
> implements support for hw offloading on the igb driver for the Intel
> i210 NIC. The tbs qdisc also supports SW best effort that can be used
> as a fallback.
> 
> The main changes since v2 can be found below.
> 
> Fixes since v2:
>  - skb->tstamp is only cleared on the forwarding path;
>  - ktime_t is no longer the type used for timestamps (s64 is);
>  - get_unaligned() is now used for copying data from the cmsg header;
>  - added getsockopt() support for SO_TXTIME;
>  - restricted SO_TXTIME input range to [0,1];
>  - removed ns_capable() check from __sock_cmsg_send();
>  - the qdisc  control struct now uses a 32 bitmap for config flags;
>  - fixed qdisc backlog decrement bug;
>  - 'overlimits' is now incremented on dequeue() drops in addition to the
>    'dropped' counter;
> 
> Interface changes since v2:
>  * CMSG interface:
>    - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
>    - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
>  * tc-tbs:
>    - clockid now receives a string;
>      e.g.: CLOCK_REALTIME or /dev/ptp0
>    - offload is now a standalone argument (i.e. no more offload 1);
>    - sorting is now argument that enables txtime based sorting provided
>      by the qdisc;
> 
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>    skb->tc_drop_if_late flag set. In practical terms, this will define if
>    the semantics of txtime on a system is "not earlier than" or "not later
>    than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>    doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>    Effectively, this means it can be configured in 4 modes: HW offload or
>    SW best-effort, sorting enabled or disabled;

A lot of new knobs, I see the need, I would've like to have fewer, but 
you've documented them pretty well. Perhaps we should add something to 
Documentation/ at one stage?

Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
Using udp_tai and tcpdump in the other end to grab the frames

Setting up with hw offload and sorting in qdisc.

Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
bypass as dual-core and i210 is not friends):

udp_tai -c1 -i eth2 -p 20 -P 10000000

Receiver (imx7, kernel 4.9.11):
chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log

Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
receiver, so these numbers can only improve.

count    2340.000000
mean        0.043770
std         0.047784
min         0.009025
25%         0.010003
50%         0.010010
75%         0.109998
max         0.120060

I have to dig more into why this is happening, a lot frames delayed much 
more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
obvious fix is move some hw around and do a direct link, but I didn't have 
time for that right now.

I'm very interested in doing what Richard's original test was when he used 
ptp-synched clocks and also used hw receive-time and compared with expected 
tx-time. So, while I'm getting that up and running, I thought I should 
share the early results.

-Henrik

> The tbs qdisc is designed so it buffers packets until a configurable time before
> their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
> fallback modes, the qdisc uses a rbtree internally so the buffered packets are
> always 'ordered' by the earliest deadline.
> 
> If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
> through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
> it will use a 'scheduled' FIFO.
> 
> The other configurable parameter from the tbs qdisc is the clockid to be used.
> In order to provide that, this series adds a new API to pkt_sched.h (i.e.
> qdisc_watchdog_init_clockid()).
> 
> The tbs qdisc will drop any packets with a transmission time in the past or
> when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
> advance plus configuring the delta parameter for the system correctly makes
> all the difference in reducing the number of drops. Moreover, note that the
> delta parameter ends up defining the Tx time when SW best-effort is used
> given that the timestamps won't be used by the NIC on this case.
> 
> Examples:
> 
> # SW best-effort with sorting #
> 
>     $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>                map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
>     $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
>                clockid CLOCK_REALTIME sorting
> 
>     In this example first the mqprio qdisc is setup, then the tbs qdisc is
>     configured onto the first hw Tx queue using SW best-effort with sorting
>     enabled. Also, it is configured so the timestamps on each packet are in
>     reference to the clockid CLOCK_REALTIME and so packets are dequeued from
>     the qdisc 100000 nanoseconds before their transmission time.
> 
> 
> # HW offload without sorting #
> 
>     $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>                map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
>     $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
> 
>     In this example, the Qdisc will use HW offload for the control of the
>     transmission time through the network adapter. It's assumed implicitly
>     the timestamp in skbuffs are in reference to the interface's PHC and
>     setting any other valid clockid would be treated as an error. Because
>     there is no scheduling being performed in the qdisc, setting a delta != 0
>     would also be considered an error.
> 
> 
> # HW offload with sorting #
>     $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>                map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
>     $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
>                clockid CLOCK_REALTIME sorting
> 
>     Here, the Qdisc will use HW offload for the txtime control again,
>     but now sorting will be enabled, and thus there will be scheduling being
>     performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
>     and packets leave the Qdisc "delta" (100000) nanoseconds before
>     their transmission time. Because this will be using HW offload and
>     since dynamic clocks are not supported by the hrtimer, the system clock
>     and the PHC clock must be synchronized for this mode to behave as expected.
> 
> 
> For testing, we've followed a similar approach from the v1 and v2 testing and
> no significant changes on the results were observed. An updated version of
> udp_tai.c is attached to this cover letter.
> 
> For last, most of the To Dos we still have before a final patchset are related
> to further testing the igb support:
>  - testing with L2 only talkers + AF_PACKET sockets;
>  - testing tbs in conjunction with cbs;
> 
> Thanks for all the feedback so far,
> Jesus

-Henrik

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-08 14:09   ` Henrik Austad
  0 siblings, 0 replies; 129+ messages in thread
From: Henrik Austad @ 2018-03-08 14:09 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Mar 06, 2018 at 05:12:12PM -0800, Jesus Sanchez-Palencia wrote:
> This series is the v3 of the Time based packet transmission RFC, which was
> originally proposed by Richard Cochran (v1: https://lwn.net/Articles/733962/ )
> and further developed by us with the addition of the tbs qdisc
> (v2: https://lwn.net/Articles/744797/ ).

Nice!

> It introduces a new socket option (SO_TXTIME), a new qdisc (tbs) and
> implements support for hw offloading on the igb driver for the Intel
> i210 NIC. The tbs qdisc also supports SW best effort that can be used
> as a fallback.
> 
> The main changes since v2 can be found below.
> 
> Fixes since v2:
>  - skb->tstamp is only cleared on the forwarding path;
>  - ktime_t is no longer the type used for timestamps (s64 is);
>  - get_unaligned() is now used for copying data from the cmsg header;
>  - added getsockopt() support for SO_TXTIME;
>  - restricted SO_TXTIME input range to [0,1];
>  - removed ns_capable() check from __sock_cmsg_send();
>  - the qdisc  control struct now uses a 32 bitmap for config flags;
>  - fixed qdisc backlog decrement bug;
>  - 'overlimits' is now incremented on dequeue() drops in addition to the
>    'dropped' counter;
> 
> Interface changes since v2:
>  * CMSG interface:
>    - added a per-packet clockid parameter to the cmsg (SCM_CLOCKID);
>    - added a per-packet drop_if_late flag to the cmsg (SCM_DROP_IF_LATE);
>  * tc-tbs:
>    - clockid now receives a string;
>      e.g.: CLOCK_REALTIME or /dev/ptp0
>    - offload is now a standalone argument (i.e. no more offload 1);
>    - sorting is now argument that enables txtime based sorting provided
>      by the qdisc;
> 
> Design changes since v2:
>  - Now on the dequeue() path, tbs only drops an expired packet if it has the
>    skb->tc_drop_if_late flag set. In practical terms, this will define if
>    the semantics of txtime on a system is "not earlier than" or "not later
>    than" a given timestamp;
>  - Now on the enqueue() path, the qdisc will drop a packet if its clockid
>    doesn't match the qdisc's one;
>  - Sorting the packets based on their txtime is now an option for the disc.
>    Effectively, this means it can be configured in 4 modes: HW offload or
>    SW best-effort, sorting enabled or disabled;

A lot of new knobs, I see the need, I would've like to have fewer, but 
you've documented them pretty well. Perhaps we should add something to 
Documentation/ at one stage?

Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
Using udp_tai and tcpdump in the other end to grab the frames

Setting up with hw offload and sorting in qdisc.

Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
bypass as dual-core and i210 is not friends):

udp_tai -c1 -i eth2 -p 20 -P 10000000

Receiver (imx7, kernel 4.9.11):
chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log

Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
receiver, so these numbers can only improve.

count    2340.000000
mean        0.043770
std         0.047784
min         0.009025
25%         0.010003
50%         0.010010
75%         0.109998
max         0.120060

I have to dig more into why this is happening, a lot frames delayed much 
more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
obvious fix is move some hw around and do a direct link, but I didn't have 
time for that right now.

I'm very interested in doing what Richard's original test was when he used 
ptp-synched clocks and also used hw receive-time and compared with expected 
tx-time. So, while I'm getting that up and running, I thought I should 
share the early results.

-Henrik

> The tbs qdisc is designed so it buffers packets until a configurable time before
> their deadline (tx times). If sorting is enabled, regardless of HW offload or SW
> fallback modes, the qdisc uses a rbtree internally so the buffered packets are
> always 'ordered' by the earliest deadline.
> 
> If sorting is disabled, then for HW offload the qdisc will use a 'raw' FIFO
> through qdisc_enqueue_tail() / qdisc_dequeue_head(), whereas for SW best-effort,
> it will use a 'scheduled' FIFO.
> 
> The other configurable parameter from the tbs qdisc is the clockid to be used.
> In order to provide that, this series adds a new API to pkt_sched.h (i.e.
> qdisc_watchdog_init_clockid()).
> 
> The tbs qdisc will drop any packets with a transmission time in the past or
> when a deadline is missed if SCM_DROP_IF_LATE is set. Queueing packets in
> advance plus configuring the delta parameter for the system correctly makes
> all the difference in reducing the number of drops. Moreover, note that the
> delta parameter ends up defining the Tx time when SW best-effort is used
> given that the timestamps won't be used by the NIC on this case.
> 
> Examples:
> 
> # SW best-effort with sorting #
> 
>     $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>                map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
> 
>     $ tc qdisc add dev enp2s0 parent 100:1 tbs delta 100000 \
>                clockid CLOCK_REALTIME sorting
> 
>     In this example first the mqprio qdisc is setup, then the tbs qdisc is
>     configured onto the first hw Tx queue using SW best-effort with sorting
>     enabled. Also, it is configured so the timestamps on each packet are in
>     reference to the clockid CLOCK_REALTIME and so packets are dequeued from
>     the qdisc 100000 nanoseconds before their transmission time.
> 
> 
> # HW offload without sorting #
> 
>     $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>                map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
> 
>     $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
> 
>     In this example, the Qdisc will use HW offload for the control of the
>     transmission time through the network adapter. It's assumed implicitly
>     the timestamp in skbuffs are in reference to the interface's PHC and
>     setting any other valid clockid would be treated as an error. Because
>     there is no scheduling being performed in the qdisc, setting a delta != 0
>     would also be considered an error.
> 
> 
> # HW offload with sorting #
>     $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>                map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
> 
>     $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
>                clockid CLOCK_REALTIME sorting
> 
>     Here, the Qdisc will use HW offload for the txtime control again,
>     but now sorting will be enabled, and thus there will be scheduling being
>     performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
>     and packets leave the Qdisc "delta" (100000) nanoseconds before
>     their transmission time. Because this will be using HW offload and
>     since dynamic clocks are not supported by the hrtimer, the system clock
>     and the PHC clock must be synchronized for this mode to behave as expected.
> 
> 
> For testing, we've followed a similar approach from the v1 and v2 testing and
> no significant changes on the results were observed. An updated version of
> udp_tai.c is attached to this cover letter.
> 
> For last, most of the To Dos we still have before a final patchset are related
> to further testing the igb support:
>  - testing with L2 only talkers + AF_PACKET sockets;
>  - testing tbs in conjunction with cbs;
> 
> Thanks for all the feedback so far,
> Jesus

-Henrik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20180308/d020e0c8/attachment.asc>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-08 11:37           ` [Intel-wired-lan] " Miroslav Lichvar
@ 2018-03-08 16:25             ` David Miller
  -1 siblings, 0 replies; 129+ messages in thread
From: David Miller @ 2018-03-08 16:25 UTC (permalink / raw)
  To: mlichvar
  Cc: eric.dumazet, jesus.sanchez-palencia, netdev, jhs,
	xiyou.wangcong, jiri, vinicius.gomes, richardcochran,
	intel-wired-lan, anna-maria, henrik, tglx, john.stultz,
	levi.pearson, edumazet, willemb

From: Miroslav Lichvar <mlichvar@redhat.com>
Date: Thu, 8 Mar 2018 12:37:22 +0100

> Well, I'd not expect it to be a common use case, but a public NTP
> server could be sending millions of packets per second in traffic
> peaks (typically at *:00:00) over multiple interfaces.

That's the problem.

Bloating up sk_buff for an uncommon use case, penalizing all others,
is a non-starter.

Sorry.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-08 16:25             ` David Miller
  0 siblings, 0 replies; 129+ messages in thread
From: David Miller @ 2018-03-08 16:25 UTC (permalink / raw)
  To: intel-wired-lan

From: Miroslav Lichvar <mlichvar@redhat.com>
Date: Thu, 8 Mar 2018 12:37:22 +0100

> Well, I'd not expect it to be a common use case, but a public NTP
> server could be sending millions of packets per second in traffic
> peaks (typically at *:00:00) over multiple interfaces.

That's the problem.

Bloating up sk_buff for an uncommon use case, penalizing all others,
is a non-starter.

Sorry.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07 17:47               ` [Intel-wired-lan] " Eric Dumazet
@ 2018-03-08 16:44                 ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-08 16:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Eric Dumazet, Jesus Sanchez-Palencia,
	Network Development, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Vinicius Gomes, intel-wired-lan,
	anna-maria, Henrik Austad, Thomas Gleixner, John Stultz,
	Levi Pearson, Willem de Bruijn, Miroslav Lichvar

On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
> I would love if skb->tstamp could be either 0 or expressed in
> ktime_get() base all the time.
> 
> ( Even if we would have to convert this to other bases when/if needed)

We really do need variable clock IDs.  Otherwise the HW offloading
case won't work.  The desired transmit time must be expressed in terms
of the clock inside the MAC.  This clock is not necessarily related to
the system time at all.

But in addition to the performance concerns, I think putting this into
a socket option is the more natural solution.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-08 16:44                 ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-08 16:44 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
> I would love if skb->tstamp could be either 0 or expressed in
> ktime_get() base all the time.
> 
> ( Even if we would have to convert this to other bases when/if needed)

We really do need variable clock IDs.  Otherwise the HW offloading
case won't work.  The desired transmit time must be expressed in terms
of the clock inside the MAC.  This clock is not necessarily related to
the system time at all.

But in addition to the performance concerns, I think putting this into
a socket option is the more natural solution.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-08 16:44                 ` [Intel-wired-lan] " Richard Cochran
@ 2018-03-08 17:56                   ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 17:56 UTC (permalink / raw)
  To: Richard Cochran, Eric Dumazet
  Cc: Willem de Bruijn, Eric Dumazet, Network Development,
	Jamal Hadi Salim, Cong Wang, Jiří Pírko,
	Vinicius Gomes, intel-wired-lan, anna-maria, Henrik Austad,
	Thomas Gleixner, John Stultz, Levi Pearson, Willem de Bruijn,
	Miroslav Lichvar

Hi,


On 03/08/2018 08:44 AM, Richard Cochran wrote:
> On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
>> I would love if skb->tstamp could be either 0 or expressed in
>> ktime_get() base all the time.
>>
>> ( Even if we would have to convert this to other bases when/if needed)
> 
> We really do need variable clock IDs.  Otherwise the HW offloading
> case won't work.  The desired transmit time must be expressed in terms
> of the clock inside the MAC.  This clock is not necessarily related to
> the system time at all.
> 
> But in addition to the performance concerns, I think putting this into
> a socket option is the more natural solution.


Ok, so we have it settled for clockid now. Providing it per-socket was what we'd
proposed previously, so this was just an attempt to accommodate all the feedback
we got on the v2 RFC.

What about the tc_drop_if_late bit, though? Would it be acceptable to keep it
per-packet, thus eating the 1-bit hole from skbuff if we would #if guard it
(e.g. with CONFIG_NET_SCH_TBS)?


Thanks,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-08 17:56                   ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 17:56 UTC (permalink / raw)
  To: intel-wired-lan

Hi,


On 03/08/2018 08:44 AM, Richard Cochran wrote:
> On Wed, Mar 07, 2018 at 09:47:40AM -0800, Eric Dumazet wrote:
>> I would love if skb->tstamp could be either 0 or expressed in
>> ktime_get() base all the time.
>>
>> ( Even if we would have to convert this to other bases when/if needed)
> 
> We really do need variable clock IDs.  Otherwise the HW offloading
> case won't work.  The desired transmit time must be expressed in terms
> of the clock inside the MAC.  This clock is not necessarily related to
> the system time at all.
> 
> But in addition to the performance concerns, I think putting this into
> a socket option is the more natural solution.


Ok, so we have it settled for clockid now. Providing it per-socket was what we'd
proposed previously, so this was just an attempt to accommodate all the feedback
we got on the v2 RFC.

What about the tc_drop_if_late bit, though? Would it be acceptable to keep it
per-packet, thus eating the 1-bit hole from skbuff if we would #if guard it
(e.g. with CONFIG_NET_SCH_TBS)?


Thanks,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 00/18] Time based packet transmission
  2018-03-08 14:09   ` [Intel-wired-lan] " Henrik Austad
@ 2018-03-08 18:06     ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 18:06 UTC (permalink / raw)
  To: Henrik Austad
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

Hi,


On 03/08/2018 06:09 AM, Henrik Austad wrote:

(...)

> 
> A lot of new knobs, I see the need, I would've like to have fewer, but 
> you've documented them pretty well. Perhaps we should add something to 
> Documentation/ at one stage?

Sure. The idea is working on that once the interfaces have been accepted.


> 
> Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> Using udp_tai and tcpdump in the other end to grab the frames
> 
> Setting up with hw offload and sorting in qdisc.
> 
> Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> bypass as dual-core and i210 is not friends):
> 
> udp_tai -c1 -i eth2 -p 20 -P 10000000
> 
> Receiver (imx7, kernel 4.9.11):
> chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
> 
> Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> receiver, so these numbers can only improve.
> 
> count    2340.000000
> mean        0.043770
> std         0.047784
> min         0.009025
> 25%         0.010003
> 50%         0.010010
> 75%         0.109998
> max         0.120060
> 

Thanks for giving it a shot.

But I'm not sure I follow the numbers above, sorry :/
Are you computing the packet's Rx timestamp offset from the (expected) Tx time?


> I have to dig more into why this is happening, a lot frames delayed much 
> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> obvious fix is move some hw around and do a direct link, but I didn't have 
> time for that right now.
> 
> I'm very interested in doing what Richard's original test was when he used 
> ptp-synched clocks and also used hw receive-time and compared with expected 
> tx-time. So, while I'm getting that up and running, I thought I should 
> share the early results.


Sure, thanks. Which delta and clockid are you using, please?
Also, was this clock synchronized to the PHC? You need that for hw offload with
sorting enabled.

Thanks,
Jesus

(...)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-08 18:06     ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 18:06 UTC (permalink / raw)
  To: intel-wired-lan

Hi,


On 03/08/2018 06:09 AM, Henrik Austad wrote:

(...)

> 
> A lot of new knobs, I see the need, I would've like to have fewer, but 
> you've documented them pretty well. Perhaps we should add something to 
> Documentation/ at one stage?

Sure. The idea is working on that once the interfaces have been accepted.


> 
> Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> Using udp_tai and tcpdump in the other end to grab the frames
> 
> Setting up with hw offload and sorting in qdisc.
> 
> Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> bypass as dual-core and i210 is not friends):
> 
> udp_tai -c1 -i eth2 -p 20 -P 10000000
> 
> Receiver (imx7, kernel 4.9.11):
> chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
> 
> Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> receiver, so these numbers can only improve.
> 
> count    2340.000000
> mean        0.043770
> std         0.047784
> min         0.009025
> 25%         0.010003
> 50%         0.010010
> 75%         0.109998
> max         0.120060
> 

Thanks for giving it a shot.

But I'm not sure I follow the numbers above, sorry :/
Are you computing the packet's Rx timestamp offset from the (expected) Tx time?


> I have to dig more into why this is happening, a lot frames delayed much 
> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> obvious fix is move some hw around and do a direct link, but I didn't have 
> time for that right now.
> 
> I'm very interested in doing what Richard's original test was when he used 
> ptp-synched clocks and also used hw receive-time and compared with expected 
> tx-time. So, while I'm getting that up and running, I thought I should 
> share the early results.


Sure, thanks. Which delta and clockid are you using, please?
Also, was this clock synchronized to the PHC? You need that for hw offload with
sorting enabled.

Thanks,
Jesus

(...)


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 00/18] Time based packet transmission
  2018-03-08 18:06     ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-08 22:54       ` Henrik Austad
  -1 siblings, 0 replies; 129+ messages in thread
From: Henrik Austad @ 2018-03-08 22:54 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

[-- Attachment #1: Type: text/plain, Size: 3082 bytes --]

On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote:
> Hi,
> 
> 
> On 03/08/2018 06:09 AM, Henrik Austad wrote:
> 
> (...)
> 
> > 
> > A lot of new knobs, I see the need, I would've like to have fewer, but 
> > you've documented them pretty well. Perhaps we should add something to 
> > Documentation/ at one stage?
> 
> Sure. The idea is working on that once the interfaces have been accepted.

Yeah, probably a good idea.

> > Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> > Using udp_tai and tcpdump in the other end to grab the frames
> > 
> > Setting up with hw offload and sorting in qdisc.
> > 
> > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> > bypass as dual-core and i210 is not friends):
> > 
> > udp_tai -c1 -i eth2 -p 20 -P 10000000
> > 
> > Receiver (imx7, kernel 4.9.11):
> > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
> > 
> > Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> > receiver, so these numbers can only improve.
> > 
> > count    2340.000000
> > mean        0.043770
> > std         0.047784
> > min         0.009025
> > 25%         0.010003
> > 50%         0.010010
> > 75%         0.109998
> > max         0.120060
> > 
> 
> Thanks for giving it a shot.
> 
> But I'm not sure I follow the numbers above, sorry :/
> Are you computing the packet's Rx timestamp offset from the (expected) Tx time?

Just looking at the timestamp when the frames were received. They should be 
sent at regular intervals if I read udp_tai.c correctly, so the assumption 
was that the timestamp from tcpdump should give an inkling to how well it 
worked.

I set it up to send a frame every 10ms and computed the diff between each 
UDP packet received. Nothing fancy, just tcpdump and grep for the 
timestamp and look at the distribution.

> > I have to dig more into why this is happening, a lot frames delayed much 
> > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> > obvious fix is move some hw around and do a direct link, but I didn't have 
> > time for that right now.
> > 
> > I'm very interested in doing what Richard's original test was when he used 
> > ptp-synched clocks and also used hw receive-time and compared with expected 
> > tx-time. So, while I'm getting that up and running, I thought I should 
> > share the early results.
> 
> Sure, thanks. Which delta and clockid are you using, please?

I used the example provided in -00,

tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

tc qdisc add dev eth2 parent 100:1 tbs offload delta 100000 clockid \
 CLOCK_REALTIME sorting

> Also, was this clock synchronized to the PHC? You need that for hw offload with
> sorting enabled.

Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
next round for both sender and receiver!

-henrik

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-08 22:54       ` Henrik Austad
  0 siblings, 0 replies; 129+ messages in thread
From: Henrik Austad @ 2018-03-08 22:54 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, Mar 08, 2018 at 10:06:46AM -0800, Jesus Sanchez-Palencia wrote:
> Hi,
> 
> 
> On 03/08/2018 06:09 AM, Henrik Austad wrote:
> 
> (...)
> 
> > 
> > A lot of new knobs, I see the need, I would've like to have fewer, but 
> > you've documented them pretty well. Perhaps we should add something to 
> > Documentation/ at one stage?
> 
> Sure. The idea is working on that once the interfaces have been accepted.

Yeah, probably a good idea.

> > Anyways, the patches applied cleanly so I gave them a (very) quick spin. 
> > Using udp_tai and tcpdump in the other end to grab the frames
> > 
> > Setting up with hw offload and sorting in qdisc.
> > 
> > Sender (every 10ms) (4.16-rc4 on a core2duo 1.8Ghz w/i210 and max_rss 
> > bypass as dual-core and i210 is not friends):
> > 
> > udp_tai -c1 -i eth2 -p 20 -P 10000000
> > 
> > Receiver (imx7, kernel 4.9.11):
> > chrt -r 20 tcpdump -i eth0 ether host a0:36:9f:3f:c0:b8 | grep "UDP, length 256" > tai_imx7.log
> > 
> > Note: this involves 2 swtiches and a somewhat hackish kernel running on the 
> > receiver, so these numbers can only improve.
> > 
> > count    2340.000000
> > mean        0.043770
> > std         0.047784
> > min         0.009025
> > 25%         0.010003
> > 50%         0.010010
> > 75%         0.109998
> > max         0.120060
> > 
> 
> Thanks for giving it a shot.
> 
> But I'm not sure I follow the numbers above, sorry :/
> Are you computing the packet's Rx timestamp offset from the (expected) Tx time?

Just looking at the timestamp when the frames were received. They should be 
sent at regular intervals if I read udp_tai.c correctly, so the assumption 
was that the timestamp from tcpdump should give an inkling to how well it 
worked.

I set it up to send a frame every 10ms and computed the diff between each 
UDP packet received. Nothing fancy, just tcpdump and grep for the 
timestamp and look at the distribution.

> > I have to dig more into why this is happening, a lot frames delayed much 
> > more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
> > obvious fix is move some hw around and do a direct link, but I didn't have 
> > time for that right now.
> > 
> > I'm very interested in doing what Richard's original test was when he used 
> > ptp-synched clocks and also used hw receive-time and compared with expected 
> > tx-time. So, while I'm getting that up and running, I thought I should 
> > share the early results.
> 
> Sure, thanks. Which delta and clockid are you using, please?

I used the example provided in -00,

tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0

tc qdisc add dev eth2 parent 100:1 tbs offload delta 100000 clockid \
 CLOCK_REALTIME sorting

> Also, was this clock synchronized to the PHC? You need that for hw offload with
> sorting enabled.

Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
next round for both sender and receiver!

-henrik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20180308/c4066803/attachment-0001.asc>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 00/18] Time based packet transmission
  2018-03-08 22:54       ` [Intel-wired-lan] " Henrik Austad
@ 2018-03-08 23:58         ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 23:58 UTC (permalink / raw)
  To: Henrik Austad
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, tglx, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

Hi,


On 03/08/2018 02:54 PM, Henrik Austad wrote:
> Just looking at the timestamp when the frames were received. They should be 
> sent at regular intervals if I read udp_tai.c correctly, so the assumption 
> was that the timestamp from tcpdump should give an inkling to how well it 
> worked.
> 
> I set it up to send a frame every 10ms and computed the diff between each 
> UDP packet received. Nothing fancy, just tcpdump and grep for the 
> timestamp and look at the distribution.

Ok, I see it now. Just as a reference, this is how I've been running tcpdump on
my tests:

$ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \
	-tt --time-stamp-precision=nano udp port 7788 -c 10000


> 
>>> I have to dig more into why this is happening, a lot frames delayed much 
>>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
>>> obvious fix is move some hw around and do a direct link, but I didn't have 
>>> time for that right now.
>>>
>>> I'm very interested in doing what Richard's original test was when he used 
>>> ptp-synched clocks and also used hw receive-time and compared with expected 
>>> tx-time. So, while I'm getting that up and running, I thought I should 
>>> share the early results.
>>
>> Sure, thanks. Which delta and clockid are you using, please?
> 
> I used the example provided in -00,
> 
> tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
>  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> tc qdisc add dev eth2 parent 100:1 tbs offload delta 100000 clockid \
>  CLOCK_REALTIME sorting


The delta value is highly dependent on the system. I recommend playing around
with it a bit before running long tests. On my KabyLake desktop I noticed that
150us is quite reliable value, for example. (same kernel as yours, and no
preempt-rt applied) But that is not the issue here it seems.



> 
>> Also, was this clock synchronized to the PHC? You need that for hw offload with
>> sorting enabled.
> 
> Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
> next round for both sender and receiver!

Oh, then you need to get that setup first. Here I synchronize both PHCs over the
network first with ptp4l:

Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2
Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 &

My Rx is the PTP master and the Tx is the PTP slave.
Then I synchronize the PHC to the system clock on the Tx side only:

Tx) $ phc2sys -a -r -r -u 8 &


And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no
difference for this test specifically because I compensate for it when
calculating the offsets on the Rx side.

For the next patchset version I will be providing a more complete set of testing
instructions. I hope that helps for now.


Thanks,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 00/18] Time based packet transmission
@ 2018-03-08 23:58         ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-08 23:58 UTC (permalink / raw)
  To: intel-wired-lan

Hi,


On 03/08/2018 02:54 PM, Henrik Austad wrote:
> Just looking at the timestamp when the frames were received. They should be 
> sent at regular intervals if I read udp_tai.c correctly, so the assumption 
> was that the timestamp from tcpdump should give an inkling to how well it 
> worked.
> 
> I set it up to send a frame every 10ms and computed the diff between each 
> UDP packet received. Nothing fancy, just tcpdump and grep for the 
> timestamp and look at the distribution.

Ok, I see it now. Just as a reference, this is how I've been running tcpdump on
my tests:

$ tcpdump -i enp3s0 -w foo.pcap -j adapter_unsynced \
	-tt --time-stamp-precision=nano udp port 7788 -c 10000


> 
>>> I have to dig more into why this is happening, a lot frames delayed much 
>>> more than I'd expect, but at this stage I'm pretty sure this is pebkac. One 
>>> obvious fix is move some hw around and do a direct link, but I didn't have 
>>> time for that right now.
>>>
>>> I'm very interested in doing what Richard's original test was when he used 
>>> ptp-synched clocks and also used hw receive-time and compared with expected 
>>> tx-time. So, while I'm getting that up and running, I thought I should 
>>> share the early results.
>>
>> Sure, thanks. Which delta and clockid are you using, please?
> 
> I used the example provided in -00,
> 
> tc qdisc replace dev eth2 parent root handle 100 mqprio num_tc 3 \
>  map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
> 
> tc qdisc add dev eth2 parent 100:1 tbs offload delta 100000 clockid \
>  CLOCK_REALTIME sorting


The delta value is highly dependent on the system. I recommend playing around
with it a bit before running long tests. On my KabyLake desktop I noticed that
150us is quite reliable value, for example. (same kernel as yours, and no
preempt-rt applied) But that is not the issue here it seems.



> 
>> Also, was this clock synchronized to the PHC? You need that for hw offload with
>> sorting enabled.
> 
> Hmm, good point, no, NIC clock was not synchronized, I'll do that in the 
> next round for both sender and receiver!

Oh, then you need to get that setup first. Here I synchronize both PHCs over the
network first with ptp4l:

Rx) $ ptp4l --summary_interval=3 -i enp3s0 -m -2
Tx) $ ptp4l --summary_interval=3 -i enp3s0 -s -m -2 &

My Rx is the PTP master and the Tx is the PTP slave.
Then I synchronize the PHC to the system clock on the Tx side only:

Tx) $ phc2sys -a -r -r -u 8 &


And udp_tai is using CLOCK_REALTIME. The UTC vs TAI 37s offset makes no
difference for this test specifically because I compensate for it when
calculating the offsets on the Rx side.

For the next patchset version I will be providing a more complete set of testing
instructions. I hope that helps for now.


Thanks,
Jesus





^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-07  5:24       ` [Intel-wired-lan] " Richard Cochran
@ 2018-03-21 12:58         ` Thomas Gleixner
  -1 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 12:58 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, netdev, jhs,
	xiyou.wangcong, jiri, vinicius.gomes, intel-wired-lan,
	anna-maria, henrik, john.stultz, levi.pearson, edumazet, willemb,
	mlichvar

On Tue, 6 Mar 2018, Richard Cochran wrote:

> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> > This is adding 32+1 bits to sk_buff, and possibly holes in this very
> > very hot (and already too fat) structure.
> > 
> > Do we really need 32 bits for a clockid_t ?
> 
> Probably we can live with fewer bits.
> 
> For clock IDs with a positive sign, the max possible clock value is 16.
> 
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
> 
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.

Errm. No. There is no way to support fd based clocks or one of the CPU
time/process time based clocks for this.

CLOCK_REALTIME and CLOCK_MONOTONIC are probably the only interesting
ones. BOOTTIME is hopefully soon irrelevant as we make MONOTONIC and
BOOTTIME the same unless this causes unexpectedly a major issues. I don't
think that CLOCK_TAI makes sense in that context, but I might be wrong.

The rest of the CLOCK_* space cannot be used at all.

So you need at max 2 bits for this, but I think 1 is good enough.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-21 12:58         ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 12:58 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 6 Mar 2018, Richard Cochran wrote:

> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> > This is adding 32+1 bits to sk_buff, and possibly holes in this very
> > very hot (and already too fat) structure.
> > 
> > Do we really need 32 bits for a clockid_t ?
> 
> Probably we can live with fewer bits.
> 
> For clock IDs with a positive sign, the max possible clock value is 16.
> 
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
> 
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.

Errm. No. There is no way to support fd based clocks or one of the CPU
time/process time based clocks for this.

CLOCK_REALTIME and CLOCK_MONOTONIC are probably the only interesting
ones. BOOTTIME is hopefully soon irrelevant as we make MONOTONIC and
BOOTTIME the same unless this causes unexpectedly a major issues. I don't
think that CLOCK_TAI makes sense in that context, but I might be wrong.

The rest of the CLOCK_* space cannot be used at all.

So you need at max 2 bits for this, but I think 1 is good enough.

Thanks,

	tglx








^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-21 13:46     ` Thomas Gleixner
  -1 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 13:46 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> +struct tbs_sched_data {
> +	bool sorting;
> +	int clockid;
> +	int queue;
> +	s32 delta; /* in ns */
> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> +	struct rb_root head;

Hmm. You are reimplementing timerqueue open coded. Have you checked whether
you could reuse the timerqueue implementation?

That requires to add a timerqueue node to struct skbuff

@@ -671,7 +671,8 @@ struct sk_buff {
 				unsigned long		dev_scratch;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem & tcp stack */
+		struct timerqueue_node	tqnode;
 	};
 	struct sock		*sk;

Then you can use timerqueue_head in your scheduler data and all the open
coded rbtree handling goes away.

> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	ktime_t txtime = nskb->tstamp;
> +	struct sock *sk = nskb->sk;
> +	ktime_t now;
> +
> +	if (sk && !sock_flag(sk, SOCK_TXTIME))
> +		return false;
> +
> +	/* We don't perform crosstimestamping.
> +	 * Drop if packet's clockid differs from qdisc's.
> +	 */
> +	if (nskb->txtime_clockid != q->clockid)
> +		return false;
> +
> +	now = get_time_by_clockid(q->clockid);

If you store the time getter function pointer in tbs_sched_data then you
avoid the lookup and just can do

       now = q->get_time();

That applies to lots of other places.

> +	if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
> +		return false;
> +
> +	return true;
> +}
> +
> +static struct sk_buff *tbs_peek(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	return q->peek(sch);
> +}
> +
> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct rb_node *p;
> +
> +	p = rb_first(&q->head);

timerqueue gives you direct access to the first expiring entry w/o walking
the rbtree. So that would become:

	p = timerqueue_getnext(&q->tqhead);
	return p ? rb_to_skb(p) : NULL;

> +	if (!p)
> +		return NULL;
> +
> +	return rb_to_skb(p);
> +}

> +static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
> +				      struct sk_buff **to_free)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct rb_node **p = &q->head.rb_node, *parent = NULL;
> +	ktime_t txtime = nskb->tstamp;
> +
> +	if (!is_packet_valid(sch, nskb))
> +		return qdisc_drop(nskb, sch, to_free);
> +
> +	while (*p) {
> +		struct sk_buff *skb;
> +
> +		parent = *p;
> +		skb = rb_to_skb(parent);
> +		if (ktime_after(txtime, skb->tstamp))
> +			p = &parent->rb_right;
> +		else
> +			p = &parent->rb_left;
> +	}
> +	rb_link_node(&nskb->rbnode, parent, p);
> +	rb_insert_color(&nskb->rbnode, &q->head);

That'd become:

       nskb->tknode.expires = txtime;
       timerqueue_add(&d->tqhead, &nskb->tknode);

> +	qdisc_qstats_backlog_inc(sch, nskb);
> +	sch->q.qlen++;
> +
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return NET_XMIT_SUCCESS;
> +}
> +
> +static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
> +				 bool drop)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	rb_erase(&skb->rbnode, &q->head);
> +
> +	qdisc_qstats_backlog_dec(sch, skb);
> +
> +	if (drop) {
> +		struct sk_buff *to_free = NULL;
> +
> +		qdisc_drop(skb, sch, &to_free);
> +		kfree_skb_list(to_free);
> +		qdisc_qstats_overlimit(sch);
> +	} else {
> +		qdisc_bstats_update(sch, skb);
> +
> +		q->last = skb->tstamp;
> +	}
> +
> +	sch->q.qlen--;
> +
> +	/* The rbnode field in the skb re-uses these fields, now that
> +	 * we are done with the rbnode, reset them.
> +	 */
> +	skb->next = NULL;
> +	skb->prev = NULL;
> +	skb->dev = qdisc_dev(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	return q->dequeue(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct sk_buff *skb = tbs_peek(sch);
> +	ktime_t now, next;
> +
> +	if (!skb)
> +		return NULL;
> +
> +	now = get_time_by_clockid(q->clockid);
> +
> +	/* Drop if packet has expired while in queue and the drop_if_late
> +	 * flag is set.
> +	 */
> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> +		struct sk_buff *to_free = NULL;
> +
> +		qdisc_queue_drop_head(sch, &to_free);
> +		kfree_skb_list(to_free);
> +		qdisc_qstats_overlimit(sch);
> +
> +		skb = NULL;
> +		goto out;

Instead of going out immediately you should check the next skb whether its
due for sending already.

> +	}
> +
> +	next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
> +	if (ktime_after(now, next))
> +		skb = qdisc_dequeue_head(sch);
> +	else
> +		skb = NULL;
> +
> +out:
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return skb;
> +}
> +
> +static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct sk_buff *skb;
> +	ktime_t now, next;
> +
> +	skb = tbs_peek(sch);
> +	if (!skb)
> +		return NULL;
> +
> +	now = get_time_by_clockid(q->clockid);
> +
> +	/* Drop if packet has expired while in queue and the drop_if_late
> +	 * flag is set.
> +	 */
> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> +		timesortedlist_erase(sch, skb, true);
> +		skb = NULL;
> +		goto out;

Same as above.

> +	}
> +
> +	next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
> +	if (ktime_after(now, next))
> +		timesortedlist_erase(sch, skb, false);
> +	else
> +		skb = NULL;
> +
> +out:
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return skb;
> +}
> +
> +static inline void setup_queueing_mode(struct tbs_sched_data *q)
> +{
> +	if (q->sorting) {
> +		q->enqueue = tbs_enqueue_timesortedlist;
> +		q->dequeue = tbs_dequeue_timesortedlist;
> +		q->peek = tbs_peek_timesortedlist;
> +	} else {
> +		q->enqueue = tbs_enqueue_scheduledfifo;
> +		q->dequeue = tbs_dequeue_scheduledfifo;
> +		q->peek = qdisc_peek_head;

I don't see the point of these two modes and all the duplicated code it
involves.

FIFO mode limits usage to a single thread which has to guarantee that the
packets are queued in time order.

If you look at the use cases of TDM in various fields then FIFO mode is
pretty much useless. In industrial/automotive fieldbus applications the
various time slices are filled by different threads or even processes.

Sure, the rbtree queue/dequeue has overhead compared to a simple linked
list, but you pay for that with more indirections and lots of mostly
duplicated code. And in the worst case one of these code pathes is going to
be rarely used and prone to bitrot.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
@ 2018-03-21 13:46     ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 13:46 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> +struct tbs_sched_data {
> +	bool sorting;
> +	int clockid;
> +	int queue;
> +	s32 delta; /* in ns */
> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> +	struct rb_root head;

Hmm. You are reimplementing timerqueue open coded. Have you checked whether
you could reuse the timerqueue implementation?

That requires to add a timerqueue node to struct skbuff

@@ -671,7 +671,8 @@ struct sk_buff {
 				unsigned long		dev_scratch;
 			};
 		};
-		struct rb_node	rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem & tcp stack */
+		struct timerqueue_node	tqnode;
 	};
 	struct sock		*sk;

Then you can use timerqueue_head in your scheduler data and all the open
coded rbtree handling goes away.

> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	ktime_t txtime = nskb->tstamp;
> +	struct sock *sk = nskb->sk;
> +	ktime_t now;
> +
> +	if (sk && !sock_flag(sk, SOCK_TXTIME))
> +		return false;
> +
> +	/* We don't perform crosstimestamping.
> +	 * Drop if packet's clockid differs from qdisc's.
> +	 */
> +	if (nskb->txtime_clockid != q->clockid)
> +		return false;
> +
> +	now = get_time_by_clockid(q->clockid);

If you store the time getter function pointer in tbs_sched_data then you
avoid the lookup and just can do

       now = q->get_time();

That applies to lots of other places.

> +	if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
> +		return false;
> +
> +	return true;
> +}
> +
> +static struct sk_buff *tbs_peek(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	return q->peek(sch);
> +}
> +
> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct rb_node *p;
> +
> +	p = rb_first(&q->head);

timerqueue gives you direct access to the first expiring entry w/o walking
the rbtree. So that would become:

	p = timerqueue_getnext(&q->tqhead);
	return p ? rb_to_skb(p) : NULL;

> +	if (!p)
> +		return NULL;
> +
> +	return rb_to_skb(p);
> +}

> +static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
> +				      struct sk_buff **to_free)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct rb_node **p = &q->head.rb_node, *parent = NULL;
> +	ktime_t txtime = nskb->tstamp;
> +
> +	if (!is_packet_valid(sch, nskb))
> +		return qdisc_drop(nskb, sch, to_free);
> +
> +	while (*p) {
> +		struct sk_buff *skb;
> +
> +		parent = *p;
> +		skb = rb_to_skb(parent);
> +		if (ktime_after(txtime, skb->tstamp))
> +			p = &parent->rb_right;
> +		else
> +			p = &parent->rb_left;
> +	}
> +	rb_link_node(&nskb->rbnode, parent, p);
> +	rb_insert_color(&nskb->rbnode, &q->head);

That'd become:

       nskb->tknode.expires = txtime;
       timerqueue_add(&d->tqhead, &nskb->tknode);

> +	qdisc_qstats_backlog_inc(sch, nskb);
> +	sch->q.qlen++;
> +
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return NET_XMIT_SUCCESS;
> +}
> +
> +static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
> +				 bool drop)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	rb_erase(&skb->rbnode, &q->head);
> +
> +	qdisc_qstats_backlog_dec(sch, skb);
> +
> +	if (drop) {
> +		struct sk_buff *to_free = NULL;
> +
> +		qdisc_drop(skb, sch, &to_free);
> +		kfree_skb_list(to_free);
> +		qdisc_qstats_overlimit(sch);
> +	} else {
> +		qdisc_bstats_update(sch, skb);
> +
> +		q->last = skb->tstamp;
> +	}
> +
> +	sch->q.qlen--;
> +
> +	/* The rbnode field in the skb re-uses these fields, now that
> +	 * we are done with the rbnode, reset them.
> +	 */
> +	skb->next = NULL;
> +	skb->prev = NULL;
> +	skb->dev = qdisc_dev(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +
> +	return q->dequeue(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct sk_buff *skb = tbs_peek(sch);
> +	ktime_t now, next;
> +
> +	if (!skb)
> +		return NULL;
> +
> +	now = get_time_by_clockid(q->clockid);
> +
> +	/* Drop if packet has expired while in queue and the drop_if_late
> +	 * flag is set.
> +	 */
> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> +		struct sk_buff *to_free = NULL;
> +
> +		qdisc_queue_drop_head(sch, &to_free);
> +		kfree_skb_list(to_free);
> +		qdisc_qstats_overlimit(sch);
> +
> +		skb = NULL;
> +		goto out;

Instead of going out immediately you should check the next skb whether its
due for sending already.

> +	}
> +
> +	next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
> +	if (ktime_after(now, next))
> +		skb = qdisc_dequeue_head(sch);
> +	else
> +		skb = NULL;
> +
> +out:
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return skb;
> +}
> +
> +static struct sk_buff *tbs_dequeue_timesortedlist(struct Qdisc *sch)
> +{
> +	struct tbs_sched_data *q = qdisc_priv(sch);
> +	struct sk_buff *skb;
> +	ktime_t now, next;
> +
> +	skb = tbs_peek(sch);
> +	if (!skb)
> +		return NULL;
> +
> +	now = get_time_by_clockid(q->clockid);
> +
> +	/* Drop if packet has expired while in queue and the drop_if_late
> +	 * flag is set.
> +	 */
> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
> +		timesortedlist_erase(sch, skb, true);
> +		skb = NULL;
> +		goto out;

Same as above.

> +	}
> +
> +	next = ktime_sub_ns(skb->tstamp, q->delta);
> +
> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
> +	if (ktime_after(now, next))
> +		timesortedlist_erase(sch, skb, false);
> +	else
> +		skb = NULL;
> +
> +out:
> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
> +	reset_watchdog(sch);
> +
> +	return skb;
> +}
> +
> +static inline void setup_queueing_mode(struct tbs_sched_data *q)
> +{
> +	if (q->sorting) {
> +		q->enqueue = tbs_enqueue_timesortedlist;
> +		q->dequeue = tbs_dequeue_timesortedlist;
> +		q->peek = tbs_peek_timesortedlist;
> +	} else {
> +		q->enqueue = tbs_enqueue_scheduledfifo;
> +		q->dequeue = tbs_dequeue_scheduledfifo;
> +		q->peek = qdisc_peek_head;

I don't see the point of these two modes and all the duplicated code it
involves.

FIFO mode limits usage to a single thread which has to guarantee that the
packets are queued in time order.

If you look at the use cases of TDM in various fields then FIFO mode is
pretty much useless. In industrial/automotive fieldbus applications the
various time slices are filled by different threads or even processes.

Sure, the rbtree queue/dequeue has overhead compared to a simple linked
list, but you pay for that with more indirections and lots of mostly
duplicated code. And in the worst case one of these code pathes is going to
be rarely used and prone to bitrot.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-21 14:22     ` Thomas Gleixner
  -1 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 14:22 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
> 
> In this example, the Qdisc will use HW offload for the control of the
> transmission time through the network adapter. It's assumed the timestamp
> in skbuffs are in reference to the interface's PHC and setting any other
> valid clockid would be treated as an error. Because there is no
> scheduling being performed in the qdisc, setting a delta != 0 would also
> be considered an error.

Which clockid will be handed in from the application? The network adapter
time has no fixed clockid. The only way you can get to it is via a fd based
posix clock and that does not work at all because the qdisc setup might
have a different FD than the application which queues packets.

I think this should look like this:

    clock_adapter:	1 = clock of the network adapter
    			0 = system clock selected by clock_system

    clock_system:	0 = CLOCK_REALTIME
    			1 = CLOCK_MONOTONIC

or something like that.

> Example 2:
> 
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
> 	   clockid CLOCK_REALTIME sorting
> 
> Here, the Qdisc will use HW offload for the txtime control again,
> but now sorting will be enabled, and thus there will be scheduling being
> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by the hrtimer, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.

So what you do here is queueing the packets in the qdisk and then schedule
them at some point ahead of actual transmission time for delivery to the
hardware. That delivery uses the same txtime as used for qdisc scheduling
to tell the hardware when the packet should go on the wire. That's needed
when the network adapter does not support queueing of multiple packets.

Bah, and probably there you need CLOCK_TAI because that's what PTP is based
on, so clock_system needs to accomodate that as well. Dammit, there goes
the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
bits plus the adapter bit.

Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
don't see us adding new fixed clocks, so we really can reserve #15 for
selecting the adapter clock if sparing that extra bit is truly required.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
@ 2018-03-21 14:22     ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 14:22 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
> 
> In this example, the Qdisc will use HW offload for the control of the
> transmission time through the network adapter. It's assumed the timestamp
> in skbuffs are in reference to the interface's PHC and setting any other
> valid clockid would be treated as an error. Because there is no
> scheduling being performed in the qdisc, setting a delta != 0 would also
> be considered an error.

Which clockid will be handed in from the application? The network adapter
time has no fixed clockid. The only way you can get to it is via a fd based
posix clock and that does not work at all because the qdisc setup might
have a different FD than the application which queues packets.

I think this should look like this:

    clock_adapter:	1 = clock of the network adapter
    			0 = system clock selected by clock_system

    clock_system:	0 = CLOCK_REALTIME
    			1 = CLOCK_MONOTONIC

or something like that.

> Example 2:
> 
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
> 	   clockid CLOCK_REALTIME sorting
> 
> Here, the Qdisc will use HW offload for the txtime control again,
> but now sorting will be enabled, and thus there will be scheduling being
> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by the hrtimer, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.

So what you do here is queueing the packets in the qdisk and then schedule
them at some point ahead of actual transmission time for delivery to the
hardware. That delivery uses the same txtime as used for qdisc scheduling
to tell the hardware when the packet should go on the wire. That's needed
when the network adapter does not support queueing of multiple packets.

Bah, and probably there you need CLOCK_TAI because that's what PTP is based
on, so clock_system needs to accomodate that as well. Dammit, there goes
the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
bits plus the adapter bit.

Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
don't see us adding new fixed clocks, so we really can reserve #15 for
selecting the adapter clock if sparing that extra bit is truly required.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-21 12:58         ` [Intel-wired-lan] " Thomas Gleixner
@ 2018-03-21 14:59           ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-21 14:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, netdev, jhs,
	xiyou.wangcong, jiri, vinicius.gomes, intel-wired-lan,
	anna-maria, henrik, john.stultz, levi.pearson, edumazet, willemb,
	mlichvar

On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> Errm. No. There is no way to support fd based clocks or one of the CPU
> time/process time based clocks for this.

Why not?
 
If the we have HW offloading, then the transmit time had better be
expressed in terms of the MAC's internal clock.  Otherwise we would
need to translate between a kernel clock and the MAC clock, but that
is expensive (eg over PCIe) and silly (because in a typical use case
the MAC will already be synchronized to the network time).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
@ 2018-03-21 14:59           ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-21 14:59 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> Errm. No. There is no way to support fd based clocks or one of the CPU
> time/process time based clocks for this.

Why not?
 
If the we have HW offloading, then the transmit time had better be
expressed in terms of the MAC's internal clock.  Otherwise we would
need to translate between a kernel clock and the MAC clock, but that
is expensive (eg over PCIe) and silly (because in a typical use case
the MAC will already be synchronized to the network time).

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-21 14:22     ` [Intel-wired-lan] " Thomas Gleixner
@ 2018-03-21 15:03       ` Richard Cochran
  -1 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-21 15:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.

Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
@ 2018-03-21 15:03       ` Richard Cochran
  0 siblings, 0 replies; 129+ messages in thread
From: Richard Cochran @ 2018-03-21 15:03 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.

Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params
  2018-03-21 14:59           ` [Intel-wired-lan] " Richard Cochran
  (?)
@ 2018-03-21 15:11           ` Thomas Gleixner
  -1 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 15:11 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Eric Dumazet, Jesus Sanchez-Palencia, netdev, jhs,
	xiyou.wangcong, jiri, vinicius.gomes, anna-maria, henrik,
	John Stultz, levi.pearson, edumazet, willemb, mlichvar

On Wed, 21 Mar 2018, Richard Cochran wrote:

@Intel: I removed intel-wired-lan@ as I have absolutely zero interest in
	the moderation spam from that list. Can you please either get rid
	of this moderation nonsense or stop CC'ing that list when posting
	to lkml/netdev?

> On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> > Errm. No. There is no way to support fd based clocks or one of the CPU
> > time/process time based clocks for this.
> 
> Why not?
>  
> If the we have HW offloading, then the transmit time had better be
> expressed in terms of the MAC's internal clock.  Otherwise we would
> need to translate between a kernel clock and the MAC clock, but that
> is expensive (eg over PCIe) and silly (because in a typical use case
> the MAC will already be synchronized to the network time).

Sure, but you CANNOT use a clockid for that because there is NONE.

The mac clock is exposed via a dynamic posix clock and can only be
referenced via a file descriptor.

The qdisc setup does fd = open(...) and hands that in as clockid. Later the
application does fd = open(...) and uses that as clockid for tagging the
messages.

What the heck guarantees that both the setup and the application will get
the same fd number?

Exactly nothing. So any attempt to use the filedescriptor number as clockid
is broken by definition.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-21 15:03       ` [Intel-wired-lan] " Richard Cochran
  (?)
@ 2018-03-21 16:18       ` Thomas Gleixner
  2018-03-22 22:01         ` Jesus Sanchez-Palencia
  -1 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 16:18 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Wed, 21 Mar 2018, Richard Cochran wrote:

> On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
> > Which clockid will be handed in from the application? The network adapter
> > time has no fixed clockid. The only way you can get to it is via a fd based
> > posix clock and that does not work at all because the qdisc setup might
> > have a different FD than the application which queues packets.
> 
> Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...

:)

So in that case you are either bound to rely on the application to use the
proper dynamic clock or if we need a sanity check, then you need a cookie
of some form which can be retrieved from the posix clock file descriptor
and handed in as 'clockid' together with clock_adapter = true.

That's doable, but that needs a bit more trickery. A simple unique ID per
dynamic posix-clock would be trivial to add, but that would not give you
any form of verification whether this ID actually belongs to the network
adapter or not.

So either you ignore the clockid and rely on the application not being
stupid when it says "clock_adpater = true" or you need some extra
complexity to build an association of a "clockid" to a network adapter.

There is a connection already, via

     adapter->ptp_clock->devid

which is MKDEV(major, index) which is accessible at least at the network
driver level, but probably not from networking core. So you'd need to drill
a few more holes by adding yet another callback to net_device_ops.

I'm not sure if its worth the trouble. If the application hands in bogus
timestamps, packets go out at the wrong time or are dropped. That's true
whether it uses the proper clock or not. So nothing the kernel should
really worry about.

For clock_system - REAL/MONO/TAI(sigh) - you surely need a sanity check,
but that is independent of the underlying network adapater even in the
qdisc assisted HW offload case.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-21 13:46     ` [Intel-wired-lan] " Thomas Gleixner
  (?)
@ 2018-03-21 22:29     ` Thomas Gleixner
  2018-03-22 20:25       ` Jesus Sanchez-Palencia
  -1 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-21 22:29 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Wed, 21 Mar 2018, Thomas Gleixner wrote:
> If you look at the use cases of TDM in various fields then FIFO mode is
> pretty much useless. In industrial/automotive fieldbus applications the
> various time slices are filled by different threads or even processes.

That brings me to a related question. The TDM cases I'm familiar with which
aim to use this utilize multiple periodic time slices, aka 802.1Qbv
time-aware scheduling.

Simple example:

[1a][1b][1c][1d]		[1a][1b][1c][1d]		[.....
		[2a][2b]			[2c][2d]
			[3a]				[3b]
			    [4a]			    [4b]
---------------------------------------------------------------------->	t		    

where 1-4 is the slice level and a-d are network nodes.

In most cases the slice levels on a node are handled by different
applications or threads. Some of the protocols utilize dedicated time slice
levels - lets assume '4' in the above example - to run general network
traffic which might even be allowed to have collisions, i.e. [4a-d] would
become [4] and any node can send; the involved componets like switches are
supposed to handle that.

I'm not seing how TBS is going to assist with any of that. It requires
everything to be handled at the application level. Not really useful
especially not for general traffic which does not know about the scheduling
bands at all.

If you look at an industrial control node. It basically does:

	queue_first_packet(tx, slice1);
   	while (!stop) {
		if (wait_for_packet(rx) == ERROR)
			goto errorhandling;
		tx = do_computation(rx);
		queue_next_tx(tx, slice1);
	}

that's a pretty common pattern for these kind of applications. For audio
sources queue_next() might be triggered by the input sampler which needs to
be synchronized to the network slices anyway in order to work properly.

TBS per current implementation is nice as a proof of concept, but it solves
just a small portion of the complete problem space. I have the suspicion
that this was 'designed' to replace the user space hack in the AVNU stack
with something close to it. Not really a good plan to be honest.

I think what we really want is a strict periodic scheduler which supports
multiple slices as shown above because thats what all relevant TDM use
cases need: A/V, industrial fieldbusses .....

  |---------------------------------------------------------|
  |                                                         |
  |                           TAS                           |<- Config
  |    1               2               3               4    |
  |---------------------------------------------------------|
       |               |               |               |
       |               |               |               |
       |               |               |               |
       |               |               |               |
  [DirectSocket]   [Qdisc FIFO]   [Qdisc Prio]     [Qdisc FIFO]
                       |               |               |
		       |               |               |
		    [Socket]   	    [Socket]     [General traffic]


The interesting thing here is that it does not require any time stamp
information brought in from the application. That's especially good for
general network traffic which is routed through a dedicated time slot. If
we don't have that then we need a user space scheduler which does exactly
the same thing and we have to route the general traffic out to user space
and back into the kernel, which is obviously a pointless exercise.

There are all kind of TDM schemes out there which are not directly driven
by applications, but rather route categorized traffic like VLANs through
dedicated time slices. That works pretty well with the above scheme because
in that case the applications might be completely oblivious about the tx
time schedule.

Surely there are protocols which do not utilize every time slice they could
use, so we need a way to tell the number of empty slices between two
consecutive packets. There are also different policies vs. the unused time
slices, like sending dummy frames or just nothing which wants to be
addressed, but I don't think that changes the general approach.

There might be some special cases for setup or node hotplug, but the
protocols I'm familiar with handle these in dedicated time slices or
through general traffic so it should just fit in.

I'm surely missing some details, but from my knowledge about the protocols
which want to utilize this, the general direction should be fine.

Feel free to tell me that I'm missing the point completely though :)

Thoughts?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-21 22:29     ` Thomas Gleixner
@ 2018-03-22 20:25       ` Jesus Sanchez-Palencia
  2018-03-22 22:52         ` Thomas Gleixner
  0 siblings, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 20:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 03/21/2018 03:29 PM, Thomas Gleixner wrote:
> On Wed, 21 Mar 2018, Thomas Gleixner wrote:
>> If you look at the use cases of TDM in various fields then FIFO mode is
>> pretty much useless. In industrial/automotive fieldbus applications the
>> various time slices are filled by different threads or even processes.
> 
> That brings me to a related question. The TDM cases I'm familiar with which
> aim to use this utilize multiple periodic time slices, aka 802.1Qbv
> time-aware scheduling.
> 
> Simple example:
> 
> [1a][1b][1c][1d]		[1a][1b][1c][1d]		[.....
> 		[2a][2b]			[2c][2d]
> 			[3a]				[3b]
> 			    [4a]			    [4b]
> ---------------------------------------------------------------------->	t		    
> 
> where 1-4 is the slice level and a-d are network nodes.
> 
> In most cases the slice levels on a node are handled by different
> applications or threads. Some of the protocols utilize dedicated time slice
> levels - lets assume '4' in the above example - to run general network
> traffic which might even be allowed to have collisions, i.e. [4a-d] would
> become [4] and any node can send; the involved componets like switches are
> supposed to handle that.
> 
> I'm not seing how TBS is going to assist with any of that. It requires
> everything to be handled at the application level. Not really useful
> especially not for general traffic which does not know about the scheduling
> bands at all.
> 
> If you look at an industrial control node. It basically does:
> 
> 	queue_first_packet(tx, slice1);
>    	while (!stop) {
> 		if (wait_for_packet(rx) == ERROR)
> 			goto errorhandling;
> 		tx = do_computation(rx);
> 		queue_next_tx(tx, slice1);
> 	}
> 
> that's a pretty common pattern for these kind of applications. For audio
> sources queue_next() might be triggered by the input sampler which needs to
> be synchronized to the network slices anyway in order to work properly.
> 
> TBS per current implementation is nice as a proof of concept, but it solves
> just a small portion of the complete problem space. I have the suspicion
> that this was 'designed' to replace the user space hack in the AVNU stack
> with something close to it. Not really a good plan to be honest.
> 
> I think what we really want is a strict periodic scheduler which supports
> multiple slices as shown above because thats what all relevant TDM use
> cases need: A/V, industrial fieldbusses .....
> 
>   |---------------------------------------------------------|
>   |                                                         |
>   |                           TAS                           |<- Config
>   |    1               2               3               4    |
>   |---------------------------------------------------------|
>        |               |               |               |
>        |               |               |               |
>        |               |               |               |
>        |               |               |               |
>   [DirectSocket]   [Qdisc FIFO]   [Qdisc Prio]     [Qdisc FIFO]
>                        |               |               |
> 		       |               |               |
> 		    [Socket]   	    [Socket]     [General traffic]
> 
> 
> The interesting thing here is that it does not require any time stamp
> information brought in from the application. That's especially good for
> general network traffic which is routed through a dedicated time slot. If
> we don't have that then we need a user space scheduler which does exactly
> the same thing and we have to route the general traffic out to user space
> and back into the kernel, which is obviously a pointless exercise.
> 
> There are all kind of TDM schemes out there which are not directly driven
> by applications, but rather route categorized traffic like VLANs through
> dedicated time slices. That works pretty well with the above scheme because
> in that case the applications might be completely oblivious about the tx
> time schedule.
> 
> Surely there are protocols which do not utilize every time slice they could
> use, so we need a way to tell the number of empty slices between two
> consecutive packets. There are also different policies vs. the unused time
> slices, like sending dummy frames or just nothing which wants to be
> addressed, but I don't think that changes the general approach.
> 
> There might be some special cases for setup or node hotplug, but the
> protocols I'm familiar with handle these in dedicated time slices or
> through general traffic so it should just fit in.
> 
> I'm surely missing some details, but from my knowledge about the protocols
> which want to utilize this, the general direction should be fine.
> 
> Feel free to tell me that I'm missing the point completely though :)
> 
> Thoughts?


We agree with most of the above. :)
Actually, last year Vinicius shared our ideas for a "time-aware priority" root
qdisc as part of the cbs RFC cover letter, dubbed 'taprio':

https://patchwork.ozlabs.org/cover/808504/

Our plan was to work directly with the Qbv-like scheduling (per-port) just after
the cbs qdisc (Qav), but the feedback here and offline was that there were use
cases for a more simplistic launchtime approach (per-queue) as well. We've
decided to invest on it first (and postpone the 'taprio' qdisc until there was
NIC available with HW support for it, basically).

You are right, and we agree, that using tbs for a per-port schedule of any sort
will require a SW scheduler to be developed on top of it, but we've never said
the contrary either. Our vision has always been that these are separate
mechanisms with different use-cases, so we do see the value for the kernel to
provide both.

In other words, tbs is not the final solution for Qbv, and we agree that a 'TAS'
qdisc is still necessary. And due to the wide range of applications and hw being
used for those out there, we need both specially given that one does not block
the other.


What do you think?

Thanks,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-21 13:46     ` [Intel-wired-lan] " Thomas Gleixner
  (?)
  (?)
@ 2018-03-22 20:29     ` Jesus Sanchez-Palencia
  2018-03-22 22:11       ` Thomas Gleixner
  -1 siblings, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 20:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +	bool sorting;
>> +	int clockid;
>> +	int queue;
>> +	s32 delta; /* in ns */
>> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +	struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>  				unsigned long		dev_scratch;
>  			};
>  		};
> -		struct rb_node	rbnode; /* used in netem & tcp stack */
> +		struct rb_node		rbnode; /* used in netem & tcp stack */
> +		struct timerqueue_node	tqnode;
>  	};
>  	struct sock		*sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


Yes, you are right. We actually looked into that for the first prototype of this
qdisc but we weren't so sure about adding the timerqueue node to the sk_buff's
union and whether it would impact the other usages here, but looking again now
and it looks fine.

We'll fix for the next version, thanks.


> 
>> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
>> +{
>> +	struct tbs_sched_data *q = qdisc_priv(sch);
>> +	ktime_t txtime = nskb->tstamp;
>> +	struct sock *sk = nskb->sk;
>> +	ktime_t now;
>> +
>> +	if (sk && !sock_flag(sk, SOCK_TXTIME))
>> +		return false;
>> +
>> +	/* We don't perform crosstimestamping.
>> +	 * Drop if packet's clockid differs from qdisc's.
>> +	 */
>> +	if (nskb->txtime_clockid != q->clockid)
>> +		return false;
>> +
>> +	now = get_time_by_clockid(q->clockid);
> 
> If you store the time getter function pointer in tbs_sched_data then you
> avoid the lookup and just can do
> 
>        now = q->get_time();
> 
> That applies to lots of other places.


Good idea, thanks. Will fix.



>> +
>> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
>> +{
>> +	struct tbs_sched_data *q = qdisc_priv(sch);
>> +	struct rb_node *p;
>> +
>> +	p = rb_first(&q->head);
> 
> timerqueue gives you direct access to the first expiring entry w/o walking
> the rbtree. So that would become:
> 
> 	p = timerqueue_getnext(&q->tqhead);
> 	return p ? rb_to_skb(p) : NULL;

OK.

(...)

>> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
>> +{
>> +	struct tbs_sched_data *q = qdisc_priv(sch);
>> +	struct sk_buff *skb = tbs_peek(sch);
>> +	ktime_t now, next;
>> +
>> +	if (!skb)
>> +		return NULL;
>> +
>> +	now = get_time_by_clockid(q->clockid);
>> +
>> +	/* Drop if packet has expired while in queue and the drop_if_late
>> +	 * flag is set.
>> +	 */
>> +	if (skb->tc_drop_if_late && ktime_before(skb->tstamp, now)) {
>> +		struct sk_buff *to_free = NULL;
>> +
>> +		qdisc_queue_drop_head(sch, &to_free);
>> +		kfree_skb_list(to_free);
>> +		qdisc_qstats_overlimit(sch);
>> +
>> +		skb = NULL;
>> +		goto out;
> 
> Instead of going out immediately you should check the next skb whether its
> due for sending already.

We wanted to have a baseline before starting with the optimizations, so we left
this for a later patchset. It was one of the opens we had listed on the v2 cover
letter IIRC, but we'll look into it.


(...)


>> +	}
>> +
>> +	next = ktime_sub_ns(skb->tstamp, q->delta);
>> +
>> +	/* Dequeue only if now is within the [txtime - delta, txtime] range. */
>> +	if (ktime_after(now, next))
>> +		timesortedlist_erase(sch, skb, false);
>> +	else
>> +		skb = NULL;
>> +
>> +out:
>> +	/* Now we may need to re-arm the qdisc watchdog for the next packet. */
>> +	reset_watchdog(sch);
>> +
>> +	return skb;
>> +}
>> +
>> +static inline void setup_queueing_mode(struct tbs_sched_data *q)
>> +{
>> +	if (q->sorting) {
>> +		q->enqueue = tbs_enqueue_timesortedlist;
>> +		q->dequeue = tbs_dequeue_timesortedlist;
>> +		q->peek = tbs_peek_timesortedlist;
>> +	} else {
>> +		q->enqueue = tbs_enqueue_scheduledfifo;
>> +		q->dequeue = tbs_dequeue_scheduledfifo;
>> +		q->peek = qdisc_peek_head;
> 
> I don't see the point of these two modes and all the duplicated code it
> involves.
> 
> FIFO mode limits usage to a single thread which has to guarantee that the
> packets are queued in time order.
> 
> If you look at the use cases of TDM in various fields then FIFO mode is
> pretty much useless. In industrial/automotive fieldbus applications the
> various time slices are filled by different threads or even processes.
> 
> Sure, the rbtree queue/dequeue has overhead compared to a simple linked
> list, but you pay for that with more indirections and lots of mostly
> duplicated code. And in the worst case one of these code pathes is going to
> be rarely used and prone to bitrot.


Our initial version (on RFC v2) was performing the sorting for all modes. After
all the feedback we got we decided to make it optional and provide FIFO modes as
well. For the SW fallback we need the scheduled FIFO, and for "pure" hw offload
we need the "raw" FIFO.

This was a way to accommodate all the use cases without imposing too much of a
burden onto anyone, regardless of their application's segment (i.e. industrial,
pro a/v, automotive, etc).

Having the sorting always enabled requires that a valid static clockid is passed
to the qdisc. For the hw offload mode, that means that the PHC and one of the
system clocks must be synchronized since hrtimers do not support dynamic clocks.
Not all systems do that or want to, and given that we do not want to perform
crosstimestamping between the packets' clock reference and the qdisc's one, the
only solution for these systems would be using the raw hw offload mode.


Thanks,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-21 16:18       ` Thomas Gleixner
@ 2018-03-22 22:01         ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 22:01 UTC (permalink / raw)
  To: Thomas Gleixner, Richard Cochran
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes, anna-maria,
	henrik, John Stultz, levi.pearson, edumazet, willemb, mlichvar

Hi,


On 03/21/2018 09:18 AM, Thomas Gleixner wrote:
> On Wed, 21 Mar 2018, Richard Cochran wrote:
> 
>> On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
>>> Which clockid will be handed in from the application? The network adapter
>>> time has no fixed clockid. The only way you can get to it is via a fd based
>>> posix clock and that does not work at all because the qdisc setup might
>>> have a different FD than the application which queues packets.
>>
>> Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...
> 
> :)
> 
> So in that case you are either bound to rely on the application to use the
> proper dynamic clock or if we need a sanity check, then you need a cookie
> of some form which can be retrieved from the posix clock file descriptor
> and handed in as 'clockid' together with clock_adapter = true.
> 
> That's doable, but that needs a bit more trickery. A simple unique ID per
> dynamic posix-clock would be trivial to add, but that would not give you
> any form of verification whether this ID actually belongs to the network
> adapter or not.
> 
> So either you ignore the clockid and rely on the application not being
> stupid when it says "clock_adpater = true" or you need some extra
> complexity to build an association of a "clockid" to a network adapter.
> 
> There is a connection already, via
> 
>      adapter->ptp_clock->devid
> 
> which is MKDEV(major, index) which is accessible at least at the network
> driver level, but probably not from networking core. So you'd need to drill
> a few more holes by adding yet another callback to net_device_ops.
> 
> I'm not sure if its worth the trouble. If the application hands in bogus
> timestamps, packets go out at the wrong time or are dropped. That's true
> whether it uses the proper clock or not. So nothing the kernel should
> really worry about.


+1 and that is the approach we've taken so far with the qdisc setting
"CLOCKID_INVALID" to its internal clockid for the "raw" (non-assisted) hw
offload case.

thanks,
Jesus



> 
> For clock_system - REAL/MONO/TAI(sigh) - you surely need a sanity check,
> but that is independent of the underlying network adapater even in the
> qdisc assisted HW offload case.
> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-22 20:29     ` Jesus Sanchez-Palencia
@ 2018-03-22 22:11       ` Thomas Gleixner
  2018-03-22 23:26         ` Jesus Sanchez-Palencia
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-22 22:11 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > If you look at the use cases of TDM in various fields then FIFO mode is
> > pretty much useless. In industrial/automotive fieldbus applications the
> > various time slices are filled by different threads or even processes.
> > 
> > Sure, the rbtree queue/dequeue has overhead compared to a simple linked
> > list, but you pay for that with more indirections and lots of mostly
> > duplicated code. And in the worst case one of these code pathes is going to
> > be rarely used and prone to bitrot.
> 
> 
> Our initial version (on RFC v2) was performing the sorting for all modes. After
> all the feedback we got we decided to make it optional and provide FIFO modes as
> well. For the SW fallback we need the scheduled FIFO, and for "pure" hw offload
> we need the "raw" FIFO.

I don't see how FIFO ever works without the issue that a newly qeueud
packet which has an earlier time stamp than the head of the FIFO list will
lose. Why would you even want to have that mode? Just because some weird
existing application misdesign thinks its required? That doesn't make it a
good idea.

With pure hardware offload the packets are immediately handed off to the
network card and that one is responsible for sending it on time. So there
is no FIFO at all. It's actually a bypass mode.

> This was a way to accommodate all the use cases without imposing too much of a
> burden onto anyone, regardless of their application's segment (i.e. industrial,
> pro a/v, automotive, etc).

I'm not buying that argument at all. That's all handwaving.

The whole approach is a burden on every application segment because it
pushes the whole schedule and time slice management out to user space,
which also requires that you route general traffic down to that user space
scheduling entity and then queue it back into the proper time slice. And
FIFO makes that even worse.

> Having the sorting always enabled requires that a valid static clockid is passed
> to the qdisc. For the hw offload mode, that means that the PHC and one of the
> system clocks must be synchronized since hrtimers do not support dynamic clocks.
> Not all systems do that or want to, and given that we do not want to perform
> crosstimestamping between the packets' clock reference and the qdisc's one, the
> only solution for these systems would be using the raw hw offload mode.

There are two variants of hardware offload:

1) Full hardware offload

   That bypasses the queue completely. You just stick the thing into the
   scatter gather buffers. Except when there is no room anymore, then you
   have to queue, but it does not make any difference if you queue in FIFO
   or in time order. The packets go out in time order anyway.

2) Single packet hardware offload

   What you do here is to schedule a hrtimer a bit earlier than the first
   packet tx time and when it fires stick the packet into the hardware and
   rearm the timer for the next one.

   The whole point of TSN with hardware support is that you have:

       - Global network time

       and

       - Frequency adjustment of the system time base

    PTP is TAI based and the kernel exposes clock TAI directly through
    hrtimers. You don't need dynamic clocks for that.

    You can even use clock MONOTONIC as it basically is just

       TAI - offset

If the network card uses anything else than TAI or a time stamp with a
strict correlation to TAI for actual TX scheduling then the whole thing is
broken to begin with.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-22 20:25       ` Jesus Sanchez-Palencia
@ 2018-03-22 22:52         ` Thomas Gleixner
  2018-03-24  0:34           ` Jesus Sanchez-Palencia
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-22 22:52 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> Our plan was to work directly with the Qbv-like scheduling (per-port) just after
> the cbs qdisc (Qav), but the feedback here and offline was that there were use
> cases for a more simplistic launchtime approach (per-queue) as well. We've
> decided to invest on it first (and postpone the 'taprio' qdisc until there was
> NIC available with HW support for it, basically).

I missed that discussion due to other urgent stuff on my plate. Just
skimmed through it. More below.

> You are right, and we agree, that using tbs for a per-port schedule of any sort
> will require a SW scheduler to be developed on top of it, but we've never said
> the contrary either. Our vision has always been that these are separate
> mechanisms with different use-cases, so we do see the value for the kernel to
> provide both.
> 
> In other words, tbs is not the final solution for Qbv, and we agree that a 'TAS'
> qdisc is still necessary. And due to the wide range of applications and hw being
> used for those out there, we need both specially given that one does not block
> the other.

So what's the plan for this? Having TAS as a separate entity or TAS feeding
into the proposed 'basic' time transmission thing?

The general objection I have with the current approach is that it creates
the playground for all flavours of misdesigned user space implementations
and just replaces the home brewn and ugly user mode network adapter
drivers.

But that's not helping the cause at all. There is enough crappy stuff out
there already and I rather see a proper designed slice management which can
be utilized and improved by all involved parties.

All variants which utilize the basic time driven packet transmission are
based on periodic explicit plan scheduling with (local) network wide time
slice assignment.

It does not matter whether you feed VLAN traffic into a time slice, where
the VLAN itself does not even have to know about it, or if you have aware
applications feeding packets to a designated timeslot. The basic principle
of this is always the same.

So coming back to last years discussion. It totally went into the wrong
direction because it turned from an approach (the patches) which came from
the big picture to an single use case and application centric view. That's
just wrong and I regret that I didn't have the time to pay attention back
then.

You always need to look at the big picture first and design from there, not
the other way round. There will always be the argument:

    But my application is special and needs X

It's easy to fall for that. From a long experience I know that none of
these claims ever held. These arguments are made because the people making
them have either never looked at the big picture or are simply refusing to
do so because it would cause them work.

If you start from the use case and application centric view and ignore the
big picture then you end up in a gazillion of extra magic features over
time which could have been completely avoided if you had put your foot down
and made everyone to agree on a proper and versatile design in the first
place.

The more low level access you hand out in the beginning the less commonly
used, improved and maintained infrastrucure you will get in the end. That
has happened before in other areas and it will happen here as well. You
create a user space ABI which you cant get rid off and before you come out
with the proper interface after that a large number of involved parties
have gone off and implemented on top of the low level ABI and they will
never look back.

In the (not so) long run this will create a lot more issues than it
solves. A simple example is that you cannot run two applications which
easily could share the network in parallel without major surgery because
both require to be the management authority.

I've not yet seen a convincing argument why this low level stuff with all
of its weird flavours is superiour over something which reflects the basic
operating principle of TSN.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-21 14:22     ` [Intel-wired-lan] " Thomas Gleixner
@ 2018-03-22 23:15       ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 23:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

Hi,


On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
>>
>> In this example, the Qdisc will use HW offload for the control of the
>> transmission time through the network adapter. It's assumed the timestamp
>> in skbuffs are in reference to the interface's PHC and setting any other
>> valid clockid would be treated as an error. Because there is no
>> scheduling being performed in the qdisc, setting a delta != 0 would also
>> be considered an error.
> 
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.


Yes. As a result, we came up with a rather simplistic solution that would still
allow for dynamic clocks to be used in the future without any API changes. As of
the v3 RFC, the qdisc returns -EINVAL if a netlink application (i.e. tc) tries
to initialize it in 'raw' hw offload passing any clockid != CLOCKID_INVALID. The
skbuffs' clockid was initialized with the same value, so if the application sets
its value to any other valid clockids through the cmsg interface, the qdisc
would just drop the patches on enqueue() due to the mismatch.

In other words, dynamic clocks are currently not used at all.

(I noticed later that this was broken anyway because the definition of invalid
clockids from posix-timers.h is actually only valid for negative numbers.)

Given all the feedback against adding the clockid into struct sk_buff, for the
next version, we'll have to re-think this anyway now that clockid will be set
per socket (i.e. as an argument to the SO_TXTIME) and not per packet anymore.




> 
> I think this should look like this:
> 
>     clock_adapter:	1 = clock of the network adapter
>     			0 = system clock selected by clock_system
> 
>     clock_system:	0 = CLOCK_REALTIME
>     			1 = CLOCK_MONOTONIC
> 
> or something like that.
> 
>> Example 2:
>>
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
>> 	   clockid CLOCK_REALTIME sorting
>>
>> Here, the Qdisc will use HW offload for the txtime control again,
>> but now sorting will be enabled, and thus there will be scheduling being
>> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
>> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
>> their transmission time. Because this will be using HW offload and
>> since dynamic clocks are not supported by the hrtimer, the system clock
>> and the PHC clock must be synchronized for this mode to behave as expected.
> 
> So what you do here is queueing the packets in the qdisk and then schedule
> them at some point ahead of actual transmission time for delivery to the
> hardware. That delivery uses the same txtime as used for qdisc scheduling
> to tell the hardware when the packet should go on the wire. That's needed
> when the network adapter does not support queueing of multiple packets.
> 
> Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> on, so clock_system needs to accomodate that as well. Dammit, there goes
> the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> bits plus the adapter bit.
> 
> Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> don't see us adding new fixed clocks, so we really can reserve #15 for
> selecting the adapter clock if sparing that extra bit is truly required.


So what about just using the previous single 'clockid' argument, but then just
adding to uapi time.h something like:

#define DYNAMIC_CLOCKID 15

And using it for that, instead. This way applications that will use the raw hw
offload mode must use this value for their per-socket clockid, and the qdisc's
clockid would be implicitly initialized to the same value.

What do you think?

Thanks,
Jesus



> 
> Thanks,
> 
> 	tglx
> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
@ 2018-03-22 23:15       ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 23:15 UTC (permalink / raw)
  To: intel-wired-lan

Hi,


On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
>>
>> In this example, the Qdisc will use HW offload for the control of the
>> transmission time through the network adapter. It's assumed the timestamp
>> in skbuffs are in reference to the interface's PHC and setting any other
>> valid clockid would be treated as an error. Because there is no
>> scheduling being performed in the qdisc, setting a delta != 0 would also
>> be considered an error.
> 
> Which clockid will be handed in from the application? The network adapter
> time has no fixed clockid. The only way you can get to it is via a fd based
> posix clock and that does not work at all because the qdisc setup might
> have a different FD than the application which queues packets.


Yes. As a result, we came up with a rather simplistic solution that would still
allow for dynamic clocks to be used in the future without any API changes. As of
the v3 RFC, the qdisc returns -EINVAL if a netlink application (i.e. tc) tries
to initialize it in 'raw' hw offload passing any clockid != CLOCKID_INVALID. The
skbuffs' clockid was initialized with the same value, so if the application sets
its value to any other valid clockids through the cmsg interface, the qdisc
would just drop the patches on enqueue() due to the mismatch.

In other words, dynamic clocks are currently not used at all.

(I noticed later that this was broken anyway because the definition of invalid
clockids from posix-timers.h is actually only valid for negative numbers.)

Given all the feedback against adding the clockid into struct sk_buff, for the
next version, we'll have to re-think this anyway now that clockid will be set
per socket (i.e. as an argument to the SO_TXTIME) and not per packet anymore.




> 
> I think this should look like this:
> 
>     clock_adapter:	1 = clock of the network adapter
>     			0 = system clock selected by clock_system
> 
>     clock_system:	0 = CLOCK_REALTIME
>     			1 = CLOCK_MONOTONIC
> 
> or something like that.
> 
>> Example 2:
>>
>> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>>            map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1 at 0 1 at 1 2 at 2 hw 0
>>
>> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 100000 \
>> 	   clockid CLOCK_REALTIME sorting
>>
>> Here, the Qdisc will use HW offload for the txtime control again,
>> but now sorting will be enabled, and thus there will be scheduling being
>> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
>> reference and packets leave the Qdisc "delta" (100000) nanoseconds before
>> their transmission time. Because this will be using HW offload and
>> since dynamic clocks are not supported by the hrtimer, the system clock
>> and the PHC clock must be synchronized for this mode to behave as expected.
> 
> So what you do here is queueing the packets in the qdisk and then schedule
> them at some point ahead of actual transmission time for delivery to the
> hardware. That delivery uses the same txtime as used for qdisc scheduling
> to tell the hardware when the packet should go on the wire. That's needed
> when the network adapter does not support queueing of multiple packets.
> 
> Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> on, so clock_system needs to accomodate that as well. Dammit, there goes
> the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> bits plus the adapter bit.
> 
> Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> don't see us adding new fixed clocks, so we really can reserve #15 for
> selecting the adapter clock if sparing that extra bit is truly required.


So what about just using the previous single 'clockid' argument, but then just
adding to uapi time.h something like:

#define DYNAMIC_CLOCKID 15

And using it for that, instead. This way applications that will use the raw hw
offload mode must use this value for their per-socket clockid, and the qdisc's
clockid would be implicitly initialized to the same value.

What do you think?

Thanks,
Jesus



> 
> Thanks,
> 
> 	tglx
> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-22 22:11       ` Thomas Gleixner
@ 2018-03-22 23:26         ` Jesus Sanchez-Palencia
  2018-03-23  8:49           ` Thomas Gleixner
  0 siblings, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-22 23:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 03/22/2018 03:11 PM, Thomas Gleixner wrote:

(...)

>> Having the sorting always enabled requires that a valid static clockid is passed
>> to the qdisc. For the hw offload mode, that means that the PHC and one of the
>> system clocks must be synchronized since hrtimers do not support dynamic clocks.
>> Not all systems do that or want to, and given that we do not want to perform
>> crosstimestamping between the packets' clock reference and the qdisc's one, the
>> only solution for these systems would be using the raw hw offload mode.
> 
> There are two variants of hardware offload:
> 
> 1) Full hardware offload
> 
>    That bypasses the queue completely. You just stick the thing into the
>    scatter gather buffers. Except when there is no room anymore, then you
>    have to queue, but it does not make any difference if you queue in FIFO
>    or in time order. The packets go out in time order anyway.


Illustrating your variants with the current qdisc's setup arguments.

The above is:
- sorting off
- offload on

(I call it a 'raw' fifo as a reference to the usage of qdisc_enqueue_tail() and
qdisc_dequeue_head(), basically.)


> 
> 2) Single packet hardware offload
> 
>    What you do here is to schedule a hrtimer a bit earlier than the first
>    packet tx time and when it fires stick the packet into the hardware and
>    rearm the timer for the next one.


The above is:
- sorting on
- offload on

right?


So, are you just opposing to the case where sorting off + offload off is used?
(i.e. the scheduled FIFO case)



> 
>    The whole point of TSN with hardware support is that you have:
> 
>        - Global network time
> 
>        and
> 
>        - Frequency adjustment of the system time base
> 
>     PTP is TAI based and the kernel exposes clock TAI directly through
>     hrtimers. You don't need dynamic clocks for that.
> 
>     You can even use clock MONOTONIC as it basically is just
> 
>        TAI - offset>
> If the network card uses anything else than TAI or a time stamp with a
> strict correlation to TAI for actual TX scheduling then the whole thing is
> broken to begin with.


Sure, I agree.

Thanks,
Jesus

> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-22 23:26         ` Jesus Sanchez-Palencia
@ 2018-03-23  8:49           ` Thomas Gleixner
  2018-03-23 23:34             ` Jesus Sanchez-Palencia
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-23  8:49 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/22/2018 03:11 PM, Thomas Gleixner wrote:
> So, are you just opposing to the case where sorting off + offload off is used?
> (i.e. the scheduled FIFO case)

FIFO does not make any sense if your packets have a fixed transmission
time. I yet have to see a reasonable explanation why FIFO in the context of
time ordered would be a good thing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
  2018-03-22 23:15       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-03-23  8:51         ` Thomas Gleixner
  -1 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-23  8:51 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> > Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> > on, so clock_system needs to accomodate that as well. Dammit, there goes
> > the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> > bits plus the adapter bit.
> > 
> > Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> > don't see us adding new fixed clocks, so we really can reserve #15 for
> > selecting the adapter clock if sparing that extra bit is truly required.
> 
> 
> So what about just using the previous single 'clockid' argument, but then just
> adding to uapi time.h something like:
> 
> #define DYNAMIC_CLOCKID 15
> 
> And using it for that, instead. This way applications that will use the raw hw
> offload mode must use this value for their per-socket clockid, and the qdisc's
> clockid would be implicitly initialized to the same value.

That's what I suggested above.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS
@ 2018-03-23  8:51         ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-23  8:51 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> > Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> > on, so clock_system needs to accomodate that as well. Dammit, there goes
> > the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> > bits plus the adapter bit.
> > 
> > Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> > don't see us adding new fixed clocks, so we really can reserve #15 for
> > selecting the adapter clock if sparing that extra bit is truly required.
> 
> 
> So what about just using the previous single 'clockid' argument, but then just
> adding to uapi time.h something like:
> 
> #define DYNAMIC_CLOCKID 15
> 
> And using it for that, instead. This way applications that will use the raw hw
> offload mode must use this value for their per-socket clockid, and the qdisc's
> clockid would be implicitly initialized to the same value.

That's what I suggested above.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-23  8:49           ` Thomas Gleixner
@ 2018-03-23 23:34             ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-23 23:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, john.stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 03/23/2018 01:49 AM, Thomas Gleixner wrote:
> On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
>> On 03/22/2018 03:11 PM, Thomas Gleixner wrote:
>> So, are you just opposing to the case where sorting off + offload off is used?
>> (i.e. the scheduled FIFO case)
> 
> FIFO does not make any sense if your packets have a fixed transmission
> time. I yet have to see a reasonable explanation why FIFO in the context of
> time ordered would be a good thing.


On context of tbs, the scheduled FIFO was developed just so consistency was kept
between all 4 variants, basically (sw best-effort or hw offload vs sorting
enabled or sorting disabled).

I don't have any strong argument in favor of this mode at the moment, so I will
just remove it on a next version - unless someone else brings up a valid use
case for it, of course.

Thanks for the feedback,
Jesus

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-22 22:52         ` Thomas Gleixner
@ 2018-03-24  0:34           ` Jesus Sanchez-Palencia
  2018-03-25 11:46             ` Thomas Gleixner
  0 siblings, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-24  0:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi,


On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
> On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
>> Our plan was to work directly with the Qbv-like scheduling (per-port) just after
>> the cbs qdisc (Qav), but the feedback here and offline was that there were use
>> cases for a more simplistic launchtime approach (per-queue) as well. We've
>> decided to invest on it first (and postpone the 'taprio' qdisc until there was
>> NIC available with HW support for it, basically).
> 
> I missed that discussion due to other urgent stuff on my plate. Just
> skimmed through it. More below.
> 
>> You are right, and we agree, that using tbs for a per-port schedule of any sort
>> will require a SW scheduler to be developed on top of it, but we've never said
>> the contrary either. Our vision has always been that these are separate
>> mechanisms with different use-cases, so we do see the value for the kernel to
>> provide both.
>>
>> In other words, tbs is not the final solution for Qbv, and we agree that a 'TAS'
>> qdisc is still necessary. And due to the wide range of applications and hw being
>> used for those out there, we need both specially given that one does not block
>> the other.
> 
> So what's the plan for this? Having TAS as a separate entity or TAS feeding
> into the proposed 'basic' time transmission thing?


The second one, I guess. Elaborating, the plan is at some point having TAS as a
separate entity, but which can use tbs for one of its classes (and cbs for
another, and strict priority for everything else, etc).

Basically, the design would something along the lines of 'taprio'. A root qdisc
that is both time and priority aware, and capable of running a schedule for the
port. That schedule can run inside the kernel with hrtimers, or just be
offloaded into the controller if Qbv is supported on HW.

Because it would expose the inner traffic classes in a mq / mqprio / prio style,
then it would allow for other per-queue qdiscs to be attached to it. On a system
using the i210, for instance, we could then have tbs installed on traffic class
0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
entity (i.e. 'taprio') which would be setting the packets' txtime before
dequeueing packets on a fast path -> tbs -> NIC.

Similarly, other qdisc, like cbs, could be installed if all that traffic class
requires is traffic shaping once its 'gate' is allowed to execute the selected
tx algorithm attached to it.



> 
> The general objection I have with the current approach is that it creates
> the playground for all flavours of misdesigned user space implementations
> and just replaces the home brewn and ugly user mode network adapter
> drivers.
> 
> But that's not helping the cause at all. There is enough crappy stuff out
> there already and I rather see a proper designed slice management which can
> be utilized and improved by all involved parties.
> 
> All variants which utilize the basic time driven packet transmission are
> based on periodic explicit plan scheduling with (local) network wide time
> slice assignment.
> 
> It does not matter whether you feed VLAN traffic into a time slice, where
> the VLAN itself does not even have to know about it, or if you have aware
> applications feeding packets to a designated timeslot. The basic principle
> of this is always the same.
> 
> So coming back to last years discussion. It totally went into the wrong
> direction because it turned from an approach (the patches) which came from
> the big picture to an single use case and application centric view. That's
> just wrong and I regret that I didn't have the time to pay attention back
> then.
> 
> You always need to look at the big picture first and design from there, not
> the other way round. There will always be the argument:
> 
>     But my application is special and needs X
> 
> It's easy to fall for that. From a long experience I know that none of
> these claims ever held. These arguments are made because the people making
> them have either never looked at the big picture or are simply refusing to
> do so because it would cause them work.
> 
> If you start from the use case and application centric view and ignore the
> big picture then you end up in a gazillion of extra magic features over
> time which could have been completely avoided if you had put your foot down
> and made everyone to agree on a proper and versatile design in the first
> place.
> 
> The more low level access you hand out in the beginning the less commonly
> used, improved and maintained infrastrucure you will get in the end. That
> has happened before in other areas and it will happen here as well. You
> create a user space ABI which you cant get rid off and before you come out
> with the proper interface after that a large number of involved parties
> have gone off and implemented on top of the low level ABI and they will
> never look back.
> 
> In the (not so) long run this will create a lot more issues than it
> solves. A simple example is that you cannot run two applications which
> easily could share the network in parallel without major surgery because
> both require to be the management authority.
> 
> I've not yet seen a convincing argument why this low level stuff with all
> of its weird flavours is superiour over something which reflects the basic
> operating principle of TSN.


As you know, not all TSN systems are designed the same. Take AVB systems, for
example. These not always are running on networks that are aware of any time
schedule, or at least not quite like what is described by Qbv.

On those systems there is usually a certain number of streams with different
priorities that care mostly about having their bandwidth reserved along the
network. The applications running on such systems are usually based on AVTP,
thus they already have to calculate and set the "avtp presentation time"
per-packet themselves. A Qbv scheduler would probably provide very little
benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, for
instance.


Thanks,
Jesus

> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-24  0:34           ` Jesus Sanchez-Palencia
@ 2018-03-25 11:46             ` Thomas Gleixner
  2018-03-27 23:26               ` Jesus Sanchez-Palencia
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-25 11:46 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
> > So what's the plan for this? Having TAS as a separate entity or TAS feeding
> > into the proposed 'basic' time transmission thing?
> 
> The second one, I guess.

That's just wrong. It won't work. See below.

> Elaborating, the plan is at some point having TAS as a separate entity,
> but which can use tbs for one of its classes (and cbs for another, and
> strict priority for everything else, etc).
>
> Basically, the design would something along the lines of 'taprio'. A root qdisc
> that is both time and priority aware, and capable of running a schedule for the
> port. That schedule can run inside the kernel with hrtimers, or just be
> offloaded into the controller if Qbv is supported on HW.
> 
> Because it would expose the inner traffic classes in a mq / mqprio / prio style,
> then it would allow for other per-queue qdiscs to be attached to it. On a system
> using the i210, for instance, we could then have tbs installed on traffic class
> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
> entity (i.e. 'taprio') which would be setting the packets' txtime before
> dequeueing packets on a fast path -> tbs -> NIC.
> 
> Similarly, other qdisc, like cbs, could be installed if all that traffic class
> requires is traffic shaping once its 'gate' is allowed to execute the selected
> tx algorithm attached to it.
> 
> > I've not yet seen a convincing argument why this low level stuff with all
> > of its weird flavours is superiour over something which reflects the basic
> > operating principle of TSN.
> 
> 
> As you know, not all TSN systems are designed the same. Take AVB systems, for
> example. These not always are running on networks that are aware of any time
> schedule, or at least not quite like what is described by Qbv.
> 
> On those systems there is usually a certain number of streams with different
> priorities that care mostly about having their bandwidth reserved along the
> network. The applications running on such systems are usually based on AVTP,
> thus they already have to calculate and set the "avtp presentation time"
> per-packet themselves. A Qbv scheduler would probably provide very little
> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, for
> instance.

You're looking at it from particular use cases and try to accomodate for
them in the simplest possible way. I don't think that cuts it.

Let's take a step back and look at it from a more general POV without
trying to make it fit to any of the standards first. I'm deliberately NOT
using any of the standard defined terms.

At the (local) network level you have always an explicit plan. This plan
might range from no plan at all to an very elaborate plan which is strict
about when each node is allowed to TX a particular class of packets.

So lets assume we have the following picture:

   	       	  [NIC]
		    |
	 [ Time slice manager ]

Now in the simplest case, the time slice manager has no constraints and
exposes a single input which allows the application to say: "Send my packet
at time X". There is no restriction on 'time X' except if there is a time
collision with an already queued packet or the requested TX time has
already passed. That's close to what you implemented.

  Is the TX timestamp which you defined in the user space ABI a fixed
  scheduling point or is it a deadline?

  That's an important distinction and for this all to work accross various
  use cases you need a way to express that in the ABI. It might be an
  implicit property of the socket/channel to which the application connects
  to but still you want to express it from the application side to do
  proper sanity checking.

  Just think about stuff like audio/video streaming. The point of
  transmission does not have to be fixed if you have some intelligent
  controller at the receiving end which can buffer stuff. The only relevant
  information is the deadline, i.e. the latest point in time where the
  packet needs to go out on the wire in order to keep the stream steady at
  the consumer side. Having the notion of a deadline and that's the only
  thing the provider knows about allows you proper utilization by using an
  approriate scheduling algorithm like EDF.

  Contrary to that you want very explicit TX points for applications like
  automation control. For this kind of use case there is no wiggle room, it
  has to go out at a fixed time because that's the way control systems
  work.

  This is missing right now and you want to get that right from the very
  beginning. Duct taping it on the interface later on is a bad idea.

Now lets go one step further and create two time slices for whatever
purpose still on the single node (not network wide). You want to do that
because you want temporal separation of services. The reason might be
bandwidth guarantee, collission avoidance or whatever.

  How does the application which was written for the simple manager which
  had no restrictions learn about this?

  Does it learn it the hard way because now the packets which fall into the
  reserved timeslice are rejected? The way you created your interface, the
  answer is yes. That's patently bad as it requires to change the
  application once it runs on a partitioned node.

  So you really want a way for the application to query the timing
  constraints and perhaps other properties of the channel it connects
  to. And you want that now before the first application starts to use the
  new ABI. If the application developer does not use it, you still have to
  fix the application, but you have to fix it because the developer was a
  lazy bastard and not because the design was bad. That's a major
  difference.

Now that we have two time slices, I'm coming back to your idea of having
your proposed qdisc as the entity which sits right at the network
interface. Lets assume the following:

   [Slice 1: Timed traffic ] [Slice 2: Other Traffic]

  Lets assume further that 'Other traffic' has no idea about time slices at
  all. It's just stuff like ssh, http, etc. So if you keep that design

       	         [ NIC ]
  	            |
           [ Time slice manager ]
	       |          |
     [ Timed traffic ]  [ Other traffic ]

  feeding into your proposed TBS thingy, then in case of underutilization
  of the 'Timed traffic' slot you prevent utilization of remaining time by
  pulling 'Other traffic' into the empty slots because 'Other traffic' is
  restricted to Slice 2 and 'Timed traffic' does not know about 'Other
  traffic' at all. And no, you cannot make TBS magically pull packets from
  'Other traffic' just because its not designed for it. So your design
  becomes strictly partitioned and forces underutilization.

  That's becoming even worse, when you switch to the proposed full hardware
  offloading scheme. In that case the only way to do admission control is
  the TX time of the farthest out packet which is already queued. That
  might work for a single application which controls all of the network
  traffic, but it wont ever work for something more flexible. The more I
  think about it the less interesting full hardware offload becomes. It's
  nice if you have a fully strict scheduling plan for everything, but then
  your admission control is bogus once you have more than one channel as
  input. So yes, it can be used when the card supports it and you have
  other ways to enforce admission control w/o hurting utilization or if you
  don't care about utilization at all. It's also useful for channels which
  are strictly isolated and have a defined TX time. Such traffic can be
  directly fed into the hardware.

Coming back to the overall scheme. If you start upfront with a time slice
manager which is designed to:

  - Handle multiple channels

  - Expose the time constraints, properties per channel

then you can fit all kind of use cases, whether designed by committee or
not. You can configure that thing per node or network wide. It does not
make a difference. The only difference are the resulting constraints.

We really want to accomodate everything between the 'no restrictions' and
the 'full network wide explicit plan' case. And it's not rocket science
once you realize that the 'no restrictions' case is just a subset of the
'full network wide explicit plan' simply because it exposes a single
channel where:

	slice period = slice length.

It's that easy, but at the same time you teach the application from the
very beginning to ask for the time constraints so if it runs on a more
sophisticated system/network, then it will see a different slice period and
a different slice length and can accomodate or react in a useful way
instead of just dying on the 17th packet it tries to send because it is
rejected.

We really want to design for this as we want to be able to run the video
stream on the same node and network which does robot control without
changing the video application. That's not a theoretical problem. These use
cases exist today, but they are forced to use different networks for the
two. But if you look at the utilization of both then they very well fit
into one and industry certainly wants to go for that.

That implies that you need constraint aware applications from the very
beginning and that requires a proper ABI in the first place. The proposed
ad hoc mode does not qualify. Please be aware, that you are creating a user
space ABI and not a random in kernel interface which can be changed at any
given time.

So lets look once more at the picture in an abstract way:

     	       [ NIC ]
	          |
	 [ Time slice manager ]
	    |           |
         [ Ch 0 ] ... [ Ch N ]

So you have a bunch of properties here:

1) Number of Channels ranging from 1 to N

2) Start point, slice period and slice length per channel

3) Queueing modes assigned per channel. Again that might be anything from
   'feed through' over FIFO, PRIO to more complex things like EDF.

   The queueing mode can also influence properties like the meaning of the
   TX time, i.e. strict or deadline.

Please sit back and map your use cases, standards or whatever you care
about into the above and I would be very surprised if they don't fit.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-25 11:46             ` Thomas Gleixner
@ 2018-03-27 23:26               ` Jesus Sanchez-Palencia
  2018-03-28  7:48                 ` Thomas Gleixner
  0 siblings, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-03-27 23:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
>> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
>>> So what's the plan for this? Having TAS as a separate entity or TAS feeding
>>> into the proposed 'basic' time transmission thing?
>>
>> The second one, I guess.
>
> That's just wrong. It won't work. See below.

Yes, our proposal does not handle the scenarios you are bringing into the
discussion.

I think we have more points of convergence than divergence already. I will just
go through some pieces of the discussion first, and then let's see if we can
agree on where we are trying to get.



>
>> Elaborating, the plan is at some point having TAS as a separate entity,
>> but which can use tbs for one of its classes (and cbs for another, and
>> strict priority for everything else, etc).
>>
>> Basically, the design would something along the lines of 'taprio'. A root qdisc
>> that is both time and priority aware, and capable of running a schedule for the
>> port. That schedule can run inside the kernel with hrtimers, or just be
>> offloaded into the controller if Qbv is supported on HW.
>>
>> Because it would expose the inner traffic classes in a mq / mqprio / prio style,
>> then it would allow for other per-queue qdiscs to be attached to it. On a system
>> using the i210, for instance, we could then have tbs installed on traffic class
>> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
>> entity (i.e. 'taprio') which would be setting the packets' txtime before
>> dequeueing packets on a fast path -> tbs -> NIC.
>>
>> Similarly, other qdisc, like cbs, could be installed if all that traffic class
>> requires is traffic shaping once its 'gate' is allowed to execute the selected
>> tx algorithm attached to it.
>>
>>> I've not yet seen a convincing argument why this low level stuff with all
>>> of its weird flavours is superiour over something which reflects the basic
>>> operating principle of TSN.
>>
>>
>> As you know, not all TSN systems are designed the same. Take AVB systems, for
>> example. These not always are running on networks that are aware of any time
>> schedule, or at least not quite like what is described by Qbv.
>>
>> On those systems there is usually a certain number of streams with different
>> priorities that care mostly about having their bandwidth reserved along the
>> network. The applications running on such systems are usually based on AVTP,
>> thus they already have to calculate and set the "avtp presentation time"
>> per-packet themselves. A Qbv scheduler would probably provide very little
>> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
>> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, for
>> instance.
>
> You're looking at it from particular use cases and try to accomodate for
> them in the simplest possible way. I don't think that cuts it.
>
> Let's take a step back and look at it from a more general POV without
> trying to make it fit to any of the standards first. I'm deliberately NOT
> using any of the standard defined terms.
>
> At the (local) network level you have always an explicit plan. This plan
> might range from no plan at all to an very elaborate plan which is strict
> about when each node is allowed to TX a particular class of packets.


Ok, we are aligned here.


>
> So lets assume we have the following picture:
>
>    	       	  [NIC]
> 		    |
> 	 [ Time slice manager ]
>
> Now in the simplest case, the time slice manager has no constraints and
> exposes a single input which allows the application to say: "Send my packet
> at time X". There is no restriction on 'time X' except if there is a time
> collision with an already queued packet or the requested TX time has
> already passed. That's close to what you implemented.
>
>   Is the TX timestamp which you defined in the user space ABI a fixed
>   scheduling point or is it a deadline?
>
>   That's an important distinction and for this all to work accross various
>   use cases you need a way to express that in the ABI. It might be an
>   implicit property of the socket/channel to which the application connects
>   to but still you want to express it from the application side to do
>   proper sanity checking.
>
>   Just think about stuff like audio/video streaming. The point of
>   transmission does not have to be fixed if you have some intelligent
>   controller at the receiving end which can buffer stuff. The only relevant
>   information is the deadline, i.e. the latest point in time where the
>   packet needs to go out on the wire in order to keep the stream steady at
>   the consumer side. Having the notion of a deadline and that's the only
>   thing the provider knows about allows you proper utilization by using an
>   approriate scheduling algorithm like EDF.
>
>   Contrary to that you want very explicit TX points for applications like
>   automation control. For this kind of use case there is no wiggle room, it
>   has to go out at a fixed time because that's the way control systems
>   work.
>
>   This is missing right now and you want to get that right from the very
>   beginning. Duct taping it on the interface later on is a bad idea.


Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe it's been
covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a different
mechanism for expressing that?


>
> Now lets go one step further and create two time slices for whatever
> purpose still on the single node (not network wide). You want to do that
> because you want temporal separation of services. The reason might be
> bandwidth guarantee, collission avoidance or whatever.
>
>   How does the application which was written for the simple manager which
>   had no restrictions learn about this?
>
>   Does it learn it the hard way because now the packets which fall into the
>   reserved timeslice are rejected? The way you created your interface, the
>   answer is yes. That's patently bad as it requires to change the
>   application once it runs on a partitioned node.
>
>   So you really want a way for the application to query the timing
>   constraints and perhaps other properties of the channel it connects
>   to. And you want that now before the first application starts to use the
>   new ABI. If the application developer does not use it, you still have to
>   fix the application, but you have to fix it because the developer was a
>   lazy bastard and not because the design was bad. That's a major
>   difference.


Ok, this is something that we have considered in the past, but then the feedback
here drove us onto a different direction. The overall input we got here was that
applications would have to be adjusted or that userspace would have to handle
the coordination between applications somehow (e.g.: a daemon could be developed
separately to accommodate the fully dynamic use-cases, etc).


>
> Now that we have two time slices, I'm coming back to your idea of having
> your proposed qdisc as the entity which sits right at the network
> interface. Lets assume the following:
>
>    [Slice 1: Timed traffic ] [Slice 2: Other Traffic]
>
>   Lets assume further that 'Other traffic' has no idea about time slices at
>   all. It's just stuff like ssh, http, etc. So if you keep that design
>
>        	         [ NIC ]
>   	            |
>            [ Time slice manager ]
> 	       |          |
>      [ Timed traffic ]  [ Other traffic ]
>
>   feeding into your proposed TBS thingy, then in case of underutilization
>   of the 'Timed traffic' slot you prevent utilization of remaining time by
>   pulling 'Other traffic' into the empty slots because 'Other traffic' is
>   restricted to Slice 2 and 'Timed traffic' does not know about 'Other
>   traffic' at all. And no, you cannot make TBS magically pull packets from
>   'Other traffic' just because its not designed for it. So your design
>   becomes strictly partitioned and forces underutilization.
>
>   That's becoming even worse, when you switch to the proposed full hardware
>   offloading scheme. In that case the only way to do admission control is
>   the TX time of the farthest out packet which is already queued. That
>   might work for a single application which controls all of the network
>   traffic, but it wont ever work for something more flexible. The more I
>   think about it the less interesting full hardware offload becomes. It's
>   nice if you have a fully strict scheduling plan for everything, but then
>   your admission control is bogus once you have more than one channel as
>   input. So yes, it can be used when the card supports it and you have
>   other ways to enforce admission control w/o hurting utilization or if you
>   don't care about utilization at all. It's also useful for channels which
>   are strictly isolated and have a defined TX time. Such traffic can be
>   directly fed into the hardware.


This is a new requirement for the entire discussion.

If I'm not missing anything, however, underutilization of the time slots is only
a problem:

1) for the fully dynamic use-cases and;
2) because now you are designing applications in terms of time slices, right?

We have not thought of making any of the proposed qdiscs capable of (optionally)
adjusting the "time slices", but mainly because this is not a problem we had
here before. Our assumption was that per-port Tx schedules would only be used
for static systems. In other words, no, we didn't think that re-balancing the
slots was a requirement, not even for 'taprio'.


>
> Coming back to the overall scheme. If you start upfront with a time slice
> manager which is designed to:
>
>   - Handle multiple channels
>
>   - Expose the time constraints, properties per channel
>
> then you can fit all kind of use cases, whether designed by committee or
> not. You can configure that thing per node or network wide. It does not
> make a difference. The only difference are the resulting constraints.


Ok, and I believe the above was covered by what we had proposed before, unless
what you meant by time constraints is beyond the configured port schedule.

Are you suggesting that we'll need to have a kernel entity that is not only
aware of the current traffic classes 'schedule', but also of the resources that
are still available for new streams to be accommodated into the classes? Putting
it differently, is the TAS you envision just an entity that runs a schedule, or
is it a time-aware 'orchestrator'?


>
> We really want to accomodate everything between the 'no restrictions' and
> the 'full network wide explicit plan' case. And it's not rocket science
> once you realize that the 'no restrictions' case is just a subset of the
> 'full network wide explicit plan' simply because it exposes a single
> channel where:
>
> 	slice period = slice length.
>
> It's that easy, but at the same time you teach the application from the
> very beginning to ask for the time constraints so if it runs on a more
> sophisticated system/network, then it will see a different slice period and
> a different slice length and can accomodate or react in a useful way
> instead of just dying on the 17th packet it tries to send because it is
> rejected.


Ok.


>
> We really want to design for this as we want to be able to run the video
> stream on the same node and network which does robot control without
> changing the video application. That's not a theoretical problem. These use
> cases exist today, but they are forced to use different networks for the
> two. But if you look at the utilization of both then they very well fit
> into one and industry certainly wants to go for that.
>
> That implies that you need constraint aware applications from the very
> beginning and that requires a proper ABI in the first place. The proposed
> ad hoc mode does not qualify. Please be aware, that you are creating a user
> space ABI and not a random in kernel interface which can be changed at any
> given time.
>
> So lets look once more at the picture in an abstract way:
>
>      	       [ NIC ]
> 	          |
> 	 [ Time slice manager ]
> 	    |           |
>          [ Ch 0 ] ... [ Ch N ]
>
> So you have a bunch of properties here:
>
> 1) Number of Channels ranging from 1 to N
>
> 2) Start point, slice period and slice length per channel

Ok, so we agree that a TAS entity is needed. Assuming that channels are traffic
classes, do you have something else in mind other than a new root qdisc?


>
> 3) Queueing modes assigned per channel. Again that might be anything from
>    'feed through' over FIFO, PRIO to more complex things like EDF.
>
>    The queueing mode can also influence properties like the meaning of the
>    TX time, i.e. strict or deadline.


Ok, but how are the queueing modes assigned / configured per channel?

Just to make sure we re-visit some ideas from the past:

* TAS:

   The idea we are currently exploring is to add a "time-aware", priority based
   qdisc, that also exposes the Tx queues available and provides a mechanism for
   mapping priority <-> traffic class <-> Tx queues in a similar fashion as
   mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:

   $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
     	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
	   queues 0 1 2 3                                              \
     	   sched-file gates.sched [base-time <interval>]               \
           [cycle-time <interval>] [extension-time <interval>]

   <file> is multi-line, with each line being of the following format:
   <cmd> <gate mask> <interval in nanoseconds>

   Qbv only defines one <cmd>: "S" for 'SetGates'

   For example:

   S 0x01 300
   S 0x03 500

   This means that there are two intervals, the first will have the gate
   for traffic class 0 open for 300 nanoseconds, the second will have
   both traffic classes open for 500 nanoseconds.


It would handle multiple channels and expose their constraints / properties.
Each channel also becomes a traffic class, so other qdiscs can be attached to
them separately.


So, in summary, because our entire design is based on qdisc interfaces, what we
had proposed was a root qdisc (the time slice manager, as you put) that allows
for other qdiscs to be attached to each channel. The inner qdiscs define the
queueing modes for each channel, and tbs is just one of those modes. I
understand now that you want to allow for fully dynamic use-cases to be
supported as well, which we hadn't covered with our TAS proposal before because
we hadn't envisioned it being used for these systems' design.

Have I missed anything?

Thanks,
Jesus



>
> Please sit back and map your use cases, standards or whatever you care
> about into the above and I would be very surprised if they don't fit.
>
> Thanks,
>
> 	tglx
>
>
>
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-27 23:26               ` Jesus Sanchez-Palencia
@ 2018-03-28  7:48                 ` Thomas Gleixner
  2018-03-28 13:07                   ` Henrik Austad
  2018-04-09 16:36                   ` Jesus Sanchez-Palencia
  0 siblings, 2 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-03-28  7:48 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Jesus,

On Tue, 27 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> >   This is missing right now and you want to get that right from the very
> >   beginning. Duct taping it on the interface later on is a bad idea.
> 
> Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe it's been
> covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a different
> mechanism for expressing that?

Uuurgh. No. DROP_IF_LATE is just crap to be honest.

There are two modes:

      1) Send at the given TX time (Explicit mode)

      2) Send before given TX time (Deadline mode)

There is no need to specify 'drop if late' simply because if the message is
handed in past the given TX time, it's too late by definition. What you are
trying to implement is a hybrid of TSN and general purpose (not time aware)
networking in one go. And you do that because your overall design is not
looking at the big picture. You designed from a given use case assumption
and tried to fit other things into it with duct tape.

> >   So you really want a way for the application to query the timing
> >   constraints and perhaps other properties of the channel it connects
> >   to. And you want that now before the first application starts to use the
> >   new ABI. If the application developer does not use it, you still have to
> >   fix the application, but you have to fix it because the developer was a
> >   lazy bastard and not because the design was bad. That's a major
> >   difference.
> 
> Ok, this is something that we have considered in the past, but then the feedback
> here drove us onto a different direction. The overall input we got here was that
> applications would have to be adjusted or that userspace would have to handle
> the coordination between applications somehow (e.g.: a daemon could be developed
> separately to accommodate the fully dynamic use-cases, etc).

The only thing which will happen is that you get applications which require
to control the full interface themself because they are so important and
the only ones which get it right. Good luck with fixing them up.

That extra daemon if it ever surfaces will be just a PITA. Think about
20khz control loops. Do you really want queueing, locking, several context
switches and priority configuration nightmares in such a scenario?
Definitely not! You want a fast channel directly to the root qdisc which
takes care of getting it out at the right point, which might be immediate
handover if the adapter supports hw scheduling.

> This is a new requirement for the entire discussion.
> 
> If I'm not missing anything, however, underutilization of the time slots is only
> a problem:
> 
> 1) for the fully dynamic use-cases and;
> 2) because now you are designing applications in terms of time slices, right?

No. It's a general problem. I'm not designing applications in terms of time
slices. Time slices are a fundamental property of TSN. Whether you use them
for explicit scheduling or bandwidth reservation or make them flat does not
matter.

The application does not necessarily need to know about the time
constraints at all. But if it wants to use timed scheduling then it better
does know about them.

> We have not thought of making any of the proposed qdiscs capable of (optionally)
> adjusting the "time slices", but mainly because this is not a problem we had
> here before. Our assumption was that per-port Tx schedules would only be used
> for static systems. In other words, no, we didn't think that re-balancing the
> slots was a requirement, not even for 'taprio'.

Sigh. Utilization is not something entirely new in the network space. I'm
not saying that this needs to be implemented right away, but designing it
in a way which forces underutilization is just wrong.

> > Coming back to the overall scheme. If you start upfront with a time slice
> > manager which is designed to:
> >
> >   - Handle multiple channels
> >
> >   - Expose the time constraints, properties per channel
> >
> > then you can fit all kind of use cases, whether designed by committee or
> > not. You can configure that thing per node or network wide. It does not
> > make a difference. The only difference are the resulting constraints.
> 
>
> Ok, and I believe the above was covered by what we had proposed before, unless
> what you meant by time constraints is beyond the configured port schedule.
>
> Are you suggesting that we'll need to have a kernel entity that is not only
> aware of the current traffic classes 'schedule', but also of the resources that
> are still available for new streams to be accommodated into the classes? Putting
> it differently, is the TAS you envision just an entity that runs a schedule, or
> is it a time-aware 'orchestrator'?

In the first place its something which runs a defined schedule.

The accomodation for new streams is required, but not necessarily at the
root qdisc level. That might be a qdisc feeding into it.

Assume you have a bandwidth reservation, aka time slot, for audio. If your
audio related qdisc does deadline scheduling then you can add new streams
to it up to the point where it's not longer able to fit.

The only thing which might be needed at the root qdisc is the ability to
utilize unused time slots for other purposes, but that's not required to be
there in the first place as long as its designed in a way that it can be
added later on.

> > So lets look once more at the picture in an abstract way:
> >
> >      	       [ NIC ]
> > 	          |
> > 	 [ Time slice manager ]
> > 	    |           |
> >          [ Ch 0 ] ... [ Ch N ]
> >
> > So you have a bunch of properties here:
> >
> > 1) Number of Channels ranging from 1 to N
> >
> > 2) Start point, slice period and slice length per channel
> 
> Ok, so we agree that a TAS entity is needed. Assuming that channels are traffic
> classes, do you have something else in mind other than a new root qdisc?

Whatever you call it, the important point is that it is the gate keeper to
the network adapter and there is no way around it. It fully controls the
timed schedule how simple or how complex it may be.

> > 3) Queueing modes assigned per channel. Again that might be anything from
> >    'feed through' over FIFO, PRIO to more complex things like EDF.
> >
> >    The queueing mode can also influence properties like the meaning of the
> >    TX time, i.e. strict or deadline.
> 
> 
> Ok, but how are the queueing modes assigned / configured per channel?
> 
> Just to make sure we re-visit some ideas from the past:
> 
> * TAS:
> 
>    The idea we are currently exploring is to add a "time-aware", priority based
>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
> 
>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> 	   queues 0 1 2 3                                              \
>      	   sched-file gates.sched [base-time <interval>]               \
>            [cycle-time <interval>] [extension-time <interval>]
> 
>    <file> is multi-line, with each line being of the following format:
>    <cmd> <gate mask> <interval in nanoseconds>
> 
>    Qbv only defines one <cmd>: "S" for 'SetGates'
> 
>    For example:
> 
>    S 0x01 300
>    S 0x03 500
> 
>    This means that there are two intervals, the first will have the gate
>    for traffic class 0 open for 300 nanoseconds, the second will have
>    both traffic classes open for 500 nanoseconds.

To accomodate stuff like control systems you also need a base line, which
is not expressed as interval. Otherwise you can't schedule network wide
explicit plans. That's either an absolute network-time (TAI) time stamp or
an offset to a well defined network-time (TAI) time stamp, e.g. start of
epoch or something else which is agreed on. The actual schedule then fast
forwards past now (TAI) and sets up the slots from there. That makes node
hotplug possible as well.

Btw, it's not only control systems. Think about complex multi source A/V
streams. They are reality in recording and life mixing and looking at the
timing constraints of such scenarios, collision avoidance is key there. So
you want to be able to do network wide traffic orchestration.

> It would handle multiple channels and expose their constraints / properties.
> Each channel also becomes a traffic class, so other qdiscs can be attached to
> them separately.

Right.

> So, in summary, because our entire design is based on qdisc interfaces, what we
> had proposed was a root qdisc (the time slice manager, as you put) that allows
> for other qdiscs to be attached to each channel. The inner qdiscs define the
> queueing modes for each channel, and tbs is just one of those modes. I
> understand now that you want to allow for fully dynamic use-cases to be
> supported as well, which we hadn't covered with our TAS proposal before because
> we hadn't envisioned it being used for these systems' design.

Yes, you have the root qdisc, which is in charge of the overall scheduling
plan, how complex or not it is defined does not matter. It exposes traffic
classes which have properties defined by the configuration.

The qdiscs which are attached to those traffic classes can be anything
including:

 - Simple feed through (Applications are time contraints aware and set the
   exact schedule). qdisc has admission control.

 - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
   of time constraints and provide the packet deadline. qdisc has admission
   control. This can be a simple first comes, first served scheduler or
   something like EDF which allows optimized utilization. The qdisc sets
   the TX time depending on the deadline and feeds into the root.

 - FIFO/PRIO/XXX for general traffic. Applications do not know anything
   about timing constraints. These qdiscs obviously have neither admission
   control nor do they set a TX time.  The root qdisc just pulls from there
   when the assigned time slot is due or if it (optionally) decides to use
   underutilized time slots from other classes.

 - .... Add your favourite scheduling mode(s).

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-28  7:48                 ` Thomas Gleixner
@ 2018-03-28 13:07                   ` Henrik Austad
  2018-04-09 16:36                   ` Jesus Sanchez-Palencia
  1 sibling, 0 replies; 129+ messages in thread
From: Henrik Austad @ 2018-03-28 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jesus Sanchez-Palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, richardcochran, anna-maria, John Stultz,
	levi.pearson, edumazet, willemb, mlichvar

[-- Attachment #1: Type: text/plain, Size: 13445 bytes --]

On Wed, Mar 28, 2018 at 09:48:05AM +0200, Thomas Gleixner wrote:
> Jesus,

Thomas, Jesus,

> On Tue, 27 Mar 2018, Jesus Sanchez-Palencia wrote:
> > On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> > >   This is missing right now and you want to get that right from the very
> > >   beginning. Duct taping it on the interface later on is a bad idea.
> > 
> > Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe it's been
> > covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a different
> > mechanism for expressing that?
> 
> Uuurgh. No. DROP_IF_LATE is just crap to be honest.
> 
> There are two modes:
> 
>       1) Send at the given TX time (Explicit mode)
> 
>       2) Send before given TX time (Deadline mode)
> 
> There is no need to specify 'drop if late' simply because if the message is
> handed in past the given TX time, it's too late by definition. What you are
> trying to implement is a hybrid of TSN and general purpose (not time aware)
> networking in one go. And you do that because your overall design is not
> looking at the big picture. You designed from a given use case assumption
> and tried to fit other things into it with duct tape.

Yes, +1 to this. The whole point of bandwidth reservation is to not drop 
frames, you should never, ever miss a deadline, if you do, then your 
admission tests are inadequate.

> > >   So you really want a way for the application to query the timing
> > >   constraints and perhaps other properties of the channel it connects
> > >   to. And you want that now before the first application starts to use the
> > >   new ABI. If the application developer does not use it, you still have to
> > >   fix the application, but you have to fix it because the developer was a
> > >   lazy bastard and not because the design was bad. That's a major
> > >   difference.
> > 
> > Ok, this is something that we have considered in the past, but then the feedback
> > here drove us onto a different direction. The overall input we got here was that
> > applications would have to be adjusted or that userspace would have to handle
> > the coordination between applications somehow (e.g.: a daemon could be developed
> > separately to accommodate the fully dynamic use-cases, etc).
> 
> The only thing which will happen is that you get applications which require
> to control the full interface themself because they are so important and
> the only ones which get it right. Good luck with fixing them up.
> 
> That extra daemon if it ever surfaces will be just a PITA. Think about
> 20khz control loops. Do you really want queueing, locking, several context
> switches and priority configuration nightmares in such a scenario?
> Definitely not! You want a fast channel directly to the root qdisc which
> takes care of getting it out at the right point, which might be immediate
> handover if the adapter supports hw scheduling.
> 
> > This is a new requirement for the entire discussion.
> > If I'm not missing anything, however, underutilization of the time slots is only
> > a problem:
> > 
> > 1) for the fully dynamic use-cases and;
> > 2) because now you are designing applications in terms of time slices, right?
> 
> No. It's a general problem. I'm not designing applications in terms of time
> slices. Time slices are a fundamental property of TSN. Whether you use them
> for explicit scheduling or bandwidth reservation or make them flat does not
> matter.
> 
> The application does not necessarily need to know about the time
> constraints at all. But if it wants to use timed scheduling then it better
> does know about them.

yep, +1 in a lot of A/V cases here, the application will have to know about 
presentation_time, and the delay through the network stack should be "low 
and deterministic", but apart from that, the application shouldn't have to 
care about SO_TXTIME and what other applications may or may not do.

> > We have not thought of making any of the proposed qdiscs capable of (optionally)
> > adjusting the "time slices", but mainly because this is not a problem we had
> > here before. Our assumption was that per-port Tx schedules would only be used
> > for static systems. In other words, no, we didn't think that re-balancing the
> > slots was a requirement, not even for 'taprio'.
> 
> Sigh. Utilization is not something entirely new in the network space. I'm
> not saying that this needs to be implemented right away, but designing it
> in a way which forces underutilization is just wrong.
> 
> > > Coming back to the overall scheme. If you start upfront with a time slice
> > > manager which is designed to:
> > >
> > >   - Handle multiple channels
> > >
> > >   - Expose the time constraints, properties per channel
> > >
> > > then you can fit all kind of use cases, whether designed by committee or
> > > not. You can configure that thing per node or network wide. It does not
> > > make a difference. The only difference are the resulting constraints.
> > 
> >
> > Ok, and I believe the above was covered by what we had proposed before, unless
> > what you meant by time constraints is beyond the configured port schedule.
> >
> > Are you suggesting that we'll need to have a kernel entity that is not only
> > aware of the current traffic classes 'schedule', but also of the resources that
> > are still available for new streams to be accommodated into the classes? Putting
> > it differently, is the TAS you envision just an entity that runs a schedule, or
> > is it a time-aware 'orchestrator'?
> 
> In the first place its something which runs a defined schedule.
> 
> The accomodation for new streams is required, but not necessarily at the
> root qdisc level. That might be a qdisc feeding into it.
> 
> Assume you have a bandwidth reservation, aka time slot, for audio. If your
> audio related qdisc does deadline scheduling then you can add new streams
> to it up to the point where it's not longer able to fit.
> 
> The only thing which might be needed at the root qdisc is the ability to
> utilize unused time slots for other purposes, but that's not required to be
> there in the first place as long as its designed in a way that it can be
> added later on.
> 
> > > So lets look once more at the picture in an abstract way:
> > >
> > >      	       [ NIC ]
> > > 	          |
> > > 	 [ Time slice manager ]
> > > 	    |           |
> > >          [ Ch 0 ] ... [ Ch N ]
> > >
> > > So you have a bunch of properties here:
> > >
> > > 1) Number of Channels ranging from 1 to N
> > >
> > > 2) Start point, slice period and slice length per channel
> > 
> > Ok, so we agree that a TAS entity is needed. Assuming that channels are traffic
> > classes, do you have something else in mind other than a new root qdisc?
> 
> Whatever you call it, the important point is that it is the gate keeper to
> the network adapter and there is no way around it. It fully controls the
> timed schedule how simple or how complex it may be.
> 
> > > 3) Queueing modes assigned per channel. Again that might be anything from
> > >    'feed through' over FIFO, PRIO to more complex things like EDF.
> > >
> > >    The queueing mode can also influence properties like the meaning of the
> > >    TX time, i.e. strict or deadline.
> > 
> > 
> > Ok, but how are the queueing modes assigned / configured per channel?
> > 
> > Just to make sure we re-visit some ideas from the past:
> > 
> > * TAS:
> > 
> >    The idea we are currently exploring is to add a "time-aware", priority based
> >    qdisc, that also exposes the Tx queues available and provides a mechanism for
> >    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
> >    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
> > 
> >    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
> >      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
> > 	   queues 0 1 2 3                                              \
> >      	   sched-file gates.sched [base-time <interval>]               \
> >            [cycle-time <interval>] [extension-time <interval>]
> > 
> >    <file> is multi-line, with each line being of the following format:
> >    <cmd> <gate mask> <interval in nanoseconds>
> > 
> >    Qbv only defines one <cmd>: "S" for 'SetGates'
> > 
> >    For example:
> > 
> >    S 0x01 300
> >    S 0x03 500
> > 
> >    This means that there are two intervals, the first will have the gate
> >    for traffic class 0 open for 300 nanoseconds, the second will have
> >    both traffic classes open for 500 nanoseconds.
> 
> To accomodate stuff like control systems you also need a base line, which
> is not expressed as interval. Otherwise you can't schedule network wide
> explicit plans. That's either an absolute network-time (TAI) time stamp or
> an offset to a well defined network-time (TAI) time stamp, e.g. start of
> epoch or something else which is agreed on. The actual schedule then fast
> forwards past now (TAI) and sets up the slots from there. That makes node
> hotplug possible as well.

Ok, so this is perhaps a bit of a sidetrack, but based on other discussions 
in this patch-series, does it really make sense to discuss anything *but* 
TAI?

If you have a TSN-stream (or any other time-sensitive way of prioritizing 
frames based on time), then the network is going to be PTP synched anyway, 
and all the rest of the network is going to operate on PTP-time. Why even 
bother adding CLOCK_REALTIME and CLOCK_MONOTONIC to the discussion? Sure, 
use CLOCK_REALTIME locally and sync that to TAI, but the kernel should 
worry about ptp-time _for_that_adapter_, and we should make it pretty 
obvious to userspace that if you want to specify tx-time, then there's this 
thing called 'PTP' and it rules this domain. My $0.02 etc

> Btw, it's not only control systems. Think about complex multi source A/V
> streams. They are reality in recording and life mixing and looking at the
> timing constraints of such scenarios, collision avoidance is key there. So
> you want to be able to do network wide traffic orchestration.

Yep, and if are too bursty, the network is free to drop your frames, which 
is not desired.

> > It would handle multiple channels and expose their constraints / properties.
> > Each channel also becomes a traffic class, so other qdiscs can be attached to
> > them separately.
> 
> Right.

I don't think you need a separate qdisc for each channel, if you describe a 
channel with

- period (what AVB calls observation interval)
- max data
- deadline

you should be able to keep a sorted rb-tree and handle that pretty 
efficiently. Or perhaps I'm completely missing the mark here. If so, my 
apologies

> > So, in summary, because our entire design is based on qdisc interfaces, what we
> > had proposed was a root qdisc (the time slice manager, as you put) that allows
> > for other qdiscs to be attached to each channel. The inner qdiscs define the
> > queueing modes for each channel, and tbs is just one of those modes. I
> > understand now that you want to allow for fully dynamic use-cases to be
> > supported as well, which we hadn't covered with our TAS proposal before because
> > we hadn't envisioned it being used for these systems' design.
> 
> Yes, you have the root qdisc, which is in charge of the overall scheduling
> plan, how complex or not it is defined does not matter. It exposes traffic
> classes which have properties defined by the configuration.
> 
> The qdiscs which are attached to those traffic classes can be anything
> including:
> 
>  - Simple feed through (Applications are time contraints aware and set the
>    exact schedule). qdisc has admission control.
> 
>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>    of time constraints and provide the packet deadline. qdisc has admission
>    control. This can be a simple first comes, first served scheduler or
>    something like EDF which allows optimized utilization. The qdisc sets
>    the TX time depending on the deadline and feeds into the root.

As a small nitpick, it would make more sense to do a laxity-approach here, 
both for explicit mode and deadline-mode. We know the size of the frame to 
send, we know the outgoing rate, so keep a ready-queue sorted based on 
laxity

     laxity = absolute_deadline - (size / outgoing_rate)

Also, given that we use a *single* tx-queue for time-triggered 
transmission, this boils down to a uniprocessor equivalent and we have a 
lot of func real-time scheduling academia to draw from.

This could then probably handle both of the above (Direct + deadline), but 
that's implementatino specific I guess.

>  - FIFO/PRIO/XXX for general traffic. Applications do not know anything
>    about timing constraints. These qdiscs obviously have neither admission
>    control nor do they set a TX time.  The root qdisc just pulls from there
>    when the assigned time slot is due or if it (optionally) decides to use
>    underutilized time slots from other classes.
> 
>  - .... Add your favourite scheduling mode(s).

Just give it sub-qdiscs and offload enqueue/dequeue to those I suppose.

-- 
Henrik Austad

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-28  7:48                 ` Thomas Gleixner
  2018-03-28 13:07                   ` Henrik Austad
@ 2018-04-09 16:36                   ` Jesus Sanchez-Palencia
  2018-04-10 12:37                     ` Thomas Gleixner
  1 sibling, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-04-09 16:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 03/28/2018 12:48 AM, Thomas Gleixner wrote:

(...)

>
> There are two modes:
>
>       1) Send at the given TX time (Explicit mode)
>
>       2) Send before given TX time (Deadline mode)
>
> There is no need to specify 'drop if late' simply because if the message is
> handed in past the given TX time, it's too late by definition. What you are
> trying to implement is a hybrid of TSN and general purpose (not time aware)
> networking in one go. And you do that because your overall design is not
> looking at the big picture. You designed from a given use case assumption
> and tried to fit other things into it with duct tape.


Ok, I see the difference now, thanks. I have just two more questions about the
deadline mode, please see below.

(...)


>
>>> Coming back to the overall scheme. If you start upfront with a time slice
>>> manager which is designed to:
>>>
>>>   - Handle multiple channels
>>>
>>>   - Expose the time constraints, properties per channel
>>>
>>> then you can fit all kind of use cases, whether designed by committee or
>>> not. You can configure that thing per node or network wide. It does not
>>> make a difference. The only difference are the resulting constraints.
>>
>>
>> Ok, and I believe the above was covered by what we had proposed before, unless
>> what you meant by time constraints is beyond the configured port schedule.
>>
>> Are you suggesting that we'll need to have a kernel entity that is not only
>> aware of the current traffic classes 'schedule', but also of the resources that
>> are still available for new streams to be accommodated into the classes? Putting
>> it differently, is the TAS you envision just an entity that runs a schedule, or
>> is it a time-aware 'orchestrator'?
>
> In the first place its something which runs a defined schedule.
>
> The accomodation for new streams is required, but not necessarily at the
> root qdisc level. That might be a qdisc feeding into it.
>
> Assume you have a bandwidth reservation, aka time slot, for audio. If your
> audio related qdisc does deadline scheduling then you can add new streams
> to it up to the point where it's not longer able to fit.
>
> The only thing which might be needed at the root qdisc is the ability to
> utilize unused time slots for other purposes, but that's not required to be
> there in the first place as long as its designed in a way that it can be
> added later on.


Ok, agreed.


>
>>> So lets look once more at the picture in an abstract way:
>>>
>>>      	       [ NIC ]
>>> 	          |
>>> 	 [ Time slice manager ]
>>> 	    |           |
>>>          [ Ch 0 ] ... [ Ch N ]
>>>
>>> So you have a bunch of properties here:
>>>
>>> 1) Number of Channels ranging from 1 to N
>>>
>>> 2) Start point, slice period and slice length per channel
>>
>> Ok, so we agree that a TAS entity is needed. Assuming that channels are traffic
>> classes, do you have something else in mind other than a new root qdisc?
>
> Whatever you call it, the important point is that it is the gate keeper to
> the network adapter and there is no way around it. It fully controls the
> timed schedule how simple or how complex it may be.


Ok, and I've finally understood the nuance between the above and what we had
planned initially.


(...)


>>
>> * TAS:
>>
>>    The idea we are currently exploring is to add a "time-aware", priority based
>>    qdisc, that also exposes the Tx queues available and provides a mechanism for
>>    mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>>    mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:
>>
>>    $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
>>      	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
>> 	   queues 0 1 2 3                                              \
>>      	   sched-file gates.sched [base-time <interval>]               \
>>            [cycle-time <interval>] [extension-time <interval>]
>>
>>    <file> is multi-line, with each line being of the following format:
>>    <cmd> <gate mask> <interval in nanoseconds>
>>
>>    Qbv only defines one <cmd>: "S" for 'SetGates'
>>
>>    For example:
>>
>>    S 0x01 300
>>    S 0x03 500
>>
>>    This means that there are two intervals, the first will have the gate
>>    for traffic class 0 open for 300 nanoseconds, the second will have
>>    both traffic classes open for 500 nanoseconds.
>
> To accomodate stuff like control systems you also need a base line, which
> is not expressed as interval. Otherwise you can't schedule network wide
> explicit plans. That's either an absolute network-time (TAI) time stamp or
> an offset to a well defined network-time (TAI) time stamp, e.g. start of
> epoch or something else which is agreed on. The actual schedule then fast
> forwards past now (TAI) and sets up the slots from there. That makes node
> hotplug possible as well.


Sure, and the [base-time <interval>] on the command line above was actually
wrong. It should have been expressed as [base-time <timestamp>].



>> It would handle multiple channels and expose their constraints / properties.
>> Each channel also becomes a traffic class, so other qdiscs can be attached to
>> them separately.
>
> Right.
>
>> So, in summary, because our entire design is based on qdisc interfaces, what we
>> had proposed was a root qdisc (the time slice manager, as you put) that allows
>> for other qdiscs to be attached to each channel. The inner qdiscs define the
>> queueing modes for each channel, and tbs is just one of those modes. I
>> understand now that you want to allow for fully dynamic use-cases to be
>> supported as well, which we hadn't covered with our TAS proposal before because
>> we hadn't envisioned it being used for these systems' design.
>
> Yes, you have the root qdisc, which is in charge of the overall scheduling
> plan, how complex or not it is defined does not matter. It exposes traffic
> classes which have properties defined by the configuration.


Perfect. Let's see if we can agree on an overall plan, then. Hopefully I'm not
missing anything.

For the above we'll develop a new qdisc, designed along the 'taprio' ideas, thus
a Qbv style scheduler, to be used as root qdisc. It can run the schedule inside
the kernel or just offload it to the NIC if supported. Similarly to the other
multiqueue qdiscs, it will expose the HW Tx queues.

What is new here from the ideas we shared last year is that this new root qdisc
will be responsible for calling the attached qdiscs' dequeue functions during
their timeslices, making it the only entity capable of enqueueing packets into
the NIC.

This is the "global scheduler", but we still need the txtime aware qdisc. For
that, we'll modify tbs to accommodate the feedback from this thread. More below.


>
> The qdiscs which are attached to those traffic classes can be anything
> including:
>
>  - Simple feed through (Applications are time contraints aware and set the
>    exact schedule). qdisc has admission control.


This will be provided by the tbs qdisc. It will still provide a txtime sorted
list and hw offload, but now there will be a per-socket option that tells the
qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
called it) or a deadline. The drop_if_late flag will be removed.

When in explicit mode, packets from that socket are dequeued from the qdisc
during its time slice if their [(txtime - delta) < now].


>
>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>    of time constraints and provide the packet deadline. qdisc has admission
>    control. This can be a simple first comes, first served scheduler or
>    something like EDF which allows optimized utilization. The qdisc sets
>    the TX time depending on the deadline and feeds into the root.


This will be provided by tbs if the socket which is transmitting packets is
configured for deadline mode.

For the deadline -> txtime conversion, what I have in mind is: when dequeue is
called tbs will just change the skbuff's timestamp from the deadline to 'now'
(i.e. as soon as possible) and dequeue the packet. Would that be enough or
should we use the delta parameter of the qdisc on this case add make [txtime =
now + delta]? The only benefit of doing so would be to provide a configurable
'fudge' factor.

Another question for this mode (but perhaps that applies to both modes) is, what
if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
the packet during dequeue.


Putting it all together, we end up with:

1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
$ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting

2) a new cmsg-interface for setting a per-packet timestamp that will be used
either as a txtime or as deadline by tbs (and further the NIC driver for the
offlaod case): SCM_TXTIME.

3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
socket, and will have as parameters a clockid and a txtime mode (deadline or
explicit), that defines the semantics of the timestamp set on packets using
SCM_TXTIME.

4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .

5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its cli
will look like what was proposed for taprio (base time being an absolute timestamp).



If we all agree with the above, we will start by closing on 1-4 asap and will
focus on 5 next.

How does that sound?

Thanks,
Jesus



>
>  - FIFO/PRIO/XXX for general traffic. Applications do not know anything
>    about timing constraints. These qdiscs obviously have neither admission
>    control nor do they set a TX time.  The root qdisc just pulls from there
>    when the assigned time slot is due or if it (optionally) decides to use
>    underutilized time slots from other classes.
>
>  - .... Add your favourite scheduling mode(s).
>
> Thanks,
>
> 	tglx
>

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-09 16:36                   ` Jesus Sanchez-Palencia
@ 2018-04-10 12:37                     ` Thomas Gleixner
  2018-04-10 21:24                       ` Jesus Sanchez-Palencia
  0 siblings, 1 reply; 129+ messages in thread
From: Thomas Gleixner @ 2018-04-10 12:37 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Jesus,

On Mon, 9 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/28/2018 12:48 AM, Thomas Gleixner wrote:
> > Yes, you have the root qdisc, which is in charge of the overall scheduling
> > plan, how complex or not it is defined does not matter. It exposes traffic
> > classes which have properties defined by the configuration.
> 
> Perfect. Let's see if we can agree on an overall plan, then. Hopefully I'm not
> missing anything.
> 
> For the above we'll develop a new qdisc, designed along the 'taprio' ideas, thus
> a Qbv style scheduler, to be used as root qdisc. It can run the schedule inside
> the kernel or just offload it to the NIC if supported. Similarly to the other
> multiqueue qdiscs, it will expose the HW Tx queues.
> 
> What is new here from the ideas we shared last year is that this new root qdisc
> will be responsible for calling the attached qdiscs' dequeue functions during
> their timeslices, making it the only entity capable of enqueueing packets into
> the NIC.

Correct. Aside of that it's the entity which is in charge of the overall
scheduling.

> This is the "global scheduler", but we still need the txtime aware
> qdisc. For that, we'll modify tbs to accommodate the feedback from this
> thread. More below.

> > The qdiscs which are attached to those traffic classes can be anything
> > including:
> >
> >  - Simple feed through (Applications are time contraints aware and set the
> >    exact schedule). qdisc has admission control.
> 
> This will be provided by the tbs qdisc. It will still provide a txtime sorted
> list and hw offload, but now there will be a per-socket option that tells the
> qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
> called it) or a deadline. The drop_if_late flag will be removed.
> 
> When in explicit mode, packets from that socket are dequeued from the qdisc
> during its time slice if their [(txtime - delta) < now].
> 
> >
> >  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
> >    of time constraints and provide the packet deadline. qdisc has admission
> >    control. This can be a simple first comes, first served scheduler or
> >    something like EDF which allows optimized utilization. The qdisc sets
> >    the TX time depending on the deadline and feeds into the root.
> 
> This will be provided by tbs if the socket which is transmitting packets is
> configured for deadline mode.

You don't want the socket to decide that. The qdisc into which a socket
feeds defines the mode and the qdisc rejects requests with the wrong mode.

Making a qdisc doing both and let the user decide what he wants it to be is
not really going to fly. Especially if you have different users which want
a different mode. It's clearly distinct functionality.

Please stop trying to develop swiss army knifes with integrated coffee
machines.

> For the deadline -> txtime conversion, what I have in mind is: when dequeue is
> called tbs will just change the skbuff's timestamp from the deadline to 'now'
> (i.e. as soon as possible) and dequeue the packet. Would that be enough or
> should we use the delta parameter of the qdisc on this case add make [txtime =
> now + delta]? The only benefit of doing so would be to provide a configurable
> 'fudge' factor.

Well, that really depends on how your deadline scheduler works.

> Another question for this mode (but perhaps that applies to both modes) is, what
> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> the packet during dequeue.

There the question is how user space is notified about that issue. The
application which queued the packet on time does rightfully assume that
it's going to be on the wire on time.

This is a violation of the overall scheduling plan, so you need to have
a sane design to handle that.

> Putting it all together, we end up with:
> 
> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting

Why CLOCK_REALTIME? The only interesting time in a TSN network is
CLOCK_TAI, really.

> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> either as a txtime or as deadline by tbs (and further the NIC driver for the
> offlaod case): SCM_TXTIME.
> 
> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> socket, and will have as parameters a clockid and a txtime mode (deadline or
> explicit), that defines the semantics of the timestamp set on packets using
> SCM_TXTIME.
> 
> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .

Can you remind me why we would need that?

> 5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its cli
> will look like what was proposed for taprio (base time being an absolute timestamp).
> 
> If we all agree with the above, we will start by closing on 1-4 asap and will
> focus on 5 next.
> 
> How does that sound?

Backwards to be honest.

You should start with the NIC facing qdisc because that's the key part of
all this and the design might have implications on how the qdiscs which
feed into it need to be designed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-10 12:37                     ` Thomas Gleixner
@ 2018-04-10 21:24                       ` Jesus Sanchez-Palencia
  2018-04-11 20:16                         ` Thomas Gleixner
  0 siblings, 1 reply; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-04-10 21:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi Thomas,


On 04/10/2018 05:37 AM, Thomas Gleixner wrote:

(...)


>>>
>>>  - Simple feed through (Applications are time contraints aware and set the
>>>    exact schedule). qdisc has admission control.
>>
>> This will be provided by the tbs qdisc. It will still provide a txtime sorted
>> list and hw offload, but now there will be a per-socket option that tells the
>> qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
>> called it) or a deadline. The drop_if_late flag will be removed.
>>
>> When in explicit mode, packets from that socket are dequeued from the qdisc
>> during its time slice if their [(txtime - delta) < now].
>>
>>>
>>>  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
>>>    of time constraints and provide the packet deadline. qdisc has admission
>>>    control. This can be a simple first comes, first served scheduler or
>>>    something like EDF which allows optimized utilization. The qdisc sets
>>>    the TX time depending on the deadline and feeds into the root.
>>
>> This will be provided by tbs if the socket which is transmitting packets is
>> configured for deadline mode.
> 
> You don't want the socket to decide that. The qdisc into which a socket
> feeds defines the mode and the qdisc rejects requests with the wrong mode.
> 
> Making a qdisc doing both and let the user decide what he wants it to be is
> not really going to fly. Especially if you have different users which want
> a different mode. It's clearly distinct functionality.


Ok, so just to make sure I got this right, are you suggesting that both the
'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
parameter for specifying the txtime mode? This way if there is a mismatch,
packets from that socket are rejected by the qdisc.



(...)


> 
>> Another question for this mode (but perhaps that applies to both modes) is, what
>> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
>> the packet during dequeue.
> 
> There the question is how user space is notified about that issue. The
> application which queued the packet on time does rightfully assume that
> it's going to be on the wire on time.
> 
> This is a violation of the overall scheduling plan, so you need to have
> a sane design to handle that.


In addition to the qdisc stats, we could look into using the socket's error
queue to notify the application about that.


> 
>> Putting it all together, we end up with:
>>
>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
> 
> Why CLOCK_REALTIME? The only interesting time in a TSN network is
> CLOCK_TAI, really.


REALTIME was just an example here to show that the qdisc has to be configured
with a clockid parameter. Are you suggesting that instead both of the new qdiscs
(i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?


> 
>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>> offlaod case): SCM_TXTIME.
>>
>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>> explicit), that defines the semantics of the timestamp set on packets using
>> SCM_TXTIME.
>>
>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> 
> Can you remind me why we would need that?


So there is a "clockid" that can be used for the full hw offload modes. On this
case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
can't just use a clockid that was computed from the fd pointing to /dev/ptpX .


> 
>> 5) a new schedule-aware qdisc, 'tas' or 'taprio', to be used per port. Its cli
>> will look like what was proposed for taprio (base time being an absolute timestamp).
>>
>> If we all agree with the above, we will start by closing on 1-4 asap and will
>> focus on 5 next.
>>
>> How does that sound?
> 
> Backwards to be honest.
> 
> You should start with the NIC facing qdisc because that's the key part of
> all this and the design might have implications on how the qdiscs which
> feed into it need to be designed.


Ok, let's just try to close on the above first.


Thanks,
Jesus


> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-10 21:24                       ` Jesus Sanchez-Palencia
@ 2018-04-11 20:16                         ` Thomas Gleixner
  2018-04-11 20:31                           ` Ivan Briano
  2018-04-11 23:38                           ` Jesus Sanchez-Palencia
  0 siblings, 2 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-04-11 20:16 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
> >> This will be provided by tbs if the socket which is transmitting packets is
> >> configured for deadline mode.
> > 
> > You don't want the socket to decide that. The qdisc into which a socket
> > feeds defines the mode and the qdisc rejects requests with the wrong mode.
> > 
> > Making a qdisc doing both and let the user decide what he wants it to be is
> > not really going to fly. Especially if you have different users which want
> > a different mode. It's clearly distinct functionality.
> 
> 
> Ok, so just to make sure I got this right, are you suggesting that both the
> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
> parameter for specifying the txtime mode? This way if there is a mismatch,
> packets from that socket are rejected by the qdisc.

Correct. The same is true if you try to set SO_TXTIME for something which
is just routing regular traffic.

> (...)
> > 
> >> Another question for this mode (but perhaps that applies to both modes) is, what
> >> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> >> the packet during dequeue.
> > 
> > There the question is how user space is notified about that issue. The
> > application which queued the packet on time does rightfully assume that
> > it's going to be on the wire on time.
> > 
> > This is a violation of the overall scheduling plan, so you need to have
> > a sane design to handle that.
> 
> In addition to the qdisc stats, we could look into using the socket's error
> queue to notify the application about that.

Makes sense.
 
> >> Putting it all together, we end up with:
> >>
> >> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
> >> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
> > 
> > Why CLOCK_REALTIME? The only interesting time in a TSN network is
> > CLOCK_TAI, really.
> 
> REALTIME was just an example here to show that the qdisc has to be configured
> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?

I think so. It's _the_ network time on which everything is based on.

> >> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> >> either as a txtime or as deadline by tbs (and further the NIC driver for the
> >> offlaod case): SCM_TXTIME.
> >>
> >> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> >> socket, and will have as parameters a clockid and a txtime mode (deadline or
> >> explicit), that defines the semantics of the timestamp set on packets using
> >> SCM_TXTIME.
> >>
> >> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> > 
> > Can you remind me why we would need that?
> 
> So there is a "clockid" that can be used for the full hw offload modes. On this
> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .

And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
yet another clock, right?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-11 20:16                         ` Thomas Gleixner
@ 2018-04-11 20:31                           ` Ivan Briano
  2018-04-11 23:38                           ` Jesus Sanchez-Palencia
  1 sibling, 0 replies; 129+ messages in thread
From: Ivan Briano @ 2018-04-11 20:31 UTC (permalink / raw)
  To: Thomas Gleixner, Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar



On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
> On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
>>>> This will be provided by tbs if the socket which is transmitting packets is
>>>> configured for deadline mode.
>>>
>>> You don't want the socket to decide that. The qdisc into which a socket
>>> feeds defines the mode and the qdisc rejects requests with the wrong mode.
>>>
>>> Making a qdisc doing both and let the user decide what he wants it to be is
>>> not really going to fly. Especially if you have different users which want
>>> a different mode. It's clearly distinct functionality.
>>
>>
>> Ok, so just to make sure I got this right, are you suggesting that both the
>> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
>> parameter for specifying the txtime mode? This way if there is a mismatch,
>> packets from that socket are rejected by the qdisc.
> 
> Correct. The same is true if you try to set SO_TXTIME for something which
> is just routing regular traffic.
> 
>> (...)
>>>
>>>> Another question for this mode (but perhaps that applies to both modes) is, what
>>>> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
>>>> the packet during dequeue.
>>>
>>> There the question is how user space is notified about that issue. The
>>> application which queued the packet on time does rightfully assume that
>>> it's going to be on the wire on time.
>>>
>>> This is a violation of the overall scheduling plan, so you need to have
>>> a sane design to handle that.
>>
>> In addition to the qdisc stats, we could look into using the socket's error
>> queue to notify the application about that.
> 
> Makes sense.
>  
>>>> Putting it all together, we end up with:
>>>>
>>>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>>>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.
> 
>>>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>>>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>>>> offlaod case): SCM_TXTIME.
>>>>
>>>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>>>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>>>> explicit), that defines the semantics of the timestamp set on packets using
>>>> SCM_TXTIME.
>>>>
>>>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?
> 

Most likely, though you can technically have a different time domain
that is not based on TAI.

> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-11 20:16                         ` Thomas Gleixner
  2018-04-11 20:31                           ` Ivan Briano
@ 2018-04-11 23:38                           ` Jesus Sanchez-Palencia
  2018-04-12 15:03                             ` Richard Cochran
  2018-04-19 10:03                             ` Thomas Gleixner
  1 sibling, 2 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-04-11 23:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

Hi,

On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
>>>> Putting it all together, we end up with:
>>>>
>>>> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look like:
>>>> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 150000 offload sorting
>>>
>>> Why CLOCK_REALTIME? The only interesting time in a TSN network is
>>> CLOCK_TAI, really.
>>
>> REALTIME was just an example here to show that the qdisc has to be configured
>> with a clockid parameter. Are you suggesting that instead both of the new qdiscs
>> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?
> 
> I think so. It's _the_ network time on which everything is based on.

Yes, but more on this below.


> 
>>>> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
>>>> either as a txtime or as deadline by tbs (and further the NIC driver for the
>>>> offlaod case): SCM_TXTIME.
>>>>
>>>> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
>>>> socket, and will have as parameters a clockid and a txtime mode (deadline or
>>>> explicit), that defines the semantics of the timestamp set on packets using
>>>> SCM_TXTIME.
>>>>
>>>> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
>>>
>>> Can you remind me why we would need that?
>>
>> So there is a "clockid" that can be used for the full hw offload modes. On this
>> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
>> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
> 
> And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> yet another clock, right?

Just breaking this down a bit, yes, TAI is the network time base, and the NICs
PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
been synchronized over the network (e.g. with ptp4l), my understanding is that
if applications want to use the clockid_t CLOCK_TAI as a network clock reference
it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
system clock, and also that something calls adjtime to apply the TAI vs UTC
offset to CLOCK_TAI.

If we are fine with those 'dependencies', then I agree there is no need for
another clock.

I was thinking about the full offload use-cases, thus when no scheduling is
happening inside the qdiscs. Applications could just read the time from the PHC
clocks directly without having to rely on any of the above. On this case,
userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
admit it's not clear to me how common of a use-case that is, or even if it makes
sense.


Thanks,
Jesus


> 
> Thanks,
> 
> 	tglx
> 

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-11 23:38                           ` Jesus Sanchez-Palencia
@ 2018-04-12 15:03                             ` Richard Cochran
  2018-04-12 15:19                               ` Miroslav Lichvar
  2018-04-19 10:03                             ` Thomas Gleixner
  1 sibling, 1 reply; 129+ messages in thread
From: Richard Cochran @ 2018-04-12 15:03 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: Thomas Gleixner, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Wed, Apr 11, 2018 at 04:38:44PM -0700, Jesus Sanchez-Palencia wrote:
> Just breaking this down a bit, yes, TAI is the network time base, and the NICs
> PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
> been synchronized over the network (e.g. with ptp4l), my understanding is that
> if applications want to use the clockid_t CLOCK_TAI as a network clock reference
> it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
> system clock, and also that something calls adjtime to apply the TAI vs UTC
> offset to CLOCK_TAI.

Yes.  I haven't seen any distro that sets the TAI-UTC offset after
boot, nor are there any user space tools for this.  The kernel is
ready, though.

> I was thinking about the full offload use-cases, thus when no scheduling is
> happening inside the qdiscs. Applications could just read the time from the PHC
> clocks directly without having to rely on any of the above. On this case,
> userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
> admit it's not clear to me how common of a use-case that is, or even if it makes
> sense.

1588 allows only two timescales, TAI and ARB-itrary.  Although it
doesn't make too much sense to use ARB, still people will do strange
things.  Probably some people use UTC.  I am not advocating supporting
alternate timescales, just pointing out the possibility.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-12 15:03                             ` Richard Cochran
@ 2018-04-12 15:19                               ` Miroslav Lichvar
  0 siblings, 0 replies; 129+ messages in thread
From: Miroslav Lichvar @ 2018-04-12 15:19 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Jesus Sanchez-Palencia, Thomas Gleixner, netdev, jhs,
	xiyou.wangcong, jiri, vinicius.gomes, anna-maria, henrik,
	John Stultz, levi.pearson, edumazet, willemb

On Thu, Apr 12, 2018 at 08:03:49AM -0700, Richard Cochran wrote:
> On Wed, Apr 11, 2018 at 04:38:44PM -0700, Jesus Sanchez-Palencia wrote:
> > Just breaking this down a bit, yes, TAI is the network time base, and the NICs
> > PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
> > been synchronized over the network (e.g. with ptp4l), my understanding is that
> > if applications want to use the clockid_t CLOCK_TAI as a network clock reference
> > it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
> > system clock, and also that something calls adjtime to apply the TAI vs UTC
> > offset to CLOCK_TAI.
> 
> Yes.  I haven't seen any distro that sets the TAI-UTC offset after
> boot, nor are there any user space tools for this.  The kernel is
> ready, though.

FWIW, the default NTP configuration in Fedora sets the kernel TAI-UTC
offset.

> > I was thinking about the full offload use-cases, thus when no scheduling is
> > happening inside the qdiscs. Applications could just read the time from the PHC
> > clocks directly without having to rely on any of the above. On this case,
> > userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
> > admit it's not clear to me how common of a use-case that is, or even if it makes
> > sense.
> 
> 1588 allows only two timescales, TAI and ARB-itrary.  Although it
> doesn't make too much sense to use ARB, still people will do strange
> things.  Probably some people use UTC.  I am not advocating supporting
> alternate timescales, just pointing out the possibility.

There is also the possibility that the NIC clock is not synchronized
to anything. For synchronization of the system clock it's easier to
leave it free running and only track its phase/frequency offset to
allow conversion between the PHC and system time.

-- 
Miroslav Lichvar

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-11 23:38                           ` Jesus Sanchez-Palencia
  2018-04-12 15:03                             ` Richard Cochran
@ 2018-04-19 10:03                             ` Thomas Gleixner
  1 sibling, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-04-19 10:03 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, anna-maria, henrik, John Stultz, levi.pearson,
	edumazet, willemb, mlichvar

On Wed, 11 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
> >> So there is a "clockid" that can be used for the full hw offload modes. On this
> >> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, we
> >> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .
> > 
> > And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> > yet another clock, right?
> 
> Just breaking this down a bit, yes, TAI is the network time base, and the NICs
> PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
> been synchronized over the network (e.g. with ptp4l), my understanding is that
> if applications want to use the clockid_t CLOCK_TAI as a network clock reference
> it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
> system clock, and also that something calls adjtime to apply the TAI vs UTC
> offset to CLOCK_TAI.
> 
> If we are fine with those 'dependencies', then I agree there is no need for
> another clock.
> 
> I was thinking about the full offload use-cases, thus when no scheduling is
> happening inside the qdiscs. Applications could just read the time from the PHC
> clocks directly without having to rely on any of the above. On this case,
> userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I must
> admit it's not clear to me how common of a use-case that is, or even if it makes
> sense.

I don't think it makes a lot of sense because the only use case for that is
a full user space scheduler which routes _ALL_ traffic. I don't think
that's something which we want to proliferate.

So I'd rather start off with the CLOCK_TAI assumption and if the need
really arises we can discuss that separately. So you can take a clockid
into account when designing the ABI, but have it CLOCK_TAI only for the
start.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-03-21 13:46     ` [Intel-wired-lan] " Thomas Gleixner
@ 2018-04-23 18:21       ` Jesus Sanchez-Palencia
  -1 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-04-23 18:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +	bool sorting;
>> +	int clockid;
>> +	int queue;
>> +	s32 delta; /* in ns */
>> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +	struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>  				unsigned long		dev_scratch;
>  			};
>  		};
> -		struct rb_node	rbnode; /* used in netem & tcp stack */
> +		struct rb_node		rbnode; /* used in netem & tcp stack */
> +		struct timerqueue_node	tqnode;
>  	};
>  	struct sock		*sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


I just noticed that doing the above increases the size of struct sk_buff by 8
bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
24bytes long.

Given the feedback we got here before against touching struct sk_buff at all for
non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
keeping the open-coded version for now, ok?

Thanks,
Jesus


(...)

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
@ 2018-04-23 18:21       ` Jesus Sanchez-Palencia
  0 siblings, 0 replies; 129+ messages in thread
From: Jesus Sanchez-Palencia @ 2018-04-23 18:21 UTC (permalink / raw)
  To: intel-wired-lan

Hi Thomas,


On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
>> +struct tbs_sched_data {
>> +	bool sorting;
>> +	int clockid;
>> +	int queue;
>> +	s32 delta; /* in ns */
>> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
>> +	struct rb_root head;
> 
> Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> you could reuse the timerqueue implementation?
> 
> That requires to add a timerqueue node to struct skbuff
> 
> @@ -671,7 +671,8 @@ struct sk_buff {
>  				unsigned long		dev_scratch;
>  			};
>  		};
> -		struct rb_node	rbnode; /* used in netem & tcp stack */
> +		struct rb_node		rbnode; /* used in netem & tcp stack */
> +		struct timerqueue_node	tqnode;
>  	};
>  	struct sock		*sk;
> 
> Then you can use timerqueue_head in your scheduler data and all the open
> coded rbtree handling goes away.


I just noticed that doing the above increases the size of struct sk_buff by 8
bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
24bytes long.

Given the feedback we got here before against touching struct sk_buff at all for
non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
keeping the open-coded version for now, ok?

Thanks,
Jesus


(...)



^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-23 18:21       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
@ 2018-04-24  8:50         ` Thomas Gleixner
  -1 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-04-24  8:50 UTC (permalink / raw)
  To: Jesus Sanchez-Palencia
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	richardcochran, intel-wired-lan, anna-maria, henrik, john.stultz,
	levi.pearson, edumazet, willemb, mlichvar

On Mon, 23 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> >> +struct tbs_sched_data {
> >> +	bool sorting;
> >> +	int clockid;
> >> +	int queue;
> >> +	s32 delta; /* in ns */
> >> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> >> +	struct rb_root head;
> > 
> > Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> > you could reuse the timerqueue implementation?
> > 
> > That requires to add a timerqueue node to struct skbuff
> > 
> > @@ -671,7 +671,8 @@ struct sk_buff {
> >  				unsigned long		dev_scratch;
> >  			};
> >  		};
> > -		struct rb_node	rbnode; /* used in netem & tcp stack */
> > +		struct rb_node		rbnode; /* used in netem & tcp stack */
> > +		struct timerqueue_node	tqnode;
> >  	};
> >  	struct sock		*sk;
> > 
> > Then you can use timerqueue_head in your scheduler data and all the open
> > coded rbtree handling goes away.
> 
> 
> I just noticed that doing the above increases the size of struct sk_buff by 8
> bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
> 24bytes long.
> 
> Given the feedback we got here before against touching struct sk_buff at all for
> non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
> keeping the open-coded version for now, ok?

The size of sk_buff is 216 and the size of sk_buff_fclones is 440
bytes. The sk_buff and sk_buff_fclones kmem_caches use objects sized 256
and 512 bytes because the kmem_caches are created with SLAB_HWCACHE_ALIGN.

So adding 8 bytes to spare duplicated code will not change the kmem_cache
object size and I really doubt that anyone will notice.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
@ 2018-04-24  8:50         ` Thomas Gleixner
  0 siblings, 0 replies; 129+ messages in thread
From: Thomas Gleixner @ 2018-04-24  8:50 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, 23 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> >> +struct tbs_sched_data {
> >> +	bool sorting;
> >> +	int clockid;
> >> +	int queue;
> >> +	s32 delta; /* in ns */
> >> +	ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> >> +	struct rb_root head;
> > 
> > Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> > you could reuse the timerqueue implementation?
> > 
> > That requires to add a timerqueue node to struct skbuff
> > 
> > @@ -671,7 +671,8 @@ struct sk_buff {
> >  				unsigned long		dev_scratch;
> >  			};
> >  		};
> > -		struct rb_node	rbnode; /* used in netem & tcp stack */
> > +		struct rb_node		rbnode; /* used in netem & tcp stack */
> > +		struct timerqueue_node	tqnode;
> >  	};
> >  	struct sock		*sk;
> > 
> > Then you can use timerqueue_head in your scheduler data and all the open
> > coded rbtree handling goes away.
> 
> 
> I just noticed that doing the above increases the size of struct sk_buff by 8
> bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
> 24bytes long.
> 
> Given the feedback we got here before against touching struct sk_buff at all for
> non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
> keeping the open-coded version for now, ok?

The size of sk_buff is 216 and the size of sk_buff_fclones is 440
bytes. The sk_buff and sk_buff_fclones kmem_caches use objects sized 256
and 512 bytes because the kmem_caches are created with SLAB_HWCACHE_ALIGN.

So adding 8 bytes to spare duplicated code will not change the kmem_cache
object size and I really doubt that anyone will notice.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 129+ messages in thread

* Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
  2018-04-24  8:50         ` [Intel-wired-lan] " Thomas Gleixner
@ 2018-04-24 13:50           ` David Miller
  -1 siblings, 0 replies; 129+ messages in thread
From: David Miller @ 2018-04-24 13:50 UTC (permalink / raw)
  To: tglx
  Cc: jesus.sanchez-palencia, netdev, jhs, xiyou.wangcong, jiri,
	vinicius.gomes, richardcochran, intel-wired-lan, anna-maria,
	henrik, john.stultz, levi.pearson, edumazet, willemb, mlichvar

From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 24 Apr 2018 10:50:04 +0200 (CEST)

> So adding 8 bytes to spare duplicated code will not change the kmem_cache
> object size and I really doubt that anyone will notice.

It's about where the cache lines end up when each and every byte is added
to the structure, not just the slab object size.

^ permalink raw reply	[flat|nested] 129+ messages in thread

* [Intel-wired-lan] [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
@ 2018-04-24 13:50           ` David Miller
  0 siblings, 0 replies; 129+ messages in thread
From: David Miller @ 2018-04-24 13:50 UTC (permalink / raw)
  To: intel-wired-lan

From: Thomas Gleixner <tglx@linutronix.de>
Date: Tue, 24 Apr 2018 10:50:04 +0200 (CEST)

> So adding 8 bytes to spare duplicated code will not change the kmem_cache
> object size and I really doubt that anyone will notice.

It's about where the cache lines end up when each and every byte is added
to the structure, not just the slab object size.

^ permalink raw reply	[flat|nested] 129+ messages in thread

end of thread, other threads:[~2018-04-24 13:50 UTC | newest]

Thread overview: 129+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-07  1:12 [RFC v3 net-next 00/18] Time based packet transmission Jesus Sanchez-Palencia
2018-03-07  1:12 ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 01/18] sock: Fix SO_ZEROCOPY switch case Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07 16:58   ` Willem de Bruijn
2018-03-07 16:58     ` [Intel-wired-lan] " Willem de Bruijn
2018-03-07  1:12 ` [RFC v3 net-next 02/18] net: Clear skb->tstamp only on the forwarding path Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07 16:59   ` Willem de Bruijn
2018-03-07 16:59     ` [Intel-wired-lan] " Willem de Bruijn
2018-03-07 22:03     ` Jesus Sanchez-Palencia
2018-03-07 22:03       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 03/18] posix-timers: Add CLOCKID_INVALID mask Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 04/18] net: Add a new socket option for a future transmit time Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 05/18] net: ipv4: raw: Hook into time based transmission Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07 17:00   ` Willem de Bruijn
2018-03-07 17:00     ` [Intel-wired-lan] " Willem de Bruijn
2018-03-07  1:12 ` [RFC v3 net-next 06/18] net: ipv4: udp: " Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07 17:00   ` Willem de Bruijn
2018-03-07 17:00     ` [Intel-wired-lan] " Willem de Bruijn
2018-03-07  1:12 ` [RFC v3 net-next 07/18] net: packet: " Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  2:53   ` Eric Dumazet
2018-03-07  2:53     ` [Intel-wired-lan] " Eric Dumazet
2018-03-07  5:24     ` Richard Cochran
2018-03-07  5:24       ` [Intel-wired-lan] " Richard Cochran
2018-03-07 17:01       ` Willem de Bruijn
2018-03-07 17:01         ` [Intel-wired-lan] " Willem de Bruijn
2018-03-07 17:35         ` Richard Cochran
2018-03-07 17:35           ` [Intel-wired-lan] " Richard Cochran
2018-03-07 17:37           ` Richard Cochran
2018-03-07 17:37             ` [Intel-wired-lan] " Richard Cochran
2018-03-07 17:47             ` Eric Dumazet
2018-03-07 17:47               ` [Intel-wired-lan] " Eric Dumazet
2018-03-08 16:44               ` Richard Cochran
2018-03-08 16:44                 ` [Intel-wired-lan] " Richard Cochran
2018-03-08 17:56                 ` Jesus Sanchez-Palencia
2018-03-08 17:56                   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-21 12:58       ` Thomas Gleixner
2018-03-21 12:58         ` [Intel-wired-lan] " Thomas Gleixner
2018-03-21 14:59         ` Richard Cochran
2018-03-21 14:59           ` [Intel-wired-lan] " Richard Cochran
2018-03-21 15:11           ` Thomas Gleixner
2018-03-07 21:52     ` Jesus Sanchez-Palencia
2018-03-07 21:52       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07 22:45       ` Eric Dumazet
2018-03-07 22:45         ` [Intel-wired-lan] " Eric Dumazet
2018-03-07 23:03         ` David Miller
2018-03-07 23:03           ` [Intel-wired-lan] " David Miller
2018-03-08 11:37         ` Miroslav Lichvar
2018-03-08 11:37           ` [Intel-wired-lan] " Miroslav Lichvar
2018-03-08 16:25           ` David Miller
2018-03-08 16:25             ` [Intel-wired-lan] " David Miller
2018-03-07  1:12 ` [RFC v3 net-next 09/18] net: ipv4: raw: Handle remaining txtime parameters Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 10/18] net: ipv4: udp: " Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 11/18] net: packet: " Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 12/18] net/sched: Allow creating a Qdisc watchdog with other clocks Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-21 13:46   ` Thomas Gleixner
2018-03-21 13:46     ` [Intel-wired-lan] " Thomas Gleixner
2018-03-21 22:29     ` Thomas Gleixner
2018-03-22 20:25       ` Jesus Sanchez-Palencia
2018-03-22 22:52         ` Thomas Gleixner
2018-03-24  0:34           ` Jesus Sanchez-Palencia
2018-03-25 11:46             ` Thomas Gleixner
2018-03-27 23:26               ` Jesus Sanchez-Palencia
2018-03-28  7:48                 ` Thomas Gleixner
2018-03-28 13:07                   ` Henrik Austad
2018-04-09 16:36                   ` Jesus Sanchez-Palencia
2018-04-10 12:37                     ` Thomas Gleixner
2018-04-10 21:24                       ` Jesus Sanchez-Palencia
2018-04-11 20:16                         ` Thomas Gleixner
2018-04-11 20:31                           ` Ivan Briano
2018-04-11 23:38                           ` Jesus Sanchez-Palencia
2018-04-12 15:03                             ` Richard Cochran
2018-04-12 15:19                               ` Miroslav Lichvar
2018-04-19 10:03                             ` Thomas Gleixner
2018-03-22 20:29     ` Jesus Sanchez-Palencia
2018-03-22 22:11       ` Thomas Gleixner
2018-03-22 23:26         ` Jesus Sanchez-Palencia
2018-03-23  8:49           ` Thomas Gleixner
2018-03-23 23:34             ` Jesus Sanchez-Palencia
2018-04-23 18:21     ` Jesus Sanchez-Palencia
2018-04-23 18:21       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-04-24  8:50       ` Thomas Gleixner
2018-04-24  8:50         ` [Intel-wired-lan] " Thomas Gleixner
2018-04-24 13:50         ` David Miller
2018-04-24 13:50           ` [Intel-wired-lan] " David Miller
2018-03-07  1:12 ` [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-21 14:22   ` Thomas Gleixner
2018-03-21 14:22     ` [Intel-wired-lan] " Thomas Gleixner
2018-03-21 15:03     ` Richard Cochran
2018-03-21 15:03       ` [Intel-wired-lan] " Richard Cochran
2018-03-21 16:18       ` Thomas Gleixner
2018-03-22 22:01         ` Jesus Sanchez-Palencia
2018-03-22 23:15     ` Jesus Sanchez-Palencia
2018-03-22 23:15       ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-23  8:51       ` Thomas Gleixner
2018-03-23  8:51         ` [Intel-wired-lan] " Thomas Gleixner
2018-03-07  1:12 ` [RFC v3 net-next 15/18] igb: Refactor igb_configure_cbs() Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 16/18] igb: Only change Tx arbitration when CBS is on Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 17/18] igb: Refactor igb_offload_cbs() Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  1:12 ` [RFC v3 net-next 18/18] igb: Add support for TBS offload Jesus Sanchez-Palencia
2018-03-07  1:12   ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-07  5:28 ` [RFC v3 net-next 00/18] Time based packet transmission Richard Cochran
2018-03-07  5:28   ` [Intel-wired-lan] " Richard Cochran
2018-03-08 14:09 ` Henrik Austad
2018-03-08 14:09   ` [Intel-wired-lan] " Henrik Austad
2018-03-08 18:06   ` Jesus Sanchez-Palencia
2018-03-08 18:06     ` [Intel-wired-lan] " Jesus Sanchez-Palencia
2018-03-08 22:54     ` Henrik Austad
2018-03-08 22:54       ` [Intel-wired-lan] " Henrik Austad
2018-03-08 23:58       ` Jesus Sanchez-Palencia
2018-03-08 23:58         ` [Intel-wired-lan] " Jesus Sanchez-Palencia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.