netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio
@ 2019-05-28 17:46 Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 1/7] igb: clear out tstamp after sending the packet Vedang Patel
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

Currently, we are seeing packets being transmitted outside their timeslices. We
can confirm that the packets are being dequeued at the right time. So, the
delay is induced after the packet is dequeued, because taprio, without any
offloading, has no control of when a packet is actually transmitted.

In order to solve this, we are making use of the txtime feature provided by ETF
qdisc. Hardware offloading needs to be supported by the ETF qdisc in order to
take advantage of this feature. The taprio qdisc will assign txtime (in
skb->tstamp) for all the packets which do not have the txtime allocated via the
SO_TXTIME socket option. For the packets which already have SO_TXTIME set,
taprio will validate whether the packet will be transmitted in the correct
interval.

In order to support this, the following parameters have been added:
- offload (taprio): This is added in order to support different offloading
  modes which will be added in the future.
- txtime-delay (taprio): this is the time the packet takes to traverse through
  the kernel to adapter card.
- skip_sock_check (etf): etf qdisc currently drops any packet which does not
  have the SO_TXTIME socket option set. This check can be skipped by specifying
  this option.

Following is an example configuration:

$ tc qdisc replace dev $IFACE parent root handle 100 taprio \\
      num_tc 3 \\
      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
      queues 1@0 1@0 1@0 \\
      base-time $BASE_TIME \\
      sched-entry S 01 300000 \\
      sched-entry S 02 300000 \\
      sched-entry S 04 400000 \\
      offload 2 \\
      txtime-delay 40000 \\
      clockid CLOCK_TAI

$ tc qdisc replace dev $IFACE parent 100:1 etf \\
      offload delta 200000 clockid CLOCK_TAI skip_sock_check

Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
enabled. Also, that all the traffic classes have been assigned the same queue.
This is to prevent the traffic classes in the lower priority queues from
getting starved. Note that this configuration is specific to the i210 ethernet
card. Other network cards where the hardware queues are given the same
priority, might be able to utilize more than one queue.

Following are some of the other highlights of the series:
- Fix a bug where hardware timestamping and SO_TXTIME options cannot be used
  together. (Patch 1)
- Introduce hardware offloading. This patch introduces offload parameter which
  can be used to enable the txtime offload mode. It will also support enabling
  full offload when the support is available in drivers. (Patch 3)
- Make TxTime assist mode work with TCP packets. (Patch 6 & 7)


The following changes are recommended to be done in order to get the best
performance from taprio in this mode:
# ip link set dev enp1s0 mtu 1514
# ethtool -K eth0 gso off
# ethtool -K eth0 tso off
# ethtool --set-eee eth0 eee off

Thanks,
Vedang Patel

Vedang Patel (6):
  igb: clear out tstamp after sending the packet.
  etf: Add skip_sock_check
  taprio: calculate cycle_time when schedule is installed
  taprio: Add support for txtime offload mode.
  taprio: make clock reference conversions easier
  taprio: Adjust timestamps for TCP packets.

Vinicius Costa Gomes (1):
  taprio: Add the skeleton to enable hardware offloading

 drivers/net/ethernet/intel/igb/igb_main.c |   1 +
 include/uapi/linux/pkt_sched.h            |   6 +
 net/sched/sch_etf.c                       |  10 +
 net/sched/sch_taprio.c                    | 548 ++++++++++++++++++++--
 4 files changed, 532 insertions(+), 33 deletions(-)

-- 
2.17.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 1/7] igb: clear out tstamp after sending the packet.
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 2/7] etf: Add skip_sock_check Vedang Patel
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

skb->tstamp is being used at multiple places. On the transmit side, it
is used to determine the launchtime of the packet. It is also used to
determine the software timestamp after the packet has been transmitted.

So, clear out the tstamp value after it has been read so that we do not
report false software timestamp on the receive side.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 39f33afc479c..005c1693efc8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5687,6 +5687,7 @@ static void igb_tx_ctxtdesc(struct igb_ring *tx_ring,
 	 */
 	if (tx_ring->launchtime_enable) {
 		ts = ns_to_timespec64(first->skb->tstamp);
+		first->skb->tstamp = 0;
 		context_desc->seqnum_seed = cpu_to_le32(ts.tv_nsec / 32);
 	} else {
 		context_desc->seqnum_seed = 0;
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 2/7] etf: Add skip_sock_check
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 1/7] igb: clear out tstamp after sending the packet Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading Vedang Patel
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

Currently, etf expects a socket with SO_TXTIME option set for each packet it
encounters. So, it will drop all other packets. But, in the future commits we
are planning to add functionality which where tstamp value will be set by
another qdisc. Also, some packets which are generated from within the kernel
(e.g. ICMP packets) do not have any socket associated with them.

So, this commit adds support for skip_sock_check. When this option is set, etf
will skip checking for a socket and other associated options for all skbs.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 include/uapi/linux/pkt_sched.h |  1 +
 net/sched/sch_etf.c            | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 8b2f993cbb77..69fc52e4d6bd 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -990,6 +990,7 @@ struct tc_etf_qopt {
 	__u32 flags;
 #define TC_ETF_DEADLINE_MODE_ON	BIT(0)
 #define TC_ETF_OFFLOAD_ON	BIT(1)
+#define TC_ETF_SKIP_SOCK_CHECK  BIT(2)
 };
 
 enum {
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index db0c2ba1d156..cebfb65d8556 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -22,10 +22,12 @@
 
 #define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
 #define OFFLOAD_IS_ON(x) ((x)->flags & TC_ETF_OFFLOAD_ON)
+#define SKIP_SOCK_CHECK_IS_SET(x) ((x)->flags & TC_ETF_SKIP_SOCK_CHECK)
 
 struct etf_sched_data {
 	bool offload;
 	bool deadline_mode;
+	bool skip_sock_check;
 	int clockid;
 	int queue;
 	s32 delta; /* in ns */
@@ -77,6 +79,9 @@ static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
 	struct sock *sk = nskb->sk;
 	ktime_t now;
 
+	if (q->skip_sock_check)
+		goto skip;
+
 	if (!sk)
 		return false;
 
@@ -92,6 +97,7 @@ static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
 	if (sk->sk_txtime_deadline_mode != q->deadline_mode)
 		return false;
 
+skip:
 	now = q->get_time();
 	if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
 		return false;
@@ -385,6 +391,7 @@ static int etf_init(struct Qdisc *sch, struct nlattr *opt,
 	q->clockid = qopt->clockid;
 	q->offload = OFFLOAD_IS_ON(qopt);
 	q->deadline_mode = DEADLINE_MODE_IS_ON(qopt);
+	q->skip_sock_check = SKIP_SOCK_CHECK_IS_SET(qopt);
 
 	switch (q->clockid) {
 	case CLOCK_REALTIME:
@@ -473,6 +480,9 @@ static int etf_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (q->deadline_mode)
 		opt.flags |= TC_ETF_DEADLINE_MODE_ON;
 
+	if (q->skip_sock_check)
+		opt.flags |= TC_ETF_SKIP_SOCK_CHECK;
+
 	if (nla_put(skb, TCA_ETF_PARMS, sizeof(opt), &opt))
 		goto nla_put_failure;
 
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 1/7] igb: clear out tstamp after sending the packet Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 2/7] etf: Add skip_sock_check Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-05-28 22:45   ` Jakub Kicinski
  2019-05-28 17:46 ` [PATCH net-next v1 4/7] taprio: calculate cycle_time when schedule is installed Vedang Patel
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This adds the UAPI and the core bits necessary for userspace to
request hardware offloading to be enabled.

The future commits will enable hybrid or full offloading for taprio. This
commit sets up the infrastructure to enable it via the netlink interface.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 include/uapi/linux/pkt_sched.h |   4 +
 net/sched/sch_taprio.c         | 156 ++++++++++++++++++++++++++++++++-
 2 files changed, 158 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 69fc52e4d6bd..3319255ffa25 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1159,6 +1159,9 @@ enum {
  *       [TCA_TAPRIO_ATTR_SCHED_ENTRY_INTERVAL]
  */
 
+#define TCA_TAPRIO_ATTR_OFFLOAD_FLAG_FULL_OFFLOAD 0x1
+#define TCA_TAPRIO_ATTR_OFFLOAD_FLAG_TXTIME_OFFLOAD 0x2
+
 enum {
 	TCA_TAPRIO_ATTR_UNSPEC,
 	TCA_TAPRIO_ATTR_PRIOMAP, /* struct tc_mqprio_qopt */
@@ -1170,6 +1173,7 @@ enum {
 	TCA_TAPRIO_ATTR_ADMIN_SCHED, /* The admin sched, only used in dump */
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
+	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
 	__TCA_TAPRIO_ATTR_MAX,
 };
 
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 9ecfb8f5902a..cee6b7a3dd85 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -26,6 +26,11 @@ static LIST_HEAD(taprio_list);
 static DEFINE_SPINLOCK(taprio_list_lock);
 
 #define TAPRIO_ALL_GATES_OPEN -1
+#define VALID_OFFLOAD(flags) ((flags) != U32_MAX)
+#define FULL_OFFLOAD_IS_ON(flags) (VALID_OFFLOAD((flags)) && \
+				   ((flags) & TCA_TAPRIO_ATTR_OFFLOAD_FLAG_FULL_OFFLOAD))
+#define TXTIME_OFFLOAD_IS_ON(flags) (VALID_OFFLOAD((flags)) && \
+				     ((flags) & TCA_TAPRIO_ATTR_OFFLOAD_FLAG_TXTIME_OFFLOAD))
 
 struct sched_entry {
 	struct list_head list;
@@ -55,6 +60,7 @@ struct sched_gate_list {
 struct taprio_sched {
 	struct Qdisc **qdiscs;
 	struct Qdisc *root;
+	u32 offload_flags;
 	int clockid;
 	atomic64_t picos_per_byte; /* Using picoseconds because for 10Gbps+
 				    * speeds it's sub-nanoseconds per byte
@@ -66,6 +72,8 @@ struct taprio_sched {
 	struct sched_gate_list __rcu *oper_sched;
 	struct sched_gate_list __rcu *admin_sched;
 	ktime_t (*get_time)(void);
+	struct sk_buff *(*dequeue)(struct Qdisc *sch);
+	struct sk_buff *(*peek)(struct Qdisc *sch);
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
 };
@@ -143,7 +151,30 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	return qdisc_enqueue(skb, child, to_free);
 }
 
-static struct sk_buff *taprio_peek(struct Qdisc *sch)
+static struct sk_buff *taprio_peek_offload(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct sk_buff *skb;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct Qdisc *child = q->qdiscs[i];
+
+		if (unlikely(!child))
+			continue;
+
+		skb = child->ops->peek(child);
+		if (!skb)
+			continue;
+
+		return skb;
+	}
+
+	return NULL;
+}
+
+static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
@@ -184,6 +215,13 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
 	return NULL;
 }
 
+static struct sk_buff *taprio_peek(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+
+	return q->peek(sch);
+}
+
 static inline int length_to_duration(struct taprio_sched *q, int len)
 {
 	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
@@ -196,7 +234,7 @@ static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
 			     atomic64_read(&q->picos_per_byte)));
 }
 
-static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
+static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
@@ -275,6 +313,40 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
+static struct sk_buff *taprio_dequeue_offload(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct sk_buff *skb;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct Qdisc *child = q->qdiscs[i];
+
+		if (unlikely(!child))
+			continue;
+
+		skb = child->ops->dequeue(child);
+		if (unlikely(!skb))
+			continue;
+
+		qdisc_bstats_update(sch, skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		sch->q.qlen--;
+
+		return skb;
+	}
+
+	return NULL;
+}
+
+static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+
+	return q->dequeue(sch);
+}
+
 static bool should_restart_cycle(const struct sched_gate_list *oper,
 				 const struct sched_entry *entry)
 {
@@ -707,6 +779,59 @@ static int taprio_dev_notifier(struct notifier_block *nb, unsigned long event,
 	return NOTIFY_DONE;
 }
 
+static void taprio_disable_offload(struct net_device *dev,
+				   struct taprio_sched *q)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (!q->offload_flags)
+		return;
+
+	if (!ops->ndo_setup_tc)
+		return;
+
+	/* FIXME: disable offloading */
+
+	/* Just to be sure to keep the function pointers in a
+	 * consistent state always.
+	 */
+	q->dequeue = taprio_dequeue_soft;
+	q->peek = taprio_peek_soft;
+
+	q->offload_flags = 0;
+}
+
+static int taprio_enable_offload(struct net_device *dev,
+				 struct tc_mqprio_qopt *mqprio,
+				 struct taprio_sched *q,
+				 struct sched_gate_list *sched,
+				 struct netlink_ext_ack *extack,
+				 u32 offload_flags)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	int err = 0;
+
+	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
+		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (!ops->ndo_setup_tc) {
+		NL_SET_ERR_MSG(extack, "Specified device does not support taprio offload");
+		return -EOPNOTSUPP;
+	}
+
+	/* FIXME: enable offloading */
+
+	q->dequeue = taprio_dequeue_offload;
+	q->peek = taprio_peek_offload;
+
+	if (err == 0)
+		q->offload_flags = offload_flags;
+
+	return err;
+}
+
 static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 			 struct netlink_ext_ack *extack)
 {
@@ -715,6 +840,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
 	struct tc_mqprio_qopt *mqprio = NULL;
+	u32 offload_flags = U32_MAX;
 	int i, err, clockid;
 	unsigned long flags;
 	ktime_t start;
@@ -731,6 +857,9 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	if (err < 0)
 		return err;
 
+	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
+		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
+
 	new_admin = kzalloc(sizeof(*new_admin), GFP_KERNEL);
 	if (!new_admin) {
 		NL_SET_ERR_MSG(extack, "Not enough memory for a new schedule");
@@ -749,6 +878,12 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto free_sched;
 	}
 
+	if (offload_flags != U32_MAX && (oper || admin)) {
+		NL_SET_ERR_MSG(extack, "Changing 'offload' of a running schedule is not supported");
+		err = -ENOTSUPP;
+		goto free_sched;
+	}
+
 	err = parse_taprio_schedule(tb, new_admin, extack);
 	if (err < 0)
 		goto free_sched;
@@ -823,6 +958,15 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto unlock;
 	}
 
+	if (!offload_flags) {
+		taprio_disable_offload(dev, q);
+	} else if (VALID_OFFLOAD(offload_flags) || q->offload_flags) {
+		err = taprio_enable_offload(dev, mqprio, q,
+					    new_admin, extack, offload_flags);
+		if (err)
+			goto unlock;
+	}
+
 	err = taprio_get_start_time(sch, new_admin, &start);
 	if (err < 0) {
 		NL_SET_ERR_MSG(extack, "Internal error: failed get start time");
@@ -866,6 +1010,8 @@ static void taprio_destroy(struct Qdisc *sch)
 
 	hrtimer_cancel(&q->advance_timer);
 
+	taprio_disable_offload(dev, q);
+
 	if (q->qdiscs) {
 		for (i = 0; i < dev->num_tx_queues && q->qdiscs[i]; i++)
 			qdisc_put(q->qdiscs[i]);
@@ -895,6 +1041,9 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
 	hrtimer_init(&q->advance_timer, CLOCK_TAI, HRTIMER_MODE_ABS);
 	q->advance_timer.function = advance_sched;
 
+	q->dequeue = taprio_dequeue_soft;
+	q->peek = taprio_peek_soft;
+
 	q->root = sch;
 
 	/* We only support static clockids. Use an invalid value as default
@@ -1080,6 +1229,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_SCHED_CLOCKID, q->clockid))
 		goto options_error;
 
+	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
+		goto options_error;
+
 	if (oper && dump_schedule(skb, oper))
 		goto options_error;
 
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 4/7] taprio: calculate cycle_time when schedule is installed
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
                   ` (2 preceding siblings ...)
  2019-05-28 17:46 ` [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode Vedang Patel
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

cycle time for a particular schedule is calculated only when it is first
installed. So, it makes sense to just calculate it once right after the
'cycle_time' parameter has been parsed and store it in cycle_time.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 net/sched/sch_taprio.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index cee6b7a3dd85..8d87ba099130 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -116,22 +116,6 @@ static void switch_schedules(struct taprio_sched *q,
 	*admin = NULL;
 }
 
-static ktime_t get_cycle_time(struct sched_gate_list *sched)
-{
-	struct sched_entry *entry;
-	ktime_t cycle = 0;
-
-	if (sched->cycle_time != 0)
-		return sched->cycle_time;
-
-	list_for_each_entry(entry, &sched->entries, list)
-		cycle = ktime_add_ns(cycle, entry->interval);
-
-	sched->cycle_time = cycle;
-
-	return cycle;
-}
-
 static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			  struct sk_buff **to_free)
 {
@@ -596,6 +580,15 @@ static int parse_taprio_schedule(struct nlattr **tb,
 	if (err < 0)
 		return err;
 
+	if (!new->cycle_time) {
+		struct sched_entry *entry;
+		ktime_t cycle = 0;
+
+		list_for_each_entry(entry, &new->entries, list)
+			cycle = ktime_add_ns(cycle, entry->interval);
+		new->cycle_time = cycle;
+	}
+
 	return 0;
 }
 
@@ -677,7 +670,7 @@ static int taprio_get_start_time(struct Qdisc *sch,
 		return 0;
 	}
 
-	cycle = get_cycle_time(sched);
+	cycle = sched->cycle_time;
 
 	/* The qdisc is expected to have at least one sched_entry.  Moreover,
 	 * any entry must have 'interval' > 0. Thus if the cycle time is zero,
@@ -704,7 +697,7 @@ static void setup_first_close_time(struct taprio_sched *q,
 	first = list_first_entry(&sched->entries,
 				 struct sched_entry, list);
 
-	cycle = get_cycle_time(sched);
+	cycle = sched->cycle_time;
 
 	/* FIXME: find a better place to do this */
 	sched->cycle_close_time = ktime_add_ns(base, cycle);
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
                   ` (3 preceding siblings ...)
  2019-05-28 17:46 ` [PATCH net-next v1 4/7] taprio: calculate cycle_time when schedule is installed Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-06-03 14:15   ` Murali Karicheri
  2019-05-28 17:46 ` [PATCH net-next v1 6/7] taprio: make clock reference conversions easier Vedang Patel
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

Currently, we are seeing non-critical packets being transmitted outside
of their timeslice. We can confirm that the packets are being dequeued
at the right time. So, the delay is induced in the hardware side.  The
most likely reason is the hardware queues are starving the lower
priority queues.

In order to improve the performance of taprio, we will be making use of the
txtime feature provided by the ETF qdisc. For all the packets which do not have
the SO_TXTIME option set, taprio will set the transmit timestamp (set in
skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
corresponding to skb's traffic class is open.

Following is the example configuration for enabling txtime offload:

tc qdisc replace dev eth0 parent root handle 100 taprio \\
      num_tc 3 \\
      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
      queues 1@0 1@0 1@0 \\
      base-time 1558653424279842568 \\
      sched-entry S 01 300000 \\
      sched-entry S 02 300000 \\
      sched-entry S 04 400000 \\
      offload 2 \\
      txtime-delay 40000 \\
      clockid CLOCK_TAI

tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
      offload delta 200000 clockid CLOCK_TAI

Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
enabled. Also, all the traffic classes are mapped to the same queue.  This is
only possible in taprio when txtime offload is enabled. Also note that the ETF
Qdisc is enabled with offload mode set.

In this mode, if the packet's traffic class is open and the complete packet can
be transmitted, taprio will try to transmit the packet immediately. This will
be done by setting skb->tstamp to current_time + the time delta indicated in
the txtime_delay parameter. This parameter indicates the time taken (in
software) for packet to reach the network adapter.

If the packet cannot be transmitted in the current interval or if the packet's
traffic is not currently transmitting, the skb->tstamp is set to the next
available timestamp value. This is tracked in the next_launchtime parameter in
the struct sched_entry.

The behaviour w.r.t admin and oper schedules is not changed from what is
present in software mode.

The transmit time is already known in advance. So, we do not need the HR timers
to advance the schedule and wakeup the dequeue side of taprio.  So, HR timer
won't be run when this mode is enabled.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 include/uapi/linux/pkt_sched.h |   1 +
 net/sched/sch_taprio.c         | 326 ++++++++++++++++++++++++++++++---
 2 files changed, 306 insertions(+), 21 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 3319255ffa25..afffda24e055 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1174,6 +1174,7 @@ enum {
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
 	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
+	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
 	__TCA_TAPRIO_ATTR_MAX,
 };
 
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 8d87ba099130..1cd19eabc53b 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -21,6 +21,7 @@
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
 #include <net/sch_generic.h>
+#include <net/sock.h>
 
 static LIST_HEAD(taprio_list);
 static DEFINE_SPINLOCK(taprio_list_lock);
@@ -40,6 +41,7 @@ struct sched_entry {
 	 * packet leaves after this time.
 	 */
 	ktime_t close_time;
+	ktime_t next_txtime;
 	atomic_t budget;
 	int index;
 	u32 gate_mask;
@@ -76,6 +78,7 @@ struct taprio_sched {
 	struct sk_buff *(*peek)(struct Qdisc *sch);
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
+	int txtime_delay;
 };
 
 static ktime_t sched_base_time(const struct sched_gate_list *sched)
@@ -116,6 +119,235 @@ static void switch_schedules(struct taprio_sched *q,
 	*admin = NULL;
 }
 
+/* Get how much time has been already elapsed in the current cycle. */
+static inline s32 get_cycle_time_elapsed(struct sched_gate_list *sched, ktime_t time)
+{
+	ktime_t time_since_sched_start;
+	s32 time_elapsed;
+
+	time_since_sched_start = ktime_sub(time, sched->base_time);
+	div_s64_rem(time_since_sched_start, sched->cycle_time, &time_elapsed);
+
+	return time_elapsed;
+}
+
+static ktime_t get_interval_end_time(struct sched_gate_list *sched,
+				     struct sched_gate_list *admin,
+				     struct sched_entry *entry,
+				     ktime_t intv_start)
+{
+	s32 cycle_elapsed = get_cycle_time_elapsed(sched, intv_start);
+	ktime_t intv_end, cycle_ext_end, cycle_end;
+
+	cycle_end = ktime_add_ns(intv_start, sched->cycle_time - cycle_elapsed);
+	intv_end = ktime_add_ns(intv_start, entry->interval);
+	cycle_ext_end = ktime_add(cycle_end, sched->cycle_time_extension);
+
+	if (ktime_before(intv_end, cycle_end))
+		return intv_end;
+	else if (admin && admin != sched &&
+		 ktime_after(admin->base_time, cycle_end) &&
+		 ktime_before(admin->base_time, cycle_ext_end))
+		return admin->base_time;
+	else
+		return cycle_end;
+}
+
+static inline int length_to_duration(struct taprio_sched *q, int len)
+{
+	return (len * atomic64_read(&q->picos_per_byte)) / 1000;
+}
+
+/* Returns the entry corresponding to next available interval. If
+ * validate_interval is set, it only validates whether the timestamp occurs
+ * when the gate corresponding to the skb's traffic class is open.
+ */
+static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
+						  struct Qdisc *sch,
+						  struct sched_gate_list *sched,
+						  struct sched_gate_list *admin,
+						  ktime_t time,
+						  ktime_t *interval_start,
+						  ktime_t *interval_end,
+						  bool validate_interval)
+{
+	ktime_t curr_intv_start, curr_intv_end, cycle_end, packet_transmit_time;
+	ktime_t earliest_txtime = KTIME_MAX, txtime, cycle, transmit_end_time;
+	struct sched_entry *entry = NULL, *entry_found = NULL;
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	int tc, entry_available = 0, n;
+	s32 cycle_elapsed;
+
+	tc = netdev_get_prio_tc_map(dev, skb->priority);
+	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
+
+	*interval_start = 0;
+	*interval_end = 0;
+
+	if (!sched)
+		return NULL;
+
+	cycle = sched->cycle_time;
+	cycle_elapsed = get_cycle_time_elapsed(sched, time);
+	curr_intv_end = ktime_sub_ns(time, cycle_elapsed);
+	cycle_end = ktime_add_ns(curr_intv_end, cycle);
+
+	list_for_each_entry(entry, &sched->entries, list) {
+		curr_intv_start = curr_intv_end;
+		curr_intv_end = get_interval_end_time(sched, admin, entry,
+						      curr_intv_start);
+
+		if (ktime_after(curr_intv_start, cycle_end))
+			break;
+
+		if (!(entry->gate_mask & BIT(tc)) ||
+		    packet_transmit_time > entry->interval)
+			continue;
+
+		txtime = entry->next_txtime;
+
+		if (ktime_before(txtime, time) || validate_interval) {
+			transmit_end_time = ktime_add_ns(time, packet_transmit_time);
+			if ((ktime_before(curr_intv_start, time) &&
+			     ktime_before(transmit_end_time, curr_intv_end)) ||
+			    (ktime_after(curr_intv_start, time) && !validate_interval)) {
+				entry_found = entry;
+				*interval_start = curr_intv_start;
+				*interval_end = curr_intv_end;
+				break;
+			} else if (!entry_available && !validate_interval) {
+				/* Here, we are just trying to find out the
+				 * first available interval in the next cycle.
+				 */
+				entry_available = 1;
+				entry_found = entry;
+				*interval_start = ktime_add_ns(curr_intv_start, cycle);
+				*interval_end = ktime_add_ns(curr_intv_end, cycle);
+			}
+		} else if (ktime_before(txtime, earliest_txtime) &&
+			   !entry_available) {
+			earliest_txtime = txtime;
+			entry_found = entry;
+			n = div_s64(ktime_sub(txtime, curr_intv_start), cycle);
+			*interval_start = ktime_add(curr_intv_start, n * cycle);
+			*interval_end = ktime_add(curr_intv_end, n * cycle);
+		}
+	}
+
+	return entry_found;
+}
+
+static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct sched_gate_list *sched, *admin;
+	ktime_t interval_start, interval_end;
+	struct sched_entry *entry;
+
+	rcu_read_lock();
+	sched = rcu_dereference(q->oper_sched);
+	admin = rcu_dereference(q->admin_sched);
+
+	entry = find_entry_to_transmit(skb, sch, sched, admin, skb->tstamp,
+				       &interval_start, &interval_end, true);
+	rcu_read_unlock();
+
+	return entry;
+}
+
+static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
+				      ktime_t time)
+{
+	ktime_t cycle_elapsed;
+
+	cycle_elapsed = get_cycle_time_elapsed(sched, time);
+
+	return ktime_sub(time, cycle_elapsed);
+}
+
+/* There are a few scenarios where we will have to modify the txtime from
+ * what is read from next_txtime in sched_entry. They are:
+ * 1. If txtime is in the past,
+ *    a. The gate for the traffic class is currently open and packet can be
+ *       transmitted before it closes, schedule the packet right away.
+ *    b. If the gate corresponding to the traffic class is going to open later
+ *       in the cycle, set the txtime of packet to the interval start.
+ * 2. If txtime is in the future, there are packets corresponding to the
+ *    current traffic class waiting to be transmitted. So, the following
+ *    possibilities exist:
+ *    a. We can transmit the packet before the window containing the txtime
+ *       closes.
+ *    b. The window might close before the transmission can be completed
+ *       successfully. So, schedule the packet in the next open window.
+ */
+static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
+{
+	ktime_t transmit_end_time, interval_end, interval_start;
+	int len, packet_transmit_time, sched_changed;
+	struct taprio_sched *q = qdisc_priv(sch);
+	ktime_t minimum_time, now, txtime;
+	struct sched_gate_list *sched, *admin;
+	struct sched_entry *entry;
+
+	now = q->get_time();
+	minimum_time = ktime_add_ns(now, q->txtime_delay);
+
+	rcu_read_lock();
+	admin = rcu_dereference(q->admin_sched);
+	sched = rcu_dereference(q->oper_sched);
+	if (admin && ktime_after(minimum_time, admin->base_time))
+		switch_schedules(q, &admin, &sched);
+
+	/* Until the schedule starts, all the queues are open */
+	if (!sched || ktime_before(minimum_time, sched->base_time)) {
+		txtime = minimum_time;
+		goto done;
+	}
+
+	len = qdisc_pkt_len(skb);
+	packet_transmit_time = length_to_duration(q, len);
+
+	do {
+		sched_changed = 0;
+
+		entry = find_entry_to_transmit(skb, sch, sched, admin,
+					       minimum_time,
+					       &interval_start, &interval_end,
+					       false);
+		if (!entry) {
+			txtime = 0;
+			goto done;
+		}
+
+		txtime = entry->next_txtime;
+		txtime = max_t(ktime_t, txtime, minimum_time);
+		txtime = max_t(ktime_t, txtime, interval_start);
+
+		if (admin && admin != sched &&
+		    ktime_after(txtime, admin->base_time)) {
+			sched = admin;
+			sched_changed = 1;
+			continue;
+		}
+
+		transmit_end_time = ktime_add(txtime, packet_transmit_time);
+		minimum_time = transmit_end_time;
+
+		/* Update the txtime of current entry to the next time it's
+		 * interval starts.
+		 */
+		if (ktime_after(transmit_end_time, interval_end))
+			entry->next_txtime = ktime_add(interval_start, sched->cycle_time);
+	} while (sched_changed || ktime_after(transmit_end_time, interval_end));
+
+	entry->next_txtime = transmit_end_time;
+
+done:
+	rcu_read_unlock();
+	return txtime;
+}
+
 static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			  struct sk_buff **to_free)
 {
@@ -129,6 +361,15 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	if (unlikely(!child))
 		return qdisc_drop(skb, sch, to_free);
 
+	if (skb->sk && sock_flag(skb->sk, SOCK_TXTIME)) {
+		if (!is_valid_interval(skb, sch))
+			return qdisc_drop(skb, sch, to_free);
+	} else if (TXTIME_OFFLOAD_IS_ON(q->offload_flags)) {
+		skb->tstamp = get_packet_txtime(skb, sch);
+		if (!skb->tstamp)
+			return qdisc_drop(skb, sch, to_free);
+	}
+
 	qdisc_qstats_backlog_inc(sch, skb);
 	sch->q.qlen++;
 
@@ -206,11 +447,6 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
 	return q->peek(sch);
 }
 
-static inline int length_to_duration(struct taprio_sched *q, int len)
-{
-	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
-}
-
 static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
 {
 	atomic_set(&entry->budget,
@@ -594,7 +830,8 @@ static int parse_taprio_schedule(struct nlattr **tb,
 
 static int taprio_parse_mqprio_opt(struct net_device *dev,
 				   struct tc_mqprio_qopt *qopt,
-				   struct netlink_ext_ack *extack)
+				   struct netlink_ext_ack *extack,
+				   u32 offload_flags)
 {
 	int i, j;
 
@@ -642,6 +879,9 @@ static int taprio_parse_mqprio_opt(struct net_device *dev,
 			return -EINVAL;
 		}
 
+		if (TXTIME_OFFLOAD_IS_ON(offload_flags))
+			continue;
+
 		/* Verify that the offset and counts do not overlap */
 		for (j = i + 1; j < qopt->num_tc; j++) {
 			if (last > qopt->offset[j]) {
@@ -804,6 +1044,9 @@ static int taprio_enable_offload(struct net_device *dev,
 	const struct net_device_ops *ops = dev->netdev_ops;
 	int err = 0;
 
+	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
+		goto done;
+
 	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
 		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
 		return -EOPNOTSUPP;
@@ -816,15 +1059,28 @@ static int taprio_enable_offload(struct net_device *dev,
 
 	/* FIXME: enable offloading */
 
-	q->dequeue = taprio_dequeue_offload;
-	q->peek = taprio_peek_offload;
-
-	if (err == 0)
+done:
+	if (err == 0) {
+		q->dequeue = taprio_dequeue_offload;
+		q->peek = taprio_peek_offload;
 		q->offload_flags = offload_flags;
+	}
 
 	return err;
 }
 
+static void setup_txtime(struct taprio_sched *q,
+			 struct sched_gate_list *sched, ktime_t base)
+{
+	struct sched_entry *entry;
+	u32 interval = 0;
+
+	list_for_each_entry(entry, &sched->entries, list) {
+		entry->next_txtime = ktime_add_ns(base, interval);
+		interval += entry->interval;
+	}
+}
+
 static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 			 struct netlink_ext_ack *extack)
 {
@@ -846,7 +1102,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	if (tb[TCA_TAPRIO_ATTR_PRIOMAP])
 		mqprio = nla_data(tb[TCA_TAPRIO_ATTR_PRIOMAP]);
 
-	err = taprio_parse_mqprio_opt(dev, mqprio, extack);
+	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
+		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
+
+	err = taprio_parse_mqprio_opt(dev, mqprio, extack, offload_flags);
 	if (err < 0)
 		return err;
 
@@ -887,6 +1146,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto free_sched;
 	}
 
+	/* preserve offload flags when changing the schedule. */
+	if (q->offload_flags)
+		offload_flags = q->offload_flags;
+
 	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
 		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
 
@@ -914,7 +1177,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	/* Protects against enqueue()/dequeue() */
 	spin_lock_bh(qdisc_lock(sch));
 
-	if (!hrtimer_active(&q->advance_timer)) {
+	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY])
+		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
+
+	if (!TXTIME_OFFLOAD_IS_ON(offload_flags) && !hrtimer_active(&q->advance_timer)) {
 		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
 		q->advance_timer.function = advance_sched;
 	}
@@ -966,20 +1232,35 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto unlock;
 	}
 
-	setup_first_close_time(q, new_admin, start);
+	if (TXTIME_OFFLOAD_IS_ON(offload_flags)) {
+		setup_txtime(q, new_admin, start);
+
+		if (!oper) {
+			rcu_assign_pointer(q->oper_sched, new_admin);
+			err = 0;
+			new_admin = NULL;
+			goto unlock;
+		}
+
+		rcu_assign_pointer(q->admin_sched, new_admin);
+		if (admin)
+			call_rcu(&admin->rcu, taprio_free_sched_cb);
+	} else {
+		setup_first_close_time(q, new_admin, start);
 
-	/* Protects against advance_sched() */
-	spin_lock_irqsave(&q->current_entry_lock, flags);
+		/* Protects against advance_sched() */
+		spin_lock_irqsave(&q->current_entry_lock, flags);
 
-	taprio_start_sched(sch, start, new_admin);
+		taprio_start_sched(sch, start, new_admin);
 
-	rcu_assign_pointer(q->admin_sched, new_admin);
-	if (admin)
-		call_rcu(&admin->rcu, taprio_free_sched_cb);
-	new_admin = NULL;
+		rcu_assign_pointer(q->admin_sched, new_admin);
+		if (admin)
+			call_rcu(&admin->rcu, taprio_free_sched_cb);
 
-	spin_unlock_irqrestore(&q->current_entry_lock, flags);
+		spin_unlock_irqrestore(&q->current_entry_lock, flags);
+	}
 
+	new_admin = NULL;
 	err = 0;
 
 unlock:
@@ -1225,6 +1506,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
 		goto options_error;
 
+	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
+		goto options_error;
+
 	if (oper && dump_schedule(skb, oper))
 		goto options_error;
 
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 6/7] taprio: make clock reference conversions easier
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
                   ` (4 preceding siblings ...)
  2019-05-28 17:46 ` [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-05-28 17:46 ` [PATCH net-next v1 7/7] taprio: Adjust timestamps for TCP packets Vedang Patel
  2019-06-03 13:50 ` [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Murali Karicheri
  7 siblings, 0 replies; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

Later in this series we will need to transform from
CLOCK_MONOTONIC (used in TCP) to the clock reference used in TAPRIO.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 net/sched/sch_taprio.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 1cd19eabc53b..b892fa32ea2b 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -63,6 +63,7 @@ struct taprio_sched {
 	struct Qdisc **qdiscs;
 	struct Qdisc *root;
 	u32 offload_flags;
+	enum tk_offsets tk_offset;
 	int clockid;
 	atomic64_t picos_per_byte; /* Using picoseconds because for 10Gbps+
 				    * speeds it's sub-nanoseconds per byte
@@ -73,7 +74,6 @@ struct taprio_sched {
 	struct sched_entry __rcu *current_entry;
 	struct sched_gate_list __rcu *oper_sched;
 	struct sched_gate_list __rcu *admin_sched;
-	ktime_t (*get_time)(void);
 	struct sk_buff *(*dequeue)(struct Qdisc *sch);
 	struct sk_buff *(*peek)(struct Qdisc *sch);
 	struct hrtimer advance_timer;
@@ -89,6 +89,20 @@ static ktime_t sched_base_time(const struct sched_gate_list *sched)
 	return ns_to_ktime(sched->base_time);
 }
 
+static inline ktime_t taprio_get_time(struct taprio_sched *q)
+{
+	ktime_t mono = ktime_get();
+
+	switch (q->tk_offset) {
+	case TK_OFFS_MAX:
+		return mono;
+	default:
+		return ktime_mono_to_any(mono, q->tk_offset);
+	}
+
+	return KTIME_MAX;
+}
+
 static void taprio_free_sched_cb(struct rcu_head *head)
 {
 	struct sched_gate_list *sched = container_of(head, struct sched_gate_list, rcu);
@@ -290,7 +304,7 @@ static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
 	struct sched_gate_list *sched, *admin;
 	struct sched_entry *entry;
 
-	now = q->get_time();
+	now = taprio_get_time(q);
 	minimum_time = ktime_add_ns(now, q->txtime_delay);
 
 	rcu_read_lock();
@@ -501,7 +515,7 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
 			continue;
 
 		len = qdisc_pkt_len(skb);
-		guard = ktime_add_ns(q->get_time(),
+		guard = ktime_add_ns(taprio_get_time(q),
 				     length_to_duration(q, len));
 
 		/* In the case that there's no gate entry, there's no
@@ -903,7 +917,7 @@ static int taprio_get_start_time(struct Qdisc *sch,
 	s64 n;
 
 	base = sched_base_time(sched);
-	now = q->get_time();
+	now = taprio_get_time(q);
 
 	if (ktime_after(base, now)) {
 		*start = base;
@@ -1200,16 +1214,16 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 
 	switch (q->clockid) {
 	case CLOCK_REALTIME:
-		q->get_time = ktime_get_real;
+		q->tk_offset = TK_OFFS_REAL;
 		break;
 	case CLOCK_MONOTONIC:
-		q->get_time = ktime_get;
+		q->tk_offset = TK_OFFS_MAX;
 		break;
 	case CLOCK_BOOTTIME:
-		q->get_time = ktime_get_boottime;
+		q->tk_offset = TK_OFFS_BOOT;
 		break;
 	case CLOCK_TAI:
-		q->get_time = ktime_get_clocktai;
+		q->tk_offset = TK_OFFS_TAI;
 		break;
 	default:
 		NL_SET_ERR_MSG(extack, "Invalid 'clockid'");
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v1 7/7] taprio: Adjust timestamps for TCP packets.
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
                   ` (5 preceding siblings ...)
  2019-05-28 17:46 ` [PATCH net-next v1 6/7] taprio: make clock reference conversions easier Vedang Patel
@ 2019-05-28 17:46 ` Vedang Patel
  2019-06-03 13:50 ` [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Murali Karicheri
  7 siblings, 0 replies; 25+ messages in thread
From: Vedang Patel @ 2019-05-28 17:46 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, Vedang Patel

When the taprio qdisc is running in "txtime offload" mode, it will
set the launchtime value (in skb->tstamp) for all the packets which do
not have the SO_TXTIME socket option. But, the TCP packets already have
this value set and it indicates the earliest departure time represented
in CLOCK_MONOTONIC clock.

We need to respect the timestamp set by the TCP subsystem. So, convert
this time to the clock which taprio is using and ensure that the packet
is not transmitted before the deadline set by TCP.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 net/sched/sch_taprio.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index b892fa32ea2b..cadb2f5d16f0 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -22,6 +22,7 @@
 #include <net/pkt_cls.h>
 #include <net/sch_generic.h>
 #include <net/sock.h>
+#include <net/tcp.h>
 
 static LIST_HEAD(taprio_list);
 static DEFINE_SPINLOCK(taprio_list_lock);
@@ -280,6 +281,41 @@ static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
 	return ktime_sub(time, cycle_elapsed);
 }
 
+/* This returns the tstamp value set by TCP in terms of the set clock. */
+static ktime_t get_tcp_tstamp(struct taprio_sched *q, struct sk_buff *skb)
+{
+	unsigned int offset = skb_network_offset(skb);
+	const struct ipv6hdr *ipv6h;
+	const struct iphdr *iph;
+	struct ipv6hdr _ipv6h;
+
+	ipv6h = skb_header_pointer(skb, offset, sizeof(_ipv6h), &_ipv6h);
+	if (!ipv6h)
+		return 0;
+
+	if (ipv6h->version == 4) {
+		iph = (struct iphdr *)ipv6h;
+		offset += iph->ihl * 4;
+
+		/* special-case 6in4 tunnelling, as that is a common way to get
+		 * v6 connectivity in the home
+		 */
+		if (iph->protocol == IPPROTO_IPV6) {
+			ipv6h = skb_header_pointer(skb, offset,
+						   sizeof(_ipv6h), &_ipv6h);
+
+			if (!ipv6h || ipv6h->nexthdr != IPPROTO_TCP)
+				return 0;
+		} else if (iph->protocol != IPPROTO_TCP) {
+			return 0;
+		}
+	} else if (ipv6h->version == 6 && ipv6h->nexthdr != IPPROTO_TCP) {
+		return 0;
+	}
+
+	return ktime_mono_to_any(skb->skb_mstamp_ns, q->tk_offset);
+}
+
 /* There are a few scenarios where we will have to modify the txtime from
  * what is read from next_txtime in sched_entry. They are:
  * 1. If txtime is in the past,
@@ -297,7 +333,7 @@ static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
  */
 static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
 {
-	ktime_t transmit_end_time, interval_end, interval_start;
+	ktime_t transmit_end_time, interval_end, interval_start, tcp_tstamp;
 	int len, packet_transmit_time, sched_changed;
 	struct taprio_sched *q = qdisc_priv(sch);
 	ktime_t minimum_time, now, txtime;
@@ -307,6 +343,9 @@ static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
 	now = taprio_get_time(q);
 	minimum_time = ktime_add_ns(now, q->txtime_delay);
 
+	tcp_tstamp = get_tcp_tstamp(q, skb);
+	minimum_time = max_t(ktime_t, minimum_time, tcp_tstamp);
+
 	rcu_read_lock();
 	admin = rcu_dereference(q->admin_sched);
 	sched = rcu_dereference(q->oper_sched);
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-28 17:46 ` [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading Vedang Patel
@ 2019-05-28 22:45   ` Jakub Kicinski
  2019-05-29 17:06     ` Patel, Vedang
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2019-05-28 22:45 UTC (permalink / raw)
  To: Vedang Patel
  Cc: netdev, jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l

On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:
> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> 
> This adds the UAPI and the core bits necessary for userspace to
> request hardware offloading to be enabled.
> 
> The future commits will enable hybrid or full offloading for taprio. This
> commit sets up the infrastructure to enable it via the netlink interface.
> 
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> Signed-off-by: Vedang Patel <vedang.patel@intel.com>

Other qdiscs offload by default, this offload-level selection here is a
little bit inconsistent with that :(

> @@ -731,6 +857,9 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>  	if (err < 0)
>  		return err;
>  
> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);

You should make sure user doesn't set unknown bits.  Otherwise using
other bits will not be possible in the future.

>  	new_admin = kzalloc(sizeof(*new_admin), GFP_KERNEL);
>  	if (!new_admin) {
>  		NL_SET_ERR_MSG(extack, "Not enough memory for a new schedule");

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-28 22:45   ` Jakub Kicinski
@ 2019-05-29 17:06     ` Patel, Vedang
  2019-05-29 19:14       ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Patel, Vedang @ 2019-05-29 17:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, jiri, intel-wired-lan, Gomes, Vinicius, l

Thanks Jacub for they inputs.

> On May 28, 2019, at 3:45 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:
>> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>> 
>> This adds the UAPI and the core bits necessary for userspace to
>> request hardware offloading to be enabled.
>> 
>> The future commits will enable hybrid or full offloading for taprio. This
>> commit sets up the infrastructure to enable it via the netlink interface.
>> 
>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
> 
> Other qdiscs offload by default, this offload-level selection here is a
> little bit inconsistent with that :(
> 
There are 2 different offload modes introduced in this patch.

1. Txtime offload mode: This mode depends on skip_sock_check flag being set in the etf qdisc. Also, it requires some manual configuration which might be specific to the network adapter card. For example, for the i210 card, the user will have to route all the traffic to the highest priority queue and install etf qdisc with offload enabled on that queue. So, I don’t think this mode should be enabled by default.
2. Full offload mode: This mode is currently not supported by any network driver. The support for this will be coming soon. But, we can enable this mode by default. 

Also, from what Vinicius tells me, offload modes for cbs, etf and mqprio are also disabled by default. So, it will make more sense to keep it disabled to be consistent with other qdiscs similar to this one.
>> @@ -731,6 +857,9 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>> 	if (err < 0)
>> 		return err;
>> 
>> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
>> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
> 
> You should make sure user doesn't set unknown bits.  Otherwise using
> other bits will not be possible in the future.
> 
Yes, I agree here, will include this in the next patchset.

Thanks,
Vedang
>> 	new_admin = kzalloc(sizeof(*new_admin), GFP_KERNEL);
>> 	if (!new_admin) {
>> 		NL_SET_ERR_MSG(extack, "Not enough memory for a new schedule");


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-29 17:06     ` Patel, Vedang
@ 2019-05-29 19:14       ` Jakub Kicinski
  2019-05-29 20:05         ` Patel, Vedang
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2019-05-29 19:14 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, jiri, intel-wired-lan, Gomes, Vinicius, l

On Wed, 29 May 2019 17:06:49 +0000, Patel, Vedang wrote:
> > On May 28, 2019, at 3:45 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:  
> >> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >> 
> >> This adds the UAPI and the core bits necessary for userspace to
> >> request hardware offloading to be enabled.
> >> 
> >> The future commits will enable hybrid or full offloading for taprio. This
> >> commit sets up the infrastructure to enable it via the netlink interface.
> >> 
> >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >> Signed-off-by: Vedang Patel <vedang.patel@intel.com>  
> > 
> > Other qdiscs offload by default, this offload-level selection here is a
> > little bit inconsistent with that :(
> >   
> There are 2 different offload modes introduced in this patch.
> 
> 1. Txtime offload mode: This mode depends on skip_sock_check flag being set in the etf qdisc. Also, it requires some manual configuration which might be specific to the network adapter card. For example, for the i210 card, the user will have to route all the traffic to the highest priority queue and install etf qdisc with offload enabled on that queue. So, I don’t think this mode should be enabled by default.

Excellent, it looks like there will be driver patches necessary for
this offload to function, also it seems your offload enable function
still contains this after the series:

	/* FIXME: enable offloading */

Please only post offload infrastructure when fully fleshed out and with
driver patches making use of it.

> 2. Full offload mode: This mode is currently not supported by any network driver. The support for this will be coming soon. But, we can enable this mode by default. 
> 
> Also, from what Vinicius tells me, offload modes for cbs, etf and mqprio are also disabled by default. So, it will make more sense to keep it disabled to be consistent with other qdiscs similar to this one.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-29 19:14       ` Jakub Kicinski
@ 2019-05-29 20:05         ` Patel, Vedang
  2019-05-29 21:06           ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Patel, Vedang @ 2019-05-29 20:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, jiri, intel-wired-lan, Gomes, Vinicius, l

[Sending the email again since the last one was rejected by netdev because it was html.]

> On May 29, 2019, at 12:14 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> On Wed, 29 May 2019 17:06:49 +0000, Patel, Vedang wrote:
>>> On May 28, 2019, at 3:45 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>>> On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:  
>>>> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>> 
>>>> This adds the UAPI and the core bits necessary for userspace to
>>>> request hardware offloading to be enabled.
>>>> 
>>>> The future commits will enable hybrid or full offloading for taprio. This
>>>> commit sets up the infrastructure to enable it via the netlink interface.
>>>> 
>>>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>  
>>> 
>>> Other qdiscs offload by default, this offload-level selection here is a
>>> little bit inconsistent with that :(
>>> 
>> There are 2 different offload modes introduced in this patch.
>> 
>> 1. Txtime offload mode: This mode depends on skip_sock_check flag being set in the etf qdisc. Also, it requires some manual configuration which might be specific to the network adapter card. For example, for the i210 card, the user will have to route all the traffic to the highest priority queue and install etf qdisc with offload enabled on that queue. So, I don’t think this mode should be enabled by default.
> 
> Excellent, it looks like there will be driver patches necessary for
> this offload to function, also it seems your offload enable function
> still contains this after the series:
> 
> 	/* FIXME: enable offloading */
> 
> Please only post offload infrastructure when fully fleshed out and with
> driver patches making use of it.
> 
The above comment refers to the full offload mode described below. It is not needed for txtime offload mode. Txtime offload mode is essentially setting the transmit time for each packet  depending on what interval it is going to be transmitted instead of relying on the hrtimers to open gates and send packets. More details about why exactly this is done is mentioned in patch #5[1] of this series.

What we can do is just add the txtime offload related flag and add the full offload flag whenever the driver bits are ready. Does that address your concern?

Thanks,
Vedang
>> 2. Full offload mode: This mode is currently not supported by any network driver. The support for this will be coming soon. But, we can enable this mode by default. 
>> 
>> Also, from what Vinicius tells me, offload modes for cbs, etf and mqprio are also disabled by default. So, it will make more sense to keep it disabled to be consistent with other qdiscs similar to this one.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-29 20:05         ` Patel, Vedang
@ 2019-05-29 21:06           ` Jakub Kicinski
  2019-05-30  0:21             ` Patel, Vedang
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2019-05-29 21:06 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, jiri, intel-wired-lan, Gomes, Vinicius, l

On Wed, 29 May 2019 20:05:16 +0000, Patel, Vedang wrote:
> [Sending the email again since the last one was rejected by netdev because it was html.]
> 
> > On May 29, 2019, at 12:14 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > 
> > On Wed, 29 May 2019 17:06:49 +0000, Patel, Vedang wrote:  
> >>> On May 28, 2019, at 3:45 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> >>> On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:    
> >>>> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >>>> 
> >>>> This adds the UAPI and the core bits necessary for userspace to
> >>>> request hardware offloading to be enabled.
> >>>> 
> >>>> The future commits will enable hybrid or full offloading for taprio. This
> >>>> commit sets up the infrastructure to enable it via the netlink interface.
> >>>> 
> >>>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>    
> >>> 
> >>> Other qdiscs offload by default, this offload-level selection here is a
> >>> little bit inconsistent with that :(
> >>>   
> >> There are 2 different offload modes introduced in this patch.
> >> 
> >> 1. Txtime offload mode: This mode depends on skip_sock_check flag being set in the etf qdisc. Also, it requires some manual configuration which might be specific to the network adapter card. For example, for the i210 card, the user will have to route all the traffic to the highest priority queue and install etf qdisc with offload enabled on that queue. So, I don’t think this mode should be enabled by default.  
> > 
> > Excellent, it looks like there will be driver patches necessary for
> > this offload to function, also it seems your offload enable function
> > still contains this after the series:
> > 
> > 	/* FIXME: enable offloading */
> > 
> > Please only post offload infrastructure when fully fleshed out and with
> > driver patches making use of it.
>
> The above comment refers to the full offload mode described below. It
> is not needed for txtime offload mode. Txtime offload mode is
> essentially setting the transmit time for each packet  depending on
> what interval it is going to be transmitted instead of relying on the
> hrtimers to open gates and send packets. More details about why
> exactly this is done is mentioned in patch #5[1] of this series.

From looking at this set it looks like I can add that qdisc to any
netdev now *with* offload, and as long as the driver has _any_
ndo_setup_tc implementation taprio will not return an error. 

Perhaps this mode of operation should not be called offload?  Can't 
the ETF qdisc underneath run in software mode?

> What we can do is just add the txtime offload related flag and add
> the full offload flag whenever the driver bits are ready. Does that
> address your concern?

That would be a step in the right direction.  And please remove all the
other unused code from the series.  AFAICT this is what the enable
offload function looks like after your set - what's the point of the
'err' variable?

+static int taprio_enable_offload(struct net_device *dev,
+				 struct tc_mqprio_qopt *mqprio,
+				 struct taprio_sched *q,
+				 struct sched_gate_list *sched,
+				 struct netlink_ext_ack *extack,
+				 u32 offload_flags)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	int err = 0;
+
+	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
+		goto done;
+
+	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
+		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (!ops->ndo_setup_tc) {
+		NL_SET_ERR_MSG(extack, "Specified device does not support taprio offload");
+		return -EOPNOTSUPP;
+	}
+
+	/* FIXME: enable offloading */
+
+done:
+	if (err == 0) {
+		q->dequeue = taprio_dequeue_offload;
+		q->peek = taprio_peek_offload;
+ 		q->offload_flags = offload_flags;
+	}
+
+	return err;
+}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-29 21:06           ` Jakub Kicinski
@ 2019-05-30  0:21             ` Patel, Vedang
  2019-05-30 20:41               ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Patel, Vedang @ 2019-05-30  0:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, jiri, intel-wired-lan, Gomes, Vinicius, l



> On May 29, 2019, at 2:06 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> On Wed, 29 May 2019 20:05:16 +0000, Patel, Vedang wrote:
>> [Sending the email again since the last one was rejected by netdev because it was html.]
>> 
>>> On May 29, 2019, at 12:14 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>>> 
>>> On Wed, 29 May 2019 17:06:49 +0000, Patel, Vedang wrote:  
>>>>> On May 28, 2019, at 3:45 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>>>>> On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:    
>>>>>> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>>>> 
>>>>>> This adds the UAPI and the core bits necessary for userspace to
>>>>>> request hardware offloading to be enabled.
>>>>>> 
>>>>>> The future commits will enable hybrid or full offloading for taprio. This
>>>>>> commit sets up the infrastructure to enable it via the netlink interface.
>>>>>> 
>>>>>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>    
>>>>> 
>>>>> Other qdiscs offload by default, this offload-level selection here is a
>>>>> little bit inconsistent with that :(
>>>>> 
>>>> There are 2 different offload modes introduced in this patch.
>>>> 
>>>> 1. Txtime offload mode: This mode depends on skip_sock_check flag being set in the etf qdisc. Also, it requires some manual configuration which might be specific to the network adapter card. For example, for the i210 card, the user will have to route all the traffic to the highest priority queue and install etf qdisc with offload enabled on that queue. So, I don’t think this mode should be enabled by default.  
>>> 
>>> Excellent, it looks like there will be driver patches necessary for
>>> this offload to function, also it seems your offload enable function
>>> still contains this after the series:
>>> 
>>> 	/* FIXME: enable offloading */
>>> 
>>> Please only post offload infrastructure when fully fleshed out and with
>>> driver patches making use of it.
>> 
>> The above comment refers to the full offload mode described below. It
>> is not needed for txtime offload mode. Txtime offload mode is
>> essentially setting the transmit time for each packet  depending on
>> what interval it is going to be transmitted instead of relying on the
>> hrtimers to open gates and send packets. More details about why
>> exactly this is done is mentioned in patch #5[1] of this series.
> 
> From looking at this set it looks like I can add that qdisc to any
> netdev now *with* offload, and as long as the driver has _any_
> ndo_setup_tc implementation taprio will not return an error. 
> 
> Perhaps this mode of operation should not be called offload?  Can't 
> the ETF qdisc underneath run in software mode?
> 
Yeah, it doesn’t make much sense to run ETF in software mode but it can be done. What about naming it txtime-assist instead?
>> What we can do is just add the txtime offload related flag and add
>> the full offload flag whenever the driver bits are ready. Does that
>> address your concern?
> 
> That would be a step in the right direction.  And please remove all the
> other unused code from the series.  AFAICT this is what the enable
> offload function looks like after your set - what's the point of the
> 'err' variable?
> 
Yes. This patch seems to have a few really silly mistakes. I apologize for that. I willl clean it up and send another version of the series. There is no unused code anywhere else in the series.
> +static int taprio_enable_offload(struct net_device *dev,
> +				 struct tc_mqprio_qopt *mqprio,
> +				 struct taprio_sched *q,
> +				 struct sched_gate_list *sched,
> +				 struct netlink_ext_ack *extack,
> +				 u32 offload_flags)
> +{
> +	const struct net_device_ops *ops = dev->netdev_ops;
> +	int err = 0;
> +
> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
> +		goto done;
> +
> +	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
> +		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (!ops->ndo_setup_tc) {
> +		NL_SET_ERR_MSG(extack, "Specified device does not support taprio offload");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* FIXME: enable offloading */
> +
> +done:
> +	if (err == 0) {
> +		q->dequeue = taprio_dequeue_offload;
> +		q->peek = taprio_peek_offload;
> + 		q->offload_flags = offload_flags;
> +	}
> +
> +	return err;
> +}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading
  2019-05-30  0:21             ` Patel, Vedang
@ 2019-05-30 20:41               ` Jakub Kicinski
  0 siblings, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2019-05-30 20:41 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, jiri, intel-wired-lan, Gomes, Vinicius, l

On Thu, 30 May 2019 00:21:39 +0000, Patel, Vedang wrote:
> > On May 29, 2019, at 2:06 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > On Wed, 29 May 2019 20:05:16 +0000, Patel, Vedang wrote:  
> >> [Sending the email again since the last one was rejected by netdev because it was html.]
> >>   
> >>> On May 29, 2019, at 12:14 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> >>> 
> >>> On Wed, 29 May 2019 17:06:49 +0000, Patel, Vedang wrote:    
> >>>>> On May 28, 2019, at 3:45 PM, Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> >>>>> On Tue, 28 May 2019 10:46:44 -0700, Vedang Patel wrote:      
> >>>>>> From: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >>>>>> 
> >>>>>> This adds the UAPI and the core bits necessary for userspace to
> >>>>>> request hardware offloading to be enabled.
> >>>>>> 
> >>>>>> The future commits will enable hybrid or full offloading for taprio. This
> >>>>>> commit sets up the infrastructure to enable it via the netlink interface.
> >>>>>> 
> >>>>>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >>>>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>      
> >>>>> 
> >>>>> Other qdiscs offload by default, this offload-level selection here is a
> >>>>> little bit inconsistent with that :(
> >>>>>   
> >>>> There are 2 different offload modes introduced in this patch.
> >>>> 
> >>>> 1. Txtime offload mode: This mode depends on skip_sock_check flag being set in the etf qdisc. Also, it requires some manual configuration which might be specific to the network adapter card. For example, for the i210 card, the user will have to route all the traffic to the highest priority queue and install etf qdisc with offload enabled on that queue. So, I don’t think this mode should be enabled by default.    
> >>> 
> >>> Excellent, it looks like there will be driver patches necessary for
> >>> this offload to function, also it seems your offload enable function
> >>> still contains this after the series:
> >>> 
> >>> 	/* FIXME: enable offloading */
> >>> 
> >>> Please only post offload infrastructure when fully fleshed out and with
> >>> driver patches making use of it.  
> >> 
> >> The above comment refers to the full offload mode described below. It
> >> is not needed for txtime offload mode. Txtime offload mode is
> >> essentially setting the transmit time for each packet  depending on
> >> what interval it is going to be transmitted instead of relying on the
> >> hrtimers to open gates and send packets. More details about why
> >> exactly this is done is mentioned in patch #5[1] of this series.  
> > 
> > From looking at this set it looks like I can add that qdisc to any
> > netdev now *with* offload, and as long as the driver has _any_
> > ndo_setup_tc implementation taprio will not return an error. 
> > 
> > Perhaps this mode of operation should not be called offload?  Can't 
> > the ETF qdisc underneath run in software mode?
> >   
> Yeah, it doesn’t make much sense to run ETF in software mode but it
> can be done. What about naming it txtime-assist instead?

Sounds good!  txtime.* works for me.

> >> What we can do is just add the txtime offload related flag and add
> >> the full offload flag whenever the driver bits are ready. Does that
> >> address your concern?  
> > 
> > That would be a step in the right direction.  And please remove all the
> > other unused code from the series.  AFAICT this is what the enable
> > offload function looks like after your set - what's the point of the
> > 'err' variable?
> >   
> Yes. This patch seems to have a few really silly mistakes. I
> apologize for that. I willl clean it up and send another version of
> the series. There is no unused code anywhere else in the series.

Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio
  2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
                   ` (6 preceding siblings ...)
  2019-05-28 17:46 ` [PATCH net-next v1 7/7] taprio: Adjust timestamps for TCP packets Vedang Patel
@ 2019-06-03 13:50 ` Murali Karicheri
  2019-06-03 13:54   ` Murali Karicheri
  2019-06-04 19:47   ` Patel, Vedang
  7 siblings, 2 replies; 25+ messages in thread
From: Murali Karicheri @ 2019-06-03 13:50 UTC (permalink / raw)
  To: Vedang Patel, netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l

Hi Vedang,

On 05/28/2019 01:46 PM, Vedang Patel wrote:
> Currently, we are seeing packets being transmitted outside their timeslices. We
> can confirm that the packets are being dequeued at the right time. So, the
> delay is induced after the packet is dequeued, because taprio, without any
> offloading, has no control of when a packet is actually transmitted.
> 
> In order to solve this, we are making use of the txtime feature provided by ETF
> qdisc. Hardware offloading needs to be supported by the ETF qdisc in order to
> take advantage of this feature. The taprio qdisc will assign txtime (in
> skb->tstamp) for all the packets which do not have the txtime allocated via the
> SO_TXTIME socket option. For the packets which already have SO_TXTIME set,
> taprio will validate whether the packet will be transmitted in the correct
> interval.
> 
> In order to support this, the following parameters have been added:
> - offload (taprio): This is added in order to support different offloading
>    modes which will be added in the future.
> - txtime-delay (taprio): this is the time the packet takes to traverse through
>    the kernel to adapter card.

I am very new to this TC and QoS handling of Linux kernel and TSN. So 
sorry some of the  questions below are very basic in nature. I would 
soon would be working to add this support in our platform based on this.
So please answer.

Is txtime-delay from the instance qdisc de-queue the packet to the time
packets get onto the wire? I am wondering if this time is deterministic
or we have some way to ensure this can be tuned to meet a max value?
Also how would one calculate this value or have to measure it?

> - skip_sock_check (etf): etf qdisc currently drops any packet which does not
>    have the SO_TXTIME socket option set. This check can be skipped by specifying
>    this option.
> 
> Following is an example configuration:
> 
> $ tc qdisc replace dev $IFACE parent root handle 100 taprio \\
>        num_tc 3 \\
>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>        queues 1@0 1@0 1@0 \\
>        base-time $BASE_TIME \\
>        sched-entry S 01 300000 \\
>        sched-entry S 02 300000 \\
>        sched-entry S 04 400000 \\
>        offload 2 \\
>        txtime-delay 40000 \\
>        clockid CLOCK_TAI
> 

Could you tell me what is CLOCK_TAI?? I see below in the source code for
the definition in include/uapi/linux/time.h

/*
  * The driver implementing this got removed. The clock ID is kept as a
  * place holder. Do not reuse!
  */
#define CLOCK_SGI_CYCLE			10
#define CLOCK_TAI			11

So why do I see such defines being used in the example that is being
removed?

AFAIK, From the usage point of view, TSN requires the network being
synchronized through linux PTP. For synchronization phc2sys is used to
synchronize system time to the PHC. So why don't one use system time for
this?

So application will use clock_gettime() to know current time and
schedule the packet for transmission as well as user would use scripts
or such to check current system time and issue tc command above. So the
command should use CLOCK_REALTIME in that case. So all h/w and software
are aligned to the same time source. Or Have I understood it wrong? I am
assuming that for the offloaded case, h/w schedule Gate open, send
packet etc are synchronized to the PHC or use a translated time w.r.t 
the common time (network time. Assuming PHC tracks this).

Thanks in advance for clarifying this.

> $ tc qdisc replace dev $IFACE parent 100:1 etf \\
>        offload delta 200000 clockid CLOCK_TAI skip_sock_check
> 
> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
> enabled. Also, that all the traffic classes have been assigned the same queue.
> This is to prevent the traffic classes in the lower priority queues from
> getting starved. Note that this configuration is specific to the i210 ethernet
> card. Other network cards where the hardware queues are given the same
> priority, might be able to utilize more than one queue.
> 
> Following are some of the other highlights of the series:
> - Fix a bug where hardware timestamping and SO_TXTIME options cannot be used
>    together. (Patch 1)
> - Introduce hardware offloading. This patch introduces offload parameter which
>    can be used to enable the txtime offload mode. It will also support enabling
>    full offload when the support is available in drivers. (Patch 3)
> - Make TxTime assist mode work with TCP packets. (Patch 6 & 7)
> 
> 
> The following changes are recommended to be done in order to get the best
> performance from taprio in this mode:
> # ip link set dev enp1s0 mtu 1514

May I know why MTU is changed to 1514 to include the Ethernet header?

> # ethtool -K eth0 gso off
> # ethtool -K eth0 tso off
> # ethtool --set-eee eth0 eee off

Could you please explain why these are needed?

Thanks

Murali
> 
> Thanks,
> Vedang Patel
> 
> Vedang Patel (6):
>    igb: clear out tstamp after sending the packet.
>    etf: Add skip_sock_check
>    taprio: calculate cycle_time when schedule is installed
>    taprio: Add support for txtime offload mode.
>    taprio: make clock reference conversions easier
>    taprio: Adjust timestamps for TCP packets.
> 
> Vinicius Costa Gomes (1):
>    taprio: Add the skeleton to enable hardware offloading
> 
>   drivers/net/ethernet/intel/igb/igb_main.c |   1 +
>   include/uapi/linux/pkt_sched.h            |   6 +
>   net/sched/sch_etf.c                       |  10 +
>   net/sched/sch_taprio.c                    | 548 ++++++++++++++++++++--
>   4 files changed, 532 insertions(+), 33 deletions(-)
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio
  2019-06-03 13:50 ` [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Murali Karicheri
@ 2019-06-03 13:54   ` Murali Karicheri
  2019-06-04 19:53     ` Patel, Vedang
  2019-06-04 19:47   ` Patel, Vedang
  1 sibling, 1 reply; 25+ messages in thread
From: Murali Karicheri @ 2019-06-03 13:54 UTC (permalink / raw)
  To: Vedang Patel, netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l

On 06/03/2019 09:50 AM, Murali Karicheri wrote:
> Hi Vedang,
> 
> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>> Currently, we are seeing packets being transmitted outside their 
>> timeslices. We
>> can confirm that the packets are being dequeued at the right time. So, 
>> the
>> delay is induced after the packet is dequeued, because taprio, without 
>> any
>> offloading, has no control of when a packet is actually transmitted.
>>
>> In order to solve this, we are making use of the txtime feature 
>> provided by ETF
>> qdisc. Hardware offloading needs to be supported by the ETF qdisc in 
>> order to
>> take advantage of this feature. The taprio qdisc will assign txtime (in
>> skb->tstamp) for all the packets which do not have the txtime 
>> allocated via the
>> SO_TXTIME socket option. For the packets which already have SO_TXTIME 
>> set,
>> taprio will validate whether the packet will be transmitted in the 
>> correct
>> interval.
>>
>> In order to support this, the following parameters have been added:
>> - offload (taprio): This is added in order to support different 
>> offloading
>>    modes which will be added in the future.
>> - txtime-delay (taprio): this is the time the packet takes to traverse 
>> through
>>    the kernel to adapter card.
> 
> I am very new to this TC and QoS handling of Linux kernel and TSN. So 
> sorry some of the  questions below are very basic in nature. I would 
> soon would be working to add this support in our platform based on this.
> So please answer.
> 
> Is txtime-delay from the instance qdisc de-queue the packet to the time
> packets get onto the wire? I am wondering if this time is deterministic
> or we have some way to ensure this can be tuned to meet a max value?
> Also how would one calculate this value or have to measure it?
> 
>> - skip_sock_check (etf): etf qdisc currently drops any packet which 
>> does not
>>    have the SO_TXTIME socket option set. This check can be skipped by 
>> specifying
>>    this option.
>>
>> Following is an example configuration:
>>
>> $ tc qdisc replace dev $IFACE parent root handle 100 taprio \\
>>        num_tc 3 \\
>>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>        queues 1@0 1@0 1@0 \\
>>        base-time $BASE_TIME \\
>>        sched-entry S 01 300000 \\
>>        sched-entry S 02 300000 \\
>>        sched-entry S 04 400000 \\
>>        offload 2 \\
>>        txtime-delay 40000 \\
>>        clockid CLOCK_TAI
>>
> 
> Could you tell me what is CLOCK_TAI?? I see below in the source code for
> the definition in include/uapi/linux/time.h
> 
> /*
>   * The driver implementing this got removed. The clock ID is kept as a
>   * place holder. Do not reuse!
>   */
> #define CLOCK_SGI_CYCLE            10
> #define CLOCK_TAI            11
> 
> So why do I see such defines being used in the example that is being
> removed?
> 
> AFAIK, From the usage point of view, TSN requires the network being
> synchronized through linux PTP. For synchronization phc2sys is used to
> synchronize system time to the PHC. So why don't one use system time for
> this?
> 
> So application will use clock_gettime() to know current time and
> schedule the packet for transmission as well as user would use scripts
> or such to check current system time and issue tc command above. So the
> command should use CLOCK_REALTIME in that case. So all h/w and software
> are aligned to the same time source. Or Have I understood it wrong? I am
> assuming that for the offloaded case, h/w schedule Gate open, send
> packet etc are synchronized to the PHC or use a translated time w.r.t 
> the common time (network time. Assuming PHC tracks this).
> 
> Thanks in advance for clarifying this.
> 
>> $ tc qdisc replace dev $IFACE parent 100:1 etf \\
>>        offload delta 200000 clockid CLOCK_TAI skip_sock_check
>>
>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD 
>> mode is
>> enabled. Also, that all the traffic classes have been assigned the 
>> same queue.

Actually the tc class is mapped to the same queue in the previous
command, not this one, right?

Murali
>> This is to prevent the traffic classes in the lower priority queues from
>> getting starved. Note that this configuration is specific to the i210 
>> ethernet
>> card. Other network cards where the hardware queues are given the same
>> priority, might be able to utilize more than one queue.
>>
>> Following are some of the other highlights of the series:
>> - Fix a bug where hardware timestamping and SO_TXTIME options cannot 
>> be used
>>    together. (Patch 1)
>> - Introduce hardware offloading. This patch introduces offload 
>> parameter which
>>    can be used to enable the txtime offload mode. It will also support 
>> enabling
>>    full offload when the support is available in drivers. (Patch 3)
>> - Make TxTime assist mode work with TCP packets. (Patch 6 & 7)
>>
>>
>> The following changes are recommended to be done in order to get the best
>> performance from taprio in this mode:
>> # ip link set dev enp1s0 mtu 1514
> 
> May I know why MTU is changed to 1514 to include the Ethernet header?
> 
>> # ethtool -K eth0 gso off
>> # ethtool -K eth0 tso off
>> # ethtool --set-eee eth0 eee off
> 
> Could you please explain why these are needed?
> 
> Thanks
> 
> Murali
>>
>> Thanks,
>> Vedang Patel
>>
>> Vedang Patel (6):
>>    igb: clear out tstamp after sending the packet.
>>    etf: Add skip_sock_check
>>    taprio: calculate cycle_time when schedule is installed
>>    taprio: Add support for txtime offload mode.
>>    taprio: make clock reference conversions easier
>>    taprio: Adjust timestamps for TCP packets.
>>
>> Vinicius Costa Gomes (1):
>>    taprio: Add the skeleton to enable hardware offloading
>>
>>   drivers/net/ethernet/intel/igb/igb_main.c |   1 +
>>   include/uapi/linux/pkt_sched.h            |   6 +
>>   net/sched/sch_etf.c                       |  10 +
>>   net/sched/sch_taprio.c                    | 548 ++++++++++++++++++++--
>>   4 files changed, 532 insertions(+), 33 deletions(-)
>>
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-05-28 17:46 ` [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode Vedang Patel
@ 2019-06-03 14:15   ` Murali Karicheri
  2019-06-04 20:06     ` Patel, Vedang
  0 siblings, 1 reply; 25+ messages in thread
From: Murali Karicheri @ 2019-06-03 14:15 UTC (permalink / raw)
  To: Vedang Patel, netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l

Hi Vedang,

On 05/28/2019 01:46 PM, Vedang Patel wrote:
> Currently, we are seeing non-critical packets being transmitted outside
> of their timeslice. We can confirm that the packets are being dequeued
> at the right time. So, the delay is induced in the hardware side.  The
> most likely reason is the hardware queues are starving the lower
> priority queues.
> 
> In order to improve the performance of taprio, we will be making use of the
> txtime feature provided by the ETF qdisc. For all the packets which do not have
> the SO_TXTIME option set, taprio will set the transmit timestamp (set in
> skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
> the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
> qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
> corresponding to skb's traffic class is open.
> 
> Following is the example configuration for enabling txtime offload:
> 
> tc qdisc replace dev eth0 parent root handle 100 taprio \\
>        num_tc 3 \\
>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>        queues 1@0 1@0 1@0 \\
>        base-time 1558653424279842568 \\
>        sched-entry S 01 300000 \\
>        sched-entry S 02 300000 \\
>        sched-entry S 04 400000 \\
>        offload 2 \\
>        txtime-delay 40000 \\
>        clockid CLOCK_TAI
> 
> tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
>        offload delta 200000 clockid CLOCK_TAI
> 
> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
> enabled. Also, all the traffic classes are mapped to the same queue.  This is
> only possible in taprio when txtime offload is enabled. Also note that the ETF
> Qdisc is enabled with offload mode set.
> 
> In this mode, if the packet's traffic class is open and the complete packet can
> be transmitted, taprio will try to transmit the packet immediately. This will
> be done by setting skb->tstamp to current_time + the time delta indicated in
> the txtime_delay parameter. This parameter indicates the time taken (in
> software) for packet to reach the network adapter.

In TSN Time aware shaper, packets are sent when gate for a specific
traffic class is open. So packets that are available in the queues are
sent by the scheduler. So the ETF is not strictly required for this
function. I understand if the application needs to send packets with
some latency expectation should use ETF to schedule the packet in sync
with the next gate open time. So txtime_delay is used to account for
the delay for packets to travel from user space to nic. So it is ETF
that need to inspect the skb->tstamp and allow or if time match or
discard if late. Is this the case?

For offload case, the h/w that offload ETF needs to send a trigger to
software to get packet for transmission ahead of the Gate open event?

Thanks

Murali
> 
> If the packet cannot be transmitted in the current interval or if the packet's
> traffic is not currently transmitting, the skb->tstamp is set to the next
> available timestamp value. This is tracked in the next_launchtime parameter in
> the struct sched_entry.
> 
> The behaviour w.r.t admin and oper schedules is not changed from what is
> present in software mode.
> 
> The transmit time is already known in advance. So, we do not need the HR timers
> to advance the schedule and wakeup the dequeue side of taprio.  So, HR timer
> won't be run when this mode is enabled.
> 
> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
> ---
>   include/uapi/linux/pkt_sched.h |   1 +
>   net/sched/sch_taprio.c         | 326 ++++++++++++++++++++++++++++++---
>   2 files changed, 306 insertions(+), 21 deletions(-)
> 
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 3319255ffa25..afffda24e055 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -1174,6 +1174,7 @@ enum {
>   	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
>   	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
>   	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
>   	__TCA_TAPRIO_ATTR_MAX,
>   };
>   
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 8d87ba099130..1cd19eabc53b 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -21,6 +21,7 @@
>   #include <net/pkt_sched.h>
>   #include <net/pkt_cls.h>
>   #include <net/sch_generic.h>
> +#include <net/sock.h>
>   
>   static LIST_HEAD(taprio_list);
>   static DEFINE_SPINLOCK(taprio_list_lock);
> @@ -40,6 +41,7 @@ struct sched_entry {
>   	 * packet leaves after this time.
>   	 */
>   	ktime_t close_time;
> +	ktime_t next_txtime;
>   	atomic_t budget;
>   	int index;
>   	u32 gate_mask;
> @@ -76,6 +78,7 @@ struct taprio_sched {
>   	struct sk_buff *(*peek)(struct Qdisc *sch);
>   	struct hrtimer advance_timer;
>   	struct list_head taprio_list;
> +	int txtime_delay;
>   };
>   
>   static ktime_t sched_base_time(const struct sched_gate_list *sched)
> @@ -116,6 +119,235 @@ static void switch_schedules(struct taprio_sched *q,
>   	*admin = NULL;
>   }
>   
> +/* Get how much time has been already elapsed in the current cycle. */
> +static inline s32 get_cycle_time_elapsed(struct sched_gate_list *sched, ktime_t time)
> +{
> +	ktime_t time_since_sched_start;
> +	s32 time_elapsed;
> +
> +	time_since_sched_start = ktime_sub(time, sched->base_time);
> +	div_s64_rem(time_since_sched_start, sched->cycle_time, &time_elapsed);
> +
> +	return time_elapsed;
> +}
> +
> +static ktime_t get_interval_end_time(struct sched_gate_list *sched,
> +				     struct sched_gate_list *admin,
> +				     struct sched_entry *entry,
> +				     ktime_t intv_start)
> +{
> +	s32 cycle_elapsed = get_cycle_time_elapsed(sched, intv_start);
> +	ktime_t intv_end, cycle_ext_end, cycle_end;
> +
> +	cycle_end = ktime_add_ns(intv_start, sched->cycle_time - cycle_elapsed);
> +	intv_end = ktime_add_ns(intv_start, entry->interval);
> +	cycle_ext_end = ktime_add(cycle_end, sched->cycle_time_extension);
> +
> +	if (ktime_before(intv_end, cycle_end))
> +		return intv_end;
> +	else if (admin && admin != sched &&
> +		 ktime_after(admin->base_time, cycle_end) &&
> +		 ktime_before(admin->base_time, cycle_ext_end))
> +		return admin->base_time;
> +	else
> +		return cycle_end;
> +}
> +
> +static inline int length_to_duration(struct taprio_sched *q, int len)
> +{
> +	return (len * atomic64_read(&q->picos_per_byte)) / 1000;
> +}
> +
> +/* Returns the entry corresponding to next available interval. If
> + * validate_interval is set, it only validates whether the timestamp occurs
> + * when the gate corresponding to the skb's traffic class is open.
> + */
> +static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
> +						  struct Qdisc *sch,
> +						  struct sched_gate_list *sched,
> +						  struct sched_gate_list *admin,
> +						  ktime_t time,
> +						  ktime_t *interval_start,
> +						  ktime_t *interval_end,
> +						  bool validate_interval)
> +{
> +	ktime_t curr_intv_start, curr_intv_end, cycle_end, packet_transmit_time;
> +	ktime_t earliest_txtime = KTIME_MAX, txtime, cycle, transmit_end_time;
> +	struct sched_entry *entry = NULL, *entry_found = NULL;
> +	struct taprio_sched *q = qdisc_priv(sch);
> +	struct net_device *dev = qdisc_dev(sch);
> +	int tc, entry_available = 0, n;
> +	s32 cycle_elapsed;
> +
> +	tc = netdev_get_prio_tc_map(dev, skb->priority);
> +	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
> +
> +	*interval_start = 0;
> +	*interval_end = 0;
> +
> +	if (!sched)
> +		return NULL;
> +
> +	cycle = sched->cycle_time;
> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
> +	curr_intv_end = ktime_sub_ns(time, cycle_elapsed);
> +	cycle_end = ktime_add_ns(curr_intv_end, cycle);
> +
> +	list_for_each_entry(entry, &sched->entries, list) {
> +		curr_intv_start = curr_intv_end;
> +		curr_intv_end = get_interval_end_time(sched, admin, entry,
> +						      curr_intv_start);
> +
> +		if (ktime_after(curr_intv_start, cycle_end))
> +			break;
> +
> +		if (!(entry->gate_mask & BIT(tc)) ||
> +		    packet_transmit_time > entry->interval)
> +			continue;
> +
> +		txtime = entry->next_txtime;
> +
> +		if (ktime_before(txtime, time) || validate_interval) {
> +			transmit_end_time = ktime_add_ns(time, packet_transmit_time);
> +			if ((ktime_before(curr_intv_start, time) &&
> +			     ktime_before(transmit_end_time, curr_intv_end)) ||
> +			    (ktime_after(curr_intv_start, time) && !validate_interval)) {
> +				entry_found = entry;
> +				*interval_start = curr_intv_start;
> +				*interval_end = curr_intv_end;
> +				break;
> +			} else if (!entry_available && !validate_interval) {
> +				/* Here, we are just trying to find out the
> +				 * first available interval in the next cycle.
> +				 */
> +				entry_available = 1;
> +				entry_found = entry;
> +				*interval_start = ktime_add_ns(curr_intv_start, cycle);
> +				*interval_end = ktime_add_ns(curr_intv_end, cycle);
> +			}
> +		} else if (ktime_before(txtime, earliest_txtime) &&
> +			   !entry_available) {
> +			earliest_txtime = txtime;
> +			entry_found = entry;
> +			n = div_s64(ktime_sub(txtime, curr_intv_start), cycle);
> +			*interval_start = ktime_add(curr_intv_start, n * cycle);
> +			*interval_end = ktime_add(curr_intv_end, n * cycle);
> +		}
> +	}
> +
> +	return entry_found;
> +}
> +
> +static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
> +{
> +	struct taprio_sched *q = qdisc_priv(sch);
> +	struct sched_gate_list *sched, *admin;
> +	ktime_t interval_start, interval_end;
> +	struct sched_entry *entry;
> +
> +	rcu_read_lock();
> +	sched = rcu_dereference(q->oper_sched);
> +	admin = rcu_dereference(q->admin_sched);
> +
> +	entry = find_entry_to_transmit(skb, sch, sched, admin, skb->tstamp,
> +				       &interval_start, &interval_end, true);
> +	rcu_read_unlock();
> +
> +	return entry;
> +}
> +
> +static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
> +				      ktime_t time)
> +{
> +	ktime_t cycle_elapsed;
> +
> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
> +
> +	return ktime_sub(time, cycle_elapsed);
> +}
> +
> +/* There are a few scenarios where we will have to modify the txtime from
> + * what is read from next_txtime in sched_entry. They are:
> + * 1. If txtime is in the past,
> + *    a. The gate for the traffic class is currently open and packet can be
> + *       transmitted before it closes, schedule the packet right away.
> + *    b. If the gate corresponding to the traffic class is going to open later
> + *       in the cycle, set the txtime of packet to the interval start.
> + * 2. If txtime is in the future, there are packets corresponding to the
> + *    current traffic class waiting to be transmitted. So, the following
> + *    possibilities exist:
> + *    a. We can transmit the packet before the window containing the txtime
> + *       closes.
> + *    b. The window might close before the transmission can be completed
> + *       successfully. So, schedule the packet in the next open window.
> + */
> +static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
> +{
> +	ktime_t transmit_end_time, interval_end, interval_start;
> +	int len, packet_transmit_time, sched_changed;
> +	struct taprio_sched *q = qdisc_priv(sch);
> +	ktime_t minimum_time, now, txtime;
> +	struct sched_gate_list *sched, *admin;
> +	struct sched_entry *entry;
> +
> +	now = q->get_time();
> +	minimum_time = ktime_add_ns(now, q->txtime_delay);
> +
> +	rcu_read_lock();
> +	admin = rcu_dereference(q->admin_sched);
> +	sched = rcu_dereference(q->oper_sched);
> +	if (admin && ktime_after(minimum_time, admin->base_time))
> +		switch_schedules(q, &admin, &sched);
> +
> +	/* Until the schedule starts, all the queues are open */
> +	if (!sched || ktime_before(minimum_time, sched->base_time)) {
> +		txtime = minimum_time;
> +		goto done;
> +	}
> +
> +	len = qdisc_pkt_len(skb);
> +	packet_transmit_time = length_to_duration(q, len);
> +
> +	do {
> +		sched_changed = 0;
> +
> +		entry = find_entry_to_transmit(skb, sch, sched, admin,
> +					       minimum_time,
> +					       &interval_start, &interval_end,
> +					       false);
> +		if (!entry) {
> +			txtime = 0;
> +			goto done;
> +		}
> +
> +		txtime = entry->next_txtime;
> +		txtime = max_t(ktime_t, txtime, minimum_time);
> +		txtime = max_t(ktime_t, txtime, interval_start);
> +
> +		if (admin && admin != sched &&
> +		    ktime_after(txtime, admin->base_time)) {
> +			sched = admin;
> +			sched_changed = 1;
> +			continue;
> +		}
> +
> +		transmit_end_time = ktime_add(txtime, packet_transmit_time);
> +		minimum_time = transmit_end_time;
> +
> +		/* Update the txtime of current entry to the next time it's
> +		 * interval starts.
> +		 */
> +		if (ktime_after(transmit_end_time, interval_end))
> +			entry->next_txtime = ktime_add(interval_start, sched->cycle_time);
> +	} while (sched_changed || ktime_after(transmit_end_time, interval_end));
> +
> +	entry->next_txtime = transmit_end_time;
> +
> +done:
> +	rcu_read_unlock();
> +	return txtime;
> +}
> +
>   static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>   			  struct sk_buff **to_free)
>   {
> @@ -129,6 +361,15 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>   	if (unlikely(!child))
>   		return qdisc_drop(skb, sch, to_free);
>   
> +	if (skb->sk && sock_flag(skb->sk, SOCK_TXTIME)) {
> +		if (!is_valid_interval(skb, sch))
> +			return qdisc_drop(skb, sch, to_free);
> +	} else if (TXTIME_OFFLOAD_IS_ON(q->offload_flags)) {
> +		skb->tstamp = get_packet_txtime(skb, sch);
> +		if (!skb->tstamp)
> +			return qdisc_drop(skb, sch, to_free);
> +	}
> +
>   	qdisc_qstats_backlog_inc(sch, skb);
>   	sch->q.qlen++;
>   
> @@ -206,11 +447,6 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
>   	return q->peek(sch);
>   }
>   
> -static inline int length_to_duration(struct taprio_sched *q, int len)
> -{
> -	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
> -}
> -
>   static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
>   {
>   	atomic_set(&entry->budget,
> @@ -594,7 +830,8 @@ static int parse_taprio_schedule(struct nlattr **tb,
>   
>   static int taprio_parse_mqprio_opt(struct net_device *dev,
>   				   struct tc_mqprio_qopt *qopt,
> -				   struct netlink_ext_ack *extack)
> +				   struct netlink_ext_ack *extack,
> +				   u32 offload_flags)
>   {
>   	int i, j;
>   
> @@ -642,6 +879,9 @@ static int taprio_parse_mqprio_opt(struct net_device *dev,
>   			return -EINVAL;
>   		}
>   
> +		if (TXTIME_OFFLOAD_IS_ON(offload_flags))
> +			continue;
> +
>   		/* Verify that the offset and counts do not overlap */
>   		for (j = i + 1; j < qopt->num_tc; j++) {
>   			if (last > qopt->offset[j]) {
> @@ -804,6 +1044,9 @@ static int taprio_enable_offload(struct net_device *dev,
>   	const struct net_device_ops *ops = dev->netdev_ops;
>   	int err = 0;
>   
> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
> +		goto done;
> +
>   	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
>   		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
>   		return -EOPNOTSUPP;
> @@ -816,15 +1059,28 @@ static int taprio_enable_offload(struct net_device *dev,
>   
>   	/* FIXME: enable offloading */
>   
> -	q->dequeue = taprio_dequeue_offload;
> -	q->peek = taprio_peek_offload;
> -
> -	if (err == 0)
> +done:
> +	if (err == 0) {
> +		q->dequeue = taprio_dequeue_offload;
> +		q->peek = taprio_peek_offload;
>   		q->offload_flags = offload_flags;
> +	}
>   
>   	return err;
>   }
>   
> +static void setup_txtime(struct taprio_sched *q,
> +			 struct sched_gate_list *sched, ktime_t base)
> +{
> +	struct sched_entry *entry;
> +	u32 interval = 0;
> +
> +	list_for_each_entry(entry, &sched->entries, list) {
> +		entry->next_txtime = ktime_add_ns(base, interval);
> +		interval += entry->interval;
> +	}
> +}
> +
>   static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>   			 struct netlink_ext_ack *extack)
>   {
> @@ -846,7 +1102,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>   	if (tb[TCA_TAPRIO_ATTR_PRIOMAP])
>   		mqprio = nla_data(tb[TCA_TAPRIO_ATTR_PRIOMAP]);
>   
> -	err = taprio_parse_mqprio_opt(dev, mqprio, extack);
> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
> +
> +	err = taprio_parse_mqprio_opt(dev, mqprio, extack, offload_flags);
>   	if (err < 0)
>   		return err;
>   
> @@ -887,6 +1146,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>   		goto free_sched;
>   	}
>   
> +	/* preserve offload flags when changing the schedule. */
> +	if (q->offload_flags)
> +		offload_flags = q->offload_flags;
> +
>   	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
>   		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
>   
> @@ -914,7 +1177,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>   	/* Protects against enqueue()/dequeue() */
>   	spin_lock_bh(qdisc_lock(sch));
>   
> -	if (!hrtimer_active(&q->advance_timer)) {
> +	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY])
> +		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
> +
> +	if (!TXTIME_OFFLOAD_IS_ON(offload_flags) && !hrtimer_active(&q->advance_timer)) {
>   		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
>   		q->advance_timer.function = advance_sched;
>   	}
> @@ -966,20 +1232,35 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>   		goto unlock;
>   	}
>   
> -	setup_first_close_time(q, new_admin, start);
> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags)) {
> +		setup_txtime(q, new_admin, start);
> +
> +		if (!oper) {
> +			rcu_assign_pointer(q->oper_sched, new_admin);
> +			err = 0;
> +			new_admin = NULL;
> +			goto unlock;
> +		}
> +
> +		rcu_assign_pointer(q->admin_sched, new_admin);
> +		if (admin)
> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
> +	} else {
> +		setup_first_close_time(q, new_admin, start);
>   
> -	/* Protects against advance_sched() */
> -	spin_lock_irqsave(&q->current_entry_lock, flags);
> +		/* Protects against advance_sched() */
> +		spin_lock_irqsave(&q->current_entry_lock, flags);
>   
> -	taprio_start_sched(sch, start, new_admin);
> +		taprio_start_sched(sch, start, new_admin);
>   
> -	rcu_assign_pointer(q->admin_sched, new_admin);
> -	if (admin)
> -		call_rcu(&admin->rcu, taprio_free_sched_cb);
> -	new_admin = NULL;
> +		rcu_assign_pointer(q->admin_sched, new_admin);
> +		if (admin)
> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>   
> -	spin_unlock_irqrestore(&q->current_entry_lock, flags);
> +		spin_unlock_irqrestore(&q->current_entry_lock, flags);
> +	}
>   
> +	new_admin = NULL;
>   	err = 0;
>   
>   unlock:
> @@ -1225,6 +1506,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
>   	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
>   		goto options_error;
>   
> +	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
> +		goto options_error;
> +
>   	if (oper && dump_schedule(skb, oper))
>   		goto options_error;
>   
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio
  2019-06-03 13:50 ` [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Murali Karicheri
  2019-06-03 13:54   ` Murali Karicheri
@ 2019-06-04 19:47   ` Patel, Vedang
  1 sibling, 0 replies; 25+ messages in thread
From: Patel, Vedang @ 2019-06-04 19:47 UTC (permalink / raw)
  To: Murali Karicheri
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, intel-wired-lan, Gomes, Vinicius, l

Hi Murali,

> On Jun 3, 2019, at 6:50 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
> 
> Hi Vedang,
> 
> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>> Currently, we are seeing packets being transmitted outside their timeslices. We
>> can confirm that the packets are being dequeued at the right time. So, the
>> delay is induced after the packet is dequeued, because taprio, without any
>> offloading, has no control of when a packet is actually transmitted.
>> In order to solve this, we are making use of the txtime feature provided by ETF
>> qdisc. Hardware offloading needs to be supported by the ETF qdisc in order to
>> take advantage of this feature. The taprio qdisc will assign txtime (in
>> skb->tstamp) for all the packets which do not have the txtime allocated via the
>> SO_TXTIME socket option. For the packets which already have SO_TXTIME set,
>> taprio will validate whether the packet will be transmitted in the correct
>> interval.
>> In order to support this, the following parameters have been added:
>> - offload (taprio): This is added in order to support different offloading
>>   modes which will be added in the future.
>> - txtime-delay (taprio): this is the time the packet takes to traverse through
>>   the kernel to adapter card.
> 
> I am very new to this TC and QoS handling of Linux kernel and TSN. So sorry some of the  questions below are very basic in nature. I would soon would be working to add this support in our platform based on this.
> So please answer.
No problems. This is my first contribution to networking subsystem as well. :)
> 
> Is txtime-delay from the instance qdisc de-queue the packet to the time
> packets get onto the wire? I am wondering if this time is deterministic
> or we have some way to ensure this can be tuned to meet a max value?
> Also how would one calculate this value or have to measure it?
> 
So, consider a scenario where we just received a packet to send and the gate corresponding to that packet is open. But, we are not sure whether we can transmit the packet fully before the gate closes. Txtime-delay will be useful parameter in such situations. So, it is basically the time between start of taprio_enqueue() and packet hitting the wire.

It’s probably going to be a little tricky to measure because the packet changes hands. But, you could get reasonable accuracy  if you print out the CLOCK_REALTIME at the beginning of the packet and substrat it from the Hardware transmit timestamp of the packet. You will have to run phc2sys at least to ensure that the PTP clock in network card and CLOCK_REALTIME are synchronized. 
>> - skip_sock_check (etf): etf qdisc currently drops any packet which does not
>>   have the SO_TXTIME socket option set. This check can be skipped by specifying
>>   this option.
>> Following is an example configuration:
>> $ tc qdisc replace dev $IFACE parent root handle 100 taprio \\
>>       num_tc 3 \\
>>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>       queues 1@0 1@0 1@0 \\
>>       base-time $BASE_TIME \\
>>       sched-entry S 01 300000 \\
>>       sched-entry S 02 300000 \\
>>       sched-entry S 04 400000 \\
>>       offload 2 \\
>>       txtime-delay 40000 \\
>>       clockid CLOCK_TAI
> 
> Could you tell me what is CLOCK_TAI?? I see below in the source code for
> the definition in include/uapi/linux/time.h
It is International atomic time. It is currently exactly 37 seconds ahead of UTC (or CLOCK_REALTIME). 

More info: https://en.wikipedia.org/wiki/International_Atomic_Time
> 
> /*
> * The driver implementing this got removed. The clock ID is kept as a
> * place holder. Do not reuse!
> */
> #define CLOCK_SGI_CYCLE			10
> #define CLOCK_TAI			11
> 
> So why do I see such defines being used in the example that is being
> removed?
> 
I believe the comment is just referring to CLOCK_SGI_CYCLE. CLOCK_TAI is still being supported.
> AFAIK, From the usage point of view, TSN requires the network being
> synchronized through linux PTP. For synchronization phc2sys is used to
> synchronize system time to the PHC. So why don't one use system time for
> this?
> 
> So application will use clock_gettime() to know current time and
> schedule the packet for transmission as well as user would use scripts
> or such to check current system time and issue tc command above. So the
> command should use CLOCK_REALTIME in that case. So all h/w and software
> are aligned to the same time source. Or Have I understood it wrong? I am
> assuming that for the offloaded case, h/w schedule Gate open, send
> packet etc are synchronized to the PHC or use a translated time w.r.t the common time (network time. Assuming PHC tracks this).
> 
CLOCK_REALTIME is also supported by the driver. So, you can definitely use that. CLOCK_TAI is basically CLOCK_REALTIME + UTC_Offset. So, you can use either.
> Thanks in advance for clarifying this.
> 
>> $ tc qdisc replace dev $IFACE parent 100:1 etf \\
>>       offload delta 200000 clockid CLOCK_TAI skip_sock_check
>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>> enabled. Also, that all the traffic classes have been assigned the same queue.
>> This is to prevent the traffic classes in the lower priority queues from
>> getting starved. Note that this configuration is specific to the i210 ethernet
>> card. Other network cards where the hardware queues are given the same
>> priority, might be able to utilize more than one queue.
>> Following are some of the other highlights of the series:
>> - Fix a bug where hardware timestamping and SO_TXTIME options cannot be used
>>   together. (Patch 1)
>> - Introduce hardware offloading. This patch introduces offload parameter which
>>   can be used to enable the txtime offload mode. It will also support enabling
>>   full offload when the support is available in drivers. (Patch 3)
>> - Make TxTime assist mode work with TCP packets. (Patch 6 & 7)
>> The following changes are recommended to be done in order to get the best
>> performance from taprio in this mode:
>> # ip link set dev enp1s0 mtu 1514
> 
> May I know why MTU is changed to 1514 to include the Ethernet header?
> 
For TSN, in general, Jumbo frames are not allowed. Which is why this setting is used.
>> # ethtool -K eth0 gso off
>> # ethtool -K eth0 tso off
Let’s consider a scenario where an application is trying to send a lot of small packets. By default, these packets will be combined into a single packet and this packet will be segmented in the hardware. But, we are not sure whether these packets can be transmitted before the gate for this application closes. So, these options are used to prevent this from happening. 
>> # ethtool --set-eee eth0 eee off
> 
This option just disables energy efficient ethernet so that we do not suffer from latency spikes due to the NIC trying to wake up when the packet is supposed to be sent.

Thanks,
Vedang
> Could you please explain why these are needed?
> 
> Thanks
> 
> Murali
>> Thanks,
>> Vedang Patel
>> Vedang Patel (6):
>>   igb: clear out tstamp after sending the packet.
>>   etf: Add skip_sock_check
>>   taprio: calculate cycle_time when schedule is installed
>>   taprio: Add support for txtime offload mode.
>>   taprio: make clock reference conversions easier
>>   taprio: Adjust timestamps for TCP packets.
>> Vinicius Costa Gomes (1):
>>   taprio: Add the skeleton to enable hardware offloading
>>  drivers/net/ethernet/intel/igb/igb_main.c |   1 +
>>  include/uapi/linux/pkt_sched.h            |   6 +
>>  net/sched/sch_etf.c                       |  10 +
>>  net/sched/sch_taprio.c                    | 548 ++++++++++++++++++++--
>>  4 files changed, 532 insertions(+), 33 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio
  2019-06-03 13:54   ` Murali Karicheri
@ 2019-06-04 19:53     ` Patel, Vedang
  0 siblings, 0 replies; 25+ messages in thread
From: Patel, Vedang @ 2019-06-04 19:53 UTC (permalink / raw)
  To: Murali Karicheri
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, intel-wired-lan, Gomes, Vinicius, l


> On Jun 3, 2019, at 6:54 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
> 
> On 06/03/2019 09:50 AM, Murali Karicheri wrote:
>> Hi Vedang,
>> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>>> Currently, we are seeing packets being transmitted outside their timeslices. We
>>> can confirm that the packets are being dequeued at the right time. So, the
>>> delay is induced after the packet is dequeued, because taprio, without any
>>> offloading, has no control of when a packet is actually transmitted.
>>> 
>>> In order to solve this, we are making use of the txtime feature provided by ETF
>>> qdisc. Hardware offloading needs to be supported by the ETF qdisc in order to
>>> take advantage of this feature. The taprio qdisc will assign txtime (in
>>> skb->tstamp) for all the packets which do not have the txtime allocated via the
>>> SO_TXTIME socket option. For the packets which already have SO_TXTIME set,
>>> taprio will validate whether the packet will be transmitted in the correct
>>> interval.
>>> 
>>> In order to support this, the following parameters have been added:
>>> - offload (taprio): This is added in order to support different offloading
>>>    modes which will be added in the future.
>>> - txtime-delay (taprio): this is the time the packet takes to traverse through
>>>    the kernel to adapter card.
>> I am very new to this TC and QoS handling of Linux kernel and TSN. So sorry some of the  questions below are very basic in nature. I would soon would be working to add this support in our platform based on this.
>> So please answer.
>> Is txtime-delay from the instance qdisc de-queue the packet to the time
>> packets get onto the wire? I am wondering if this time is deterministic
>> or we have some way to ensure this can be tuned to meet a max value?
>> Also how would one calculate this value or have to measure it?
>>> - skip_sock_check (etf): etf qdisc currently drops any packet which does not
>>>    have the SO_TXTIME socket option set. This check can be skipped by specifying
>>>    this option.
>>> 
>>> Following is an example configuration:
>>> 
>>> $ tc qdisc replace dev $IFACE parent root handle 100 taprio \\
>>>        num_tc 3 \\
>>>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>>        queues 1@0 1@0 1@0 \\
>>>        base-time $BASE_TIME \\
>>>        sched-entry S 01 300000 \\
>>>        sched-entry S 02 300000 \\
>>>        sched-entry S 04 400000 \\
>>>        offload 2 \\
>>>        txtime-delay 40000 \\
>>>        clockid CLOCK_TAI
>>> 
>> Could you tell me what is CLOCK_TAI?? I see below in the source code for
>> the definition in include/uapi/linux/time.h
>> /*
>>  * The driver implementing this got removed. The clock ID is kept as a
>>  * place holder. Do not reuse!
>>  */
>> #define CLOCK_SGI_CYCLE            10
>> #define CLOCK_TAI            11
>> So why do I see such defines being used in the example that is being
>> removed?
>> AFAIK, From the usage point of view, TSN requires the network being
>> synchronized through linux PTP. For synchronization phc2sys is used to
>> synchronize system time to the PHC. So why don't one use system time for
>> this?
>> So application will use clock_gettime() to know current time and
>> schedule the packet for transmission as well as user would use scripts
>> or such to check current system time and issue tc command above. So the
>> command should use CLOCK_REALTIME in that case. So all h/w and software
>> are aligned to the same time source. Or Have I understood it wrong? I am
>> assuming that for the offloaded case, h/w schedule Gate open, send
>> packet etc are synchronized to the PHC or use a translated time w.r.t the common time (network time. Assuming PHC tracks this).
>> Thanks in advance for clarifying this.
>>> $ tc qdisc replace dev $IFACE parent 100:1 etf \\
>>>        offload delta 200000 clockid CLOCK_TAI skip_sock_check
>>> 
>>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>>> enabled. Also, that all the traffic classes have been assigned the same queue.
> 
> Actually the tc class is mapped to the same queue in the previous
> command, not this one, right?
> 
Yes, you are right. It is done while setting up taprio. The way it’s written looks confusing. I will modify and clarify this in the next version of the series.

Thanks,
Vedang
> Murali
>>> This is to prevent the traffic classes in the lower priority queues from
>>> getting starved. Note that this configuration is specific to the i210 ethernet
>>> card. Other network cards where the hardware queues are given the same
>>> priority, might be able to utilize more than one queue.
>>> 
>>> Following are some of the other highlights of the series:
>>> - Fix a bug where hardware timestamping and SO_TXTIME options cannot be used
>>>    together. (Patch 1)
>>> - Introduce hardware offloading. This patch introduces offload parameter which
>>>    can be used to enable the txtime offload mode. It will also support enabling
>>>    full offload when the support is available in drivers. (Patch 3)
>>> - Make TxTime assist mode work with TCP packets. (Patch 6 & 7)
>>> 
>>> 
>>> The following changes are recommended to be done in order to get the best
>>> performance from taprio in this mode:
>>> # ip link set dev enp1s0 mtu 1514
>> May I know why MTU is changed to 1514 to include the Ethernet header?
>>> # ethtool -K eth0 gso off
>>> # ethtool -K eth0 tso off
>>> # ethtool --set-eee eth0 eee off
>> Could you please explain why these are needed?
>> Thanks
>> Murali
>>> 
>>> Thanks,
>>> Vedang Patel
>>> 
>>> Vedang Patel (6):
>>>    igb: clear out tstamp after sending the packet.
>>>    etf: Add skip_sock_check
>>>    taprio: calculate cycle_time when schedule is installed
>>>    taprio: Add support for txtime offload mode.
>>>    taprio: make clock reference conversions easier
>>>    taprio: Adjust timestamps for TCP packets.
>>> 
>>> Vinicius Costa Gomes (1):
>>>    taprio: Add the skeleton to enable hardware offloading
>>> 
>>>   drivers/net/ethernet/intel/igb/igb_main.c |   1 +
>>>   include/uapi/linux/pkt_sched.h            |   6 +
>>>   net/sched/sch_etf.c                       |  10 +
>>>   net/sched/sch_taprio.c                    | 548 ++++++++++++++++++++--
>>>   4 files changed, 532 insertions(+), 33 deletions(-)
>>> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-06-03 14:15   ` Murali Karicheri
@ 2019-06-04 20:06     ` Patel, Vedang
  2019-06-07 18:52       ` Murali Karicheri
  0 siblings, 1 reply; 25+ messages in thread
From: Patel, Vedang @ 2019-06-04 20:06 UTC (permalink / raw)
  To: Murali Karicheri
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	xiyou.wangcong, jiri, intel-wired-lan, Gomes, Vinicius, l

Hi Murali,

> On Jun 3, 2019, at 7:15 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
> 
> Hi Vedang,
> 
> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>> Currently, we are seeing non-critical packets being transmitted outside
>> of their timeslice. We can confirm that the packets are being dequeued
>> at the right time. So, the delay is induced in the hardware side.  The
>> most likely reason is the hardware queues are starving the lower
>> priority queues.
>> In order to improve the performance of taprio, we will be making use of the
>> txtime feature provided by the ETF qdisc. For all the packets which do not have
>> the SO_TXTIME option set, taprio will set the transmit timestamp (set in
>> skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
>> the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
>> qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
>> corresponding to skb's traffic class is open.
>> Following is the example configuration for enabling txtime offload:
>> tc qdisc replace dev eth0 parent root handle 100 taprio \\
>>       num_tc 3 \\
>>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>       queues 1@0 1@0 1@0 \\
>>       base-time 1558653424279842568 \\
>>       sched-entry S 01 300000 \\
>>       sched-entry S 02 300000 \\
>>       sched-entry S 04 400000 \\
>>       offload 2 \\
>>       txtime-delay 40000 \\
>>       clockid CLOCK_TAI
>> tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
>>       offload delta 200000 clockid CLOCK_TAI
>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>> enabled. Also, all the traffic classes are mapped to the same queue.  This is
>> only possible in taprio when txtime offload is enabled. Also note that the ETF
>> Qdisc is enabled with offload mode set.
>> In this mode, if the packet's traffic class is open and the complete packet can
>> be transmitted, taprio will try to transmit the packet immediately. This will
>> be done by setting skb->tstamp to current_time + the time delta indicated in
>> the txtime_delay parameter. This parameter indicates the time taken (in
>> software) for packet to reach the network adapter.
> 
> In TSN Time aware shaper, packets are sent when gate for a specific
> traffic class is open. So packets that are available in the queues are
> sent by the scheduler. So the ETF is not strictly required for this
> function.
> I understand if the application needs to send packets with
> some latency expectation should use ETF to schedule the packet in sync
> with the next gate open time. So txtime_delay is used to account for
> the delay for packets to travel from user space to nic.
This is not true. As explained in the other email, txtime-delay is the maximum time a packet might take after reaching taprio_enqueue() assuming the gate corresponding to that packet was already open.
> So it is ETF
> that need to inspect the skb->tstamp and allow or if time match or
> discard if late. Is this the case?
> 
The role of ETF is just to sort packets according to their transmit timestamp and send them to the hardware queues to transmit whenever their time comes. This is needed because the i210 hardware does not have any sort feature within its queue. If 2 packets arrive and the first packet has a transmit timestamp later than the second packet, the second packet won’t be transmitted on time.

Taprio in the txtime offload mode (btw, this will soon be renamed to txtime-assist) will set the transmit timestamp of each packet (in skb->tstamp) and then send it to ETF so that it can be sorted with the other packets and sent to hardware whenever the time comes. ETF will discard the packet if the transmit time is in the past or before the transmit time of the packet which is already been sent to the hardware. 

Let me know if you have more questions about this. 

Thanks,
Vedang
> For offload case, the h/w that offload ETF needs to send a trigger to
> software to get packet for transmission ahead of the Gate open event?
> 
> Thanks
> 
> Murali
>> If the packet cannot be transmitted in the current interval or if the packet's
>> traffic is not currently transmitting, the skb->tstamp is set to the next
>> available timestamp value. This is tracked in the next_launchtime parameter in
>> the struct sched_entry.
>> The behaviour w.r.t admin and oper schedules is not changed from what is
>> present in software mode.
>> The transmit time is already known in advance. So, we do not need the HR timers
>> to advance the schedule and wakeup the dequeue side of taprio.  So, HR timer
>> won't be run when this mode is enabled.
>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
>> ---
>>  include/uapi/linux/pkt_sched.h |   1 +
>>  net/sched/sch_taprio.c         | 326 ++++++++++++++++++++++++++++++---
>>  2 files changed, 306 insertions(+), 21 deletions(-)
>> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
>> index 3319255ffa25..afffda24e055 100644
>> --- a/include/uapi/linux/pkt_sched.h
>> +++ b/include/uapi/linux/pkt_sched.h
>> @@ -1174,6 +1174,7 @@ enum {
>>  	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
>>  	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
>>  	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
>> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
>>  	__TCA_TAPRIO_ATTR_MAX,
>>  };
>>  diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
>> index 8d87ba099130..1cd19eabc53b 100644
>> --- a/net/sched/sch_taprio.c
>> +++ b/net/sched/sch_taprio.c
>> @@ -21,6 +21,7 @@
>>  #include <net/pkt_sched.h>
>>  #include <net/pkt_cls.h>
>>  #include <net/sch_generic.h>
>> +#include <net/sock.h>
>>    static LIST_HEAD(taprio_list);
>>  static DEFINE_SPINLOCK(taprio_list_lock);
>> @@ -40,6 +41,7 @@ struct sched_entry {
>>  	 * packet leaves after this time.
>>  	 */
>>  	ktime_t close_time;
>> +	ktime_t next_txtime;
>>  	atomic_t budget;
>>  	int index;
>>  	u32 gate_mask;
>> @@ -76,6 +78,7 @@ struct taprio_sched {
>>  	struct sk_buff *(*peek)(struct Qdisc *sch);
>>  	struct hrtimer advance_timer;
>>  	struct list_head taprio_list;
>> +	int txtime_delay;
>>  };
>>    static ktime_t sched_base_time(const struct sched_gate_list *sched)
>> @@ -116,6 +119,235 @@ static void switch_schedules(struct taprio_sched *q,
>>  	*admin = NULL;
>>  }
>>  +/* Get how much time has been already elapsed in the current cycle. */
>> +static inline s32 get_cycle_time_elapsed(struct sched_gate_list *sched, ktime_t time)
>> +{
>> +	ktime_t time_since_sched_start;
>> +	s32 time_elapsed;
>> +
>> +	time_since_sched_start = ktime_sub(time, sched->base_time);
>> +	div_s64_rem(time_since_sched_start, sched->cycle_time, &time_elapsed);
>> +
>> +	return time_elapsed;
>> +}
>> +
>> +static ktime_t get_interval_end_time(struct sched_gate_list *sched,
>> +				     struct sched_gate_list *admin,
>> +				     struct sched_entry *entry,
>> +				     ktime_t intv_start)
>> +{
>> +	s32 cycle_elapsed = get_cycle_time_elapsed(sched, intv_start);
>> +	ktime_t intv_end, cycle_ext_end, cycle_end;
>> +
>> +	cycle_end = ktime_add_ns(intv_start, sched->cycle_time - cycle_elapsed);
>> +	intv_end = ktime_add_ns(intv_start, entry->interval);
>> +	cycle_ext_end = ktime_add(cycle_end, sched->cycle_time_extension);
>> +
>> +	if (ktime_before(intv_end, cycle_end))
>> +		return intv_end;
>> +	else if (admin && admin != sched &&
>> +		 ktime_after(admin->base_time, cycle_end) &&
>> +		 ktime_before(admin->base_time, cycle_ext_end))
>> +		return admin->base_time;
>> +	else
>> +		return cycle_end;
>> +}
>> +
>> +static inline int length_to_duration(struct taprio_sched *q, int len)
>> +{
>> +	return (len * atomic64_read(&q->picos_per_byte)) / 1000;
>> +}
>> +
>> +/* Returns the entry corresponding to next available interval. If
>> + * validate_interval is set, it only validates whether the timestamp occurs
>> + * when the gate corresponding to the skb's traffic class is open.
>> + */
>> +static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
>> +						  struct Qdisc *sch,
>> +						  struct sched_gate_list *sched,
>> +						  struct sched_gate_list *admin,
>> +						  ktime_t time,
>> +						  ktime_t *interval_start,
>> +						  ktime_t *interval_end,
>> +						  bool validate_interval)
>> +{
>> +	ktime_t curr_intv_start, curr_intv_end, cycle_end, packet_transmit_time;
>> +	ktime_t earliest_txtime = KTIME_MAX, txtime, cycle, transmit_end_time;
>> +	struct sched_entry *entry = NULL, *entry_found = NULL;
>> +	struct taprio_sched *q = qdisc_priv(sch);
>> +	struct net_device *dev = qdisc_dev(sch);
>> +	int tc, entry_available = 0, n;
>> +	s32 cycle_elapsed;
>> +
>> +	tc = netdev_get_prio_tc_map(dev, skb->priority);
>> +	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
>> +
>> +	*interval_start = 0;
>> +	*interval_end = 0;
>> +
>> +	if (!sched)
>> +		return NULL;
>> +
>> +	cycle = sched->cycle_time;
>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>> +	curr_intv_end = ktime_sub_ns(time, cycle_elapsed);
>> +	cycle_end = ktime_add_ns(curr_intv_end, cycle);
>> +
>> +	list_for_each_entry(entry, &sched->entries, list) {
>> +		curr_intv_start = curr_intv_end;
>> +		curr_intv_end = get_interval_end_time(sched, admin, entry,
>> +						      curr_intv_start);
>> +
>> +		if (ktime_after(curr_intv_start, cycle_end))
>> +			break;
>> +
>> +		if (!(entry->gate_mask & BIT(tc)) ||
>> +		    packet_transmit_time > entry->interval)
>> +			continue;
>> +
>> +		txtime = entry->next_txtime;
>> +
>> +		if (ktime_before(txtime, time) || validate_interval) {
>> +			transmit_end_time = ktime_add_ns(time, packet_transmit_time);
>> +			if ((ktime_before(curr_intv_start, time) &&
>> +			     ktime_before(transmit_end_time, curr_intv_end)) ||
>> +			    (ktime_after(curr_intv_start, time) && !validate_interval)) {
>> +				entry_found = entry;
>> +				*interval_start = curr_intv_start;
>> +				*interval_end = curr_intv_end;
>> +				break;
>> +			} else if (!entry_available && !validate_interval) {
>> +				/* Here, we are just trying to find out the
>> +				 * first available interval in the next cycle.
>> +				 */
>> +				entry_available = 1;
>> +				entry_found = entry;
>> +				*interval_start = ktime_add_ns(curr_intv_start, cycle);
>> +				*interval_end = ktime_add_ns(curr_intv_end, cycle);
>> +			}
>> +		} else if (ktime_before(txtime, earliest_txtime) &&
>> +			   !entry_available) {
>> +			earliest_txtime = txtime;
>> +			entry_found = entry;
>> +			n = div_s64(ktime_sub(txtime, curr_intv_start), cycle);
>> +			*interval_start = ktime_add(curr_intv_start, n * cycle);
>> +			*interval_end = ktime_add(curr_intv_end, n * cycle);
>> +		}
>> +	}
>> +
>> +	return entry_found;
>> +}
>> +
>> +static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
>> +{
>> +	struct taprio_sched *q = qdisc_priv(sch);
>> +	struct sched_gate_list *sched, *admin;
>> +	ktime_t interval_start, interval_end;
>> +	struct sched_entry *entry;
>> +
>> +	rcu_read_lock();
>> +	sched = rcu_dereference(q->oper_sched);
>> +	admin = rcu_dereference(q->admin_sched);
>> +
>> +	entry = find_entry_to_transmit(skb, sch, sched, admin, skb->tstamp,
>> +				       &interval_start, &interval_end, true);
>> +	rcu_read_unlock();
>> +
>> +	return entry;
>> +}
>> +
>> +static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
>> +				      ktime_t time)
>> +{
>> +	ktime_t cycle_elapsed;
>> +
>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>> +
>> +	return ktime_sub(time, cycle_elapsed);
>> +}
>> +
>> +/* There are a few scenarios where we will have to modify the txtime from
>> + * what is read from next_txtime in sched_entry. They are:
>> + * 1. If txtime is in the past,
>> + *    a. The gate for the traffic class is currently open and packet can be
>> + *       transmitted before it closes, schedule the packet right away.
>> + *    b. If the gate corresponding to the traffic class is going to open later
>> + *       in the cycle, set the txtime of packet to the interval start.
>> + * 2. If txtime is in the future, there are packets corresponding to the
>> + *    current traffic class waiting to be transmitted. So, the following
>> + *    possibilities exist:
>> + *    a. We can transmit the packet before the window containing the txtime
>> + *       closes.
>> + *    b. The window might close before the transmission can be completed
>> + *       successfully. So, schedule the packet in the next open window.
>> + */
>> +static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
>> +{
>> +	ktime_t transmit_end_time, interval_end, interval_start;
>> +	int len, packet_transmit_time, sched_changed;
>> +	struct taprio_sched *q = qdisc_priv(sch);
>> +	ktime_t minimum_time, now, txtime;
>> +	struct sched_gate_list *sched, *admin;
>> +	struct sched_entry *entry;
>> +
>> +	now = q->get_time();
>> +	minimum_time = ktime_add_ns(now, q->txtime_delay);
>> +
>> +	rcu_read_lock();
>> +	admin = rcu_dereference(q->admin_sched);
>> +	sched = rcu_dereference(q->oper_sched);
>> +	if (admin && ktime_after(minimum_time, admin->base_time))
>> +		switch_schedules(q, &admin, &sched);
>> +
>> +	/* Until the schedule starts, all the queues are open */
>> +	if (!sched || ktime_before(minimum_time, sched->base_time)) {
>> +		txtime = minimum_time;
>> +		goto done;
>> +	}
>> +
>> +	len = qdisc_pkt_len(skb);
>> +	packet_transmit_time = length_to_duration(q, len);
>> +
>> +	do {
>> +		sched_changed = 0;
>> +
>> +		entry = find_entry_to_transmit(skb, sch, sched, admin,
>> +					       minimum_time,
>> +					       &interval_start, &interval_end,
>> +					       false);
>> +		if (!entry) {
>> +			txtime = 0;
>> +			goto done;
>> +		}
>> +
>> +		txtime = entry->next_txtime;
>> +		txtime = max_t(ktime_t, txtime, minimum_time);
>> +		txtime = max_t(ktime_t, txtime, interval_start);
>> +
>> +		if (admin && admin != sched &&
>> +		    ktime_after(txtime, admin->base_time)) {
>> +			sched = admin;
>> +			sched_changed = 1;
>> +			continue;
>> +		}
>> +
>> +		transmit_end_time = ktime_add(txtime, packet_transmit_time);
>> +		minimum_time = transmit_end_time;
>> +
>> +		/* Update the txtime of current entry to the next time it's
>> +		 * interval starts.
>> +		 */
>> +		if (ktime_after(transmit_end_time, interval_end))
>> +			entry->next_txtime = ktime_add(interval_start, sched->cycle_time);
>> +	} while (sched_changed || ktime_after(transmit_end_time, interval_end));
>> +
>> +	entry->next_txtime = transmit_end_time;
>> +
>> +done:
>> +	rcu_read_unlock();
>> +	return txtime;
>> +}
>> +
>>  static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>  			  struct sk_buff **to_free)
>>  {
>> @@ -129,6 +361,15 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>  	if (unlikely(!child))
>>  		return qdisc_drop(skb, sch, to_free);
>>  +	if (skb->sk && sock_flag(skb->sk, SOCK_TXTIME)) {
>> +		if (!is_valid_interval(skb, sch))
>> +			return qdisc_drop(skb, sch, to_free);
>> +	} else if (TXTIME_OFFLOAD_IS_ON(q->offload_flags)) {
>> +		skb->tstamp = get_packet_txtime(skb, sch);
>> +		if (!skb->tstamp)
>> +			return qdisc_drop(skb, sch, to_free);
>> +	}
>> +
>>  	qdisc_qstats_backlog_inc(sch, skb);
>>  	sch->q.qlen++;
>>  @@ -206,11 +447,6 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
>>  	return q->peek(sch);
>>  }
>>  -static inline int length_to_duration(struct taprio_sched *q, int len)
>> -{
>> -	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
>> -}
>> -
>>  static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
>>  {
>>  	atomic_set(&entry->budget,
>> @@ -594,7 +830,8 @@ static int parse_taprio_schedule(struct nlattr **tb,
>>    static int taprio_parse_mqprio_opt(struct net_device *dev,
>>  				   struct tc_mqprio_qopt *qopt,
>> -				   struct netlink_ext_ack *extack)
>> +				   struct netlink_ext_ack *extack,
>> +				   u32 offload_flags)
>>  {
>>  	int i, j;
>>  @@ -642,6 +879,9 @@ static int taprio_parse_mqprio_opt(struct net_device *dev,
>>  			return -EINVAL;
>>  		}
>>  +		if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>> +			continue;
>> +
>>  		/* Verify that the offset and counts do not overlap */
>>  		for (j = i + 1; j < qopt->num_tc; j++) {
>>  			if (last > qopt->offset[j]) {
>> @@ -804,6 +1044,9 @@ static int taprio_enable_offload(struct net_device *dev,
>>  	const struct net_device_ops *ops = dev->netdev_ops;
>>  	int err = 0;
>>  +	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>> +		goto done;
>> +
>>  	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
>>  		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
>>  		return -EOPNOTSUPP;
>> @@ -816,15 +1059,28 @@ static int taprio_enable_offload(struct net_device *dev,
>>    	/* FIXME: enable offloading */
>>  -	q->dequeue = taprio_dequeue_offload;
>> -	q->peek = taprio_peek_offload;
>> -
>> -	if (err == 0)
>> +done:
>> +	if (err == 0) {
>> +		q->dequeue = taprio_dequeue_offload;
>> +		q->peek = taprio_peek_offload;
>>  		q->offload_flags = offload_flags;
>> +	}
>>    	return err;
>>  }
>>  +static void setup_txtime(struct taprio_sched *q,
>> +			 struct sched_gate_list *sched, ktime_t base)
>> +{
>> +	struct sched_entry *entry;
>> +	u32 interval = 0;
>> +
>> +	list_for_each_entry(entry, &sched->entries, list) {
>> +		entry->next_txtime = ktime_add_ns(base, interval);
>> +		interval += entry->interval;
>> +	}
>> +}
>> +
>>  static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>  			 struct netlink_ext_ack *extack)
>>  {
>> @@ -846,7 +1102,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>  	if (tb[TCA_TAPRIO_ATTR_PRIOMAP])
>>  		mqprio = nla_data(tb[TCA_TAPRIO_ATTR_PRIOMAP]);
>>  -	err = taprio_parse_mqprio_opt(dev, mqprio, extack);
>> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
>> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
>> +
>> +	err = taprio_parse_mqprio_opt(dev, mqprio, extack, offload_flags);
>>  	if (err < 0)
>>  		return err;
>>  @@ -887,6 +1146,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>  		goto free_sched;
>>  	}
>>  +	/* preserve offload flags when changing the schedule. */
>> +	if (q->offload_flags)
>> +		offload_flags = q->offload_flags;
>> +
>>  	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
>>  		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
>>  @@ -914,7 +1177,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>  	/* Protects against enqueue()/dequeue() */
>>  	spin_lock_bh(qdisc_lock(sch));
>>  -	if (!hrtimer_active(&q->advance_timer)) {
>> +	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY])
>> +		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
>> +
>> +	if (!TXTIME_OFFLOAD_IS_ON(offload_flags) && !hrtimer_active(&q->advance_timer)) {
>>  		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
>>  		q->advance_timer.function = advance_sched;
>>  	}
>> @@ -966,20 +1232,35 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>  		goto unlock;
>>  	}
>>  -	setup_first_close_time(q, new_admin, start);
>> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags)) {
>> +		setup_txtime(q, new_admin, start);
>> +
>> +		if (!oper) {
>> +			rcu_assign_pointer(q->oper_sched, new_admin);
>> +			err = 0;
>> +			new_admin = NULL;
>> +			goto unlock;
>> +		}
>> +
>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>> +		if (admin)
>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>> +	} else {
>> +		setup_first_close_time(q, new_admin, start);
>>  -	/* Protects against advance_sched() */
>> -	spin_lock_irqsave(&q->current_entry_lock, flags);
>> +		/* Protects against advance_sched() */
>> +		spin_lock_irqsave(&q->current_entry_lock, flags);
>>  -	taprio_start_sched(sch, start, new_admin);
>> +		taprio_start_sched(sch, start, new_admin);
>>  -	rcu_assign_pointer(q->admin_sched, new_admin);
>> -	if (admin)
>> -		call_rcu(&admin->rcu, taprio_free_sched_cb);
>> -	new_admin = NULL;
>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>> +		if (admin)
>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>  -	spin_unlock_irqrestore(&q->current_entry_lock, flags);
>> +		spin_unlock_irqrestore(&q->current_entry_lock, flags);
>> +	}
>>  +	new_admin = NULL;
>>  	err = 0;
>>    unlock:
>> @@ -1225,6 +1506,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
>>  	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
>>  		goto options_error;
>>  +	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
>> +		goto options_error;
>> +
>>  	if (oper && dump_schedule(skb, oper))
>>  		goto options_error;
>>  
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-06-04 20:06     ` Patel, Vedang
@ 2019-06-07 18:52       ` Murali Karicheri
  2019-06-07 21:12         ` Patel, Vedang
  0 siblings, 1 reply; 25+ messages in thread
From: Murali Karicheri @ 2019-06-07 18:52 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	xiyou.wangcong, jiri, intel-wired-lan, Gomes, Vinicius, l

On 06/04/2019 04:06 PM, Patel, Vedang wrote:
> Hi Murali,
> 
>> On Jun 3, 2019, at 7:15 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>
>> Hi Vedang,
>>
>> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>>> Currently, we are seeing non-critical packets being transmitted outside
>>> of their timeslice. We can confirm that the packets are being dequeued
>>> at the right time. So, the delay is induced in the hardware side.  The
>>> most likely reason is the hardware queues are starving the lower
>>> priority queues.
>>> In order to improve the performance of taprio, we will be making use of the
>>> txtime feature provided by the ETF qdisc. For all the packets which do not have
>>> the SO_TXTIME option set, taprio will set the transmit timestamp (set in
>>> skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
>>> the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
>>> qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
>>> corresponding to skb's traffic class is open.
>>> Following is the example configuration for enabling txtime offload:
>>> tc qdisc replace dev eth0 parent root handle 100 taprio \\
>>>        num_tc 3 \\
>>>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>>        queues 1@0 1@0 1@0 \\
>>>        base-time 1558653424279842568 \\
>>>        sched-entry S 01 300000 \\
>>>        sched-entry S 02 300000 \\
>>>        sched-entry S 04 400000 \\
>>>        offload 2 \\
>>>        txtime-delay 40000 \\
>>>        clockid CLOCK_TAI
>>> tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
>>>        offload delta 200000 clockid CLOCK_TAI
>>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>>> enabled. Also, all the traffic classes are mapped to the same queue.  This is
>>> only possible in taprio when txtime offload is enabled. Also note that the ETF
>>> Qdisc is enabled with offload mode set.
>>> In this mode, if the packet's traffic class is open and the complete packet can
>>> be transmitted, taprio will try to transmit the packet immediately. This will
>>> be done by setting skb->tstamp to current_time + the time delta indicated in
>>> the txtime_delay parameter. This parameter indicates the time taken (in
>>> software) for packet to reach the network adapter.
>>
>> In TSN Time aware shaper, packets are sent when gate for a specific
>> traffic class is open. So packets that are available in the queues are
>> sent by the scheduler. So the ETF is not strictly required for this
>> function.
>> I understand if the application needs to send packets with
>> some latency expectation should use ETF to schedule the packet in sync
>> with the next gate open time. So txtime_delay is used to account for
>> the delay for packets to travel from user space to nic.
> This is not true. As explained in the other email, txtime-delay is the maximum time a packet might take after reaching taprio_enqueue() assuming the gate corresponding to that packet was already open.
>> So it is ETF
>> that need to inspect the skb->tstamp and allow or if time match or
>> discard if late. Is this the case?
>>
> The role of ETF is just to sort packets according to their transmit timestamp and send them to the hardware queues to transmit whenever their time comes. This is needed because the i210 hardware does not have any sort feature within its queue. If 2 packets arrive and the first packet has a transmit timestamp later than the second packet, the second packet won’t be transmitted on time.
> 
> Taprio in the txtime offload mode (btw, this will soon be renamed to txtime-assist) will set the transmit timestamp of each packet (in skb->tstamp) and then send it to ETF so that it can be sorted with the other packets and sent to hardware whenever the time comes. ETF will discard the packet if the transmit time is in the past or before the transmit time of the packet which is already been sent to the hardware.
> 
> Let me know if you have more questions about this.
> 

It is bit confusing to have this mode in taprio. My assumption is
that taprio implements TSN standard 802.1Qbv scheduler that is
responsible for managing the Gate open/close and sending frames
for specific traffic class during the Gate open. AFAIK, this
scheduler doesn't inspect the packet's metadata such as time to
send or such. So why is txtime offload mode is added to taprio?
Could you please explain?

Murali

> Thanks,
> Vedang
>> For offload case, the h/w that offload ETF needs to send a trigger to
>> software to get packet for transmission ahead of the Gate open event?
>>
>> Thanks
>>
>> Murali
>>> If the packet cannot be transmitted in the current interval or if the packet's
>>> traffic is not currently transmitting, the skb->tstamp is set to the next
>>> available timestamp value. This is tracked in the next_launchtime parameter in
>>> the struct sched_entry.
>>> The behaviour w.r.t admin and oper schedules is not changed from what is
>>> present in software mode.
>>> The transmit time is already known in advance. So, we do not need the HR timers
>>> to advance the schedule and wakeup the dequeue side of taprio.  So, HR timer
>>> won't be run when this mode is enabled.
>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
>>> ---
>>>   include/uapi/linux/pkt_sched.h |   1 +
>>>   net/sched/sch_taprio.c         | 326 ++++++++++++++++++++++++++++++---
>>>   2 files changed, 306 insertions(+), 21 deletions(-)
>>> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
>>> index 3319255ffa25..afffda24e055 100644
>>> --- a/include/uapi/linux/pkt_sched.h
>>> +++ b/include/uapi/linux/pkt_sched.h
>>> @@ -1174,6 +1174,7 @@ enum {
>>>   	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
>>>   	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
>>>   	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
>>> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
>>>   	__TCA_TAPRIO_ATTR_MAX,
>>>   };
>>>   diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
>>> index 8d87ba099130..1cd19eabc53b 100644
>>> --- a/net/sched/sch_taprio.c
>>> +++ b/net/sched/sch_taprio.c
>>> @@ -21,6 +21,7 @@
>>>   #include <net/pkt_sched.h>
>>>   #include <net/pkt_cls.h>
>>>   #include <net/sch_generic.h>
>>> +#include <net/sock.h>
>>>     static LIST_HEAD(taprio_list);
>>>   static DEFINE_SPINLOCK(taprio_list_lock);
>>> @@ -40,6 +41,7 @@ struct sched_entry {
>>>   	 * packet leaves after this time.
>>>   	 */
>>>   	ktime_t close_time;
>>> +	ktime_t next_txtime;
>>>   	atomic_t budget;
>>>   	int index;
>>>   	u32 gate_mask;
>>> @@ -76,6 +78,7 @@ struct taprio_sched {
>>>   	struct sk_buff *(*peek)(struct Qdisc *sch);
>>>   	struct hrtimer advance_timer;
>>>   	struct list_head taprio_list;
>>> +	int txtime_delay;
>>>   };
>>>     static ktime_t sched_base_time(const struct sched_gate_list *sched)
>>> @@ -116,6 +119,235 @@ static void switch_schedules(struct taprio_sched *q,
>>>   	*admin = NULL;
>>>   }
>>>   +/* Get how much time has been already elapsed in the current cycle. */
>>> +static inline s32 get_cycle_time_elapsed(struct sched_gate_list *sched, ktime_t time)
>>> +{
>>> +	ktime_t time_since_sched_start;
>>> +	s32 time_elapsed;
>>> +
>>> +	time_since_sched_start = ktime_sub(time, sched->base_time);
>>> +	div_s64_rem(time_since_sched_start, sched->cycle_time, &time_elapsed);
>>> +
>>> +	return time_elapsed;
>>> +}
>>> +
>>> +static ktime_t get_interval_end_time(struct sched_gate_list *sched,
>>> +				     struct sched_gate_list *admin,
>>> +				     struct sched_entry *entry,
>>> +				     ktime_t intv_start)
>>> +{
>>> +	s32 cycle_elapsed = get_cycle_time_elapsed(sched, intv_start);
>>> +	ktime_t intv_end, cycle_ext_end, cycle_end;
>>> +
>>> +	cycle_end = ktime_add_ns(intv_start, sched->cycle_time - cycle_elapsed);
>>> +	intv_end = ktime_add_ns(intv_start, entry->interval);
>>> +	cycle_ext_end = ktime_add(cycle_end, sched->cycle_time_extension);
>>> +
>>> +	if (ktime_before(intv_end, cycle_end))
>>> +		return intv_end;
>>> +	else if (admin && admin != sched &&
>>> +		 ktime_after(admin->base_time, cycle_end) &&
>>> +		 ktime_before(admin->base_time, cycle_ext_end))
>>> +		return admin->base_time;
>>> +	else
>>> +		return cycle_end;
>>> +}
>>> +
>>> +static inline int length_to_duration(struct taprio_sched *q, int len)
>>> +{
>>> +	return (len * atomic64_read(&q->picos_per_byte)) / 1000;
>>> +}
>>> +
>>> +/* Returns the entry corresponding to next available interval. If
>>> + * validate_interval is set, it only validates whether the timestamp occurs
>>> + * when the gate corresponding to the skb's traffic class is open.
>>> + */
>>> +static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
>>> +						  struct Qdisc *sch,
>>> +						  struct sched_gate_list *sched,
>>> +						  struct sched_gate_list *admin,
>>> +						  ktime_t time,
>>> +						  ktime_t *interval_start,
>>> +						  ktime_t *interval_end,
>>> +						  bool validate_interval)
>>> +{
>>> +	ktime_t curr_intv_start, curr_intv_end, cycle_end, packet_transmit_time;
>>> +	ktime_t earliest_txtime = KTIME_MAX, txtime, cycle, transmit_end_time;
>>> +	struct sched_entry *entry = NULL, *entry_found = NULL;
>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>> +	struct net_device *dev = qdisc_dev(sch);
>>> +	int tc, entry_available = 0, n;
>>> +	s32 cycle_elapsed;
>>> +
>>> +	tc = netdev_get_prio_tc_map(dev, skb->priority);
>>> +	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
>>> +
>>> +	*interval_start = 0;
>>> +	*interval_end = 0;
>>> +
>>> +	if (!sched)
>>> +		return NULL;
>>> +
>>> +	cycle = sched->cycle_time;
>>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>>> +	curr_intv_end = ktime_sub_ns(time, cycle_elapsed);
>>> +	cycle_end = ktime_add_ns(curr_intv_end, cycle);
>>> +
>>> +	list_for_each_entry(entry, &sched->entries, list) {
>>> +		curr_intv_start = curr_intv_end;
>>> +		curr_intv_end = get_interval_end_time(sched, admin, entry,
>>> +						      curr_intv_start);
>>> +
>>> +		if (ktime_after(curr_intv_start, cycle_end))
>>> +			break;
>>> +
>>> +		if (!(entry->gate_mask & BIT(tc)) ||
>>> +		    packet_transmit_time > entry->interval)
>>> +			continue;
>>> +
>>> +		txtime = entry->next_txtime;
>>> +
>>> +		if (ktime_before(txtime, time) || validate_interval) {
>>> +			transmit_end_time = ktime_add_ns(time, packet_transmit_time);
>>> +			if ((ktime_before(curr_intv_start, time) &&
>>> +			     ktime_before(transmit_end_time, curr_intv_end)) ||
>>> +			    (ktime_after(curr_intv_start, time) && !validate_interval)) {
>>> +				entry_found = entry;
>>> +				*interval_start = curr_intv_start;
>>> +				*interval_end = curr_intv_end;
>>> +				break;
>>> +			} else if (!entry_available && !validate_interval) {
>>> +				/* Here, we are just trying to find out the
>>> +				 * first available interval in the next cycle.
>>> +				 */
>>> +				entry_available = 1;
>>> +				entry_found = entry;
>>> +				*interval_start = ktime_add_ns(curr_intv_start, cycle);
>>> +				*interval_end = ktime_add_ns(curr_intv_end, cycle);
>>> +			}
>>> +		} else if (ktime_before(txtime, earliest_txtime) &&
>>> +			   !entry_available) {
>>> +			earliest_txtime = txtime;
>>> +			entry_found = entry;
>>> +			n = div_s64(ktime_sub(txtime, curr_intv_start), cycle);
>>> +			*interval_start = ktime_add(curr_intv_start, n * cycle);
>>> +			*interval_end = ktime_add(curr_intv_end, n * cycle);
>>> +		}
>>> +	}
>>> +
>>> +	return entry_found;
>>> +}
>>> +
>>> +static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
>>> +{
>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>> +	struct sched_gate_list *sched, *admin;
>>> +	ktime_t interval_start, interval_end;
>>> +	struct sched_entry *entry;
>>> +
>>> +	rcu_read_lock();
>>> +	sched = rcu_dereference(q->oper_sched);
>>> +	admin = rcu_dereference(q->admin_sched);
>>> +
>>> +	entry = find_entry_to_transmit(skb, sch, sched, admin, skb->tstamp,
>>> +				       &interval_start, &interval_end, true);
>>> +	rcu_read_unlock();
>>> +
>>> +	return entry;
>>> +}
>>> +
>>> +static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
>>> +				      ktime_t time)
>>> +{
>>> +	ktime_t cycle_elapsed;
>>> +
>>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>>> +
>>> +	return ktime_sub(time, cycle_elapsed);
>>> +}
>>> +
>>> +/* There are a few scenarios where we will have to modify the txtime from
>>> + * what is read from next_txtime in sched_entry. They are:
>>> + * 1. If txtime is in the past,
>>> + *    a. The gate for the traffic class is currently open and packet can be
>>> + *       transmitted before it closes, schedule the packet right away.
>>> + *    b. If the gate corresponding to the traffic class is going to open later
>>> + *       in the cycle, set the txtime of packet to the interval start.
>>> + * 2. If txtime is in the future, there are packets corresponding to the
>>> + *    current traffic class waiting to be transmitted. So, the following
>>> + *    possibilities exist:
>>> + *    a. We can transmit the packet before the window containing the txtime
>>> + *       closes.
>>> + *    b. The window might close before the transmission can be completed
>>> + *       successfully. So, schedule the packet in the next open window.
>>> + */
>>> +static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
>>> +{
>>> +	ktime_t transmit_end_time, interval_end, interval_start;
>>> +	int len, packet_transmit_time, sched_changed;
>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>> +	ktime_t minimum_time, now, txtime;
>>> +	struct sched_gate_list *sched, *admin;
>>> +	struct sched_entry *entry;
>>> +
>>> +	now = q->get_time();
>>> +	minimum_time = ktime_add_ns(now, q->txtime_delay);
>>> +
>>> +	rcu_read_lock();
>>> +	admin = rcu_dereference(q->admin_sched);
>>> +	sched = rcu_dereference(q->oper_sched);
>>> +	if (admin && ktime_after(minimum_time, admin->base_time))
>>> +		switch_schedules(q, &admin, &sched);
>>> +
>>> +	/* Until the schedule starts, all the queues are open */
>>> +	if (!sched || ktime_before(minimum_time, sched->base_time)) {
>>> +		txtime = minimum_time;
>>> +		goto done;
>>> +	}
>>> +
>>> +	len = qdisc_pkt_len(skb);
>>> +	packet_transmit_time = length_to_duration(q, len);
>>> +
>>> +	do {
>>> +		sched_changed = 0;
>>> +
>>> +		entry = find_entry_to_transmit(skb, sch, sched, admin,
>>> +					       minimum_time,
>>> +					       &interval_start, &interval_end,
>>> +					       false);
>>> +		if (!entry) {
>>> +			txtime = 0;
>>> +			goto done;
>>> +		}
>>> +
>>> +		txtime = entry->next_txtime;
>>> +		txtime = max_t(ktime_t, txtime, minimum_time);
>>> +		txtime = max_t(ktime_t, txtime, interval_start);
>>> +
>>> +		if (admin && admin != sched &&
>>> +		    ktime_after(txtime, admin->base_time)) {
>>> +			sched = admin;
>>> +			sched_changed = 1;
>>> +			continue;
>>> +		}
>>> +
>>> +		transmit_end_time = ktime_add(txtime, packet_transmit_time);
>>> +		minimum_time = transmit_end_time;
>>> +
>>> +		/* Update the txtime of current entry to the next time it's
>>> +		 * interval starts.
>>> +		 */
>>> +		if (ktime_after(transmit_end_time, interval_end))
>>> +			entry->next_txtime = ktime_add(interval_start, sched->cycle_time);
>>> +	} while (sched_changed || ktime_after(transmit_end_time, interval_end));
>>> +
>>> +	entry->next_txtime = transmit_end_time;
>>> +
>>> +done:
>>> +	rcu_read_unlock();
>>> +	return txtime;
>>> +}
>>> +
>>>   static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>>   			  struct sk_buff **to_free)
>>>   {
>>> @@ -129,6 +361,15 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>>   	if (unlikely(!child))
>>>   		return qdisc_drop(skb, sch, to_free);
>>>   +	if (skb->sk && sock_flag(skb->sk, SOCK_TXTIME)) {
>>> +		if (!is_valid_interval(skb, sch))
>>> +			return qdisc_drop(skb, sch, to_free);
>>> +	} else if (TXTIME_OFFLOAD_IS_ON(q->offload_flags)) {
>>> +		skb->tstamp = get_packet_txtime(skb, sch);
>>> +		if (!skb->tstamp)
>>> +			return qdisc_drop(skb, sch, to_free);
>>> +	}
>>> +
>>>   	qdisc_qstats_backlog_inc(sch, skb);
>>>   	sch->q.qlen++;
>>>   @@ -206,11 +447,6 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
>>>   	return q->peek(sch);
>>>   }
>>>   -static inline int length_to_duration(struct taprio_sched *q, int len)
>>> -{
>>> -	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
>>> -}
>>> -
>>>   static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
>>>   {
>>>   	atomic_set(&entry->budget,
>>> @@ -594,7 +830,8 @@ static int parse_taprio_schedule(struct nlattr **tb,
>>>     static int taprio_parse_mqprio_opt(struct net_device *dev,
>>>   				   struct tc_mqprio_qopt *qopt,
>>> -				   struct netlink_ext_ack *extack)
>>> +				   struct netlink_ext_ack *extack,
>>> +				   u32 offload_flags)
>>>   {
>>>   	int i, j;
>>>   @@ -642,6 +879,9 @@ static int taprio_parse_mqprio_opt(struct net_device *dev,
>>>   			return -EINVAL;
>>>   		}
>>>   +		if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>>> +			continue;
>>> +
>>>   		/* Verify that the offset and counts do not overlap */
>>>   		for (j = i + 1; j < qopt->num_tc; j++) {
>>>   			if (last > qopt->offset[j]) {
>>> @@ -804,6 +1044,9 @@ static int taprio_enable_offload(struct net_device *dev,
>>>   	const struct net_device_ops *ops = dev->netdev_ops;
>>>   	int err = 0;
>>>   +	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>>> +		goto done;
>>> +
>>>   	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
>>>   		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
>>>   		return -EOPNOTSUPP;
>>> @@ -816,15 +1059,28 @@ static int taprio_enable_offload(struct net_device *dev,
>>>     	/* FIXME: enable offloading */
>>>   -	q->dequeue = taprio_dequeue_offload;
>>> -	q->peek = taprio_peek_offload;
>>> -
>>> -	if (err == 0)
>>> +done:
>>> +	if (err == 0) {
>>> +		q->dequeue = taprio_dequeue_offload;
>>> +		q->peek = taprio_peek_offload;
>>>   		q->offload_flags = offload_flags;
>>> +	}
>>>     	return err;
>>>   }
>>>   +static void setup_txtime(struct taprio_sched *q,
>>> +			 struct sched_gate_list *sched, ktime_t base)
>>> +{
>>> +	struct sched_entry *entry;
>>> +	u32 interval = 0;
>>> +
>>> +	list_for_each_entry(entry, &sched->entries, list) {
>>> +		entry->next_txtime = ktime_add_ns(base, interval);
>>> +		interval += entry->interval;
>>> +	}
>>> +}
>>> +
>>>   static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>   			 struct netlink_ext_ack *extack)
>>>   {
>>> @@ -846,7 +1102,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>   	if (tb[TCA_TAPRIO_ATTR_PRIOMAP])
>>>   		mqprio = nla_data(tb[TCA_TAPRIO_ATTR_PRIOMAP]);
>>>   -	err = taprio_parse_mqprio_opt(dev, mqprio, extack);
>>> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
>>> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
>>> +
>>> +	err = taprio_parse_mqprio_opt(dev, mqprio, extack, offload_flags);
>>>   	if (err < 0)
>>>   		return err;
>>>   @@ -887,6 +1146,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>   		goto free_sched;
>>>   	}
>>>   +	/* preserve offload flags when changing the schedule. */
>>> +	if (q->offload_flags)
>>> +		offload_flags = q->offload_flags;
>>> +
>>>   	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
>>>   		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
>>>   @@ -914,7 +1177,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>   	/* Protects against enqueue()/dequeue() */
>>>   	spin_lock_bh(qdisc_lock(sch));
>>>   -	if (!hrtimer_active(&q->advance_timer)) {
>>> +	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY])
>>> +		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
>>> +
>>> +	if (!TXTIME_OFFLOAD_IS_ON(offload_flags) && !hrtimer_active(&q->advance_timer)) {
>>>   		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
>>>   		q->advance_timer.function = advance_sched;
>>>   	}
>>> @@ -966,20 +1232,35 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>   		goto unlock;
>>>   	}
>>>   -	setup_first_close_time(q, new_admin, start);
>>> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags)) {
>>> +		setup_txtime(q, new_admin, start);
>>> +
>>> +		if (!oper) {
>>> +			rcu_assign_pointer(q->oper_sched, new_admin);
>>> +			err = 0;
>>> +			new_admin = NULL;
>>> +			goto unlock;
>>> +		}
>>> +
>>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>>> +		if (admin)
>>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>> +	} else {
>>> +		setup_first_close_time(q, new_admin, start);
>>>   -	/* Protects against advance_sched() */
>>> -	spin_lock_irqsave(&q->current_entry_lock, flags);
>>> +		/* Protects against advance_sched() */
>>> +		spin_lock_irqsave(&q->current_entry_lock, flags);
>>>   -	taprio_start_sched(sch, start, new_admin);
>>> +		taprio_start_sched(sch, start, new_admin);
>>>   -	rcu_assign_pointer(q->admin_sched, new_admin);
>>> -	if (admin)
>>> -		call_rcu(&admin->rcu, taprio_free_sched_cb);
>>> -	new_admin = NULL;
>>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>>> +		if (admin)
>>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>   -	spin_unlock_irqrestore(&q->current_entry_lock, flags);
>>> +		spin_unlock_irqrestore(&q->current_entry_lock, flags);
>>> +	}
>>>   +	new_admin = NULL;
>>>   	err = 0;
>>>     unlock:
>>> @@ -1225,6 +1506,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
>>>   	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
>>>   		goto options_error;
>>>   +	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
>>> +		goto options_error;
>>> +
>>>   	if (oper && dump_schedule(skb, oper))
>>>   		goto options_error;
>>>   
>>
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-06-07 18:52       ` Murali Karicheri
@ 2019-06-07 21:12         ` Patel, Vedang
  2019-06-10 14:27           ` Murali Karicheri
  0 siblings, 1 reply; 25+ messages in thread
From: Patel, Vedang @ 2019-06-07 21:12 UTC (permalink / raw)
  To: Murali Karicheri
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	xiyou.wangcong, jiri, intel-wired-lan, Gomes, Vinicius, l

Hi Murali,

> On Jun 7, 2019, at 11:52 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
> 
> On 06/04/2019 04:06 PM, Patel, Vedang wrote:
>> Hi Murali,
>>> On Jun 3, 2019, at 7:15 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>> 
>>> Hi Vedang,
>>> 
>>> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>>>> Currently, we are seeing non-critical packets being transmitted outside
>>>> of their timeslice. We can confirm that the packets are being dequeued
>>>> at the right time. So, the delay is induced in the hardware side.  The
>>>> most likely reason is the hardware queues are starving the lower
>>>> priority queues.
>>>> In order to improve the performance of taprio, we will be making use of the
>>>> txtime feature provided by the ETF qdisc. For all the packets which do not have
>>>> the SO_TXTIME option set, taprio will set the transmit timestamp (set in
>>>> skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
>>>> the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
>>>> qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
>>>> corresponding to skb's traffic class is open.
>>>> Following is the example configuration for enabling txtime offload:
>>>> tc qdisc replace dev eth0 parent root handle 100 taprio \\
>>>>       num_tc 3 \\
>>>>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>>>       queues 1@0 1@0 1@0 \\
>>>>       base-time 1558653424279842568 \\
>>>>       sched-entry S 01 300000 \\
>>>>       sched-entry S 02 300000 \\
>>>>       sched-entry S 04 400000 \\
>>>>       offload 2 \\
>>>>       txtime-delay 40000 \\
>>>>       clockid CLOCK_TAI
>>>> tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
>>>>       offload delta 200000 clockid CLOCK_TAI
>>>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>>>> enabled. Also, all the traffic classes are mapped to the same queue.  This is
>>>> only possible in taprio when txtime offload is enabled. Also note that the ETF
>>>> Qdisc is enabled with offload mode set.
>>>> In this mode, if the packet's traffic class is open and the complete packet can
>>>> be transmitted, taprio will try to transmit the packet immediately. This will
>>>> be done by setting skb->tstamp to current_time + the time delta indicated in
>>>> the txtime_delay parameter. This parameter indicates the time taken (in
>>>> software) for packet to reach the network adapter.
>>> 
>>> In TSN Time aware shaper, packets are sent when gate for a specific
>>> traffic class is open. So packets that are available in the queues are
>>> sent by the scheduler. So the ETF is not strictly required for this
>>> function.
>>> I understand if the application needs to send packets with
>>> some latency expectation should use ETF to schedule the packet in sync
>>> with the next gate open time. So txtime_delay is used to account for
>>> the delay for packets to travel from user space to nic.
>> This is not true. As explained in the other email, txtime-delay is the maximum time a packet might take after reaching taprio_enqueue() assuming the gate corresponding to that packet was already open.
>>> So it is ETF
>>> that need to inspect the skb->tstamp and allow or if time match or
>>> discard if late. Is this the case?
>>> 
>> The role of ETF is just to sort packets according to their transmit timestamp and send them to the hardware queues to transmit whenever their time comes. This is needed because the i210 hardware does not have any sort feature within its queue. If 2 packets arrive and the first packet has a transmit timestamp later than the second packet, the second packet won’t be transmitted on time.
>> Taprio in the txtime offload mode (btw, this will soon be renamed to txtime-assist) will set the transmit timestamp of each packet (in skb->tstamp) and then send it to ETF so that it can be sorted with the other packets and sent to hardware whenever the time comes. ETF will discard the packet if the transmit time is in the past or before the transmit time of the packet which is already been sent to the hardware.
>> Let me know if you have more questions about this.
> 
> It is bit confusing to have this mode in taprio. My assumption is
> that taprio implements TSN standard 802.1Qbv scheduler that is
> responsible for managing the Gate open/close and sending frames
> for specific traffic class during the Gate open. AFAIK, this
> scheduler doesn't inspect the packet's metadata such as time to
> send or such. So why is txtime offload mode is added to taprio?
> Could you please explain?
> 
> Murali
> 
Short answer: Taprio still implements a 802.1Qbv like schedule. But, it leverages the functionality from ETF to do so.

Long answer:
The software-only implementation of 802.1Qbv has quite a few low priority packets being transmitted outside their timeslice. This is because the higher priority queues in i210 are starving the lower priority queues. So, what the txtime-assist mode does is to assign an explicit tx timestamp and use the launchtime feature of the i210 adapter card to transmit the packets on time. It is still sending frames for a particular traffic class only when their gate is open. Also, it is not modifying any data in the packet which is transmitted. Just notifying the NIC when to transmit the packet. If there is a tx timestamp already assigned to a packet, it does not change it. Since we are assigning the tx timestamp, we have to route the packet through the ETF queue disc for sorting the packets according to their timestamps and sending it to the NIC.

We have to implement the above mechanism because, currently, we do not have the capability to offload the taprio schedule to the hardware.

Hope this makes things a bit more clear. Let me know if you have more questions.

Thanks,
Vedang 
>> Thanks,
>> Vedang
>>> For offload case, the h/w that offload ETF needs to send a trigger to
>>> software to get packet for transmission ahead of the Gate open event?
>>> 
>>> Thanks
>>> 
>>> Murali
>>>> If the packet cannot be transmitted in the current interval or if the packet's
>>>> traffic is not currently transmitting, the skb->tstamp is set to the next
>>>> available timestamp value. This is tracked in the next_launchtime parameter in
>>>> the struct sched_entry.
>>>> The behaviour w.r.t admin and oper schedules is not changed from what is
>>>> present in software mode.
>>>> The transmit time is already known in advance. So, we do not need the HR timers
>>>> to advance the schedule and wakeup the dequeue side of taprio.  So, HR timer
>>>> won't be run when this mode is enabled.
>>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
>>>> ---
>>>>  include/uapi/linux/pkt_sched.h |   1 +
>>>>  net/sched/sch_taprio.c         | 326 ++++++++++++++++++++++++++++++---
>>>>  2 files changed, 306 insertions(+), 21 deletions(-)
>>>> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
>>>> index 3319255ffa25..afffda24e055 100644
>>>> --- a/include/uapi/linux/pkt_sched.h
>>>> +++ b/include/uapi/linux/pkt_sched.h
>>>> @@ -1174,6 +1174,7 @@ enum {
>>>>  	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
>>>>  	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
>>>>  	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
>>>> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
>>>>  	__TCA_TAPRIO_ATTR_MAX,
>>>>  };
>>>>  diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
>>>> index 8d87ba099130..1cd19eabc53b 100644
>>>> --- a/net/sched/sch_taprio.c
>>>> +++ b/net/sched/sch_taprio.c
>>>> @@ -21,6 +21,7 @@
>>>>  #include <net/pkt_sched.h>
>>>>  #include <net/pkt_cls.h>
>>>>  #include <net/sch_generic.h>
>>>> +#include <net/sock.h>
>>>>    static LIST_HEAD(taprio_list);
>>>>  static DEFINE_SPINLOCK(taprio_list_lock);
>>>> @@ -40,6 +41,7 @@ struct sched_entry {
>>>>  	 * packet leaves after this time.
>>>>  	 */
>>>>  	ktime_t close_time;
>>>> +	ktime_t next_txtime;
>>>>  	atomic_t budget;
>>>>  	int index;
>>>>  	u32 gate_mask;
>>>> @@ -76,6 +78,7 @@ struct taprio_sched {
>>>>  	struct sk_buff *(*peek)(struct Qdisc *sch);
>>>>  	struct hrtimer advance_timer;
>>>>  	struct list_head taprio_list;
>>>> +	int txtime_delay;
>>>>  };
>>>>    static ktime_t sched_base_time(const struct sched_gate_list *sched)
>>>> @@ -116,6 +119,235 @@ static void switch_schedules(struct taprio_sched *q,
>>>>  	*admin = NULL;
>>>>  }
>>>>  +/* Get how much time has been already elapsed in the current cycle. */
>>>> +static inline s32 get_cycle_time_elapsed(struct sched_gate_list *sched, ktime_t time)
>>>> +{
>>>> +	ktime_t time_since_sched_start;
>>>> +	s32 time_elapsed;
>>>> +
>>>> +	time_since_sched_start = ktime_sub(time, sched->base_time);
>>>> +	div_s64_rem(time_since_sched_start, sched->cycle_time, &time_elapsed);
>>>> +
>>>> +	return time_elapsed;
>>>> +}
>>>> +
>>>> +static ktime_t get_interval_end_time(struct sched_gate_list *sched,
>>>> +				     struct sched_gate_list *admin,
>>>> +				     struct sched_entry *entry,
>>>> +				     ktime_t intv_start)
>>>> +{
>>>> +	s32 cycle_elapsed = get_cycle_time_elapsed(sched, intv_start);
>>>> +	ktime_t intv_end, cycle_ext_end, cycle_end;
>>>> +
>>>> +	cycle_end = ktime_add_ns(intv_start, sched->cycle_time - cycle_elapsed);
>>>> +	intv_end = ktime_add_ns(intv_start, entry->interval);
>>>> +	cycle_ext_end = ktime_add(cycle_end, sched->cycle_time_extension);
>>>> +
>>>> +	if (ktime_before(intv_end, cycle_end))
>>>> +		return intv_end;
>>>> +	else if (admin && admin != sched &&
>>>> +		 ktime_after(admin->base_time, cycle_end) &&
>>>> +		 ktime_before(admin->base_time, cycle_ext_end))
>>>> +		return admin->base_time;
>>>> +	else
>>>> +		return cycle_end;
>>>> +}
>>>> +
>>>> +static inline int length_to_duration(struct taprio_sched *q, int len)
>>>> +{
>>>> +	return (len * atomic64_read(&q->picos_per_byte)) / 1000;
>>>> +}
>>>> +
>>>> +/* Returns the entry corresponding to next available interval. If
>>>> + * validate_interval is set, it only validates whether the timestamp occurs
>>>> + * when the gate corresponding to the skb's traffic class is open.
>>>> + */
>>>> +static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
>>>> +						  struct Qdisc *sch,
>>>> +						  struct sched_gate_list *sched,
>>>> +						  struct sched_gate_list *admin,
>>>> +						  ktime_t time,
>>>> +						  ktime_t *interval_start,
>>>> +						  ktime_t *interval_end,
>>>> +						  bool validate_interval)
>>>> +{
>>>> +	ktime_t curr_intv_start, curr_intv_end, cycle_end, packet_transmit_time;
>>>> +	ktime_t earliest_txtime = KTIME_MAX, txtime, cycle, transmit_end_time;
>>>> +	struct sched_entry *entry = NULL, *entry_found = NULL;
>>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>>> +	struct net_device *dev = qdisc_dev(sch);
>>>> +	int tc, entry_available = 0, n;
>>>> +	s32 cycle_elapsed;
>>>> +
>>>> +	tc = netdev_get_prio_tc_map(dev, skb->priority);
>>>> +	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
>>>> +
>>>> +	*interval_start = 0;
>>>> +	*interval_end = 0;
>>>> +
>>>> +	if (!sched)
>>>> +		return NULL;
>>>> +
>>>> +	cycle = sched->cycle_time;
>>>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>>>> +	curr_intv_end = ktime_sub_ns(time, cycle_elapsed);
>>>> +	cycle_end = ktime_add_ns(curr_intv_end, cycle);
>>>> +
>>>> +	list_for_each_entry(entry, &sched->entries, list) {
>>>> +		curr_intv_start = curr_intv_end;
>>>> +		curr_intv_end = get_interval_end_time(sched, admin, entry,
>>>> +						      curr_intv_start);
>>>> +
>>>> +		if (ktime_after(curr_intv_start, cycle_end))
>>>> +			break;
>>>> +
>>>> +		if (!(entry->gate_mask & BIT(tc)) ||
>>>> +		    packet_transmit_time > entry->interval)
>>>> +			continue;
>>>> +
>>>> +		txtime = entry->next_txtime;
>>>> +
>>>> +		if (ktime_before(txtime, time) || validate_interval) {
>>>> +			transmit_end_time = ktime_add_ns(time, packet_transmit_time);
>>>> +			if ((ktime_before(curr_intv_start, time) &&
>>>> +			     ktime_before(transmit_end_time, curr_intv_end)) ||
>>>> +			    (ktime_after(curr_intv_start, time) && !validate_interval)) {
>>>> +				entry_found = entry;
>>>> +				*interval_start = curr_intv_start;
>>>> +				*interval_end = curr_intv_end;
>>>> +				break;
>>>> +			} else if (!entry_available && !validate_interval) {
>>>> +				/* Here, we are just trying to find out the
>>>> +				 * first available interval in the next cycle.
>>>> +				 */
>>>> +				entry_available = 1;
>>>> +				entry_found = entry;
>>>> +				*interval_start = ktime_add_ns(curr_intv_start, cycle);
>>>> +				*interval_end = ktime_add_ns(curr_intv_end, cycle);
>>>> +			}
>>>> +		} else if (ktime_before(txtime, earliest_txtime) &&
>>>> +			   !entry_available) {
>>>> +			earliest_txtime = txtime;
>>>> +			entry_found = entry;
>>>> +			n = div_s64(ktime_sub(txtime, curr_intv_start), cycle);
>>>> +			*interval_start = ktime_add(curr_intv_start, n * cycle);
>>>> +			*interval_end = ktime_add(curr_intv_end, n * cycle);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return entry_found;
>>>> +}
>>>> +
>>>> +static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
>>>> +{
>>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>>> +	struct sched_gate_list *sched, *admin;
>>>> +	ktime_t interval_start, interval_end;
>>>> +	struct sched_entry *entry;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	sched = rcu_dereference(q->oper_sched);
>>>> +	admin = rcu_dereference(q->admin_sched);
>>>> +
>>>> +	entry = find_entry_to_transmit(skb, sch, sched, admin, skb->tstamp,
>>>> +				       &interval_start, &interval_end, true);
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return entry;
>>>> +}
>>>> +
>>>> +static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
>>>> +				      ktime_t time)
>>>> +{
>>>> +	ktime_t cycle_elapsed;
>>>> +
>>>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>>>> +
>>>> +	return ktime_sub(time, cycle_elapsed);
>>>> +}
>>>> +
>>>> +/* There are a few scenarios where we will have to modify the txtime from
>>>> + * what is read from next_txtime in sched_entry. They are:
>>>> + * 1. If txtime is in the past,
>>>> + *    a. The gate for the traffic class is currently open and packet can be
>>>> + *       transmitted before it closes, schedule the packet right away.
>>>> + *    b. If the gate corresponding to the traffic class is going to open later
>>>> + *       in the cycle, set the txtime of packet to the interval start.
>>>> + * 2. If txtime is in the future, there are packets corresponding to the
>>>> + *    current traffic class waiting to be transmitted. So, the following
>>>> + *    possibilities exist:
>>>> + *    a. We can transmit the packet before the window containing the txtime
>>>> + *       closes.
>>>> + *    b. The window might close before the transmission can be completed
>>>> + *       successfully. So, schedule the packet in the next open window.
>>>> + */
>>>> +static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
>>>> +{
>>>> +	ktime_t transmit_end_time, interval_end, interval_start;
>>>> +	int len, packet_transmit_time, sched_changed;
>>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>>> +	ktime_t minimum_time, now, txtime;
>>>> +	struct sched_gate_list *sched, *admin;
>>>> +	struct sched_entry *entry;
>>>> +
>>>> +	now = q->get_time();
>>>> +	minimum_time = ktime_add_ns(now, q->txtime_delay);
>>>> +
>>>> +	rcu_read_lock();
>>>> +	admin = rcu_dereference(q->admin_sched);
>>>> +	sched = rcu_dereference(q->oper_sched);
>>>> +	if (admin && ktime_after(minimum_time, admin->base_time))
>>>> +		switch_schedules(q, &admin, &sched);
>>>> +
>>>> +	/* Until the schedule starts, all the queues are open */
>>>> +	if (!sched || ktime_before(minimum_time, sched->base_time)) {
>>>> +		txtime = minimum_time;
>>>> +		goto done;
>>>> +	}
>>>> +
>>>> +	len = qdisc_pkt_len(skb);
>>>> +	packet_transmit_time = length_to_duration(q, len);
>>>> +
>>>> +	do {
>>>> +		sched_changed = 0;
>>>> +
>>>> +		entry = find_entry_to_transmit(skb, sch, sched, admin,
>>>> +					       minimum_time,
>>>> +					       &interval_start, &interval_end,
>>>> +					       false);
>>>> +		if (!entry) {
>>>> +			txtime = 0;
>>>> +			goto done;
>>>> +		}
>>>> +
>>>> +		txtime = entry->next_txtime;
>>>> +		txtime = max_t(ktime_t, txtime, minimum_time);
>>>> +		txtime = max_t(ktime_t, txtime, interval_start);
>>>> +
>>>> +		if (admin && admin != sched &&
>>>> +		    ktime_after(txtime, admin->base_time)) {
>>>> +			sched = admin;
>>>> +			sched_changed = 1;
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		transmit_end_time = ktime_add(txtime, packet_transmit_time);
>>>> +		minimum_time = transmit_end_time;
>>>> +
>>>> +		/* Update the txtime of current entry to the next time it's
>>>> +		 * interval starts.
>>>> +		 */
>>>> +		if (ktime_after(transmit_end_time, interval_end))
>>>> +			entry->next_txtime = ktime_add(interval_start, sched->cycle_time);
>>>> +	} while (sched_changed || ktime_after(transmit_end_time, interval_end));
>>>> +
>>>> +	entry->next_txtime = transmit_end_time;
>>>> +
>>>> +done:
>>>> +	rcu_read_unlock();
>>>> +	return txtime;
>>>> +}
>>>> +
>>>>  static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>>>  			  struct sk_buff **to_free)
>>>>  {
>>>> @@ -129,6 +361,15 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>>>  	if (unlikely(!child))
>>>>  		return qdisc_drop(skb, sch, to_free);
>>>>  +	if (skb->sk && sock_flag(skb->sk, SOCK_TXTIME)) {
>>>> +		if (!is_valid_interval(skb, sch))
>>>> +			return qdisc_drop(skb, sch, to_free);
>>>> +	} else if (TXTIME_OFFLOAD_IS_ON(q->offload_flags)) {
>>>> +		skb->tstamp = get_packet_txtime(skb, sch);
>>>> +		if (!skb->tstamp)
>>>> +			return qdisc_drop(skb, sch, to_free);
>>>> +	}
>>>> +
>>>>  	qdisc_qstats_backlog_inc(sch, skb);
>>>>  	sch->q.qlen++;
>>>>  @@ -206,11 +447,6 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
>>>>  	return q->peek(sch);
>>>>  }
>>>>  -static inline int length_to_duration(struct taprio_sched *q, int len)
>>>> -{
>>>> -	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
>>>> -}
>>>> -
>>>>  static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
>>>>  {
>>>>  	atomic_set(&entry->budget,
>>>> @@ -594,7 +830,8 @@ static int parse_taprio_schedule(struct nlattr **tb,
>>>>    static int taprio_parse_mqprio_opt(struct net_device *dev,
>>>>  				   struct tc_mqprio_qopt *qopt,
>>>> -				   struct netlink_ext_ack *extack)
>>>> +				   struct netlink_ext_ack *extack,
>>>> +				   u32 offload_flags)
>>>>  {
>>>>  	int i, j;
>>>>  @@ -642,6 +879,9 @@ static int taprio_parse_mqprio_opt(struct net_device *dev,
>>>>  			return -EINVAL;
>>>>  		}
>>>>  +		if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>>>> +			continue;
>>>> +
>>>>  		/* Verify that the offset and counts do not overlap */
>>>>  		for (j = i + 1; j < qopt->num_tc; j++) {
>>>>  			if (last > qopt->offset[j]) {
>>>> @@ -804,6 +1044,9 @@ static int taprio_enable_offload(struct net_device *dev,
>>>>  	const struct net_device_ops *ops = dev->netdev_ops;
>>>>  	int err = 0;
>>>>  +	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>>>> +		goto done;
>>>> +
>>>>  	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
>>>>  		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
>>>>  		return -EOPNOTSUPP;
>>>> @@ -816,15 +1059,28 @@ static int taprio_enable_offload(struct net_device *dev,
>>>>    	/* FIXME: enable offloading */
>>>>  -	q->dequeue = taprio_dequeue_offload;
>>>> -	q->peek = taprio_peek_offload;
>>>> -
>>>> -	if (err == 0)
>>>> +done:
>>>> +	if (err == 0) {
>>>> +		q->dequeue = taprio_dequeue_offload;
>>>> +		q->peek = taprio_peek_offload;
>>>>  		q->offload_flags = offload_flags;
>>>> +	}
>>>>    	return err;
>>>>  }
>>>>  +static void setup_txtime(struct taprio_sched *q,
>>>> +			 struct sched_gate_list *sched, ktime_t base)
>>>> +{
>>>> +	struct sched_entry *entry;
>>>> +	u32 interval = 0;
>>>> +
>>>> +	list_for_each_entry(entry, &sched->entries, list) {
>>>> +		entry->next_txtime = ktime_add_ns(base, interval);
>>>> +		interval += entry->interval;
>>>> +	}
>>>> +}
>>>> +
>>>>  static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>  			 struct netlink_ext_ack *extack)
>>>>  {
>>>> @@ -846,7 +1102,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>  	if (tb[TCA_TAPRIO_ATTR_PRIOMAP])
>>>>  		mqprio = nla_data(tb[TCA_TAPRIO_ATTR_PRIOMAP]);
>>>>  -	err = taprio_parse_mqprio_opt(dev, mqprio, extack);
>>>> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
>>>> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
>>>> +
>>>> +	err = taprio_parse_mqprio_opt(dev, mqprio, extack, offload_flags);
>>>>  	if (err < 0)
>>>>  		return err;
>>>>  @@ -887,6 +1146,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>  		goto free_sched;
>>>>  	}
>>>>  +	/* preserve offload flags when changing the schedule. */
>>>> +	if (q->offload_flags)
>>>> +		offload_flags = q->offload_flags;
>>>> +
>>>>  	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
>>>>  		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
>>>>  @@ -914,7 +1177,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>  	/* Protects against enqueue()/dequeue() */
>>>>  	spin_lock_bh(qdisc_lock(sch));
>>>>  -	if (!hrtimer_active(&q->advance_timer)) {
>>>> +	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY])
>>>> +		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
>>>> +
>>>> +	if (!TXTIME_OFFLOAD_IS_ON(offload_flags) && !hrtimer_active(&q->advance_timer)) {
>>>>  		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
>>>>  		q->advance_timer.function = advance_sched;
>>>>  	}
>>>> @@ -966,20 +1232,35 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>  		goto unlock;
>>>>  	}
>>>>  -	setup_first_close_time(q, new_admin, start);
>>>> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags)) {
>>>> +		setup_txtime(q, new_admin, start);
>>>> +
>>>> +		if (!oper) {
>>>> +			rcu_assign_pointer(q->oper_sched, new_admin);
>>>> +			err = 0;
>>>> +			new_admin = NULL;
>>>> +			goto unlock;
>>>> +		}
>>>> +
>>>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>>>> +		if (admin)
>>>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>> +	} else {
>>>> +		setup_first_close_time(q, new_admin, start);
>>>>  -	/* Protects against advance_sched() */
>>>> -	spin_lock_irqsave(&q->current_entry_lock, flags);
>>>> +		/* Protects against advance_sched() */
>>>> +		spin_lock_irqsave(&q->current_entry_lock, flags);
>>>>  -	taprio_start_sched(sch, start, new_admin);
>>>> +		taprio_start_sched(sch, start, new_admin);
>>>>  -	rcu_assign_pointer(q->admin_sched, new_admin);
>>>> -	if (admin)
>>>> -		call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>> -	new_admin = NULL;
>>>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>>>> +		if (admin)
>>>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>>  -	spin_unlock_irqrestore(&q->current_entry_lock, flags);
>>>> +		spin_unlock_irqrestore(&q->current_entry_lock, flags);
>>>> +	}
>>>>  +	new_admin = NULL;
>>>>  	err = 0;
>>>>    unlock:
>>>> @@ -1225,6 +1506,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
>>>>  	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
>>>>  		goto options_error;
>>>>  +	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
>>>> +		goto options_error;
>>>> +
>>>>  	if (oper && dump_schedule(skb, oper))
>>>>  		goto options_error;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-06-07 21:12         ` Patel, Vedang
@ 2019-06-10 14:27           ` Murali Karicheri
  2019-06-11 21:03             ` Patel, Vedang
  0 siblings, 1 reply; 25+ messages in thread
From: Murali Karicheri @ 2019-06-10 14:27 UTC (permalink / raw)
  To: Patel, Vedang
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	xiyou.wangcong, jiri, intel-wired-lan, Gomes, Vinicius, l

Vedang,

On 06/07/2019 05:12 PM, Patel, Vedang wrote:
> Hi Murali,
> 
>> On Jun 7, 2019, at 11:52 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>
>> On 06/04/2019 04:06 PM, Patel, Vedang wrote:
>>> Hi Murali,
>>>> On Jun 3, 2019, at 7:15 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>>>
>>>> Hi Vedang,
>>>>
>>>> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>>>>> Currently, we are seeing non-critical packets being transmitted outside
>>>>> of their timeslice. We can confirm that the packets are being dequeued
>>>>> at the right time. So, the delay is induced in the hardware side.  The
>>>>> most likely reason is the hardware queues are starving the lower
>>>>> priority queues.
>>>>> In order to improve the performance of taprio, we will be making use of the
>>>>> txtime feature provided by the ETF qdisc. For all the packets which do not have
>>>>> the SO_TXTIME option set, taprio will set the transmit timestamp (set in
>>>>> skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
>>>>> the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
>>>>> qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
>>>>> corresponding to skb's traffic class is open.
>>>>> Following is the example configuration for enabling txtime offload:
>>>>> tc qdisc replace dev eth0 parent root handle 100 taprio \\
>>>>>        num_tc 3 \\
>>>>>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>>>>        queues 1@0 1@0 1@0 \\
>>>>>        base-time 1558653424279842568 \\
>>>>>        sched-entry S 01 300000 \\
>>>>>        sched-entry S 02 300000 \\
>>>>>        sched-entry S 04 400000 \\
>>>>>        offload 2 \\
>>>>>        txtime-delay 40000 \\
>>>>>        clockid CLOCK_TAI
>>>>> tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
>>>>>        offload delta 200000 clockid CLOCK_TAI
>>>>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>>>>> enabled. Also, all the traffic classes are mapped to the same queue.  This is
>>>>> only possible in taprio when txtime offload is enabled. Also note that the ETF
>>>>> Qdisc is enabled with offload mode set.
>>>>> In this mode, if the packet's traffic class is open and the complete packet can
>>>>> be transmitted, taprio will try to transmit the packet immediately. This will
>>>>> be done by setting skb->tstamp to current_time + the time delta indicated in
>>>>> the txtime_delay parameter. This parameter indicates the time taken (in
>>>>> software) for packet to reach the network adapter.
>>>>
>>>> In TSN Time aware shaper, packets are sent when gate for a specific
>>>> traffic class is open. So packets that are available in the queues are
>>>> sent by the scheduler. So the ETF is not strictly required for this
>>>> function.
>>>> I understand if the application needs to send packets with
>>>> some latency expectation should use ETF to schedule the packet in sync
>>>> with the next gate open time. So txtime_delay is used to account for
>>>> the delay for packets to travel from user space to nic.
>>> This is not true. As explained in the other email, txtime-delay is the maximum time a packet might take after reaching taprio_enqueue() assuming the gate corresponding to that packet was already open.
>>>> So it is ETF
>>>> that need to inspect the skb->tstamp and allow or if time match or
>>>> discard if late. Is this the case?
>>>>
>>> The role of ETF is just to sort packets according to their transmit timestamp and send them to the hardware queues to transmit whenever their time comes. This is needed because the i210 hardware does not have any sort feature within its queue. If 2 packets arrive and the first packet has a transmit timestamp later than the second packet, the second packet won’t be transmitted on time.
>>> Taprio in the txtime offload mode (btw, this will soon be renamed to txtime-assist) will set the transmit timestamp of each packet (in skb->tstamp) and then send it to ETF so that it can be sorted with the other packets and sent to hardware whenever the time comes. ETF will discard the packet if the transmit time is in the past or before the transmit time of the packet which is already been sent to the hardware.
>>> Let me know if you have more questions about this.
>>
>> It is bit confusing to have this mode in taprio. My assumption is
>> that taprio implements TSN standard 802.1Qbv scheduler that is
>> responsible for managing the Gate open/close and sending frames
>> for specific traffic class during the Gate open. AFAIK, this
>> scheduler doesn't inspect the packet's metadata such as time to
>> send or such. So why is txtime offload mode is added to taprio?
>> Could you please explain?
>>
>> Murali
>>
> Short answer: Taprio still implements a 802.1Qbv like schedule. But, it leverages the functionality from ETF to do so.
> 
> Long answer:
> The software-only implementation of 802.1Qbv has quite a few low priority packets being transmitted outside their timeslice. This is because the higher priority queues in i210 are starving the lower priority queues. So, what the txtime-assist mode does is to assign an explicit tx timestamp and use the launchtime feature of the i210 adapter card to transmit the packets on time. It is still sending frames for a particular traffic class only when their gate is open. Also, it is not modifying any data in the packet which is transmitted. Just notifying the NIC when to transmit the packet. If there is a tx timestamp already assigned to a packet, it does not change it. Since we are assigning the tx timestamp, we have to route the packet through the ETF queue disc for sorting the packets according to their timestamps and sending it to the NIC.
> 
> We have to implement the above mechanism because, currently, we do not have the capability to offload the taprio schedule to the hardware.
> 
Ok, Thanks for the patience and explanation. Yes, it helps.

One last question. If your hardware i210 checks the time in a packet
and finds it is  late, does it drop the packet? I understand that
txtime-delay is for tuning this such that hardware see the packet on
time and transmit. I also understand that packets get dropped
at the ETF qdisc if late.

Murali
> Hope this makes things a bit more clear. Let me know if you have more questions.
> 
> Thanks,
> Vedang
>>> Thanks,
>>> Vedang
>>>> For offload case, the h/w that offload ETF needs to send a trigger to
>>>> software to get packet for transmission ahead of the Gate open event?
>>>>
>>>> Thanks
>>>>
>>>> Murali
>>>>> If the packet cannot be transmitted in the current interval or if the packet's
>>>>> traffic is not currently transmitting, the skb->tstamp is set to the next
>>>>> available timestamp value. This is tracked in the next_launchtime parameter in
>>>>> the struct sched_entry.
>>>>> The behaviour w.r.t admin and oper schedules is not changed from what is
>>>>> present in software mode.
>>>>> The transmit time is already known in advance. So, we do not need the HR timers
>>>>> to advance the schedule and wakeup the dequeue side of taprio.  So, HR timer
>>>>> won't be run when this mode is enabled.
>>>>> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
>>>>> ---
>>>>>   include/uapi/linux/pkt_sched.h |   1 +
>>>>>   net/sched/sch_taprio.c         | 326 ++++++++++++++++++++++++++++++---
>>>>>   2 files changed, 306 insertions(+), 21 deletions(-)
>>>>> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
>>>>> index 3319255ffa25..afffda24e055 100644
>>>>> --- a/include/uapi/linux/pkt_sched.h
>>>>> +++ b/include/uapi/linux/pkt_sched.h
>>>>> @@ -1174,6 +1174,7 @@ enum {
>>>>>   	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
>>>>>   	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
>>>>>   	TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, /* u32 */
>>>>> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
>>>>>   	__TCA_TAPRIO_ATTR_MAX,
>>>>>   };
>>>>>   diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
>>>>> index 8d87ba099130..1cd19eabc53b 100644
>>>>> --- a/net/sched/sch_taprio.c
>>>>> +++ b/net/sched/sch_taprio.c
>>>>> @@ -21,6 +21,7 @@
>>>>>   #include <net/pkt_sched.h>
>>>>>   #include <net/pkt_cls.h>
>>>>>   #include <net/sch_generic.h>
>>>>> +#include <net/sock.h>
>>>>>     static LIST_HEAD(taprio_list);
>>>>>   static DEFINE_SPINLOCK(taprio_list_lock);
>>>>> @@ -40,6 +41,7 @@ struct sched_entry {
>>>>>   	 * packet leaves after this time.
>>>>>   	 */
>>>>>   	ktime_t close_time;
>>>>> +	ktime_t next_txtime;
>>>>>   	atomic_t budget;
>>>>>   	int index;
>>>>>   	u32 gate_mask;
>>>>> @@ -76,6 +78,7 @@ struct taprio_sched {
>>>>>   	struct sk_buff *(*peek)(struct Qdisc *sch);
>>>>>   	struct hrtimer advance_timer;
>>>>>   	struct list_head taprio_list;
>>>>> +	int txtime_delay;
>>>>>   };
>>>>>     static ktime_t sched_base_time(const struct sched_gate_list *sched)
>>>>> @@ -116,6 +119,235 @@ static void switch_schedules(struct taprio_sched *q,
>>>>>   	*admin = NULL;
>>>>>   }
>>>>>   +/* Get how much time has been already elapsed in the current cycle. */
>>>>> +static inline s32 get_cycle_time_elapsed(struct sched_gate_list *sched, ktime_t time)
>>>>> +{
>>>>> +	ktime_t time_since_sched_start;
>>>>> +	s32 time_elapsed;
>>>>> +
>>>>> +	time_since_sched_start = ktime_sub(time, sched->base_time);
>>>>> +	div_s64_rem(time_since_sched_start, sched->cycle_time, &time_elapsed);
>>>>> +
>>>>> +	return time_elapsed;
>>>>> +}
>>>>> +
>>>>> +static ktime_t get_interval_end_time(struct sched_gate_list *sched,
>>>>> +				     struct sched_gate_list *admin,
>>>>> +				     struct sched_entry *entry,
>>>>> +				     ktime_t intv_start)
>>>>> +{
>>>>> +	s32 cycle_elapsed = get_cycle_time_elapsed(sched, intv_start);
>>>>> +	ktime_t intv_end, cycle_ext_end, cycle_end;
>>>>> +
>>>>> +	cycle_end = ktime_add_ns(intv_start, sched->cycle_time - cycle_elapsed);
>>>>> +	intv_end = ktime_add_ns(intv_start, entry->interval);
>>>>> +	cycle_ext_end = ktime_add(cycle_end, sched->cycle_time_extension);
>>>>> +
>>>>> +	if (ktime_before(intv_end, cycle_end))
>>>>> +		return intv_end;
>>>>> +	else if (admin && admin != sched &&
>>>>> +		 ktime_after(admin->base_time, cycle_end) &&
>>>>> +		 ktime_before(admin->base_time, cycle_ext_end))
>>>>> +		return admin->base_time;
>>>>> +	else
>>>>> +		return cycle_end;
>>>>> +}
>>>>> +
>>>>> +static inline int length_to_duration(struct taprio_sched *q, int len)
>>>>> +{
>>>>> +	return (len * atomic64_read(&q->picos_per_byte)) / 1000;
>>>>> +}
>>>>> +
>>>>> +/* Returns the entry corresponding to next available interval. If
>>>>> + * validate_interval is set, it only validates whether the timestamp occurs
>>>>> + * when the gate corresponding to the skb's traffic class is open.
>>>>> + */
>>>>> +static struct sched_entry *find_entry_to_transmit(struct sk_buff *skb,
>>>>> +						  struct Qdisc *sch,
>>>>> +						  struct sched_gate_list *sched,
>>>>> +						  struct sched_gate_list *admin,
>>>>> +						  ktime_t time,
>>>>> +						  ktime_t *interval_start,
>>>>> +						  ktime_t *interval_end,
>>>>> +						  bool validate_interval)
>>>>> +{
>>>>> +	ktime_t curr_intv_start, curr_intv_end, cycle_end, packet_transmit_time;
>>>>> +	ktime_t earliest_txtime = KTIME_MAX, txtime, cycle, transmit_end_time;
>>>>> +	struct sched_entry *entry = NULL, *entry_found = NULL;
>>>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>>>> +	struct net_device *dev = qdisc_dev(sch);
>>>>> +	int tc, entry_available = 0, n;
>>>>> +	s32 cycle_elapsed;
>>>>> +
>>>>> +	tc = netdev_get_prio_tc_map(dev, skb->priority);
>>>>> +	packet_transmit_time = length_to_duration(q, qdisc_pkt_len(skb));
>>>>> +
>>>>> +	*interval_start = 0;
>>>>> +	*interval_end = 0;
>>>>> +
>>>>> +	if (!sched)
>>>>> +		return NULL;
>>>>> +
>>>>> +	cycle = sched->cycle_time;
>>>>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>>>>> +	curr_intv_end = ktime_sub_ns(time, cycle_elapsed);
>>>>> +	cycle_end = ktime_add_ns(curr_intv_end, cycle);
>>>>> +
>>>>> +	list_for_each_entry(entry, &sched->entries, list) {
>>>>> +		curr_intv_start = curr_intv_end;
>>>>> +		curr_intv_end = get_interval_end_time(sched, admin, entry,
>>>>> +						      curr_intv_start);
>>>>> +
>>>>> +		if (ktime_after(curr_intv_start, cycle_end))
>>>>> +			break;
>>>>> +
>>>>> +		if (!(entry->gate_mask & BIT(tc)) ||
>>>>> +		    packet_transmit_time > entry->interval)
>>>>> +			continue;
>>>>> +
>>>>> +		txtime = entry->next_txtime;
>>>>> +
>>>>> +		if (ktime_before(txtime, time) || validate_interval) {
>>>>> +			transmit_end_time = ktime_add_ns(time, packet_transmit_time);
>>>>> +			if ((ktime_before(curr_intv_start, time) &&
>>>>> +			     ktime_before(transmit_end_time, curr_intv_end)) ||
>>>>> +			    (ktime_after(curr_intv_start, time) && !validate_interval)) {
>>>>> +				entry_found = entry;
>>>>> +				*interval_start = curr_intv_start;
>>>>> +				*interval_end = curr_intv_end;
>>>>> +				break;
>>>>> +			} else if (!entry_available && !validate_interval) {
>>>>> +				/* Here, we are just trying to find out the
>>>>> +				 * first available interval in the next cycle.
>>>>> +				 */
>>>>> +				entry_available = 1;
>>>>> +				entry_found = entry;
>>>>> +				*interval_start = ktime_add_ns(curr_intv_start, cycle);
>>>>> +				*interval_end = ktime_add_ns(curr_intv_end, cycle);
>>>>> +			}
>>>>> +		} else if (ktime_before(txtime, earliest_txtime) &&
>>>>> +			   !entry_available) {
>>>>> +			earliest_txtime = txtime;
>>>>> +			entry_found = entry;
>>>>> +			n = div_s64(ktime_sub(txtime, curr_intv_start), cycle);
>>>>> +			*interval_start = ktime_add(curr_intv_start, n * cycle);
>>>>> +			*interval_end = ktime_add(curr_intv_end, n * cycle);
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	return entry_found;
>>>>> +}
>>>>> +
>>>>> +static bool is_valid_interval(struct sk_buff *skb, struct Qdisc *sch)
>>>>> +{
>>>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>>>> +	struct sched_gate_list *sched, *admin;
>>>>> +	ktime_t interval_start, interval_end;
>>>>> +	struct sched_entry *entry;
>>>>> +
>>>>> +	rcu_read_lock();
>>>>> +	sched = rcu_dereference(q->oper_sched);
>>>>> +	admin = rcu_dereference(q->admin_sched);
>>>>> +
>>>>> +	entry = find_entry_to_transmit(skb, sch, sched, admin, skb->tstamp,
>>>>> +				       &interval_start, &interval_end, true);
>>>>> +	rcu_read_unlock();
>>>>> +
>>>>> +	return entry;
>>>>> +}
>>>>> +
>>>>> +static inline ktime_t get_cycle_start(struct sched_gate_list *sched,
>>>>> +				      ktime_t time)
>>>>> +{
>>>>> +	ktime_t cycle_elapsed;
>>>>> +
>>>>> +	cycle_elapsed = get_cycle_time_elapsed(sched, time);
>>>>> +
>>>>> +	return ktime_sub(time, cycle_elapsed);
>>>>> +}
>>>>> +
>>>>> +/* There are a few scenarios where we will have to modify the txtime from
>>>>> + * what is read from next_txtime in sched_entry. They are:
>>>>> + * 1. If txtime is in the past,
>>>>> + *    a. The gate for the traffic class is currently open and packet can be
>>>>> + *       transmitted before it closes, schedule the packet right away.
>>>>> + *    b. If the gate corresponding to the traffic class is going to open later
>>>>> + *       in the cycle, set the txtime of packet to the interval start.
>>>>> + * 2. If txtime is in the future, there are packets corresponding to the
>>>>> + *    current traffic class waiting to be transmitted. So, the following
>>>>> + *    possibilities exist:
>>>>> + *    a. We can transmit the packet before the window containing the txtime
>>>>> + *       closes.
>>>>> + *    b. The window might close before the transmission can be completed
>>>>> + *       successfully. So, schedule the packet in the next open window.
>>>>> + */
>>>>> +static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
>>>>> +{
>>>>> +	ktime_t transmit_end_time, interval_end, interval_start;
>>>>> +	int len, packet_transmit_time, sched_changed;
>>>>> +	struct taprio_sched *q = qdisc_priv(sch);
>>>>> +	ktime_t minimum_time, now, txtime;
>>>>> +	struct sched_gate_list *sched, *admin;
>>>>> +	struct sched_entry *entry;
>>>>> +
>>>>> +	now = q->get_time();
>>>>> +	minimum_time = ktime_add_ns(now, q->txtime_delay);
>>>>> +
>>>>> +	rcu_read_lock();
>>>>> +	admin = rcu_dereference(q->admin_sched);
>>>>> +	sched = rcu_dereference(q->oper_sched);
>>>>> +	if (admin && ktime_after(minimum_time, admin->base_time))
>>>>> +		switch_schedules(q, &admin, &sched);
>>>>> +
>>>>> +	/* Until the schedule starts, all the queues are open */
>>>>> +	if (!sched || ktime_before(minimum_time, sched->base_time)) {
>>>>> +		txtime = minimum_time;
>>>>> +		goto done;
>>>>> +	}
>>>>> +
>>>>> +	len = qdisc_pkt_len(skb);
>>>>> +	packet_transmit_time = length_to_duration(q, len);
>>>>> +
>>>>> +	do {
>>>>> +		sched_changed = 0;
>>>>> +
>>>>> +		entry = find_entry_to_transmit(skb, sch, sched, admin,
>>>>> +					       minimum_time,
>>>>> +					       &interval_start, &interval_end,
>>>>> +					       false);
>>>>> +		if (!entry) {
>>>>> +			txtime = 0;
>>>>> +			goto done;
>>>>> +		}
>>>>> +
>>>>> +		txtime = entry->next_txtime;
>>>>> +		txtime = max_t(ktime_t, txtime, minimum_time);
>>>>> +		txtime = max_t(ktime_t, txtime, interval_start);
>>>>> +
>>>>> +		if (admin && admin != sched &&
>>>>> +		    ktime_after(txtime, admin->base_time)) {
>>>>> +			sched = admin;
>>>>> +			sched_changed = 1;
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>> +		transmit_end_time = ktime_add(txtime, packet_transmit_time);
>>>>> +		minimum_time = transmit_end_time;
>>>>> +
>>>>> +		/* Update the txtime of current entry to the next time it's
>>>>> +		 * interval starts.
>>>>> +		 */
>>>>> +		if (ktime_after(transmit_end_time, interval_end))
>>>>> +			entry->next_txtime = ktime_add(interval_start, sched->cycle_time);
>>>>> +	} while (sched_changed || ktime_after(transmit_end_time, interval_end));
>>>>> +
>>>>> +	entry->next_txtime = transmit_end_time;
>>>>> +
>>>>> +done:
>>>>> +	rcu_read_unlock();
>>>>> +	return txtime;
>>>>> +}
>>>>> +
>>>>>   static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>>>>   			  struct sk_buff **to_free)
>>>>>   {
>>>>> @@ -129,6 +361,15 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>>>>   	if (unlikely(!child))
>>>>>   		return qdisc_drop(skb, sch, to_free);
>>>>>   +	if (skb->sk && sock_flag(skb->sk, SOCK_TXTIME)) {
>>>>> +		if (!is_valid_interval(skb, sch))
>>>>> +			return qdisc_drop(skb, sch, to_free);
>>>>> +	} else if (TXTIME_OFFLOAD_IS_ON(q->offload_flags)) {
>>>>> +		skb->tstamp = get_packet_txtime(skb, sch);
>>>>> +		if (!skb->tstamp)
>>>>> +			return qdisc_drop(skb, sch, to_free);
>>>>> +	}
>>>>> +
>>>>>   	qdisc_qstats_backlog_inc(sch, skb);
>>>>>   	sch->q.qlen++;
>>>>>   @@ -206,11 +447,6 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
>>>>>   	return q->peek(sch);
>>>>>   }
>>>>>   -static inline int length_to_duration(struct taprio_sched *q, int len)
>>>>> -{
>>>>> -	return div_u64(len * atomic64_read(&q->picos_per_byte), 1000);
>>>>> -}
>>>>> -
>>>>>   static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
>>>>>   {
>>>>>   	atomic_set(&entry->budget,
>>>>> @@ -594,7 +830,8 @@ static int parse_taprio_schedule(struct nlattr **tb,
>>>>>     static int taprio_parse_mqprio_opt(struct net_device *dev,
>>>>>   				   struct tc_mqprio_qopt *qopt,
>>>>> -				   struct netlink_ext_ack *extack)
>>>>> +				   struct netlink_ext_ack *extack,
>>>>> +				   u32 offload_flags)
>>>>>   {
>>>>>   	int i, j;
>>>>>   @@ -642,6 +879,9 @@ static int taprio_parse_mqprio_opt(struct net_device *dev,
>>>>>   			return -EINVAL;
>>>>>   		}
>>>>>   +		if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>>>>> +			continue;
>>>>> +
>>>>>   		/* Verify that the offset and counts do not overlap */
>>>>>   		for (j = i + 1; j < qopt->num_tc; j++) {
>>>>>   			if (last > qopt->offset[j]) {
>>>>> @@ -804,6 +1044,9 @@ static int taprio_enable_offload(struct net_device *dev,
>>>>>   	const struct net_device_ops *ops = dev->netdev_ops;
>>>>>   	int err = 0;
>>>>>   +	if (TXTIME_OFFLOAD_IS_ON(offload_flags))
>>>>> +		goto done;
>>>>> +
>>>>>   	if (!FULL_OFFLOAD_IS_ON(offload_flags)) {
>>>>>   		NL_SET_ERR_MSG(extack, "Offload mode is not supported");
>>>>>   		return -EOPNOTSUPP;
>>>>> @@ -816,15 +1059,28 @@ static int taprio_enable_offload(struct net_device *dev,
>>>>>     	/* FIXME: enable offloading */
>>>>>   -	q->dequeue = taprio_dequeue_offload;
>>>>> -	q->peek = taprio_peek_offload;
>>>>> -
>>>>> -	if (err == 0)
>>>>> +done:
>>>>> +	if (err == 0) {
>>>>> +		q->dequeue = taprio_dequeue_offload;
>>>>> +		q->peek = taprio_peek_offload;
>>>>>   		q->offload_flags = offload_flags;
>>>>> +	}
>>>>>     	return err;
>>>>>   }
>>>>>   +static void setup_txtime(struct taprio_sched *q,
>>>>> +			 struct sched_gate_list *sched, ktime_t base)
>>>>> +{
>>>>> +	struct sched_entry *entry;
>>>>> +	u32 interval = 0;
>>>>> +
>>>>> +	list_for_each_entry(entry, &sched->entries, list) {
>>>>> +		entry->next_txtime = ktime_add_ns(base, interval);
>>>>> +		interval += entry->interval;
>>>>> +	}
>>>>> +}
>>>>> +
>>>>>   static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>>   			 struct netlink_ext_ack *extack)
>>>>>   {
>>>>> @@ -846,7 +1102,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>>   	if (tb[TCA_TAPRIO_ATTR_PRIOMAP])
>>>>>   		mqprio = nla_data(tb[TCA_TAPRIO_ATTR_PRIOMAP]);
>>>>>   -	err = taprio_parse_mqprio_opt(dev, mqprio, extack);
>>>>> +	if (tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS])
>>>>> +		offload_flags = nla_get_u32(tb[TCA_TAPRIO_ATTR_OFFLOAD_FLAGS]);
>>>>> +
>>>>> +	err = taprio_parse_mqprio_opt(dev, mqprio, extack, offload_flags);
>>>>>   	if (err < 0)
>>>>>   		return err;
>>>>>   @@ -887,6 +1146,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>>   		goto free_sched;
>>>>>   	}
>>>>>   +	/* preserve offload flags when changing the schedule. */
>>>>> +	if (q->offload_flags)
>>>>> +		offload_flags = q->offload_flags;
>>>>> +
>>>>>   	if (tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]) {
>>>>>   		clockid = nla_get_s32(tb[TCA_TAPRIO_ATTR_SCHED_CLOCKID]);
>>>>>   @@ -914,7 +1177,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>>   	/* Protects against enqueue()/dequeue() */
>>>>>   	spin_lock_bh(qdisc_lock(sch));
>>>>>   -	if (!hrtimer_active(&q->advance_timer)) {
>>>>> +	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY])
>>>>> +		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
>>>>> +
>>>>> +	if (!TXTIME_OFFLOAD_IS_ON(offload_flags) && !hrtimer_active(&q->advance_timer)) {
>>>>>   		hrtimer_init(&q->advance_timer, q->clockid, HRTIMER_MODE_ABS);
>>>>>   		q->advance_timer.function = advance_sched;
>>>>>   	}
>>>>> @@ -966,20 +1232,35 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
>>>>>   		goto unlock;
>>>>>   	}
>>>>>   -	setup_first_close_time(q, new_admin, start);
>>>>> +	if (TXTIME_OFFLOAD_IS_ON(offload_flags)) {
>>>>> +		setup_txtime(q, new_admin, start);
>>>>> +
>>>>> +		if (!oper) {
>>>>> +			rcu_assign_pointer(q->oper_sched, new_admin);
>>>>> +			err = 0;
>>>>> +			new_admin = NULL;
>>>>> +			goto unlock;
>>>>> +		}
>>>>> +
>>>>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>>>>> +		if (admin)
>>>>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>>> +	} else {
>>>>> +		setup_first_close_time(q, new_admin, start);
>>>>>   -	/* Protects against advance_sched() */
>>>>> -	spin_lock_irqsave(&q->current_entry_lock, flags);
>>>>> +		/* Protects against advance_sched() */
>>>>> +		spin_lock_irqsave(&q->current_entry_lock, flags);
>>>>>   -	taprio_start_sched(sch, start, new_admin);
>>>>> +		taprio_start_sched(sch, start, new_admin);
>>>>>   -	rcu_assign_pointer(q->admin_sched, new_admin);
>>>>> -	if (admin)
>>>>> -		call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>>> -	new_admin = NULL;
>>>>> +		rcu_assign_pointer(q->admin_sched, new_admin);
>>>>> +		if (admin)
>>>>> +			call_rcu(&admin->rcu, taprio_free_sched_cb);
>>>>>   -	spin_unlock_irqrestore(&q->current_entry_lock, flags);
>>>>> +		spin_unlock_irqrestore(&q->current_entry_lock, flags);
>>>>> +	}
>>>>>   +	new_admin = NULL;
>>>>>   	err = 0;
>>>>>     unlock:
>>>>> @@ -1225,6 +1506,9 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
>>>>>   	if (nla_put_u32(skb, TCA_TAPRIO_ATTR_OFFLOAD_FLAGS, q->offload_flags))
>>>>>   		goto options_error;
>>>>>   +	if (nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
>>>>> +		goto options_error;
>>>>> +
>>>>>   	if (oper && dump_schedule(skb, oper))
>>>>>   		goto options_error;
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode.
  2019-06-10 14:27           ` Murali Karicheri
@ 2019-06-11 21:03             ` Patel, Vedang
  0 siblings, 0 replies; 25+ messages in thread
From: Patel, Vedang @ 2019-06-11 21:03 UTC (permalink / raw)
  To: Murali Karicheri
  Cc: netdev, Kirsher, Jeffrey T, David S . Miller, Jamal Hadi Salim,
	xiyou.wangcong, jiri, intel-wired-lan, Gomes, Vinicius, l



> On Jun 10, 2019, at 7:27 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
> 
> Vedang,
> 
> On 06/07/2019 05:12 PM, Patel, Vedang wrote:
>> Hi Murali,
>>> On Jun 7, 2019, at 11:52 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>> 
>>> On 06/04/2019 04:06 PM, Patel, Vedang wrote:
>>>> Hi Murali,
>>>>> On Jun 3, 2019, at 7:15 AM, Murali Karicheri <m-karicheri2@ti.com> wrote:
>>>>> 
>>>>> Hi Vedang,
>>>>> 
>>>>> On 05/28/2019 01:46 PM, Vedang Patel wrote:
>>>>>> Currently, we are seeing non-critical packets being transmitted outside
>>>>>> of their timeslice. We can confirm that the packets are being dequeued
>>>>>> at the right time. So, the delay is induced in the hardware side.  The
>>>>>> most likely reason is the hardware queues are starving the lower
>>>>>> priority queues.
>>>>>> In order to improve the performance of taprio, we will be making use of the
>>>>>> txtime feature provided by the ETF qdisc. For all the packets which do not have
>>>>>> the SO_TXTIME option set, taprio will set the transmit timestamp (set in
>>>>>> skb->tstamp) in this mode. TAPrio Qdisc will ensure that the transmit time for
>>>>>> the packet is set to when the gate is open. If SO_TXTIME is set, the TAPrio
>>>>>> qdisc will validate whether the timestamp (in skb->tstamp) occurs when the gate
>>>>>> corresponding to skb's traffic class is open.
>>>>>> Following is the example configuration for enabling txtime offload:
>>>>>> tc qdisc replace dev eth0 parent root handle 100 taprio \\
>>>>>>       num_tc 3 \\
>>>>>>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
>>>>>>       queues 1@0 1@0 1@0 \\
>>>>>>       base-time 1558653424279842568 \\
>>>>>>       sched-entry S 01 300000 \\
>>>>>>       sched-entry S 02 300000 \\
>>>>>>       sched-entry S 04 400000 \\
>>>>>>       offload 2 \\
>>>>>>       txtime-delay 40000 \\
>>>>>>       clockid CLOCK_TAI
>>>>>> tc qdisc replace dev $IFACE parent 100:1 etf skip_sock_check \\
>>>>>>       offload delta 200000 clockid CLOCK_TAI
>>>>>> Here, the "offload" parameter is indicating that the TXTIME_OFFLOAD mode is
>>>>>> enabled. Also, all the traffic classes are mapped to the same queue.  This is
>>>>>> only possible in taprio when txtime offload is enabled. Also note that the ETF
>>>>>> Qdisc is enabled with offload mode set.
>>>>>> In this mode, if the packet's traffic class is open and the complete packet can
>>>>>> be transmitted, taprio will try to transmit the packet immediately. This will
>>>>>> be done by setting skb->tstamp to current_time + the time delta indicated in
>>>>>> the txtime_delay parameter. This parameter indicates the time taken (in
>>>>>> software) for packet to reach the network adapter.
>>>>> 
>>>>> In TSN Time aware shaper, packets are sent when gate for a specific
>>>>> traffic class is open. So packets that are available in the queues are
>>>>> sent by the scheduler. So the ETF is not strictly required for this
>>>>> function.
>>>>> I understand if the application needs to send packets with
>>>>> some latency expectation should use ETF to schedule the packet in sync
>>>>> with the next gate open time. So txtime_delay is used to account for
>>>>> the delay for packets to travel from user space to nic.
>>>> This is not true. As explained in the other email, txtime-delay is the maximum time a packet might take after reaching taprio_enqueue() assuming the gate corresponding to that packet was already open.
>>>>> So it is ETF
>>>>> that need to inspect the skb->tstamp and allow or if time match or
>>>>> discard if late. Is this the case?
>>>>> 
>>>> The role of ETF is just to sort packets according to their transmit timestamp and send them to the hardware queues to transmit whenever their time comes. This is needed because the i210 hardware does not have any sort feature within its queue. If 2 packets arrive and the first packet has a transmit timestamp later than the second packet, the second packet won’t be transmitted on time.
>>>> Taprio in the txtime offload mode (btw, this will soon be renamed to txtime-assist) will set the transmit timestamp of each packet (in skb->tstamp) and then send it to ETF so that it can be sorted with the other packets and sent to hardware whenever the time comes. ETF will discard the packet if the transmit time is in the past or before the transmit time of the packet which is already been sent to the hardware.
>>>> Let me know if you have more questions about this.
>>> 
>>> It is bit confusing to have this mode in taprio. My assumption is
>>> that taprio implements TSN standard 802.1Qbv scheduler that is
>>> responsible for managing the Gate open/close and sending frames
>>> for specific traffic class during the Gate open. AFAIK, this
>>> scheduler doesn't inspect the packet's metadata such as time to
>>> send or such. So why is txtime offload mode is added to taprio?
>>> Could you please explain?
>>> 
>>> Murali
>>> 
>> Short answer: Taprio still implements a 802.1Qbv like schedule. But, it leverages the functionality from ETF to do so.
>> Long answer:
>> The software-only implementation of 802.1Qbv has quite a few low priority packets being transmitted outside their timeslice. This is because the higher priority queues in i210 are starving the lower priority queues. So, what the txtime-assist mode does is to assign an explicit tx timestamp and use the launchtime feature of the i210 adapter card to transmit the packets on time. It is still sending frames for a particular traffic class only when their gate is open. Also, it is not modifying any data in the packet which is transmitted. Just notifying the NIC when to transmit the packet. If there is a tx timestamp already assigned to a packet, it does not change it. Since we are assigning the tx timestamp, we have to route the packet through the ETF queue disc for sorting the packets according to their timestamps and sending it to the NIC.
>> We have to implement the above mechanism because, currently, we do not have the capability to offload the taprio schedule to the hardware.
> Ok, Thanks for the patience and explanation. Yes, it helps.
> 
> One last question. If your hardware i210 checks the time in a packet
> and finds it is  late, does it drop the packet? I understand that
> txtime-delay is for tuning this such that hardware see the packet on
> time and transmit. I also understand that packets get dropped
> at the ETF qdisc if late.
> 
> Murali

The i210 hardware does not drop packets if they are late. 

In i210 NIC, the launchtime field is only 25 bits long with 32 ns units. So, it will compare launchtime*32 to the SYSTIML register (it stores the nanosecond part of the PTP time). It will send the packet if the time is exactly same as above. It is the job of ETF to ensure that the packet is not sent too early by selecting the delta which is less than 0.5 sec.

For more information, you can look at Section 7.2.2.2.3 and Section 7.2.7.5.3 of the i210 data sheet[1]. 

Thanks,
Vedang

[1] - https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/i210-ethernet-controller-datasheet.pdf


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-06-11 21:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-28 17:46 [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Vedang Patel
2019-05-28 17:46 ` [PATCH net-next v1 1/7] igb: clear out tstamp after sending the packet Vedang Patel
2019-05-28 17:46 ` [PATCH net-next v1 2/7] etf: Add skip_sock_check Vedang Patel
2019-05-28 17:46 ` [PATCH net-next v1 3/7] taprio: Add the skeleton to enable hardware offloading Vedang Patel
2019-05-28 22:45   ` Jakub Kicinski
2019-05-29 17:06     ` Patel, Vedang
2019-05-29 19:14       ` Jakub Kicinski
2019-05-29 20:05         ` Patel, Vedang
2019-05-29 21:06           ` Jakub Kicinski
2019-05-30  0:21             ` Patel, Vedang
2019-05-30 20:41               ` Jakub Kicinski
2019-05-28 17:46 ` [PATCH net-next v1 4/7] taprio: calculate cycle_time when schedule is installed Vedang Patel
2019-05-28 17:46 ` [PATCH net-next v1 5/7] taprio: Add support for txtime offload mode Vedang Patel
2019-06-03 14:15   ` Murali Karicheri
2019-06-04 20:06     ` Patel, Vedang
2019-06-07 18:52       ` Murali Karicheri
2019-06-07 21:12         ` Patel, Vedang
2019-06-10 14:27           ` Murali Karicheri
2019-06-11 21:03             ` Patel, Vedang
2019-05-28 17:46 ` [PATCH net-next v1 6/7] taprio: make clock reference conversions easier Vedang Patel
2019-05-28 17:46 ` [PATCH net-next v1 7/7] taprio: Adjust timestamps for TCP packets Vedang Patel
2019-06-03 13:50 ` [PATCH net-next v1 0/7] net/sched: Add txtime assist support for taprio Murali Karicheri
2019-06-03 13:54   ` Murali Karicheri
2019-06-04 19:53     ` Patel, Vedang
2019-06-04 19:47   ` Patel, Vedang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).