All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/15] taprio fixprovements
@ 2023-01-28  1:07 Vladimir Oltean
  2023-01-28  1:07 ` [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation Vladimir Oltean
                   ` (15 more replies)
  0 siblings, 16 replies; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

I started to pull on a small thread and the whole thing unraveled :(

While trying to ignite a more serious discussion about how the i225/i226
hardware prioritization model seems to have affected the generic taprio
software implementation (patch 05/15), I noticed 2 things:
- taprio_peek() is dead code (patch 01/15)
- taprio has a ridiculously low iperf3 performance when all gates are
  open and it behave as a work-conserving qdisc. Patches 06/15 -> 09/15
  and 13/15 -> 15/15 collectively work to address some of that.

I had to put a hard stop for today (and at the patch limit of 15), but
now that taprio calculates the durations of contiguously open TC gates,
part 2 would be the communication of this information to offloading
drivers via ndo_setup_tc(), and the deletion of duplicated logic from
vsc9959_tas_guard_bands_update(). But that's for another day - I'm not
quite sure how that's going to work out. The gate durations change at
each link speed change, and this might mean that reoffloading is
necessary.

Another huge issue I'm seeing at small intervals with software
scheduling is simply the amount of RCU stalls. I can't get Kurt's
schedule from commit 497cc00224cf ("taprio: Handle short intervals
and large packets") to work reliably on my system even without these
patches. Eventually the system dies unless I increase the entry
intervals from the posted 500 us - my CPUs just don't do much of
anything else. Maybe someone has any idea what to do.

Almost forgot to mention, this patch set only applies on top of another
one:
https://patchwork.kernel.org/project/netdevbpf/cover/20230127001516.592984-1-vladimir.oltean@nxp.com/

Vladimir Oltean (15):
  net/sched: taprio: delete peek() implementation
  net/sched: taprio: continue with other TXQs if one dequeue() failed
  net/sched: taprio: refactor one skb dequeue from TXQ to separate
    function
  net/sched: taprio: avoid calling child->ops->dequeue(child) twice
  net/sched: taprio: give higher priority to higher TCs in software
    dequeue mode
  net/sched: taprio: calculate tc gate durations
  net/sched: taprio: rename close_time to end_time
  net/sched: taprio: calculate budgets per traffic class
  net/sched: taprio: calculate guard band against actual TC gate close
    time
  net/sched: make stab available before ops->init() call
  net/sched: taprio: warn about missing size table
  net/sched: keep the max_frm_len information inside struct
    sched_gate_list
  net/sched: taprio: automatically calculate queueMaxSDU based on TC
    gate durations
  net/sched: taprio: split segmentation logic from qdisc_enqueue()
  net/sched: taprio: don't segment unnecessarily

 net/sched/sch_api.c    |  29 +--
 net/sched/sch_taprio.c | 548 ++++++++++++++++++++++++++++-------------
 2 files changed, 383 insertions(+), 194 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 11:59   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 02/15] net/sched: taprio: continue with other TXQs if one dequeue() failed Vladimir Oltean
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

There isn't any code in the network stack which calls taprio_peek().
We only see qdisc->ops->peek() being called on child qdiscs of other
classful qdiscs, never from the generic qdisc code. Whereas taprio is
never a child qdisc, it is always root.

This snippet of a comment from qdisc_peek_dequeued() seems to confirm:

	/* we can reuse ->gso_skb because peek isn't called for root qdiscs */

Since I've been known to be wrong many times though, I'm not completely
removing it, but leaving a stub function in place which emits a warning.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 43 +-----------------------------------------
 1 file changed, 1 insertion(+), 42 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index f2c585bb0519..375f445c1cfb 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -497,50 +497,9 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	return taprio_enqueue_one(skb, sch, child, to_free);
 }
 
-/* Will not be called in the full offload case, since the TX queues are
- * attached to the Qdisc created using qdisc_create_dflt()
- */
 static struct sk_buff *taprio_peek(struct Qdisc *sch)
 {
-	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
-	struct sched_entry *entry;
-	struct sk_buff *skb;
-	u32 gate_mask;
-	int i;
-
-	rcu_read_lock();
-	entry = rcu_dereference(q->current_entry);
-	gate_mask = entry ? entry->gate_mask : TAPRIO_ALL_GATES_OPEN;
-	rcu_read_unlock();
-
-	if (!gate_mask)
-		return NULL;
-
-	for (i = 0; i < dev->num_tx_queues; i++) {
-		struct Qdisc *child = q->qdiscs[i];
-		int prio;
-		u8 tc;
-
-		if (unlikely(!child))
-			continue;
-
-		skb = child->ops->peek(child);
-		if (!skb)
-			continue;
-
-		if (TXTIME_ASSIST_IS_ENABLED(q->flags))
-			return skb;
-
-		prio = skb->priority;
-		tc = netdev_get_prio_tc_map(dev, prio);
-
-		if (!(gate_mask & BIT(tc)))
-			continue;
-
-		return skb;
-	}
-
+	WARN_ONCE(1, "taprio only supports operating as root qdisc, peek() not implemented");
 	return NULL;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 02/15] net/sched: taprio: continue with other TXQs if one dequeue() failed
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
  2023-01-28  1:07 ` [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 11:59   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 03/15] net/sched: taprio: refactor one skb dequeue from TXQ to separate function Vladimir Oltean
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

This changes the handling of an unlikely condition to not stop dequeuing
if taprio failed to dequeue the peeked skb in taprio_dequeue().

I've no idea when this can happen, but the only side effect seems to be
that the atomic_sub_return() call right above will have consumed some
budget. This isn't a big deal, since either that made us remain without
any budget (and therefore, we'd exit on the next peeked skb anyway), or
we could send some packets from other TXQs.

I'm making this change because in a future patch I'll be refactoring the
dequeue procedure to simplify it, and this corner case will have to go
away.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 375f445c1cfb..1504fdae723f 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -585,7 +585,7 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 
 		skb = child->ops->dequeue(child);
 		if (unlikely(!skb))
-			goto done;
+			continue;
 
 skb_found:
 		qdisc_bstats_update(sch, skb);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 03/15] net/sched: taprio: refactor one skb dequeue from TXQ to separate function
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
  2023-01-28  1:07 ` [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation Vladimir Oltean
  2023-01-28  1:07 ` [RFC PATCH net-next 02/15] net/sched: taprio: continue with other TXQs if one dequeue() failed Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 11:59   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 04/15] net/sched: taprio: avoid calling child->ops->dequeue(child) twice Vladimir Oltean
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Future changes will refactor the TXQ selection procedure, and a lot of
stuff will become messy, the indentation of the bulk of the dequeue
procedure would increase, etc.

Break out the bulk of the function into a new one, which knows the TXQ
(child qdisc) we should perform a dequeue from.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 121 +++++++++++++++++++++--------------------
 1 file changed, 63 insertions(+), 58 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 1504fdae723f..fed8ccc000dc 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -510,6 +510,66 @@ static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
 			     atomic64_read(&q->picos_per_byte)));
 }
 
+static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
+					       struct sched_entry *entry,
+					       u32 gate_mask)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct Qdisc *child = q->qdiscs[txq];
+	struct sk_buff *skb;
+	ktime_t guard;
+	int prio;
+	int len;
+	u8 tc;
+
+	if (unlikely(!child))
+		return NULL;
+
+	if (TXTIME_ASSIST_IS_ENABLED(q->flags)) {
+		skb = child->ops->dequeue(child);
+		if (!skb)
+			return NULL;
+		goto skb_found;
+	}
+
+	skb = child->ops->peek(child);
+	if (!skb)
+		return NULL;
+
+	prio = skb->priority;
+	tc = netdev_get_prio_tc_map(dev, prio);
+
+	if (!(gate_mask & BIT(tc)))
+		return NULL;
+
+	len = qdisc_pkt_len(skb);
+	guard = ktime_add_ns(taprio_get_time(q), length_to_duration(q, len));
+
+	/* In the case that there's no gate entry, there's no
+	 * guard band ...
+	 */
+	if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
+	    ktime_after(guard, entry->close_time))
+		return NULL;
+
+	/* ... and no budget. */
+	if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
+	    atomic_sub_return(len, &entry->budget) < 0)
+		return NULL;
+
+	skb = child->ops->dequeue(child);
+	if (unlikely(!skb))
+		return NULL;
+
+skb_found:
+	qdisc_bstats_update(sch, skb);
+	qdisc_qstats_backlog_dec(sch, skb);
+	sch->q.qlen--;
+
+	return skb;
+}
+
 /* Will not be called in the full offload case, since the TX queues are
  * attached to the Qdisc created using qdisc_create_dflt()
  */
@@ -535,64 +595,9 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 		goto done;
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
-		struct Qdisc *child = q->qdiscs[i];
-		ktime_t guard;
-		int prio;
-		int len;
-		u8 tc;
-
-		if (unlikely(!child))
-			continue;
-
-		if (TXTIME_ASSIST_IS_ENABLED(q->flags)) {
-			skb = child->ops->dequeue(child);
-			if (!skb)
-				continue;
-			goto skb_found;
-		}
-
-		skb = child->ops->peek(child);
-		if (!skb)
-			continue;
-
-		prio = skb->priority;
-		tc = netdev_get_prio_tc_map(dev, prio);
-
-		if (!(gate_mask & BIT(tc))) {
-			skb = NULL;
-			continue;
-		}
-
-		len = qdisc_pkt_len(skb);
-		guard = ktime_add_ns(taprio_get_time(q),
-				     length_to_duration(q, len));
-
-		/* In the case that there's no gate entry, there's no
-		 * guard band ...
-		 */
-		if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
-		    ktime_after(guard, entry->close_time)) {
-			skb = NULL;
-			continue;
-		}
-
-		/* ... and no budget. */
-		if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
-		    atomic_sub_return(len, &entry->budget) < 0) {
-			skb = NULL;
-			continue;
-		}
-
-		skb = child->ops->dequeue(child);
-		if (unlikely(!skb))
-			continue;
-
-skb_found:
-		qdisc_bstats_update(sch, skb);
-		qdisc_qstats_backlog_dec(sch, skb);
-		sch->q.qlen--;
-
-		goto done;
+		skb = taprio_dequeue_from_txq(sch, i, entry, gate_mask);
+		if (skb)
+			goto done;
 	}
 
 done:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 04/15] net/sched: taprio: avoid calling child->ops->dequeue(child) twice
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (2 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 03/15] net/sched: taprio: refactor one skb dequeue from TXQ to separate function Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:00   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode Vladimir Oltean
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Simplify taprio_dequeue_from_txq() by noticing that we can goto one call
earlier than the previous skb_found label. This is possible because
we've unified the treatment of the child->ops->dequeue(child) return
call, we always try other TXQs now, instead of abandoning the root
dequeue completely if we failed in the peek() case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index fed8ccc000dc..30741b950b46 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -526,12 +526,8 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	if (unlikely(!child))
 		return NULL;
 
-	if (TXTIME_ASSIST_IS_ENABLED(q->flags)) {
-		skb = child->ops->dequeue(child);
-		if (!skb)
-			return NULL;
-		goto skb_found;
-	}
+	if (TXTIME_ASSIST_IS_ENABLED(q->flags))
+		goto skip_peek_checks;
 
 	skb = child->ops->peek(child);
 	if (!skb)
@@ -558,11 +554,11 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	    atomic_sub_return(len, &entry->budget) < 0)
 		return NULL;
 
+skip_peek_checks:
 	skb = child->ops->dequeue(child);
 	if (unlikely(!skb))
 		return NULL;
 
-skb_found:
 	qdisc_bstats_update(sch, skb);
 	qdisc_qstats_backlog_dec(sch, skb);
 	sch->q.qlen--;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (3 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 04/15] net/sched: taprio: avoid calling child->ops->dequeue(child) twice Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:04   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 06/15] net/sched: taprio: calculate tc gate durations Vladimir Oltean
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Currently taprio iterates over child qdiscs in increasing order of TXQ
index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N.

However, to the best of my understanding, we should prioritize based on
the traffic class, so we should really dequeue starting with the highest
traffic class and going down from there. We get to the TXQ using the
tc_to_txq[] netdev property.

TXQs within the same TC have the same (strict) priority, so we should
pick from them as fairly as we can. Implement something very similar to
q->curband from multiq_dequeue().

Something tells me Vinicius won't like the way in which this patch
interacts with TXTIME_ASSIST_IS_ENABLED(q->flags) and NICs where TXQ 0
really has higher priority than TXQ 1....

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 49 +++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 15 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 30741b950b46..7dbb09b87bc5 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -78,6 +78,7 @@ struct taprio_sched {
 	struct sched_gate_list __rcu *admin_sched;
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
+	int cur_txq[TC_MAX_QUEUE];
 	u32 max_frm_len[TC_MAX_QUEUE]; /* for the fast path */
 	u32 max_sdu[TC_MAX_QUEUE]; /* for dump and offloading */
 	u32 txtime_delay;
@@ -515,13 +516,10 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 					       u32 gate_mask)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
 	struct Qdisc *child = q->qdiscs[txq];
 	struct sk_buff *skb;
 	ktime_t guard;
-	int prio;
 	int len;
-	u8 tc;
 
 	if (unlikely(!child))
 		return NULL;
@@ -533,12 +531,6 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	if (!skb)
 		return NULL;
 
-	prio = skb->priority;
-	tc = netdev_get_prio_tc_map(dev, prio);
-
-	if (!(gate_mask & BIT(tc)))
-		return NULL;
-
 	len = qdisc_pkt_len(skb);
 	guard = ktime_add_ns(taprio_get_time(q), length_to_duration(q, len));
 
@@ -566,6 +558,16 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	return skb;
 }
 
+static void taprio_next_tc_txq(struct net_device *dev, int tc, int *txq)
+{
+	int offset = dev->tc_to_txq[tc].offset;
+	int count = dev->tc_to_txq[tc].count;
+
+	(*txq)++;
+	if (*txq == offset + count)
+		*txq = offset;
+}
+
 /* Will not be called in the full offload case, since the TX queues are
  * attached to the Qdisc created using qdisc_create_dflt()
  */
@@ -573,10 +575,11 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	int num_tc = netdev_get_num_tc(dev);
 	struct sk_buff *skb = NULL;
 	struct sched_entry *entry;
 	u32 gate_mask;
-	int i;
+	int tc;
 
 	rcu_read_lock();
 	entry = rcu_dereference(q->current_entry);
@@ -590,10 +593,24 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 	if (!gate_mask)
 		goto done;
 
-	for (i = 0; i < dev->num_tx_queues; i++) {
-		skb = taprio_dequeue_from_txq(sch, i, entry, gate_mask);
-		if (skb)
-			goto done;
+	for (tc = num_tc - 1; tc >= 0; tc--) {
+		int first_txq = q->cur_txq[tc];
+
+		if (!(gate_mask & BIT(tc)))
+			continue;
+
+		/* Select among TXQs belonging to the same TC
+		 * using round robin
+		 */
+		do {
+			skb = taprio_dequeue_from_txq(sch, q->cur_txq[tc],
+						      entry, gate_mask);
+
+			taprio_next_tc_txq(dev, tc, &q->cur_txq[tc]);
+
+			if (skb)
+				goto done;
+		} while (q->cur_txq[tc] != first_txq);
 	}
 
 done:
@@ -1588,10 +1605,12 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		err = netdev_set_num_tc(dev, mqprio->num_tc);
 		if (err)
 			goto free_sched;
-		for (i = 0; i < mqprio->num_tc; i++)
+		for (i = 0; i < mqprio->num_tc; i++) {
 			netdev_set_tc_queue(dev, i,
 					    mqprio->count[i],
 					    mqprio->offset[i]);
+			q->cur_txq[i] = mqprio->offset[i];
+		}
 
 		/* Always use supplied priority mappings */
 		for (i = 0; i <= TC_BITMASK; i++)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 06/15] net/sched: taprio: calculate tc gate durations
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (4 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:05   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 07/15] net/sched: taprio: rename close_time to end_time Vladimir Oltean
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Current taprio code operates on a very simplistic (and incorrect)
assumption: that egress scheduling for a traffic class can only take
place for the duration of the current interval, or i.o.w., it assumes
that at the end of each schedule entry, there is a "gate close" event
for all traffic classes.

As an example, traffic sent with the schedule below will be jumpy, even
though all 8 TC gates are open, so there is absolutely no "gate close"
event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in
consecutive schedule entries):

tc qdisc replace dev veth0 parent root taprio \
	num_tc 2 \
	map 0 1 \
	queues 1@0 1@1 \
	base-time 0 \
	sched-entry S 0xff 4000000000 \
	clockid CLOCK_TAI \
	flags 0x0

This qdisc simply does not have what it takes in terms of logic to
*actually* compute the durations of traffic classes. Also, it does not
recognize the need to use this information on a per-traffic-class basis:
it always looks at entry->interval and entry->close_time.

This change proposes that each schedule entry has an array called
tc_gate_duration[tc]. This holds the information: "for how long will
this traffic class gate remain open, starting from *this* schedule
entry". If the traffic class gate is always open, that value is equal to
the cycle time of the schedule.

We'll also need to keep track, for the purpose of queueMaxSDU[tc]
calculation, what is the maximum time duration for a traffic class
having an open gate. This gives us directly what is the maximum sized
packet that this traffic class will have to accept. For everything else
it has to qdisc_drop() it in qdisc_enqueue().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 55 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 7dbb09b87bc5..fe92a75701bd 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -35,6 +35,10 @@ static LIST_HEAD(taprio_list);
 #define TAPRIO_FLAGS_INVALID U32_MAX
 
 struct sched_entry {
+	/* Durations between this GCL entry and the GCL entry where the
+	 * respective traffic class gate closes
+	 */
+	u64 tc_gate_duration[TC_MAX_QUEUE];
 	struct list_head list;
 
 	/* The instant that this entry "closes" and the next one
@@ -51,6 +55,10 @@ struct sched_entry {
 };
 
 struct sched_gate_list {
+	/* Longest non-zero contiguous gate durations per traffic class,
+	 * or 0 if a traffic class gate never opens during the schedule.
+	 */
+	u64 max_open_tc_gate_duration[TC_MAX_QUEUE];
 	struct rcu_head rcu;
 	struct list_head entries;
 	size_t num_entries;
@@ -89,6 +97,51 @@ struct __tc_taprio_qopt_offload {
 	struct tc_taprio_qopt_offload offload;
 };
 
+static void taprio_calculate_tc_gate_durations(struct taprio_sched *q,
+					       struct sched_gate_list *sched)
+{
+	struct net_device *dev = qdisc_dev(q->root);
+	int num_tc = netdev_get_num_tc(dev);
+	struct sched_entry *entry, *cur;
+	int tc;
+
+	list_for_each_entry(entry, &sched->entries, list) {
+		u32 gates_still_open = entry->gate_mask;
+
+		/* For each traffic class, calculate each open gate duration,
+		 * starting at this schedule entry and ending at the schedule
+		 * entry containing a gate close event for that TC.
+		 */
+		cur = entry;
+
+		do {
+			if (!gates_still_open)
+				break;
+
+			for (tc = 0; tc < num_tc; tc++) {
+				if (!(gates_still_open & BIT(tc)))
+					continue;
+
+				if (cur->gate_mask & BIT(tc))
+					entry->tc_gate_duration[tc] += cur->interval;
+				else
+					gates_still_open &= ~BIT(tc);
+			}
+
+			cur = list_next_entry_circular(cur, &sched->entries, list);
+		} while (cur != entry);
+
+		/* Keep track of the maximum gate duration for each traffic
+		 * class, taking care to not confuse a traffic class which is
+		 * temporarily closed with one that is always closed.
+		 */
+		for (tc = 0; tc < num_tc; tc++)
+			if (entry->tc_gate_duration[tc] &&
+			    sched->max_open_tc_gate_duration[tc] < entry->tc_gate_duration[tc])
+				sched->max_open_tc_gate_duration[tc] = entry->tc_gate_duration[tc];
+	}
+}
+
 static ktime_t sched_base_time(const struct sched_gate_list *sched)
 {
 	if (!sched)
@@ -907,6 +960,8 @@ static int parse_taprio_schedule(struct taprio_sched *q, struct nlattr **tb,
 		new->cycle_time = cycle;
 	}
 
+	taprio_calculate_tc_gate_durations(q, new);
+
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 07/15] net/sched: taprio: rename close_time to end_time
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (5 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 06/15] net/sched: taprio: calculate tc gate durations Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:05   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 08/15] net/sched: taprio: calculate budgets per traffic class Vladimir Oltean
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

There is a confusion in terms in taprio which makes what is called
"close_time" to be actually used for 2 things:

1. determining when an entry "closes" such that transmitted skbs are
   never allowed to overrun that time (?!)
2. an aid for determining when to advance and/or restart the schedule
   using the hrtimer

It makes more sense to call this so-called "close_time" "end_time",
because it's not clear at all to me what "closes". Future patches will
hopefully make better use of the term "to close".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 52 +++++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index fe92a75701bd..d5d284eaab66 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -41,11 +41,11 @@ struct sched_entry {
 	u64 tc_gate_duration[TC_MAX_QUEUE];
 	struct list_head list;
 
-	/* The instant that this entry "closes" and the next one
+	/* The instant that this entry ends and the next one
 	 * should open, the qdisc will make some effort so that no
 	 * packet leaves after this time.
 	 */
-	ktime_t close_time;
+	ktime_t end_time;
 	ktime_t next_txtime;
 	atomic_t budget;
 	int index;
@@ -62,7 +62,7 @@ struct sched_gate_list {
 	struct rcu_head rcu;
 	struct list_head entries;
 	size_t num_entries;
-	ktime_t cycle_close_time;
+	ktime_t cycle_end_time;
 	s64 cycle_time;
 	s64 cycle_time_extension;
 	s64 base_time;
@@ -591,7 +591,7 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	 * guard band ...
 	 */
 	if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
-	    ktime_after(guard, entry->close_time))
+	    ktime_after(guard, entry->end_time))
 		return NULL;
 
 	/* ... and no budget. */
@@ -678,7 +678,7 @@ static bool should_restart_cycle(const struct sched_gate_list *oper,
 	if (list_is_last(&entry->list, &oper->entries))
 		return true;
 
-	if (ktime_compare(entry->close_time, oper->cycle_close_time) == 0)
+	if (ktime_compare(entry->end_time, oper->cycle_end_time) == 0)
 		return true;
 
 	return false;
@@ -686,7 +686,7 @@ static bool should_restart_cycle(const struct sched_gate_list *oper,
 
 static bool should_change_schedules(const struct sched_gate_list *admin,
 				    const struct sched_gate_list *oper,
-				    ktime_t close_time)
+				    ktime_t end_time)
 {
 	ktime_t next_base_time, extension_time;
 
@@ -695,18 +695,18 @@ static bool should_change_schedules(const struct sched_gate_list *admin,
 
 	next_base_time = sched_base_time(admin);
 
-	/* This is the simple case, the close_time would fall after
+	/* This is the simple case, the end_time would fall after
 	 * the next schedule base_time.
 	 */
-	if (ktime_compare(next_base_time, close_time) <= 0)
+	if (ktime_compare(next_base_time, end_time) <= 0)
 		return true;
 
-	/* This is the cycle_time_extension case, if the close_time
+	/* This is the cycle_time_extension case, if the end_time
 	 * plus the amount that can be extended would fall after the
 	 * next schedule base_time, we can extend the current schedule
 	 * for that amount.
 	 */
-	extension_time = ktime_add_ns(close_time, oper->cycle_time_extension);
+	extension_time = ktime_add_ns(end_time, oper->cycle_time_extension);
 
 	/* FIXME: the IEEE 802.1Q-2018 Specification isn't clear about
 	 * how precisely the extension should be made. So after
@@ -725,7 +725,7 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 	struct sched_gate_list *oper, *admin;
 	struct sched_entry *entry, *next;
 	struct Qdisc *sch = q->root;
-	ktime_t close_time;
+	ktime_t end_time;
 
 	spin_lock(&q->current_entry_lock);
 	entry = rcu_dereference_protected(q->current_entry,
@@ -744,41 +744,41 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 	 * entry of all schedules are pre-calculated during the
 	 * schedule initialization.
 	 */
-	if (unlikely(!entry || entry->close_time == oper->base_time)) {
+	if (unlikely(!entry || entry->end_time == oper->base_time)) {
 		next = list_first_entry(&oper->entries, struct sched_entry,
 					list);
-		close_time = next->close_time;
+		end_time = next->end_time;
 		goto first_run;
 	}
 
 	if (should_restart_cycle(oper, entry)) {
 		next = list_first_entry(&oper->entries, struct sched_entry,
 					list);
-		oper->cycle_close_time = ktime_add_ns(oper->cycle_close_time,
-						      oper->cycle_time);
+		oper->cycle_end_time = ktime_add_ns(oper->cycle_end_time,
+						    oper->cycle_time);
 	} else {
 		next = list_next_entry(entry, list);
 	}
 
-	close_time = ktime_add_ns(entry->close_time, next->interval);
-	close_time = min_t(ktime_t, close_time, oper->cycle_close_time);
+	end_time = ktime_add_ns(entry->end_time, next->interval);
+	end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
 
-	if (should_change_schedules(admin, oper, close_time)) {
+	if (should_change_schedules(admin, oper, end_time)) {
 		/* Set things so the next time this runs, the new
 		 * schedule runs.
 		 */
-		close_time = sched_base_time(admin);
+		end_time = sched_base_time(admin);
 		switch_schedules(q, &admin, &oper);
 	}
 
-	next->close_time = close_time;
+	next->end_time = end_time;
 	taprio_set_budget(q, next);
 
 first_run:
 	rcu_assign_pointer(q->current_entry, next);
 	spin_unlock(&q->current_entry_lock);
 
-	hrtimer_set_expires(&q->advance_timer, close_time);
+	hrtimer_set_expires(&q->advance_timer, end_time);
 
 	rcu_read_lock();
 	__netif_schedule(sch);
@@ -1065,8 +1065,8 @@ static int taprio_get_start_time(struct Qdisc *sch,
 	return 0;
 }
 
-static void setup_first_close_time(struct taprio_sched *q,
-				   struct sched_gate_list *sched, ktime_t base)
+static void setup_first_end_time(struct taprio_sched *q,
+				 struct sched_gate_list *sched, ktime_t base)
 {
 	struct sched_entry *first;
 	ktime_t cycle;
@@ -1077,9 +1077,9 @@ static void setup_first_close_time(struct taprio_sched *q,
 	cycle = sched->cycle_time;
 
 	/* FIXME: find a better place to do this */
-	sched->cycle_close_time = ktime_add_ns(base, cycle);
+	sched->cycle_end_time = ktime_add_ns(base, cycle);
 
-	first->close_time = ktime_add_ns(base, first->interval);
+	first->end_time = ktime_add_ns(base, first->interval);
 	taprio_set_budget(q, first);
 	rcu_assign_pointer(q->current_entry, NULL);
 }
@@ -1736,7 +1736,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		if (admin)
 			call_rcu(&admin->rcu, taprio_free_sched_cb);
 	} else {
-		setup_first_close_time(q, new_admin, start);
+		setup_first_end_time(q, new_admin, start);
 
 		/* Protects against advance_sched() */
 		spin_lock_irqsave(&q->current_entry_lock, flags);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 08/15] net/sched: taprio: calculate budgets per traffic class
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (6 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 07/15] net/sched: taprio: rename close_time to end_time Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:05   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 09/15] net/sched: taprio: calculate guard band against actual TC gate close time Vladimir Oltean
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Currently taprio assumes that the budget for a traffic class expires at
the end of the current interval as if the next interval contains a "gate
close" event for this traffic class.

This is, however, an unfounded assumption. Allow schedule entry
intervals to be fused together for a particular traffic class by
calculating the budget until the gate *actually* closes.

This means we need to keep budgets per traffic class, and we also need
to update the budget consumption procedure.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 57 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 48 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index d5d284eaab66..b3c25ab6a559 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -39,6 +39,7 @@ struct sched_entry {
 	 * respective traffic class gate closes
 	 */
 	u64 tc_gate_duration[TC_MAX_QUEUE];
+	atomic_t budget[TC_MAX_QUEUE];
 	struct list_head list;
 
 	/* The instant that this entry ends and the next one
@@ -47,7 +48,6 @@ struct sched_entry {
 	 */
 	ktime_t end_time;
 	ktime_t next_txtime;
-	atomic_t budget;
 	int index;
 	u32 gate_mask;
 	u32 interval;
@@ -557,14 +557,52 @@ static struct sk_buff *taprio_peek(struct Qdisc *sch)
 	return NULL;
 }
 
-static void taprio_set_budget(struct taprio_sched *q, struct sched_entry *entry)
+static void taprio_set_budgets(struct taprio_sched *q,
+			       struct sched_gate_list *sched,
+			       struct sched_entry *entry)
 {
-	atomic_set(&entry->budget,
-		   div64_u64((u64)entry->interval * PSEC_PER_NSEC,
-			     atomic64_read(&q->picos_per_byte)));
+	struct net_device *dev = qdisc_dev(q->root);
+	int num_tc = netdev_get_num_tc(dev);
+	int tc, budget;
+
+	for (tc = 0; tc < num_tc; tc++) {
+		/* Traffic classes which never close have infinite budget */
+		if (entry->tc_gate_duration[tc] == sched->cycle_time)
+			budget = INT_MAX;
+		else
+			budget = div64_u64((u64)entry->tc_gate_duration[tc] * PSEC_PER_NSEC,
+					   atomic64_read(&q->picos_per_byte));
+
+		atomic_set(&entry->budget[tc], budget);
+	}
+}
+
+/* When an skb is sent, it consumes from the budget of all traffic classes */
+static int taprio_update_budgets(struct sched_entry *entry, size_t len,
+				 int tc_consumed, int num_tc)
+{
+	int tc, budget, new_budget = 0;
+
+	for (tc = 0; tc < num_tc; tc++) {
+		budget = atomic_read(&entry->budget[tc]);
+		/* Don't consume from infinite budget */
+		if (budget == INT_MAX) {
+			if (tc == tc_consumed)
+				new_budget = budget;
+			continue;
+		}
+
+		if (tc == tc_consumed)
+			new_budget = atomic_sub_return(len, &entry->budget[tc]);
+		else
+			atomic_sub(len, &entry->budget[tc]);
+	}
+
+	return new_budget;
 }
 
 static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
+					       int tc, int num_tc,
 					       struct sched_entry *entry,
 					       u32 gate_mask)
 {
@@ -596,7 +634,7 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 
 	/* ... and no budget. */
 	if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
-	    atomic_sub_return(len, &entry->budget) < 0)
+	    taprio_update_budgets(entry, len, tc, num_tc) < 0)
 		return NULL;
 
 skip_peek_checks:
@@ -657,7 +695,8 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 		 */
 		do {
 			skb = taprio_dequeue_from_txq(sch, q->cur_txq[tc],
-						      entry, gate_mask);
+						      tc, num_tc, entry,
+						      gate_mask);
 
 			taprio_next_tc_txq(dev, tc, &q->cur_txq[tc]);
 
@@ -772,7 +811,7 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 	}
 
 	next->end_time = end_time;
-	taprio_set_budget(q, next);
+	taprio_set_budgets(q, oper, next);
 
 first_run:
 	rcu_assign_pointer(q->current_entry, next);
@@ -1080,7 +1119,7 @@ static void setup_first_end_time(struct taprio_sched *q,
 	sched->cycle_end_time = ktime_add_ns(base, cycle);
 
 	first->end_time = ktime_add_ns(base, first->interval);
-	taprio_set_budget(q, first);
+	taprio_set_budgets(q, sched, first);
 	rcu_assign_pointer(q->current_entry, NULL);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 09/15] net/sched: taprio: calculate guard band against actual TC gate close time
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (7 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 08/15] net/sched: taprio: calculate budgets per traffic class Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:05   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 10/15] net/sched: make stab available before ops->init() call Vladimir Oltean
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

taprio_dequeue_from_txq() looks at the entry->end_time to determine
whether the skb will overrun its traffic class gate, as if at the end of
the schedule entry there surely is a "gate close" event for it. Hint:
maybe there isn't.

For each schedule entry, introduce an array of kernel times which
actually tracks when in the future will there be an *actual* gate close
event for that traffic class, and use that in the guard band overrun
calculation.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index b3c25ab6a559..8ec3c0e1f741 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -40,12 +40,12 @@ struct sched_entry {
 	 */
 	u64 tc_gate_duration[TC_MAX_QUEUE];
 	atomic_t budget[TC_MAX_QUEUE];
-	struct list_head list;
-
-	/* The instant that this entry ends and the next one
-	 * should open, the qdisc will make some effort so that no
-	 * packet leaves after this time.
+	/* The qdisc makes some effort so that no packet leaves
+	 * after this time
 	 */
+	ktime_t tc_gate_close_time[TC_MAX_QUEUE];
+	struct list_head list;
+	/* Used to calculate when to advance the schedule */
 	ktime_t end_time;
 	ktime_t next_txtime;
 	int index;
@@ -142,6 +142,12 @@ static void taprio_calculate_tc_gate_durations(struct taprio_sched *q,
 	}
 }
 
+static bool taprio_entry_allows_tx(ktime_t skb_end_time,
+				   struct sched_entry *entry, int tc)
+{
+	return ktime_before(skb_end_time, entry->tc_gate_close_time[tc]);
+}
+
 static ktime_t sched_base_time(const struct sched_gate_list *sched)
 {
 	if (!sched)
@@ -629,7 +635,7 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	 * guard band ...
 	 */
 	if (gate_mask != TAPRIO_ALL_GATES_OPEN &&
-	    ktime_after(guard, entry->end_time))
+	    !taprio_entry_allows_tx(guard, entry, tc))
 		return NULL;
 
 	/* ... and no budget. */
@@ -761,10 +767,13 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 {
 	struct taprio_sched *q = container_of(timer, struct taprio_sched,
 					      advance_timer);
+	struct net_device *dev = qdisc_dev(q->root);
 	struct sched_gate_list *oper, *admin;
+	int num_tc = netdev_get_num_tc(dev);
 	struct sched_entry *entry, *next;
 	struct Qdisc *sch = q->root;
 	ktime_t end_time;
+	int tc;
 
 	spin_lock(&q->current_entry_lock);
 	entry = rcu_dereference_protected(q->current_entry,
@@ -802,6 +811,14 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
 	end_time = ktime_add_ns(entry->end_time, next->interval);
 	end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
 
+	for (tc = 0; tc < num_tc; tc++) {
+		if (next->tc_gate_duration[tc] == oper->cycle_time)
+			next->tc_gate_close_time[tc] = KTIME_MAX;
+		else
+			next->tc_gate_close_time[tc] = ktime_add_ns(entry->end_time,
+								    next->tc_gate_duration[tc]);
+	}
+
 	if (should_change_schedules(admin, oper, end_time)) {
 		/* Set things so the next time this runs, the new
 		 * schedule runs.
@@ -1107,8 +1124,11 @@ static int taprio_get_start_time(struct Qdisc *sch,
 static void setup_first_end_time(struct taprio_sched *q,
 				 struct sched_gate_list *sched, ktime_t base)
 {
+	struct net_device *dev = qdisc_dev(q->root);
+	int num_tc = netdev_get_num_tc(dev);
 	struct sched_entry *first;
 	ktime_t cycle;
+	int tc;
 
 	first = list_first_entry(&sched->entries,
 				 struct sched_entry, list);
@@ -1120,6 +1140,14 @@ static void setup_first_end_time(struct taprio_sched *q,
 
 	first->end_time = ktime_add_ns(base, first->interval);
 	taprio_set_budgets(q, sched, first);
+
+	for (tc = 0; tc < num_tc; tc++) {
+		if (first->tc_gate_duration[tc] == sched->cycle_time)
+			first->tc_gate_close_time[tc] = KTIME_MAX;
+		else
+			first->tc_gate_close_time[tc] = ktime_add_ns(base, first->tc_gate_duration[tc]);
+	}
+
 	rcu_assign_pointer(q->current_entry, NULL);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 10/15] net/sched: make stab available before ops->init() call
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (8 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 09/15] net/sched: taprio: calculate guard band against actual TC gate close time Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:06   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 11/15] net/sched: taprio: warn about missing size table Vladimir Oltean
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Some qdiscs like taprio turn out to be actually pretty reliant on a well
configured stab, to not underestimate the skb transmission time (by
properly accounting for L1 overhead).

In a future change, taprio will need the stab, if configured by the
user, to be available at ops->init() time.

However, rcu_assign_pointer(sch->stab, stab) is called right after
ops->init(), making it unavailable, and I don't really see a good reason
for that.

Move it earlier, which nicely seems to simplify the error handling path
as well.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_api.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index c14018a8052c..e9780631b5b5 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1282,12 +1282,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	if (err)
 		goto err_out3;
 
-	if (ops->init) {
-		err = ops->init(sch, tca[TCA_OPTIONS], extack);
-		if (err != 0)
-			goto err_out5;
-	}
-
 	if (tca[TCA_STAB]) {
 		stab = qdisc_get_stab(tca[TCA_STAB], extack);
 		if (IS_ERR(stab)) {
@@ -1296,11 +1290,18 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 		}
 		rcu_assign_pointer(sch->stab, stab);
 	}
+
+	if (ops->init) {
+		err = ops->init(sch, tca[TCA_OPTIONS], extack);
+		if (err != 0)
+			goto err_out5;
+	}
+
 	if (tca[TCA_RATE]) {
 		err = -EOPNOTSUPP;
 		if (sch->flags & TCQ_F_MQROOT) {
 			NL_SET_ERR_MSG(extack, "Cannot attach rate estimator to a multi-queue root qdisc");
-			goto err_out4;
+			goto err_out5;
 		}
 
 		err = gen_new_estimator(&sch->bstats,
@@ -1311,7 +1312,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 					tca[TCA_RATE]);
 		if (err) {
 			NL_SET_ERR_MSG(extack, "Failed to generate new estimator");
-			goto err_out4;
+			goto err_out5;
 		}
 	}
 
@@ -1321,6 +1322,8 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	return sch;
 
 err_out5:
+	qdisc_put_stab(rtnl_dereference(sch->stab));
+err_out4:
 	/* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
 	if (ops->destroy)
 		ops->destroy(sch);
@@ -1332,16 +1335,6 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 err_out:
 	*errp = err;
 	return NULL;
-
-err_out4:
-	/*
-	 * Any broken qdiscs that would require a ops->reset() here?
-	 * The qdisc was never in action so it shouldn't be necessary.
-	 */
-	qdisc_put_stab(rtnl_dereference(sch->stab));
-	if (ops->destroy)
-		ops->destroy(sch);
-	goto err_out3;
 }
 
 static int qdisc_change(struct Qdisc *sch, struct nlattr **tca,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 11/15] net/sched: taprio: warn about missing size table
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (9 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 10/15] net/sched: make stab available before ops->init() call Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:06   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 12/15] net/sched: keep the max_frm_len information inside struct sched_gate_list Vladimir Oltean
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Vinicius intended taprio to take the L1 overhead into account when
estimating packet transmission time through user input, specifically
through the qdisc size table (man tc-stab).

Something like this:

tc qdisc replace dev $eth root stab overhead 24 taprio \
	num_tc 8 \
	map 0 1 2 3 4 5 6 7 \
	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
	base-time 0 \
	sched-entry S 0x7e 9000000 \
	sched-entry S 0x82 1000000 \
	max-sdu 0 0 0 0 0 0 0 200 \
	flags 0x0 clockid CLOCK_TAI

Without the overhead being specified, transmission times will be
underestimated and will cause late transmissions.

We can't make it mandatory, but we can warn the user with a netlink
extack.

Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 8ec3c0e1f741..d50b2ffe32f6 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -1671,6 +1671,7 @@ static int taprio_new_flags(const struct nlattr *attr, u32 old,
 static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 			 struct netlink_ext_ack *extack)
 {
+	struct qdisc_size_table *stab = rtnl_dereference(sch->stab);
 	struct nlattr *tb[TCA_TAPRIO_ATTR_MAX + 1] = { };
 	struct sched_gate_list *oper, *admin, *new_admin;
 	struct taprio_sched *q = qdisc_priv(sch);
@@ -1823,6 +1824,10 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	new_admin = NULL;
 	err = 0;
 
+	if (!stab)
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Size table not specified, frame length estimations may be inaccurate");
+
 unlock:
 	spin_unlock_bh(qdisc_lock(sch));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 12/15] net/sched: keep the max_frm_len information inside struct sched_gate_list
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (10 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 11/15] net/sched: taprio: warn about missing size table Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:06   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 13/15] net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations Vladimir Oltean
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

I have one practical reason for doing this and one concerning correctness.

The practical reason has to do with a follow-up patch, which aims to mix
2 sources of max_sdu (one coming from the user and the other automatically
calculated based on TC gate durations @current link speed). Among those
2 sources of input, we must always select the smaller max_sdu value, but
this can change at various link speeds. So the max_sdu coming from the
user must be kept separated from the value that is operationally used
(the minimum of the 2), because otherwise we overwrite it and forget
what the user asked us to do.

To solve that, this patch proposes that struct sched_gate_list contains
the operationally active max_frm_len, and q->max_sdu contains just what
was requested by the user.

The reason having to do with correctness lies on the following
observation: the admin sched_gate_list becomes operational at a given
base_time in the future. Until then, it is inactive and applies no
shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
apply either (this is a mechanism to ensure that packets smaller than
the largest gate duration for that TC don't hang the port; clearly it
makes little sense if the gates are always open).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index d50b2ffe32f6..43a8fd92a5a0 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -59,6 +59,7 @@ struct sched_gate_list {
 	 * or 0 if a traffic class gate never opens during the schedule.
 	 */
 	u64 max_open_tc_gate_duration[TC_MAX_QUEUE];
+	u32 max_frm_len[TC_MAX_QUEUE]; /* for the fast path */
 	struct rcu_head rcu;
 	struct list_head entries;
 	size_t num_entries;
@@ -87,8 +88,7 @@ struct taprio_sched {
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
 	int cur_txq[TC_MAX_QUEUE];
-	u32 max_frm_len[TC_MAX_QUEUE]; /* for the fast path */
-	u32 max_sdu[TC_MAX_QUEUE]; /* for dump and offloading */
+	u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
 	u32 txtime_delay;
 };
 
@@ -240,6 +240,21 @@ static int length_to_duration(struct taprio_sched *q, int len)
 	return div_u64(len * atomic64_read(&q->picos_per_byte), PSEC_PER_NSEC);
 }
 
+static void taprio_update_queue_max_sdu(struct taprio_sched *q,
+					struct sched_gate_list *sched)
+{
+	struct net_device *dev = qdisc_dev(q->root);
+	int num_tc = netdev_get_num_tc(dev);
+	int tc;
+
+	for (tc = 0; tc < num_tc; tc++) {
+		if (q->max_sdu[tc])
+			sched->max_frm_len[tc] = q->max_sdu[tc] + dev->hard_header_len;
+		else
+			sched->max_frm_len[tc] = U32_MAX; /* never oversized */
+	}
+}
+
 /* Returns the entry corresponding to next available interval. If
  * validate_interval is set, it only validates whether the timestamp occurs
  * when the gate corresponding to the skb's traffic class is open.
@@ -478,6 +493,7 @@ static int taprio_enqueue_one(struct sk_buff *skb, struct Qdisc *sch,
 {
 	struct taprio_sched *q = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct sched_gate_list *sched;
 	int prio = skb->priority;
 	u8 tc;
 
@@ -493,8 +509,14 @@ static int taprio_enqueue_one(struct sk_buff *skb, struct Qdisc *sch,
 
 	/* Devices with full offload are expected to honor this in hardware */
 	tc = netdev_get_prio_tc_map(dev, prio);
-	if (skb->len > q->max_frm_len[tc])
+
+	rcu_read_lock();
+	sched = rcu_dereference(q->oper_sched);
+	if (sched && skb->len > sched->max_frm_len[tc]) {
+		rcu_read_unlock();
 		return qdisc_drop(skb, sch, to_free);
+	}
+	rcu_read_unlock();
 
 	qdisc_qstats_backlog_inc(sch, skb);
 	sch->q.qlen++;
@@ -1590,7 +1612,6 @@ static int taprio_parse_tc_entries(struct Qdisc *sch,
 				   struct netlink_ext_ack *extack)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
 	u32 max_sdu[TC_QOPT_MAX_QUEUE];
 	unsigned long seen_tcs = 0;
 	struct nlattr *n;
@@ -1609,13 +1630,8 @@ static int taprio_parse_tc_entries(struct Qdisc *sch,
 			goto out;
 	}
 
-	for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++) {
+	for (tc = 0; tc < TC_QOPT_MAX_QUEUE; tc++)
 		q->max_sdu[tc] = max_sdu[tc];
-		if (max_sdu[tc])
-			q->max_frm_len[tc] = max_sdu[tc] + dev->hard_header_len;
-		else
-			q->max_frm_len[tc] = U32_MAX; /* never oversized */
-	}
 
 out:
 	return err;
@@ -1756,6 +1772,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto free_sched;
 
 	taprio_set_picos_per_byte(dev, q);
+	taprio_update_queue_max_sdu(q, new_admin);
 
 	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
 		err = taprio_enable_offload(dev, q, new_admin, extack);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 13/15] net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (11 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 12/15] net/sched: keep the max_frm_len information inside struct sched_gate_list Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:06   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 14/15] net/sched: taprio: split segmentation logic from qdisc_enqueue() Vladimir Oltean
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

taprio today has a huge problem with small TC gate durations, because it
might accept packets in taprio_enqueue() which will never be sent by
taprio_dequeue().

Since not much infrastructure was available, a kludge was added in
commit 497cc00224cf ("taprio: Handle short intervals and large
packets"), which segmented large TCP segments, but the fact of the
matter is that the issue isn't specific to large TCP segments (and even
worse, the performance penalty in segmenting those is absolutely huge).

In commit a54fc09e4cba ("net/sched: taprio: allow user input of per-tc
max SDU"), taprio gained support for queueMaxSDU, which is precisely the
mechanism through which packets should be dropped at qdisc_enqueue() if
they cannot be sent.

After that patch, it was necessary for the user to manually limit the
maximum MTU per TC. This change adds the necessary logic for taprio to
further limit the values specified (or not specified) by the user to
some minimum values which never allow oversized packets to be sent.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 68 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 59 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 43a8fd92a5a0..7a4c0b70cdc9 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -60,6 +60,7 @@ struct sched_gate_list {
 	 */
 	u64 max_open_tc_gate_duration[TC_MAX_QUEUE];
 	u32 max_frm_len[TC_MAX_QUEUE]; /* for the fast path */
+	u32 max_sdu[TC_MAX_QUEUE]; /* for dump */
 	struct rcu_head rcu;
 	struct list_head entries;
 	size_t num_entries;
@@ -240,18 +241,52 @@ static int length_to_duration(struct taprio_sched *q, int len)
 	return div_u64(len * atomic64_read(&q->picos_per_byte), PSEC_PER_NSEC);
 }
 
+static int duration_to_length(struct taprio_sched *q, u64 duration)
+{
+	return div_u64(duration * PSEC_PER_NSEC, atomic64_read(&q->picos_per_byte));
+}
+
+/* Sets sched->max_sdu[] and sched->max_frm_len[] to the minimum between the
+ * q->max_sdu[] requested by the user and the max_sdu dynamically determined by
+ * the maximum open gate durations at the given link speed.
+ */
 static void taprio_update_queue_max_sdu(struct taprio_sched *q,
-					struct sched_gate_list *sched)
+					struct sched_gate_list *sched,
+					struct qdisc_size_table *stab)
 {
 	struct net_device *dev = qdisc_dev(q->root);
 	int num_tc = netdev_get_num_tc(dev);
+	u32 max_sdu_from_user;
+	u32 max_sdu_dynamic;
+	u32 max_sdu;
 	int tc;
 
 	for (tc = 0; tc < num_tc; tc++) {
-		if (q->max_sdu[tc])
-			sched->max_frm_len[tc] = q->max_sdu[tc] + dev->hard_header_len;
-		else
+		max_sdu_from_user = q->max_sdu[tc] ?: U32_MAX;
+
+		/* TC gate never closes => keep the queueMaxSDU
+		 * selected by the user
+		 */
+		if (sched->max_open_tc_gate_duration[tc] == sched->cycle_time) {
+			max_sdu_dynamic = U32_MAX;
+		} else {
+			u32 max_frm_len;
+
+			max_frm_len = duration_to_length(q, sched->max_open_tc_gate_duration[tc]);
+			if (stab)
+				max_frm_len -= stab->szopts.overhead;
+			max_sdu_dynamic = max_frm_len - dev->hard_header_len;
+		}
+
+		max_sdu = min(max_sdu_dynamic, max_sdu_from_user);
+
+		if (max_sdu != U32_MAX) {
+			sched->max_frm_len[tc] = max_sdu + dev->hard_header_len;
+			sched->max_sdu[tc] = max_sdu;
+		} else {
 			sched->max_frm_len[tc] = U32_MAX; /* never oversized */
+			sched->max_sdu[tc] = 0;
+		}
 	}
 }
 
@@ -1223,6 +1258,8 @@ static int taprio_dev_notifier(struct notifier_block *nb, unsigned long event,
 			       void *ptr)
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct sched_gate_list *oper, *admin;
+	struct qdisc_size_table *stab;
 	struct taprio_sched *q;
 
 	ASSERT_RTNL();
@@ -1235,6 +1272,17 @@ static int taprio_dev_notifier(struct notifier_block *nb, unsigned long event,
 			continue;
 
 		taprio_set_picos_per_byte(dev, q);
+
+		stab = rtnl_dereference(q->root->stab);
+
+		oper = rtnl_dereference(q->oper_sched);
+		if (oper)
+			taprio_update_queue_max_sdu(q, oper, stab);
+
+		admin = rtnl_dereference(q->admin_sched);
+		if (admin)
+			taprio_update_queue_max_sdu(q, admin, stab);
+
 		break;
 	}
 
@@ -1625,7 +1673,8 @@ static int taprio_parse_tc_entries(struct Qdisc *sch,
 		if (nla_type(n) != TCA_TAPRIO_ATTR_TC_ENTRY)
 			continue;
 
-		err = taprio_parse_tc_entry(sch, n, max_sdu, &seen_tcs, extack);
+		err = taprio_parse_tc_entry(sch, n, max_sdu, &seen_tcs,
+					    extack);
 		if (err)
 			goto out;
 	}
@@ -1772,7 +1821,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto free_sched;
 
 	taprio_set_picos_per_byte(dev, q);
-	taprio_update_queue_max_sdu(q, new_admin);
+	taprio_update_queue_max_sdu(q, new_admin, stab);
 
 	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
 		err = taprio_enable_offload(dev, q, new_admin, extack);
@@ -2110,7 +2159,8 @@ static int dump_schedule(struct sk_buff *msg,
 	return -1;
 }
 
-static int taprio_dump_tc_entries(struct taprio_sched *q, struct sk_buff *skb)
+static int taprio_dump_tc_entries(struct sk_buff *skb,
+				  struct sched_gate_list *sched)
 {
 	struct nlattr *n;
 	int tc;
@@ -2124,7 +2174,7 @@ static int taprio_dump_tc_entries(struct taprio_sched *q, struct sk_buff *skb)
 			goto nla_put_failure;
 
 		if (nla_put_u32(skb, TCA_TAPRIO_TC_ENTRY_MAX_SDU,
-				q->max_sdu[tc]))
+				sched->max_sdu[tc]))
 			goto nla_put_failure;
 
 		nla_nest_end(skb, n);
@@ -2175,7 +2225,7 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
 		goto options_error;
 
-	if (taprio_dump_tc_entries(q, skb))
+	if (oper && taprio_dump_tc_entries(skb, oper))
 		goto options_error;
 
 	if (oper && dump_schedule(skb, oper))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 14/15] net/sched: taprio: split segmentation logic from qdisc_enqueue()
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (12 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 13/15] net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:07   ` Kurt Kanzenbach
  2023-01-28  1:07 ` [RFC PATCH net-next 15/15] net/sched: taprio: don't segment unnecessarily Vladimir Oltean
  2023-01-28 12:20 ` [RFC PATCH net-next 00/15] taprio fixprovements Kurt Kanzenbach
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

The majority of the taprio_enqueue()'s function is spent doing TCP
segmentation, which doesn't look right to me. Compilers shouldn't have a
problem in inlining code no matter how we write it, so move the
segmentation logic to a separate function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 66 +++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 30 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 7a4c0b70cdc9..cc11787dc62a 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -559,6 +559,40 @@ static int taprio_enqueue_one(struct sk_buff *skb, struct Qdisc *sch,
 	return qdisc_enqueue(skb, child, to_free);
 }
 
+static int taprio_enqueue_segmented(struct sk_buff *skb, struct Qdisc *sch,
+				    struct Qdisc *child,
+				    struct sk_buff **to_free)
+{
+	unsigned int slen = 0, numsegs = 0, len = qdisc_pkt_len(skb);
+	netdev_features_t features = netif_skb_features(skb);
+	struct sk_buff *segs, *nskb;
+	int ret;
+
+	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+	if (IS_ERR_OR_NULL(segs))
+		return qdisc_drop(skb, sch, to_free);
+
+	skb_list_walk_safe(segs, segs, nskb) {
+		skb_mark_not_on_list(segs);
+		qdisc_skb_cb(segs)->pkt_len = segs->len;
+		slen += segs->len;
+
+		ret = taprio_enqueue_one(segs, sch, child, to_free);
+		if (ret != NET_XMIT_SUCCESS) {
+			if (net_xmit_drop_count(ret))
+				qdisc_qstats_drop(sch);
+		} else {
+			numsegs++;
+		}
+	}
+
+	if (numsegs > 1)
+		qdisc_tree_reduce_backlog(sch, 1 - numsegs, len - slen);
+	consume_skb(skb);
+
+	return numsegs > 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;
+}
+
 /* Will not be called in the full offload case, since the TX queues are
  * attached to the Qdisc created using qdisc_create_dflt()
  */
@@ -580,36 +614,8 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	 * smaller chunks. Drivers with full offload are expected to handle
 	 * this in hardware.
 	 */
-	if (skb_is_gso(skb)) {
-		unsigned int slen = 0, numsegs = 0, len = qdisc_pkt_len(skb);
-		netdev_features_t features = netif_skb_features(skb);
-		struct sk_buff *segs, *nskb;
-		int ret;
-
-		segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
-		if (IS_ERR_OR_NULL(segs))
-			return qdisc_drop(skb, sch, to_free);
-
-		skb_list_walk_safe(segs, segs, nskb) {
-			skb_mark_not_on_list(segs);
-			qdisc_skb_cb(segs)->pkt_len = segs->len;
-			slen += segs->len;
-
-			ret = taprio_enqueue_one(segs, sch, child, to_free);
-			if (ret != NET_XMIT_SUCCESS) {
-				if (net_xmit_drop_count(ret))
-					qdisc_qstats_drop(sch);
-			} else {
-				numsegs++;
-			}
-		}
-
-		if (numsegs > 1)
-			qdisc_tree_reduce_backlog(sch, 1 - numsegs, len - slen);
-		consume_skb(skb);
-
-		return numsegs > 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;
-	}
+	if (skb_is_gso(skb))
+		return taprio_enqueue_segmented(skb, sch, child, to_free);
 
 	return taprio_enqueue_one(skb, sch, child, to_free);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH net-next 15/15] net/sched: taprio: don't segment unnecessarily
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (13 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 14/15] net/sched: taprio: split segmentation logic from qdisc_enqueue() Vladimir Oltean
@ 2023-01-28  1:07 ` Vladimir Oltean
  2023-01-28 12:07   ` Kurt Kanzenbach
  2023-01-28 12:20 ` [RFC PATCH net-next 00/15] taprio fixprovements Kurt Kanzenbach
  15 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-28  1:07 UTC (permalink / raw)
  To: netdev; +Cc: Vinicius Costa Gomes, Kurt Kanzenbach

Improve commit 497cc00224cf ("taprio: Handle short intervals and large
packets") to only perform segmentation when skb->len exceeds what
taprio_dequeue() expects.

In practice, this will make the biggest difference when a traffic class
gate is always open in the schedule. This is because the max_frm_len
will be U32_MAX, and such large skb->len values as Kurt reported will be
sent just fine unsegmented.

What I don't seem to know how to handle is how to make sure that the
segmented skbs themselves are smaller than the maximum frame size given
by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
those, otherwise the Qdisc will hang.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 67 ++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index cc11787dc62a..e0bd613a6415 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -527,10 +527,6 @@ static int taprio_enqueue_one(struct sk_buff *skb, struct Qdisc *sch,
 			      struct Qdisc *child, struct sk_buff **to_free)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
-	struct sched_gate_list *sched;
-	int prio = skb->priority;
-	u8 tc;
 
 	/* sk_flags are only safe to use on full sockets. */
 	if (skb->sk && sk_fullsock(skb->sk) && sock_flag(skb->sk, SOCK_TXTIME)) {
@@ -542,17 +538,6 @@ static int taprio_enqueue_one(struct sk_buff *skb, struct Qdisc *sch,
 			return qdisc_drop(skb, sch, to_free);
 	}
 
-	/* Devices with full offload are expected to honor this in hardware */
-	tc = netdev_get_prio_tc_map(dev, prio);
-
-	rcu_read_lock();
-	sched = rcu_dereference(q->oper_sched);
-	if (sched && skb->len > sched->max_frm_len[tc]) {
-		rcu_read_unlock();
-		return qdisc_drop(skb, sch, to_free);
-	}
-	rcu_read_unlock();
-
 	qdisc_qstats_backlog_inc(sch, skb);
 	sch->q.qlen++;
 
@@ -565,19 +550,34 @@ static int taprio_enqueue_segmented(struct sk_buff *skb, struct Qdisc *sch,
 {
 	unsigned int slen = 0, numsegs = 0, len = qdisc_pkt_len(skb);
 	netdev_features_t features = netif_skb_features(skb);
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct sched_gate_list *sched;
 	struct sk_buff *segs, *nskb;
-	int ret;
+	int tc, ret;
+
+	tc = netdev_get_prio_tc_map(dev, skb->priority);
 
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
 	if (IS_ERR_OR_NULL(segs))
 		return qdisc_drop(skb, sch, to_free);
 
+	rcu_read_lock();
+	sched = rcu_dereference(q->oper_sched);
+
 	skb_list_walk_safe(segs, segs, nskb) {
 		skb_mark_not_on_list(segs);
 		qdisc_skb_cb(segs)->pkt_len = segs->len;
 		slen += segs->len;
 
-		ret = taprio_enqueue_one(segs, sch, child, to_free);
+		/* FIXME: we should be segmenting to a smaller size
+		 * rather than dropping these
+		 */
+		if (sched && skb->len > sched->max_frm_len[tc])
+			ret = qdisc_drop(segs, sch, to_free);
+		else
+			ret = taprio_enqueue_one(segs, sch, child, to_free);
+
 		if (ret != NET_XMIT_SUCCESS) {
 			if (net_xmit_drop_count(ret))
 				qdisc_qstats_drop(sch);
@@ -586,6 +586,8 @@ static int taprio_enqueue_segmented(struct sk_buff *skb, struct Qdisc *sch,
 		}
 	}
 
+	rcu_read_unlock();
+
 	if (numsegs > 1)
 		qdisc_tree_reduce_backlog(sch, 1 - numsegs, len - slen);
 	consume_skb(skb);
@@ -600,8 +602,11 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			  struct sk_buff **to_free)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	struct sched_gate_list *sched;
+	int prio = skb->priority;
 	struct Qdisc *child;
-	int queue;
+	int tc, queue;
 
 	queue = skb_get_queue_mapping(skb);
 
@@ -609,13 +614,25 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	if (unlikely(!child))
 		return qdisc_drop(skb, sch, to_free);
 
-	/* Large packets might not be transmitted when the transmission duration
-	 * exceeds any configured interval. Therefore, segment the skb into
-	 * smaller chunks. Drivers with full offload are expected to handle
-	 * this in hardware.
-	 */
-	if (skb_is_gso(skb))
-		return taprio_enqueue_segmented(skb, sch, child, to_free);
+	/* Devices with full offload are expected to honor this in hardware */
+	tc = netdev_get_prio_tc_map(dev, prio);
+
+	rcu_read_lock();
+	sched = rcu_dereference(q->oper_sched);
+	if (sched && skb->len > sched->max_frm_len[tc]) {
+		rcu_read_unlock();
+		/* Large packets might not be transmitted when the transmission duration
+		 * exceeds any configured interval. Therefore, segment the skb into
+		 * smaller chunks. Drivers with full offload are expected to handle
+		 * this in hardware.
+		 */
+		if (skb_is_gso(skb))
+			return taprio_enqueue_segmented(skb, sch, child,
+							to_free);
+
+		return qdisc_drop(skb, sch, to_free);
+	}
+	rcu_read_unlock();
 
 	return taprio_enqueue_one(skb, sch, child, to_free);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation
  2023-01-28  1:07 ` [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation Vladimir Oltean
@ 2023-01-28 11:59   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 11:59 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 713 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> There isn't any code in the network stack which calls taprio_peek().
> We only see qdisc->ops->peek() being called on child qdiscs of other
> classful qdiscs, never from the generic qdisc code. Whereas taprio is
> never a child qdisc, it is always root.
>
> This snippet of a comment from qdisc_peek_dequeued() seems to confirm:
>
> 	/* we can reuse ->gso_skb because peek isn't called for root qdiscs */
>
> Since I've been known to be wrong many times though, I'm not completely
> removing it, but leaving a stub function in place which emits a warning.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 02/15] net/sched: taprio: continue with other TXQs if one dequeue() failed
  2023-01-28  1:07 ` [RFC PATCH net-next 02/15] net/sched: taprio: continue with other TXQs if one dequeue() failed Vladimir Oltean
@ 2023-01-28 11:59   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 11:59 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> This changes the handling of an unlikely condition to not stop dequeuing
> if taprio failed to dequeue the peeked skb in taprio_dequeue().
>
> I've no idea when this can happen, but the only side effect seems to be
> that the atomic_sub_return() call right above will have consumed some
> budget. This isn't a big deal, since either that made us remain without
> any budget (and therefore, we'd exit on the next peeked skb anyway), or
> we could send some packets from other TXQs.
>
> I'm making this change because in a future patch I'll be refactoring the
> dequeue procedure to simplify it, and this corner case will have to go
> away.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 03/15] net/sched: taprio: refactor one skb dequeue from TXQ to separate function
  2023-01-28  1:07 ` [RFC PATCH net-next 03/15] net/sched: taprio: refactor one skb dequeue from TXQ to separate function Vladimir Oltean
@ 2023-01-28 11:59   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 11:59 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 456 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Future changes will refactor the TXQ selection procedure, and a lot of
> stuff will become messy, the indentation of the bulk of the dequeue
> procedure would increase, etc.
>
> Break out the bulk of the function into a new one, which knows the TXQ
> (child qdisc) we should perform a dequeue from.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 04/15] net/sched: taprio: avoid calling child->ops->dequeue(child) twice
  2023-01-28  1:07 ` [RFC PATCH net-next 04/15] net/sched: taprio: avoid calling child->ops->dequeue(child) twice Vladimir Oltean
@ 2023-01-28 12:00   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:00 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 494 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Simplify taprio_dequeue_from_txq() by noticing that we can goto one call
> earlier than the previous skb_found label. This is possible because
> we've unified the treatment of the child->ops->dequeue(child) return
> call, we always try other TXQs now, instead of abandoning the root
> dequeue completely if we failed in the peek() case.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode
  2023-01-28  1:07 ` [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode Vladimir Oltean
@ 2023-01-28 12:04   ` Kurt Kanzenbach
  2023-01-29 13:17     ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:04 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Currently taprio iterates over child qdiscs in increasing order of TXQ
> index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N.
>
> However, to the best of my understanding, we should prioritize based on
> the traffic class, so we should really dequeue starting with the highest
> traffic class and going down from there. We get to the TXQ using the
> tc_to_txq[] netdev property.
>
> TXQs within the same TC have the same (strict) priority, so we should
> pick from them as fairly as we can. Implement something very similar to
> q->curband from multiq_dequeue().

Totally makes sense to me...

>
> Something tells me Vinicius won't like the way in which this patch
> interacts with TXTIME_ASSIST_IS_ENABLED(q->flags) and NICs where TXQ 0
> really has higher priority than TXQ 1....

However, this change may be problematic for i210/i225/i226 NIC(s).

AFAIK the Tx queue priorities for i225/i226 are configurable. Meaning
the default could be adjusted to have Tx queue 4 with higher priority
than 3 and so on. For i210 I don't know. Also Tx Launch Time only works
for the lower queues. Hm.

Thanks,
Kurt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 06/15] net/sched: taprio: calculate tc gate durations
  2023-01-28  1:07 ` [RFC PATCH net-next 06/15] net/sched: taprio: calculate tc gate durations Vladimir Oltean
@ 2023-01-28 12:05   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:05 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 1832 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Current taprio code operates on a very simplistic (and incorrect)
> assumption: that egress scheduling for a traffic class can only take
> place for the duration of the current interval, or i.o.w., it assumes
> that at the end of each schedule entry, there is a "gate close" event
> for all traffic classes.
>
> As an example, traffic sent with the schedule below will be jumpy, even
> though all 8 TC gates are open, so there is absolutely no "gate close"
> event (effectively a transition from BIT(tc)==1 to BIT(tc)==0 in
> consecutive schedule entries):
>
> tc qdisc replace dev veth0 parent root taprio \
> 	num_tc 2 \
> 	map 0 1 \
> 	queues 1@0 1@1 \
> 	base-time 0 \
> 	sched-entry S 0xff 4000000000 \
> 	clockid CLOCK_TAI \
> 	flags 0x0
>
> This qdisc simply does not have what it takes in terms of logic to
> *actually* compute the durations of traffic classes. Also, it does not
> recognize the need to use this information on a per-traffic-class basis:
> it always looks at entry->interval and entry->close_time.
>
> This change proposes that each schedule entry has an array called
> tc_gate_duration[tc]. This holds the information: "for how long will
> this traffic class gate remain open, starting from *this* schedule
> entry". If the traffic class gate is always open, that value is equal to
> the cycle time of the schedule.
>
> We'll also need to keep track, for the purpose of queueMaxSDU[tc]
> calculation, what is the maximum time duration for a traffic class
> having an open gate. This gives us directly what is the maximum sized
> packet that this traffic class will have to accept. For everything else
> it has to qdisc_drop() it in qdisc_enqueue().
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 07/15] net/sched: taprio: rename close_time to end_time
  2023-01-28  1:07 ` [RFC PATCH net-next 07/15] net/sched: taprio: rename close_time to end_time Vladimir Oltean
@ 2023-01-28 12:05   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:05 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> There is a confusion in terms in taprio which makes what is called
> "close_time" to be actually used for 2 things:
>
> 1. determining when an entry "closes" such that transmitted skbs are
>    never allowed to overrun that time (?!)
> 2. an aid for determining when to advance and/or restart the schedule
>    using the hrtimer
>
> It makes more sense to call this so-called "close_time" "end_time",
> because it's not clear at all to me what "closes". Future patches will
> hopefully make better use of the term "to close".
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 08/15] net/sched: taprio: calculate budgets per traffic class
  2023-01-28  1:07 ` [RFC PATCH net-next 08/15] net/sched: taprio: calculate budgets per traffic class Vladimir Oltean
@ 2023-01-28 12:05   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:05 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 658 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Currently taprio assumes that the budget for a traffic class expires at
> the end of the current interval as if the next interval contains a "gate
> close" event for this traffic class.
>
> This is, however, an unfounded assumption. Allow schedule entry
> intervals to be fused together for a particular traffic class by
> calculating the budget until the gate *actually* closes.
>
> This means we need to keep budgets per traffic class, and we also need
> to update the budget consumption procedure.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 09/15] net/sched: taprio: calculate guard band against actual TC gate close time
  2023-01-28  1:07 ` [RFC PATCH net-next 09/15] net/sched: taprio: calculate guard band against actual TC gate close time Vladimir Oltean
@ 2023-01-28 12:05   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:05 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 623 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> taprio_dequeue_from_txq() looks at the entry->end_time to determine
> whether the skb will overrun its traffic class gate, as if at the end of
> the schedule entry there surely is a "gate close" event for it. Hint:
> maybe there isn't.
>
> For each schedule entry, introduce an array of kernel times which
> actually tracks when in the future will there be an *actual* gate close
> event for that traffic class, and use that in the guard band overrun
> calculation.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 10/15] net/sched: make stab available before ops->init() call
  2023-01-28  1:07 ` [RFC PATCH net-next 10/15] net/sched: make stab available before ops->init() call Vladimir Oltean
@ 2023-01-28 12:06   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:06 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 702 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Some qdiscs like taprio turn out to be actually pretty reliant on a well
> configured stab, to not underestimate the skb transmission time (by
> properly accounting for L1 overhead).
>
> In a future change, taprio will need the stab, if configured by the
> user, to be available at ops->init() time.
>
> However, rcu_assign_pointer(sch->stab, stab) is called right after
> ops->init(), making it unavailable, and I don't really see a good reason
> for that.
>
> Move it earlier, which nicely seems to simplify the error handling path
> as well.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 11/15] net/sched: taprio: warn about missing size table
  2023-01-28  1:07 ` [RFC PATCH net-next 11/15] net/sched: taprio: warn about missing size table Vladimir Oltean
@ 2023-01-28 12:06   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:06 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Vinicius intended taprio to take the L1 overhead into account when
> estimating packet transmission time through user input, specifically
> through the qdisc size table (man tc-stab).
>
> Something like this:
>
> tc qdisc replace dev $eth root stab overhead 24 taprio \
> 	num_tc 8 \
> 	map 0 1 2 3 4 5 6 7 \
> 	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> 	base-time 0 \
> 	sched-entry S 0x7e 9000000 \
> 	sched-entry S 0x82 1000000 \
> 	max-sdu 0 0 0 0 0 0 0 200 \
> 	flags 0x0 clockid CLOCK_TAI
>
> Without the overhead being specified, transmission times will be
> underestimated and will cause late transmissions.
>
> We can't make it mandatory, but we can warn the user with a netlink
> extack.
>
> Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 12/15] net/sched: keep the max_frm_len information inside struct sched_gate_list
  2023-01-28  1:07 ` [RFC PATCH net-next 12/15] net/sched: keep the max_frm_len information inside struct sched_gate_list Vladimir Oltean
@ 2023-01-28 12:06   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:06 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 1440 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> I have one practical reason for doing this and one concerning correctness.
>
> The practical reason has to do with a follow-up patch, which aims to mix
> 2 sources of max_sdu (one coming from the user and the other automatically
> calculated based on TC gate durations @current link speed). Among those
> 2 sources of input, we must always select the smaller max_sdu value, but
> this can change at various link speeds. So the max_sdu coming from the
> user must be kept separated from the value that is operationally used
> (the minimum of the 2), because otherwise we overwrite it and forget
> what the user asked us to do.
>
> To solve that, this patch proposes that struct sched_gate_list contains
> the operationally active max_frm_len, and q->max_sdu contains just what
> was requested by the user.
>
> The reason having to do with correctness lies on the following
> observation: the admin sched_gate_list becomes operational at a given
> base_time in the future. Until then, it is inactive and applies no
> shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't
> apply either (this is a mechanism to ensure that packets smaller than
> the largest gate duration for that TC don't hang the port; clearly it
> makes little sense if the gates are always open).
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 13/15] net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations
  2023-01-28  1:07 ` [RFC PATCH net-next 13/15] net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations Vladimir Oltean
@ 2023-01-28 12:06   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:06 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> taprio today has a huge problem with small TC gate durations, because it
> might accept packets in taprio_enqueue() which will never be sent by
> taprio_dequeue().
>
> Since not much infrastructure was available, a kludge was added in
> commit 497cc00224cf ("taprio: Handle short intervals and large
> packets"), which segmented large TCP segments, but the fact of the
> matter is that the issue isn't specific to large TCP segments (and even
> worse, the performance penalty in segmenting those is absolutely huge).
>
> In commit a54fc09e4cba ("net/sched: taprio: allow user input of per-tc
> max SDU"), taprio gained support for queueMaxSDU, which is precisely the
> mechanism through which packets should be dropped at qdisc_enqueue() if
> they cannot be sent.
>
> After that patch, it was necessary for the user to manually limit the
> maximum MTU per TC. This change adds the necessary logic for taprio to
> further limit the values specified (or not specified) by the user to
> some minimum values which never allow oversized packets to be sent.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 14/15] net/sched: taprio: split segmentation logic from qdisc_enqueue()
  2023-01-28  1:07 ` [RFC PATCH net-next 14/15] net/sched: taprio: split segmentation logic from qdisc_enqueue() Vladimir Oltean
@ 2023-01-28 12:07   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:07 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 410 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> The majority of the taprio_enqueue()'s function is spent doing TCP
> segmentation, which doesn't look right to me. Compilers shouldn't have a
> problem in inlining code no matter how we write it, so move the
> segmentation logic to a separate function.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 15/15] net/sched: taprio: don't segment unnecessarily
  2023-01-28  1:07 ` [RFC PATCH net-next 15/15] net/sched: taprio: don't segment unnecessarily Vladimir Oltean
@ 2023-01-28 12:07   ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:07 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> Improve commit 497cc00224cf ("taprio: Handle short intervals and large
> packets") to only perform segmentation when skb->len exceeds what
> taprio_dequeue() expects.
>
> In practice, this will make the biggest difference when a traffic class
> gate is always open in the schedule. This is because the max_frm_len
> will be U32_MAX, and such large skb->len values as Kurt reported will be
> sent just fine unsegmented.
>
> What I don't seem to know how to handle is how to make sure that the
> segmented skbs themselves are smaller than the maximum frame size given
> by the current queueMaxSDU[tc]. Nonetheless, we still need to drop
> those, otherwise the Qdisc will hang.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 00/15] taprio fixprovements
  2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
                   ` (14 preceding siblings ...)
  2023-01-28  1:07 ` [RFC PATCH net-next 15/15] net/sched: taprio: don't segment unnecessarily Vladimir Oltean
@ 2023-01-28 12:20 ` Kurt Kanzenbach
  15 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-28 12:20 UTC (permalink / raw)
  To: Vladimir Oltean, netdev; +Cc: Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 1820 bytes --]

On Sat Jan 28 2023, Vladimir Oltean wrote:
> I started to pull on a small thread and the whole thing unraveled :(
>
> While trying to ignite a more serious discussion about how the i225/i226
> hardware prioritization model seems to have affected the generic taprio
> software implementation (patch 05/15), I noticed 2 things:
> - taprio_peek() is dead code (patch 01/15)
> - taprio has a ridiculously low iperf3 performance when all gates are
>   open and it behave as a work-conserving qdisc. Patches 06/15 -> 09/15
>   and 13/15 -> 15/15 collectively work to address some of that.
>
> I had to put a hard stop for today (and at the patch limit of 15), but
> now that taprio calculates the durations of contiguously open TC gates,
> part 2 would be the communication of this information to offloading
> drivers via ndo_setup_tc(), and the deletion of duplicated logic from
> vsc9959_tas_guard_bands_update(). But that's for another day - I'm not
> quite sure how that's going to work out. The gate durations change at
> each link speed change, and this might mean that reoffloading is
> necessary.
>
> Another huge issue I'm seeing at small intervals with software
> scheduling is simply the amount of RCU stalls. I can't get Kurt's
> schedule from commit 497cc00224cf ("taprio: Handle short intervals
> and large packets") to work reliably on my system even without these
> patches. Eventually the system dies unless I increase the entry
> intervals from the posted 500 us - my CPUs just don't do much of
> anything else. Maybe someone has any idea what to do.

Thanks for investing the time and improving the software
scheduling. Especially the calculations of TC durations and
incorporating the max. frame lengths in a better way. I went over this
series and it looks good to me. Except for one patch.

Thanks,
Kurt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode
  2023-01-28 12:04   ` Kurt Kanzenbach
@ 2023-01-29 13:17     ` Vladimir Oltean
  2023-01-31  9:37       ` Kurt Kanzenbach
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2023-01-29 13:17 UTC (permalink / raw)
  To: Kurt Kanzenbach; +Cc: netdev, Vinicius Costa Gomes

On Sat, Jan 28, 2023 at 01:04:37PM +0100, Kurt Kanzenbach wrote:
> > Something tells me Vinicius won't like the way in which this patch
> > interacts with TXTIME_ASSIST_IS_ENABLED(q->flags) and NICs where TXQ 0
> > really has higher priority than TXQ 1....
> 
> However, this change may be problematic for i210/i225/i226 NIC(s).
> 
> AFAIK the Tx queue priorities for i225/i226 are configurable. Meaning
> the default could be adjusted to have Tx queue 4 with higher priority
> than 3 and so on. For i210 I don't know.

Assuming the TX ring priorities aren't configurable, it's just a matter
of mapping the Linux traffic classes to the TXQs correctly, am I right?
So TC 3 to TXQ 0, TC 2 to TXQ 1, TC 1 to TXQ 2 and TC 0 to TXQ 3?
What prevents Intel from telling users to do just that?

I see neither igb nor igc implement mqprio offloading, but you're saying
they have intrinsic strict prioritization. So the default configuration
is already a problem even without taprio/mqprio, isn't it? How many TXQs
is the stack aware of? All of them? Since neither igb nor igc implement
ndo_select_queue(), the network stack performs skb_tx_hash() per packet
and selects a TX queue of arbitrary hardware priority, or am I missing
something?

My knee-jerk reaction is: when we say "this change may be problematic
for X/Y/Z", aren't these plain configuration bugs that we're so
cautiously defending?

Anyway, I also need to be very realistic about what's possible for me to
change and how far I'm willing to go, and Vinicius made it pretty clear
that the existing taprio/mqprio configurations should be kept "working"
(given an arbitrary definition of "working"). So things like adding UAPI
for TXQ feature detection, such that an automated way of constructing
the mqprio queue configuration, would be interesting but practically
useless, since existing setups don't use that.

He proposed to cut our losses and use the capability structure to
conditionally keep that code in place. The resulting taprio code would
be pretty horrible, but it might be doable.

Here is a completely untested patch below, which is intended to replace
this one. Feedback for the approach appreciated.

From 4a83f369ffce71361c90618ce2ea8ec2ee97ad78 Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Fri, 27 Jan 2023 18:16:44 +0200
Subject: [PATCH] net/sched: taprio: give higher priority to higher TCs in
 software dequeue mode

Currently taprio iterates over child qdiscs in increasing order of TXQ
index, therefore giving higher xmit priority to TXQ 0 and lower to TXQ N.
This is because that is the default prioritization scheme used for the
NICs that taprio was first written for (igb, igc), and we have a case of
two bugs canceling out, resulting in a functional setup on igb/igc, but
a less sane one on other NICs.

To the best of my understanding, we should prioritize based on the
traffic class, so we should really dequeue starting with the highest
traffic class and going down from there. We get to the TXQ using the
tc_to_txq[] netdev property.

TXQs within the same TC have the same (strict) priority, so we should
pick from them as fairly as we can. We can achieve that by implementing
something very similar to q->curband from multiq_dequeue().

Since igb/igc really do have TXQ 0 of higher hardware priority than
TXQ 1 etc, we need to preserve the behavior for them as well. Things are
made worse by the fact that in txtime-assist mode, taprio is essentially
a software scheduler towards offloaded child tc-etf qdiscs, so the TXQ
selection really does matter.

To preserve the behavior, we need a capability bit so that taprio can
determine if it's running on igb/igc, or on something else. Because igb
doesn't offload taprio at all, we can't piggyback on the
qdisc_offload_query_caps() call from taprio_enable_offload(), but
instead we need a separate call which is also made for software
scheduling.

Introduce two static keys to minimize the performance penalty on systems
which only have igb/igc NICs, and on systems which only have other NICs.
For mixed systems, taprio will have to dynamically check whether to
dequeue using one prioritization algorithm or using the other.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c |  18 ++++
 drivers/net/ethernet/intel/igc/igc_main.c |   6 +-
 include/net/pkt_sched.h                   |   1 +
 net/sched/sch_taprio.c                    | 122 ++++++++++++++++++++--
 4 files changed, 136 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 3c0c35ecea10..8eed18efc1e7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2810,6 +2810,22 @@ static int igb_offload_txtime(struct igb_adapter *adapter,
 	return 0;
 }
 
+static int igb_tc_query_caps(struct igb_adapter *adapter,
+			     struct tc_query_caps_base *base)
+{
+	switch (base->type) {
+	case TC_SETUP_QDISC_TAPRIO: {
+		struct tc_taprio_caps *caps = base->caps;
+
+		caps->broken_mqprio = true;
+
+		return 0;
+	}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static LIST_HEAD(igb_block_cb_list);
 
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
@@ -2818,6 +2834,8 @@ static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 	struct igb_adapter *adapter = netdev_priv(dev);
 
 	switch (type) {
+	case TC_QUERY_CAPS:
+		return igb_tc_query_caps(adapter, type_data);
 	case TC_SETUP_QDISC_CBS:
 		return igb_offload_cbs(adapter, type_data);
 	case TC_SETUP_BLOCK:
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index cce1dea51f76..166c41926be5 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -6214,10 +6214,10 @@ static int igc_tc_query_caps(struct igc_adapter *adapter,
 	case TC_SETUP_QDISC_TAPRIO: {
 		struct tc_taprio_caps *caps = base->caps;
 
-		if (hw->mac.type != igc_i225)
-			return -EOPNOTSUPP;
+		caps->broken_mqprio = true;
 
-		caps->gate_mask_per_txq = true;
+		if (hw->mac.type == igc_i225)
+			caps->gate_mask_per_txq = true;
 
 		return 0;
 	}
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index fd889fc4912b..9bf120fb2073 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -177,6 +177,7 @@ struct tc_mqprio_qopt_offload {
 struct tc_taprio_caps {
 	bool supports_queue_max_sdu:1;
 	bool gate_mask_per_txq:1;
+	bool broken_mqprio:1;
 };
 
 struct tc_taprio_sched_entry {
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 163e5244a9af..cef9b3e8d29e 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -27,6 +27,8 @@
 #include <net/tcp.h>
 
 static LIST_HEAD(taprio_list);
+static struct static_key_false taprio_have_broken_mqprio;
+static struct static_key_false taprio_have_working_mqprio;
 
 #define TAPRIO_ALL_GATES_OPEN -1
 
@@ -67,6 +69,8 @@ struct taprio_sched {
 	enum tk_offsets tk_offset;
 	int clockid;
 	bool offloaded;
+	bool detected_mqprio;
+	bool broken_mqprio;
 	atomic64_t picos_per_byte; /* Using picoseconds because for 10Gbps+
 				    * speeds it's sub-nanoseconds per byte
 				    */
@@ -78,6 +82,7 @@ struct taprio_sched {
 	struct sched_gate_list __rcu *admin_sched;
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
+	int cur_txq[TC_MAX_QUEUE];
 	u32 max_frm_len[TC_MAX_QUEUE]; /* for the fast path */
 	u32 max_sdu[TC_MAX_QUEUE]; /* for dump and offloading */
 	u32 txtime_delay;
@@ -566,17 +571,78 @@ static struct sk_buff *taprio_dequeue_from_txq(struct Qdisc *sch, int txq,
 	return skb;
 }
 
+static void taprio_next_tc_txq(struct net_device *dev, int tc, int *txq)
+{
+	int offset = dev->tc_to_txq[tc].offset;
+	int count = dev->tc_to_txq[tc].count;
+
+	(*txq)++;
+	if (*txq == offset + count)
+		*txq = offset;
+}
+
+/* Prioritize higher traffic classes, and select among TXQs belonging to the
+ * same TC using round robin
+ */
+static struct sk_buff *taprio_dequeue_tc_priority(struct Qdisc *sch,
+						  struct sched_entry *entry,
+						  u32 gate_mask)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	int num_tc = netdev_get_num_tc(dev);
+	struct sk_buff *skb;
+	int tc;
+
+	for (tc = num_tc - 1; tc >= 0; tc--) {
+		int first_txq = q->cur_txq[tc];
+
+		if (!(gate_mask & BIT(tc)))
+			continue;
+
+		do {
+			skb = taprio_dequeue_from_txq(sch, q->cur_txq[tc],
+						      entry, gate_mask);
+
+			taprio_next_tc_txq(dev, tc, &q->cur_txq[tc]);
+
+			if (skb)
+				return skb;
+		} while (q->cur_txq[tc] != first_txq);
+	}
+
+	return NULL;
+}
+
+/* Broken way of prioritizing smaller TXQ indices and ignoring the traffic
+ * class other than to determine whether the gate is open or not
+ */
+static struct sk_buff *taprio_dequeue_txq_priority(struct Qdisc *sch,
+						   struct sched_entry *entry,
+						   u32 gate_mask)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct sk_buff *skb;
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		skb = taprio_dequeue_from_txq(sch, i, entry, gate_mask);
+		if (skb)
+			return skb;
+	}
+
+	return NULL;
+}
+
 /* Will not be called in the full offload case, since the TX queues are
  * attached to the Qdisc created using qdisc_create_dflt()
  */
 static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
 	struct sk_buff *skb = NULL;
 	struct sched_entry *entry;
 	u32 gate_mask;
-	int i;
 
 	rcu_read_lock();
 	entry = rcu_dereference(q->current_entry);
@@ -586,14 +652,20 @@ static struct sk_buff *taprio_dequeue(struct Qdisc *sch)
 	 * "AdminGateStates"
 	 */
 	gate_mask = entry ? entry->gate_mask : TAPRIO_ALL_GATES_OPEN;
-
 	if (!gate_mask)
 		goto done;
 
-	for (i = 0; i < dev->num_tx_queues; i++) {
-		skb = taprio_dequeue_from_txq(sch, i, entry, gate_mask);
-		if (skb)
-			goto done;
+	if (static_branch_unlikely(&taprio_have_broken_mqprio) &&
+	    !static_branch_likely(&taprio_have_working_mqprio)) {
+		skb = taprio_dequeue_txq_priority(sch, entry, gate_mask);
+	} else if (static_branch_likely(&taprio_have_working_mqprio) &&
+		   !static_branch_unlikely(&taprio_have_broken_mqprio)) {
+		skb = taprio_dequeue_tc_priority(sch, entry, gate_mask);
+	} else {
+		if (q->broken_mqprio)
+			skb = taprio_dequeue_txq_priority(sch, entry, gate_mask);
+		else
+			skb = taprio_dequeue_tc_priority(sch, entry, gate_mask);
 	}
 
 done:
@@ -1223,6 +1295,34 @@ taprio_mqprio_qopt_reconstruct(struct net_device *dev,
 	}
 }
 
+static void taprio_detect_broken_mqprio(struct taprio_sched *q)
+{
+	struct net_device *dev = qdisc_dev(q->root);
+	struct tc_taprio_caps caps;
+
+	qdisc_offload_query_caps(dev, TC_SETUP_QDISC_TAPRIO,
+				 &caps, sizeof(caps));
+
+	q->broken_mqprio = caps.broken_mqprio;
+	if (q->broken_mqprio)
+		static_branch_inc(&taprio_have_broken_mqprio);
+	else
+		static_branch_inc(&taprio_have_working_mqprio);
+
+	q->detected_mqprio = true;
+}
+
+static void taprio_cleanup_broken_mqprio(struct taprio_sched *q)
+{
+	if (!q->detected_mqprio)
+		return;
+
+	if (q->broken_mqprio)
+		static_branch_dec(&taprio_have_broken_mqprio);
+	else
+		static_branch_dec(&taprio_have_working_mqprio);
+}
+
 static int taprio_enable_offload(struct net_device *dev,
 				 struct taprio_sched *q,
 				 struct sched_gate_list *sched,
@@ -1588,10 +1688,12 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		err = netdev_set_num_tc(dev, mqprio->num_tc);
 		if (err)
 			goto free_sched;
-		for (i = 0; i < mqprio->num_tc; i++)
+		for (i = 0; i < mqprio->num_tc; i++) {
 			netdev_set_tc_queue(dev, i,
 					    mqprio->count[i],
 					    mqprio->offset[i]);
+			q->cur_txq[i] = mqprio->offset[i];
+		}
 
 		/* Always use supplied priority mappings */
 		for (i = 0; i <= TC_BITMASK; i++)
@@ -1742,6 +1844,8 @@ static void taprio_destroy(struct Qdisc *sch)
 
 	if (admin)
 		call_rcu(&admin->rcu, taprio_free_sched_cb);
+
+	taprio_cleanup_broken_mqprio(q);
 }
 
 static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
@@ -1806,6 +1910,8 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
 		q->qdiscs[i] = qdisc;
 	}
 
+	taprio_detect_broken_mqprio(q);
+
 	return taprio_change(sch, opt, extack);
 }
 
-- 
2.34.1

> Also Tx Launch Time only works for the lower queues. Hm.

Lower TXQ numbers or higher priorities? Not that it makes a difference,
just curious.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode
  2023-01-29 13:17     ` Vladimir Oltean
@ 2023-01-31  9:37       ` Kurt Kanzenbach
  0 siblings, 0 replies; 34+ messages in thread
From: Kurt Kanzenbach @ 2023-01-31  9:37 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Sun Jan 29 2023, Vladimir Oltean wrote:
> Anyway, I also need to be very realistic about what's possible for me to
> change and how far I'm willing to go, and Vinicius made it pretty clear
> that the existing taprio/mqprio configurations should be kept "working"
> (given an arbitrary definition of "working"). So things like adding UAPI
> for TXQ feature detection, such that an automated way of constructing
> the mqprio queue configuration, would be interesting but practically
> useless, since existing setups don't use that.
>
> He proposed to cut our losses and use the capability structure to
> conditionally keep that code in place. The resulting taprio code would
> be pretty horrible, but it might be doable.
>
> Here is a completely untested patch below, which is intended to replace
> this one. Feedback for the approach appreciated.

Yeah, I think Vinicius has to ack or nack this.

Thanks,
Kurt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-01-31  9:38 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-28  1:07 [RFC PATCH net-next 00/15] taprio fixprovements Vladimir Oltean
2023-01-28  1:07 ` [RFC PATCH net-next 01/15] net/sched: taprio: delete peek() implementation Vladimir Oltean
2023-01-28 11:59   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 02/15] net/sched: taprio: continue with other TXQs if one dequeue() failed Vladimir Oltean
2023-01-28 11:59   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 03/15] net/sched: taprio: refactor one skb dequeue from TXQ to separate function Vladimir Oltean
2023-01-28 11:59   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 04/15] net/sched: taprio: avoid calling child->ops->dequeue(child) twice Vladimir Oltean
2023-01-28 12:00   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 05/15] net/sched: taprio: give higher priority to higher TCs in software dequeue mode Vladimir Oltean
2023-01-28 12:04   ` Kurt Kanzenbach
2023-01-29 13:17     ` Vladimir Oltean
2023-01-31  9:37       ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 06/15] net/sched: taprio: calculate tc gate durations Vladimir Oltean
2023-01-28 12:05   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 07/15] net/sched: taprio: rename close_time to end_time Vladimir Oltean
2023-01-28 12:05   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 08/15] net/sched: taprio: calculate budgets per traffic class Vladimir Oltean
2023-01-28 12:05   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 09/15] net/sched: taprio: calculate guard band against actual TC gate close time Vladimir Oltean
2023-01-28 12:05   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 10/15] net/sched: make stab available before ops->init() call Vladimir Oltean
2023-01-28 12:06   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 11/15] net/sched: taprio: warn about missing size table Vladimir Oltean
2023-01-28 12:06   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 12/15] net/sched: keep the max_frm_len information inside struct sched_gate_list Vladimir Oltean
2023-01-28 12:06   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 13/15] net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations Vladimir Oltean
2023-01-28 12:06   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 14/15] net/sched: taprio: split segmentation logic from qdisc_enqueue() Vladimir Oltean
2023-01-28 12:07   ` Kurt Kanzenbach
2023-01-28  1:07 ` [RFC PATCH net-next 15/15] net/sched: taprio: don't segment unnecessarily Vladimir Oltean
2023-01-28 12:07   ` Kurt Kanzenbach
2023-01-28 12:20 ` [RFC PATCH net-next 00/15] taprio fixprovements Kurt Kanzenbach

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.